Turning off WiFi doesn't set a route after the modem connects data

Bug #1436427 reported by Michael Zanetti
32
This bug affects 8 people
Affects Status Importance Assigned to Milestone
Canonical System Image
Fix Released
High
Canonical Phone Foundations
network-manager (Ubuntu)
Incomplete
High
Tony Espy
network-manager (Ubuntu RTM)
Incomplete
High
Tony Espy
ofono (Ubuntu)
Fix Released
High
Tony Espy
ofono (Ubuntu RTM)
Fix Released
High
Tony Espy

Bug Description

I just switched off WiFi in order to test if the device switches successfully to a mobile data connection. After disabling WiFi the indicator showed "H" in no time. So it would seem all is fine. I opened the browser and no data went through.

Here's the output of "ip route" and "list-contexts":

phablet@ubuntu-phablet:~$ sudo ip route
[sudo] password for phablet:
phablet@ubuntu-phablet:~$ /usr/share/ofono/scripts/list-contexts
[ /ril_1 ]
[ /ril_0 ]
    [ /ril_0/context1 ]
        Name = E-Plus Web GPRS
        Settings = { Netmask=255.255.255.0 Address=10.121.30.213 Interface=ccmni0 Method=static DomainNameServers=212.23.103.8,212.23.103.9, Gateway=10.121.30.213 }
        Username = eplus
        IPv6.Settings = { }
        Protocol = ip
        Active = 1
        Password = internet
        Type = internet
        AccessPointName = internet.eplus.de

    [ /ril_0/context2 ]
        IPv6.Settings = { }
        Name = E-Plus MMS
        MessageProxy = 212.23.97.153:5080
        MessageCenter = http://mms/eplus/
        Username = mms
        Settings = { }
        Protocol = ip
        Active = 0
        Password = eplus
        Type = mms
        AccessPointName = mms.eplus.de

    [ /ril_0/context3 ]
        Name = ___ubuntu_custom_apn_internet
        Settings = { }
        Username =
        IPv6.Settings = { }
        Protocol = ip
        Active = 0
        Password =
        Type = internet
        AccessPointName =

phablet@ubuntu-phablet:~$ system-image-cli -i
current build number: 21
device name: krillin
channel: ubuntu-touch/rc/bq-aquaris.en
last update: 2015-03-13 17:11:07
version version: 21
version ubuntu: 20150312
version device: 20150310-3201c0a
version custom: 20150216-561-29-186

Tags: connectivity
description: updated
description: updated
Tony Espy (awe)
Changed in network-manager (Ubuntu):
assignee: nobody → Tony Espy (awe)
Changed in network-manager (Ubuntu RTM):
assignee: nobody → Tony Espy (awe)
Changed in network-manager (Ubuntu):
importance: Undecided → High
Changed in network-manager (Ubuntu RTM):
importance: Undecided → High
Tony Espy (awe)
tags: added: connectivity
Tony Espy (awe)
Changed in network-manager (Ubuntu RTM):
status: New → Confirmed
Revision history for this message
Tony Espy (awe) wrote :

This sounds similar to bug #1410113, which is a more generic bug. I'm not going to mark this a duplicate however as it has a reproducible scenario.

Also confirmed this bug for RTM, as I reproduced it on the third try. I verified that there was a mobile data connection and that it worked ( ie. I could ping ubuntu.com ), activated a WiFi connection, verified the routing table, and then disabled WiFi. On the third cycle, I ended up with an empty routing table:

phablet@ubuntu-phablet:~$ netstat -run
Kernel IP routing table
Destination Gateway Genmask Flags MSS Window irtt Iface

...that said, at some point, it looks like the system did recover at some later point, as it appears to have magically healed itself, and I know have a valid routing table, and usable mobile data connection.

I will continue to work on reproducing, and will add logs, and other data if/when I can do so again.

phablet@ubuntu-phablet:~$ system-image-cli -i
current build number: 20
device name: krillin
channel: ubuntu-touch/ubuntu-rtm/14.09
last update: 2015-03-26 15:44:56
version version: 20
version ubuntu: 20150312
version device: 20150310-3201c0a
version custom: 20150216-561-29-186

Changed in network-manager (Ubuntu RTM):
importance: High → Critical
Changed in network-manager (Ubuntu):
importance: High → Critical
Revision history for this message
Tony Espy (awe) wrote :

This doesn't seem to be easily reproducible, which strengthens the theory that this is a race condition between rild and Network Manager.

When the mobile data connection is active, and WiFi connected, the main routing table looks like this:

# ip route show
default via <address> dev wlanX proto static
<address> dev ccmni0 proto static scope link
<address> dev wlanX proto kernel scope link src <address> metric 9

When WiFi is disabled, it looks like this:

# ip route show
default via <address> dev ccmni0 proto static
<address> dev ccmni0 proto static scope link

Note, rild and Network Manager use a different proto value when adding routes ( static vs. kernel ).

Revision history for this message
Tony Espy (awe) wrote :

Also, due to the use of hybris to disable/enable WiFi on krillin, the WiFi device's numeric component is incremented every time WiFi is power cycled ( eg. wlan0, wlan1, wlan2, ... ).

Tony Espy (awe)
description: updated
Revision history for this message
Tony Espy (awe) wrote :

I lowered the priority to High, as this problem is hard to reproduce.

Also, the one time I was able to reproduce it, the connection was re-established after some time.

Changed in network-manager (Ubuntu):
importance: Critical → High
Changed in network-manager (Ubuntu RTM):
importance: Critical → High
Revision history for this message
Tony Espy (awe) wrote :

@MIchael

When this happened, did you wait and see whether or not the connection came back on it's own?

If not, the next time you see it happen, can you check periodically for a few minutes to see if the connection is restored. There's a 5m internal timeout in NetworkManager that may be involved.

If the connection comes back on it's own, then the issue is less serious than if it's stuck in this state.

Revision history for this message
Michael Zanetti (mzanetti) wrote :

@Tony

Today I left home twice, once it worked fine, the other time I ran into this again. This kinda matches my experience of running into this every other time I leave home. Maybe a bit less. I would estimate it to about 40% of the times I leave my home. Reproducing it manually by turning WiFi off doesn't trigger it as often as with walking out WiFi range.

For me it definitely doesn't recover on its own, at least not with just waiting a few minutes. However, toggling flight mode on the phone usually makes the connection recover for me.

IMO this is quite critical, and even if it would recover on its own after a while, there's nothing more annoying than being on the go and first having to toggle flight mode, or waiting 10 minutes before being able to use the map, or quickly googling something.

Today for example we went for a walk, and after walking for some 20 minutes I wanted to look up the map. Pulled out the phone, the indicator says "H", I open the map and it doesn't load. Knowing that mobile data can be flaky at times, I waited for about 5 minutes for the map, keeping on tapping the screen in order to not get the map app suspended. Then I decided to toggle flight mode which then finally got me some map data. By then obviously we already walked in some wrong direction and were on our way back already. So this issue has the characteristic to always strike when you quickly need some mobile data.

Revision history for this message
Tony Espy (awe) wrote :

@Michael

Thanks for the feedback.

I'd like keep this bug specifically for the problem that occurs when WiFi is toggled off ( as per the bug description and summary ). The problem when going out of range of the access point may be something completely different, and is being addressed in bug #1410113.

Also, problems with the location service while related, should be considered separate too.

Regarding the toggle WiFi problem, I created a stress test to try and reproduce the problem, while I will have reviewed on Monday, to ensure that I didn't get anything wrong. The WiFi toggle switch, contrary to my original understanding, doesn't toggle the urfkill switch directly, instead, it toggles the value of the global NetworkManager property 'WirelessEnabled'. NM in turn will enable/disable WiFi via urfkill, which in turn on krillin uses hybris to load/unload the WiFi driver.

I believe the problem we hit when WiFi is disabled, and the routing table is empty, is caused by a race between rild and NetworkManager, however until I can reproduce, it's just that... a theory.

As mentioned, I was able to reproduce this once two days ago, but haven't managed to reproduce it since. I've run 500 iterations of enable/disable WiFi using my stress test, and haven't yet hit the issue. This is why I reduced the Importance of *this* scenario to High. The out-of-range problem will now be my priority.

Finally, one other bug that may compound this problem is a long-standing issue with the network indicator which shows extreme latency sometimes displaying the correct network connection to the user. So even though the indicator may show that you have a mobile data connection, it may not actually be showing the true state of things. See bug #1339792 for more details. There's a concerted effort to fix issues with the indicator; ubuntu silo-6 contains a new version which hopefully will improve this situation.

Revision history for this message
Tony Espy (awe) wrote :

The attached script is a simple stress test that's run on the phone. It toggles WiFi on, sleeps for 10 seconds, then toggles WiFi off and checks for an empty routing table, and then sleeps for another 10 seconds. I currently has a hard-coded loop count.

The script enables/disables WiFi by toggling NM's 'WirelessEnabled' property.

To work properly, it needs to be run with a previously connect WiFi access point available. It also assumes that a valid SIM card is inserted in slot 1 of a krillin, as it checks the routing table for a 'ccmni0' device when WiFi gets disabled.

Finally, I also usually ensure that the phone will not lock the screen when this test is run by setting the system settings privacy setting such that the phone is never locked.

Revision history for this message
Sebastien Bacher (seb128) wrote :

I'm hitting what seems a similar issue on bq/rtm for some days. Today I turned off wifi to test something from a different ip and I got the 3G icon, nmcli shows ril_1 having an active connection but "ip route" is empty, several hours later my device is still not getting any data through

Revision history for this message
Sebastien Bacher (seb128) wrote :

the wifi was turned off around 12:22 in that log iirc

Tony Espy (awe)
Changed in canonical-devices-system-image:
status: New → Confirmed
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in network-manager (Ubuntu):
status: New → Confirmed
Revision history for this message
Tony Espy (awe) wrote :

@Sebastien

We have a proposed ofono fix that definitely improves the situation, however nobody else with Canonical besides yourself an Michael have been able to reproduce on RTM. See bug #1435328 for an update on the analysis so far.

Could you install the version ( 1.12.bzr6894+15.04.20150413.2~rtm-0ubuntu1~awe1 ) of ofono from my personal PPA:

https://launchpad.net/~awe/+archive/ubuntu/ppa

...and also make sure you have network-manager version 0.9.10.0-4ubuntu14 installed. It's currently in silo 009, but should be landing in the archive shortly.

I'm trying to determine if we need a fix to network-manager in RTM as well, per Alfonso's comments in the other bug.

Tony Espy (awe)
Changed in ofono (Ubuntu):
status: New → In Progress
Changed in ofono (Ubuntu RTM):
status: New → In Progress
Changed in ofono (Ubuntu):
importance: Undecided → High
Changed in ofono (Ubuntu RTM):
importance: Undecided → High
assignee: nobody → Tony Espy (awe)
Changed in ofono (Ubuntu):
assignee: nobody → Tony Espy (awe)
Changed in canonical-devices-system-image:
importance: Undecided → High
milestone: none → ww19-ota
status: Confirmed → Fix Committed
Changed in network-manager (Ubuntu):
status: Confirmed → Incomplete
Changed in network-manager (Ubuntu RTM):
status: Confirmed → Incomplete
Changed in ofono (Ubuntu):
status: In Progress → Fix Released
Changed in ofono (Ubuntu RTM):
status: In Progress → Fix Released
Changed in canonical-devices-system-image:
assignee: nobody → Canonical Phone Foundations (canonical-phonedations-team)
Changed in canonical-devices-system-image:
status: Fix Committed → Fix Released
Revision history for this message
Tony Espy (awe) wrote :

Note, Alfonso just hit the empty routing table bug again today while testing my flight-mode fixes for arale.

After some discussion, we both think that the lxc-android-config NM dispatcher script 02default_route_workaround, which was added for mako only should probably be removed altogether. My initial suggestion was to add logic so that the script only purged the routing table if the product was mako, but the I discovered that the route added by mako's rild is proto=kernel, not proto=boot ( which is what the script removes ). Maybe we should get rid of it altogether...

That said, in Alfonso's latest case, the routing table was empty when switching mobile data from one SIM to the other. Perhaps it's not the script that's wiping the table, but NM's core routing logic itself. One modem is coming down, and one is going up, it could be that the adding of routes for the new SIM and the removal of routes for the first SIM are colliding.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.