[touch] randomly messed up routing

Bug #1307981 reported by Oliver Grawert
26
This bug affects 4 people
Affects Status Importance Assigned to Milestone
lxc-android-config (Ubuntu)
Expired
High
Unassigned
network-manager (Ubuntu)
Expired
High
Unassigned

Bug Description

I have no clue when exactly it started (probably before image 290), but since a while i experience random issues where the browser suddenly doesnt find websites anymore ... digging deeper i can see that the routing table is completely messed up having two default routes:

root@ubuntu-phablet:~# route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 37.85.159.174 0.0.0.0 UG 0 0 0 rmnet_usb1
0.0.0.0 192.168.2.1 0.0.0.0 UG 0 0 0 wlan0
37.85.159.172 0.0.0.0 255.255.255.252 U 0 0 0 rmnet_usb1
192.168.2.0 0.0.0.0 255.255.255.0 U 9 0 0 wlan0

i did not roam or switch networks, this phone was constantly on wlan in the same room.
afer a reboot the routing is normal:

root@ubuntu-phablet:~# route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 192.168.2.1 0.0.0.0 UG 0 0 0 wlan0
37.84.75.140 0.0.0.0 255.255.255.252 U 13 0 0 rmnet_usb0
192.168.2.0 0.0.0.0 255.255.255.0 U 9 0 0 wlan0

Oliver Grawert (ogra)
Changed in network-manager (Ubuntu):
importance: Undecided → High
Revision history for this message
Mathieu Trudel-Lapierre (cyphermox) wrote :

If this happens again, could you get the same routing table but with the 'ip route' command instead?

Revision history for this message
Oliver Grawert (ogra) wrote :

ogra@styx:~/apps$ adb shell ip route
default via 37.83.18.26 dev rmnet_usb1
default via 192.168.2.1 dev wlan0 proto static
37.83.18.24/30 dev rmnet_usb1 proto kernel scope link src 37.83.18.25
192.168.2.0/24 dev wlan0 proto kernel scope link src 192.168.2.78 metric 9
ogra@styx:~/apps$

Revision history for this message
Pat McGowan (pat-mcgowan) wrote :

Happened here after first flash and boot to u3 on mako

phablet@ubuntu-phablet:~$ route
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
default 10.185.115.145 0.0.0.0 UG 0 0 0 rmnet_usb0
default 10.0.1.1 0.0.0.0 UG 0 0 0 wlan0
10.0.1.0 * 255.255.255.0 U 9 0 0 wlan0
10.185.115.144 * 255.255.255.252 U 13 0 0 rmnet_usb0

phablet@ubuntu-phablet:~$ sudo ping www.google.com
PING www.google.com (74.125.226.244) 56(84) bytes of data.
^C
--- www.google.com ping statistics ---
13 packets transmitted, 0 received, 100% packet loss, time 12003ms

phablet@ubuntu-phablet:~$ sudo ip route
default via 10.185.115.145 dev rmnet_usb0
default via 10.0.1.1 dev wlan0 proto static
10.0.1.0/24 dev wlan0 proto kernel scope link src 10.0.1.39 metric 9
10.185.115.144/30 dev rmnet_usb0 proto kernel scope link src 10.185.115.144 metric 13

Revision history for this message
Pat McGowan (pat-mcgowan) wrote :

Same symptoms on 5 straight boots

Turning wifi off allowed 3g to work
Turning 3g off allowed wifi to work
 Disabled 3g data and then reboot wlan worked fine

Enabled 3g data and reboot and back to the network not being reachable

Revision history for this message
Pat McGowan (pat-mcgowan) wrote :
Revision history for this message
Oliver Grawert (ogra) wrote :

i just noticed bug 1314410

do not rely on a webapp for testing if your connectivity is back (restarting the app works fine indeed, but the button on the error page nor the refresh option from the hud does not)

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in network-manager (Ubuntu):
status: New → Confirmed
Changed in network-manager (Ubuntu):
assignee: nobody → Mathieu Trudel-Lapierre (mathieu-tl)
Revision history for this message
Alexander Sack (asac) wrote :

pat: the syslog you attached was captured when you were in that state? or after reboot/fixing?

Revision history for this message
Alexander Sack (asac) wrote :

maybe red herring, but see this entry which probably indicates that stuff doesn't go as expected:

Apr 28 16:55:42 ubuntu-phablet NetworkManager[1212]: <error> [1398718542.745922] [nm-system.c:965] add_ip4_route_to_gateway(): (rmnet_usb0): failed to add IPv4 route to gateway (-12)

Revision history for this message
Alexander Sack (asac) wrote :

could be red herring, but guess that error could really make nm not do proper default route replace...

in nm-system.c:1043 you see the code that will give up trying to replace the gateway in case the add_ip4_route_to_gateway call fails...

       /* Try adding a direct route to the gateway first */
        gw_route = add_ip4_route_to_gateway (parent_ifindex, ext_gw, parent_mss);
        if (!gw_route)
                goto out;

        /* Try adding the original route again */
        err = replace_default_ip4_route (ifindex, int_gw, mss);
        if (err != 0) {
                nm_netlink_route_delete (gw_route);
                nm_log_err (LOGD_DEVICE | LOGD_VPN | LOGD_IP4,
                            "(%s): failed to set IPv4 default route (pass #2): %d",
                            iface, err);
        } else
                success = TRUE;

out:
        if (gw_route)
                rtnl_route_put (gw_route);
        g_free (iface);

Revision history for this message
Alexander Sack (asac) wrote :

cypher/tony: maybe worth checking/debugging this code and making it more robust?

Revision history for this message
Alexander Sack (asac) wrote :

sorry above i referenced wrong linenumber inside the vpn specific code ... but the same pattern is also around 1094...

Revision history for this message
Alexander Sack (asac) wrote :

ignore my posts in #10 and #11... i misread that code... however, the error in #9 seems to be in code that is only run if something with the normal default route replace didn't work...

someone with device and closer to code probably should check this out.

Maybe ofono or the modem driver also tries to do route magic which causes a racy situation here?

Revision history for this message
Tony Espy (awe) wrote :

The routing and in particular the switching of the default route is handled soley by NM.

This was an issue last Summer during initial development of the NM ofono code, but Mathieu believed he'd resolved the issue.

Apparently there's still some sort of race and it's possible that other recent system changes may have caused it to be triggered more easily.

Revision history for this message
Tony Espy (awe) wrote :

After some discussion with Mathieu yesterday, there's one addition to my previous comment.

On some phones, and in this case mako, rild actually configures the routing table for the mobile network connection. As rild is a binary blob, and there's no parameters available to control this behavior, the NM ofono code was designed with this in mind.

In the normal bringup case, this shouldn't be an issue as mobile data will be activated, the routes setup, followed by WiFi if enabled, and NM should be able to fixup the routing table correctly.

I suspect what maybe happening is that the mobile connection might be dropping and then re-establishing itself, resulting in rild re-adding the default route when one already exists for Wi-Fi.

Revision history for this message
Mathieu Trudel-Lapierre (cyphermox) wrote :

That routing code is a red herring, it's actually never reached in the case where wifi is already activated, because wifi remains the best IP4 connection available (given that it has a higher priority than modems).

From what I see here, it seems like rild is just *always* setting the route, and this is basically just a race between NM and rild: if rild happens to write the route before NM has activated the primary; then you'll get a valid routing table (because the rild route gets replaced by NM's route via wifi), but if it happens after, NM rightfully no longer cares about what happens to the routing. This *is* proper behavior since we can't go trash routes that could be added after the connection by a sysadmin or some other process...

I'd very much like to understand why this suddently became an issue -- I don't think either code bases changed much in relation to routing, so something ought to have changed in the environment on mako.

I see two options, listed in order of preference (not necessarily feasibility):

1) Convince rild to never set up the default route. I would have hope this to be doable via an android property that can be set, I tried to find a property that looked related, but without results yet. It could well not be feasible at all, sadly.

2) Ship a custom dispatcher script in lxc-android-config; to, on purpose, trash the "boot" routing table which rild abuses to do this. I would rather do this as a dispatcher script we ship only on touch, because I think it's explicitly Very Very Wrong to muck with the routing table arbitrarily for other images.

I did already test this as a workaround with the expected successful result, see http://paste.ubuntu.com/7376744/.

3) Find a way to catch the precise, exact route using netlink in the ofono modem plugin of NM; and explicitly RTM_DELROUTE it. This is the safest to ship everywhere, but requires the most effort.

Revision history for this message
Tony Espy (awe) wrote :

I'm not sure what you mean by "the routing code is a red herring"?

Also, I'd like to get the exact scenario nailed down.

Oliver's description said that he had Wi-Fi enabled the whole time. If so, was the routing table *ever* correct, or was it broken from boot onwards?

So which of the following two scenarios do we think is happening?

After the initial device boot, it has a valid default route pointing at wlan and network access is enabled, then at some point rild comes along and blindly adds a new default route due to the data call dropping and being re-established.

OR

At boot, since WiFi is enabled, both the mobile context and the WiFi connection are started by NM, but the rild activation takes longer than WiFi, so rild winning the race and adds it's additional default route.

As for the options listed above...

1. I could find no way to do this...

2. I'm not sure what you mean by "trash the 'boot' routing table..."? What what would trigger this script? Would this work for both scenarios mentioned above ( ie. race at initial boot, or mobile link dropping and connection being re-established )?

3. So the basic idea is that you'd add a routing table listener to the NM ofono code, and that this listener would delete a default route added for mobile data if a default route already existed? Isn't there an inherent race ( albeit a small one ) in this approach?

I guess one final question is there anyway to configure our system routing policy to prevent this from being possible ( ie. prevent two default routes from being configured )?

Revision history for this message
Mathieu Trudel-Lapierre (cyphermox) wrote :

"After the initial device boot, it has a valid default route pointing at wlan and network access is enabled, then at some point rild comes along and blindly adds a new default route due to the data call dropping and being re-established."

This is only a problem when wifi is enabled, because otherwise you don't care what creates the default route for mobile, just that there is one.

There are a few candidate properties for this kind of stuff. Digging more, it's not even quite rild that applies the route, but really more netmgrd (at least, it seems to me like it's the case). This doesn't change the fact that both are binary blobs from qualcomm or wherever, so not much to do about it. There are lots of rild and netmgrd properties though, it's just not well documented what each does.

As for trashing the 'boot' table, it's a one-liner that would be triggered by NM directly on every device activation. NM doesn't touch the 'ip route list proto boot' routes, just kernel routes, so flushing that whole list is safe as long as we know there is nothing else that is expected to use it for a good reason -- that's something we can only say with any degree of confidence on Touch.

As for doing this programmatically directly in NM, it's not at all a listener. I want to avoid doing any such kind of route listening, because NM already does it for other things elsewhere, and it's just not necessary. When the modem is activated, we can just try to delete a route defined to be a default gateway via the gateway address we already got from ofono, whether it exists or not.

All of these options remain racy because there is no way to control rild/netmgrd. You can only hope they do their stuff right and don't take too long to run.

You can't specify a routing policy that prevents two default routes, because this is something that is actually valid in some circumstances. The best you can do is either remove the routes after they are added, or prevent rild/netmgrd from adding the routes in the first place, and there currently doesn't seem to be a way to do the latter. I'm still looking at whether it's possible to change security policies to deal with this.

Revision history for this message
Oliver Grawert (ogra) wrote :

on boot my routing is usually fine ... what i see today is:

- boot with proper default route over WLAN, mobile is enabled but not owning the default route.
- leave the phone next to me in the office or living room (both covered by the same WLAN) and randomly pick up the device to check G+ or read some news in a webapp
- two or three times throughout the day notice that suddenly the app pops up the internal error page, checking the routing then via adb shows the two default routes, disabling and re-enabling WLAN then sets the route properly again.
- while i *have* seen the routing being broken from boot on once or twice this is a very rare case, usually routing is fine on boot and breaks later (though pats comments above indicate that this is different for him)

Revision history for this message
Tony Espy (awe) wrote :

So a couple more comments...

1. Do we know if NM sees the mobile connection drop, and thus it re-activates the context again, or does the context appear active to NM the whole time and the connection drops and re-establishes itself without action from NM? My guess based upon Oliver's description is that the latter is happening.

2. Android is obviously able to handle devices that configure the routing table ( although I have seen references in the code that rild implementation should *not* do this ). Do we know how Android handles this?

3. How difficult would it be to change the NM policy so that we never have concurrent active mobile and Wi-Fi connections? Our current model is that we leave the mobile connection active when Wi-Fi is activated, and then just switch the default route to Wi-Fi once connected. One final thought along these lines, does this leave us in a situation where an app could have an active network session and continue to use mobile data when Wi-Fi is activated?

Revision history for this message
Tony Espy (awe) wrote :

Changed Status to FixCommitted, as this has landed in -proposed.

Changed in network-manager (Ubuntu):
status: Confirmed → Fix Committed
Revision history for this message
Oliver Grawert (ogra) wrote :

lxc-android-config (0.163) utopic; urgency=medium

  * add /etc/NetworkManager/dispatcher.d/02default_route_workaround to make
    sure the default route is always reset when an interface change is
    detected (LP: #1307981)

Changed in lxc-android-config (Ubuntu):
status: New → Fix Released
importance: Undecided → High
Changed in network-manager (Ubuntu):
status: Fix Committed → Won't Fix
Revision history for this message
Ricardo Salveti (rsalveti) wrote :
Download full text (3.8 KiB)

Still not fixed it seems, was able to reproduce this issue on mako last friday:

phablet@ubuntu-phablet:~/build/telepathy-ofono-0.2+14.10.20140725.2/obj-arm-linux-gnueabihf$ ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
^C
--- 8.8.8.8 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 2003ms
phablet@ubuntu-phablet:~/build/telepathy-ofono-0.2+14.10.20140725.2/obj-arm-linux-gnueabihf$ route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 177.24.28.182 0.0.0.0 UG 0 0 0 rmnet_usb0
0.0.0.0 192.168.1.1 0.0.0.0 UG 0 0 0 wlan0
177.24.28.180 0.0.0.0 255.255.255.252 U 0 0 0 rmnet_usb0
192.168.1.0 0.0.0.0 255.255.255.0 U 9 0 0 wlan0

Disabled wlan:

phablet@ubuntu-phablet:~/build/telepathy-ofono-0.2+14.10.20140725.2/obj-arm-linux-gnueabihf$ route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
177.24.28.180 0.0.0.0 255.255.255.252 U 0 0 0 rmnet_usb0

Enabled wlan again:

phablet@ubuntu-phablet:~/build/telepathy-ofono-0.2+14.10.20140725.2/obj-arm-linux-gnueabihf$ route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 192.168.1.1 0.0.0.0 UG 0 0 0 wlan0
177.24.28.180 0.0.0.0 255.255.255.252 U 0 0 0 rmnet_usb0
192.168.1.0 0.0.0.0 255.255.255.0 U 9 0 0 wlan0

phablet@ubuntu-phablet:~/build/telepathy-ofono-0.2+14.10.20140725.2/obj-arm-linux-gnueabihf$ ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=53 time=79.0 ms
^C
--- 8.8.8.8 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 79.078/79.078/79.078/0.000 ms

phablet@ubuntu-phablet:~/build/telepathy-ofono-0.2+14.10.20140725.2/obj-arm-linux-gnueabihf$ ifconfig
lo Link encap:Local Loopback
          inet addr:127.0.0.1 Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING MTU:16436 Metric:1
          RX packets:44169 errors:0 dropped:0 overruns:0 frame:0
          TX packets:44169 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:4674210 (4.6 MB) TX bytes:4674210 (4.6 MB)

rmnet_usb0 Link encap:UNSPEC HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
          inet addr:177.24.28.181 Mask:255.255.255.252
          inet6 addr: fe80::61df:6343:8b6c:fcdf/64 Scope:Link
          UP RUNNING MTU:1500 Metric:1
          RX packets:27 errors:0 dropped:0 overruns:0 frame:0
          TX packets:20 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:3326 (3.3 KB) TX bytes:2104 (2.1 KB)

wlan0 Link encap:Ethernet HWaddr 10:68:3f:fe:09:8f
          inet addr:192.168.1.61 Bcast:192.168.1.255 Mask:255.255.255.0
          inet6 addr: fe80::1268:3fff:fefe:98f/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MT...

Read more...

summary: - [touch] randomly messed up routing with recent trusty images
+ [touch] randomly messed up routing
Changed in network-manager (Ubuntu):
status: Won't Fix → Confirmed
Revision history for this message
Mathieu Trudel-Lapierre (cyphermox) wrote :

Could the routes now be in a different table, or added after NM has run the 02default_route_workaround script?

Ideally, when reproduced, this issue should include both /var/log/syslog for the full logs from NetworkManager, as well as the output of the 'ip route' command, rather than 'route -n'. If you need something easier to read (tabular), consider 'routel'.

Please don't use 'route -n'. It's deprecated, and unfortunately next to useless for debugging routing issues at this level.

Something else that could be useful; if you figure out an exact course of action that allows reproducing the problem; use:

rtmon file /home/phablet/rtmon.log

To record the data. This will save all netlink messages to a file /home/phablet/rtmon.log that can later be read with:

ip monitor file /home/phablet/rtmon.log

Changed in network-manager (Ubuntu):
status: Confirmed → Incomplete
Changed in lxc-android-config (Ubuntu):
status: Fix Released → Incomplete
assignee: nobody → Mathieu Trudel-Lapierre (mathieu-tl)
Revision history for this message
Mathieu Trudel-Lapierre (cyphermox) wrote :

I don't know that we're still seeing the same routing issues, so unassigning for now.

Furthermore, as I understand it, this may well have been due to devices handling; so on vivid with Tony's patches to ignore both ccmni* and rmnet_usb* devices, we should be all good.

Changed in lxc-android-config (Ubuntu):
assignee: Mathieu Trudel-Lapierre (mathieu-tl) → nobody
Changed in network-manager (Ubuntu):
assignee: Mathieu Trudel-Lapierre (mathieu-tl) → nobody
status: Incomplete → New
status: New → Incomplete
Changed in lxc-android-config (Ubuntu):
status: Incomplete → New
status: New → Incomplete
Tony Espy (awe)
tags: added: connectivity
Revision history for this message
Tony Espy (awe) wrote :

I'm removing the 'connectivity' tag, as this bug was specifically for the issue where the routing table ends up with multiple conflicting routes, and thus this is a slightly different problem than the current problems seen with RTM ( empty routing table when switching from WiFi to mobile data ). If someone sees this exact problem feel free to comment, otherwise we'll let this one age out...

tags: removed: connectivity
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for network-manager (Ubuntu) because there has been no activity for 60 days.]

Changed in network-manager (Ubuntu):
status: Incomplete → Expired
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for lxc-android-config (Ubuntu) because there has been no activity for 60 days.]

Changed in lxc-android-config (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.