[touch] Internet connection stops working while WiFi is still connected

Bug #1580146 reported by Andrea Bernabei on 2016-05-10
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical System Image
Undecided
Unassigned
network-manager (Ubuntu)
High
Tony Espy

Bug Description

krillin, rc-proposed, r329

Description:
It often happens that the internet connection stops working while the device is still connected to a (working) WiFi AP.
The indicator shows that the phone is connected to the AP, here's the ouput of "nmcli d" and "nmcli c" http://pastebin.ubuntu.com/16344985/

Tapping on the AP in the network indicator resets the connection and fixes the issue, which however reappears after a while.

Here's also a "grep NetworkManager" from syslog
http://pastebin.ubuntu.com/16344990/

How to reproduce:
I don't have a recipe, but here's what usually happens on my phone
1) I connect to the office WiFi (Canonical's HQ, BlueFin office)
2) Use browser, perform random google searches to check internet is working
3) After a while (sometimes I put the phone to sleep, sometimes I checked the Updates from system settings a few times), I go back to the browser, and it starts returning "Network Error". NM, as you can see from the logs, says WiFi is "connected"
4) At this point I use Terminal app and discover that "ping 8.8.8.8" is working correctly, "ping google.com" immediately returns "unkown host", i.e. it doesn't seem to be waiting for a timeout, it returns pretty fast.

There are no related crash files in /var/crash

Andrea Bernabei (faenil) wrote :

additional info:

ping 8.8.8.8 works,
ping google.com reports unknown host google.com

so the problem could be DNS specific

Tony Espy (awe) wrote :

To be clear from discussions with Andrea, the network is accessible, but it appears that DNS isn't working correctly. He's able to ping an IPv4 address, but trying to ping a hostname results in an immediate'unknown host' being displayed.

Andrea Bernabei (faenil) wrote :

today, after I switched WiFi off, the phone switched to mobile data, but internet was not working at all.

Differently from the WiFi case, where ping <IP> was working and it seems DNS was the only broken piece, now with mobile data not even ping <IP> is working.

Here are "nmcli d", "nmcli c" and "ip route" logs
https://pastebin.canonical.com/156354/ (sorry, internal only)

Andrea Bernabei (faenil) wrote :

Additional info: I have silo77 (the one with the VPN fix) installed since yesterday, Tony asked me to test it to see if it fixed the bug in the description.

I can now say that silo77 doesn't fix this bug

Tony Espy (awe) on 2016-05-12
Changed in network-manager (Ubuntu):
status: New → Incomplete
Tony Espy (awe) wrote :

@Andrea

Let's keep this bug specific to the original problem described.

Does this problem only occur at the office?

Can you reproduce it reliably? If so, how long after you initially connect does it take to manifest itself?

From what you provided in your initial pastebin, both the indicator and NetworkManager show wlan0 connected to the office access point. You also stated that you could ping 8.8.8.8, but that 'ping google.com' returned 'unknown host'.

The new version of NM now logs the IPv4 details when an IP address is assigned and/or renewed. One possibility here is that lease renewal is not working properly. So, if/when this problem happens again, please provide the following:

1) the entire syslog

2) the output of 'iw add show wlan0'

3) the output of 'ip route'

If you again can ping IPv4 addresses on the internet, but can't resolve hostnames, then let's checkout the dnsmasq configuration. To do this, run the following command:

$ sudo kill -SIGUSR1 `pidof dnsmasq`

This causes dnsmasq to write its current configured nameservers to the syslog, so after sending the signal, grep for "dnsmasq" in syslog and look for the lines like these:

May 12 19:55:35 ubuntu-phablet dnsmasq[2048]: server 192.168.1.1#53: queries sent 19, retried or failed 0

Also, in the future, it's better to include requested information ( output from commands, syslog, ... ) directly in comments or attachments vs. using pastebin.

Tony Espy (awe) wrote :

@Andrea

Re: comment #3, please review bug #1533508, to see if your hitting the same. If not, please file a new bug and include output from /usr/share/ofono/scripts/list-modems and list-contexts.

Tony Espy (awe) wrote :

@Andrea

Finally, you might also want to consider "forgetting" some of the saved access points on your device. NetworkManager shows a seriously long list of WiFi connections.

Tony Espy (awe) wrote :

@Andrea

One last thing... did you check for crash files?

Andrea Bernabei (faenil) wrote :

Hi Tony :)

The problem does not happen when I'm connected to my home WiFi AP, it only happens while I'm at the office.

Yesterday I was able to kind of reliably reproduce the bug by doing the following:
1) connect to office WiFi
2) Open System settings -> Updates page 3-4 times
3) Open browser
4) Browser returns "network error"
5) Doublecheck that the bug is active using terminal

The fact that I open Updates page is probably not related, but it seemed to help. I have to try again and see what happens if I just connect and leave the device idle for 10mins and then check via phablet-shell.

Okay, I'll provide the entire syslog (instead of just grep NetworkManager).

Okay, I will use attachments.

Re access points: why? The length of the list of known APs should not impact the behaviour of the system (if it does it's another bug, right?). I did not inject new APs, so I expect all our retail customers to have a list of APs similar to mine, after more than 1 year of daily use.
If you just want me to remove some known APs to help the debugging then I will happily do that, but I first wanted to make sure this is not something we're expecting any user to do. Please let me know your view on this :)

Re crash files: yes, I checked, there are no new crash files

Re the new instructions: thank you :) I will let you have the results once I'm back to the office on Wednesday

Tony Espy (awe) wrote :

@Andrea

Let's try and work together to get the description of the bug ( including steps to reproduce ) accurate.

The initial description was network "stops working" when WiFi is "still connected". This implies the network was actually connected and working before the bug occurred, true? It'd also be nice if we could quantify how long it takes for the network to stop working.

Your steps to reproduce in comment #9 state you see a network error the first time you open the browser. So it sounds like you never had a working internet connection this time around.

Sounds like two different scenarios?

Also, I'm not sure what you mean by "doublecheck that the bug is active using terminal"? If you're running commands to see if the network is active, then please describe what commands you used.

Is it possible to attach the full syslog from your device today?

Re: access points. Everytime NetworkManager starts up, it reads all of those connection files ( checkout /etc/NetworkManager/system-connections sometime if you're interested ). Whenever WiFi is activated and NM has to decide which access point to associate with, it iterates through it's list of connections for every device. Ideally, there'd be an easy way to 'forget all', but there isn't at the moment. This was just a suggestion that could make things work a little better on your phone, take it or leave it.

Re: results; are you only back in the office next Wed ( May 18 )?

Re: problems with mobile data ( comment # 3); did you review bug #1533508 yet and/or file a separate bug?

Andrea Bernabei (faenil) wrote :

I'll update the description, I'm replying from krillin so I'll only quockly address your questions.

Re 'stops working': Yes, the description is accurate, I can use browser and load pages, then after a while it stops working while still being connected to the AP.

Re steps: Sorry, yes, I first open the browser and try loading a new page.

Re full log: the current log doesn't have those lines anymore I have to check if I have the old log on laptop.

Re too many APs: My point is no user would use Forget All, the list of APs should not be a problem. I can delete it, but that's not a solution you can provide to users :) I choose to keep the list and will file a bug if that is the reason why NM misbehaves ;)

Re results: Yes

Re mobile data: Yes I'm ware of that bug, and I had tyat as well (at least until yesterday).
I added the info here because it was a bug triggered after the wifi one so it could actually be the same one. I haven't had issue with mobile data since updating to rc-proposed r334 thismorning, (that update should have the latest NM fixes). I'll file a new bug if I still have issues

summary: - Internet connection stops working while WiFi is still connected
+ [touch] Internet connection stops working while WiFi is still connected
Andrea Bernabei (faenil) on 2016-05-18
description: updated
description: updated
description: updated
Tony Espy (awe) wrote :

So from @Andrea's latest pastebin:

 - NM shows both modem and wlan0 as connected

 - syslog shows one DHCP renewal period

 - a do-add-ip4-address error is logged, however the IP address is still configured ( ip addr show confirms this ); need to investigate whether the error log message is correct, and possible adjust the text of the message

 - dnsmasq has four nameservers configured ( wlan0: ipv4 & ipv6; modem: two ipv4 )

 - the internet is reachable, as 8.8.8.8 can be pinged

 - DNS lookups fail ( ping google.com responds with 'unknown host' )

 - the DHCP lease interval appears to be just shy of 10m ( 571s )

 - the problem seems to occur after the lease renewal

 - the device may have been asleep between the time the WiFi connection was working and the failure

Andrea Bernabei (faenil) wrote :

I sent Tony a private pastebin (to avoid leaking sensitive data) with the logs he previously requested in comment#5, and he was kind enough to summarize his ongoing investigation in the comment above

additional info:
I rebooted the device and used "powerd-cli active" to prevent it from going to sleep, and could still reproduce the issue.
After 12 minutes since dns stopped working, I still can't "ping google.com". ping 8.8.8.8 still working

Andrea Bernabei (faenil) wrote :

Additional info:
it happened, occasionally, that DNS started working again after some time (and stopped working again shortly after), but I'm not sure what triggered that.

Andrea Bernabei (faenil) wrote :

I noticed at one point dns queries were working again, so I grabbed syslog again.
After a short time DNS stopped working.

At the beginning of the log DNS was *NOT* working (I know this for sure as I used ping google.com after 16:00:37) while at the end of the log DNS was "likely" to be working (because I pinged google.com and it was working, so I grabbed the log. But I can't know if it stopped working between the moment when I finished pinging google and the moment I grabbed the log)

I hope this will help understand what happens between a non-working and working state.

Edit: Tony had a look at the file and said unfortunately it's not enough to fully understand what's going on

Pat McGowan (pat-mcgowan) wrote :

Is bug #1270189 related

Changed in canonical-devices-system-image:
assignee: nobody → John McAleely (john.mcaleely)
status: New → Incomplete
Andrea Bernabei (faenil) wrote :

not sure, I'd say not, but let's see what Tony thinks

Andrea Bernabei (faenil) wrote :

I flashed a second Krillin device to rc-proposed r335, the bug was triggered after just a couple of minutes.

Tony adviced using tcpdump, so I looked for some useful DNS related tcpdump commands, and that resulted in the log you can find attached

INFO about the device:
It has 1 SIM, which is currently locked (as I don't know its PIN, but I can get it if needed)

Andrea Bernabei (faenil) wrote :

I could reproduce the same issue on a Vegeta, rc-proposed, r329

Andrea Bernabei (faenil) wrote :

"Dig" results from the same krilling as comment #18

phablet@ubuntu-phablet:~$ dig google.com

; <<>> DiG 9.9.5-9ubuntu0.5-Ubuntu <<>> google.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: REFUSED, id: 38228
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;google.com. IN A

;; Query time: 24 msec
;; SERVER: 127.0.1.1#53(127.0.1.1)
;; WHEN: Fri May 20 11:34:34 BST 2016
;; MSG SIZE rcvd: 39

Tony Espy (awe) wrote :

@Andrea

Can you try adding the following setting to the NetworkManager connection file for "Canonical-2.4GHz-g" in /etc/NetworkManager/system-connections:

[ipv6]
method=ignore

"[ipv6]" should already be there, so just add "method=ignore", and remove any other settings under "ipv6".

Restart the device once you've updated the connection file.

Andrea Bernabei (faenil) wrote :

Now testing.

I had
[ipv6]
addr-gen-mode=stable-privacy
dns-search=
method=auto

and changed to
[ipv6]
method=ignore

Andrea Bernabei (faenil) wrote :

@Tony
I'm unable to reproduce the bug with ipv6 method=ignore

Tony Espy (awe) on 2016-06-01
Changed in network-manager (Ubuntu):
status: Incomplete → In Progress
assignee: nobody → Tony Espy (awe)
importance: Undecided → High
Andrea Bernabei (faenil) wrote :

still happening as of r361, krillin, rc-proposed

Tony Espy (awe) wrote :

@Andrea

There are two tests I'd like you to run for me.

First, let's verify the baseline by flashing OTA10 ( pre-NM 1.1.93 landing ) on a spare krillin, and then set the WiFi connection to the opposite as earlier suggested, this time setting ipv4.method to ignore. This will validate that IPv6 connections were actually working ( vs. just not causing DNS to fail ), even though the kernels lack support for IPv6 peer addresses ( which NM 1.2x always uses when adding IPv6 addresses, and which in turn always fail ).

Second, there are three cases currently that I can see where NM adds IPv6 addresses ( note, there are more possible, but these are the most common when method=auto ):

 - NM always configures an IPv6 Link Local address with the peer set to the "Any" address ( all zeros or :: ); the code states that a LL address is always added for any method other than "ignore"

 - a VPN configuration specifies an IPv6 address ( in this case, our VPN always sets the peer with the interface identifier part of the address set to ::1/64 )

 - an address is received via RDP ( router discovery protocol ); this is the case that happens when you associate to an AP that supports IPv6 and method=auto

As I can't reproduce the last case ( my ISP doesn't support IPv6 yet ), it would be great if you could run NetworkManager in the foreground with an env var specified which triggers netlink debug messages. This will let me see what peer_address is being set when you connect to the work AP. If you want to send me these traces privately, that might be best. Please use a recent rc-proposed image.

Here are the steps:

1. From system-settings::wifi, "forget" the access point

2. adb into the phone, and start an interactive sudo session:

# sudo -i

3. Stop NM

# stop network-manager

4. Run the script command to capture output:

# script nm-nl.out

5. Start NetworkManager in the foreground like this:

# NLCB=debug /usr/sbin/NetworkManager -n -d --log-level=debug --log-domains=ip6,wifi,vpn

6. From the indicator, connect to the AP

7. Leave the connection up and NM running for 5-10m

8. Kill NM via Ctl-C

9. Exit the script via Ctl-D

10. Grab the script output and send it to me direct, or put on a fileshare

Please let me know if you have any questions, or want to discuss before you try any of this.

Andrea Bernabei (faenil) wrote :

@Tony:

that's great, thanks for all the info :)

I'm on a sprint this week, I will try it first thing when I get back to the office!

John McAleely (john.mcaleely) wrote :

I tried to reproduce the original error.

I flashed my vegetahd fresh (--bootstrap) to ubuntu-touch/rc-proposed/bq-aquaris.en, #366

I walked through the wizard, and added it to the Canonical wifi (Canonical-2.4GHz-g)

I performed some random google searches, and confirmed all was well. Over a period of ~10 mins I left the device to idle, powered it on/off with the power button (just sleep of course), and tried to observe the 'network error' reported.

Eventually, while on the lock screen, I woke the device to see the wifi password prompt - the device appeared to have forgotten the password for the canonical wifi. As we know, this is a failure mode that can mean other things.

I cancelled the dialog, and logged in to the device. Using the indicator menu, I pulled down the network panel. Observing that the system appeared to be on cellular data (2G, note that both SIMS are loaded and registered), I selected the Cannonical-2.4GHz-g wifi network and observed it turned green, and the indicator switched to the wifi signal strength view.

Dismissing the indicator Menu, I opened the browser and navigated to a page.The browser displays 'Network Error', 'It appears you are having trouble viewing: http://microsoft.com/. Please check your network settings and try refreshing the page. <Refresh Page>'

Pressing refresh page, or navigating to other sites, produced the same error. After a sleep cycle and unlock, the <refresh page> button started working - ie the device healed from whatever state it was in, while I typed this up.

John McAleely (john.mcaleely) wrote :

In attempting to reproduce #27 again, I have now seen a very similar sequence, but no passsword prompt for the wifi was made, and the device has appeared (from indicators) to have been on the wifi throughout.

However, on attempting to access a website, the 'network error' dialog appeared.

John McAleely (john.mcaleely) wrote :

I have captured the logs requested in #25.

Note that during the session, the problem with 'network error' occured after a few mins. Unlike in #28, a sleep (power button) cycle did not recover the issue, but I noted that I was connected via ADB, perhaps changing the power behaviour.

Therefore toward the end of the log, I used the network indicator pull down to disable and then re-enable wifi. On rejoining the wifi (which it did automatically), the browser could successfully refresh a page.

I therefore currently speculate that power cycling the wifi chipset clears this issue.

I terminated the log gathering shortly after this event.

John McAleely (john.mcaleely) wrote :

I have captured the logs requested in #25.

Note that during the session, the problem with 'network error' occured after a few mins. Unlike in #28, a sleep (power button) cycle did not recover the issue, but I noted that I was connected via ADB, perhaps changing the power behaviour.

Therefore toward the end of the log, I used the network indicator pull down to disable and then re-enable wifi. On rejoining the wifi (which it did automatically), the browser could successfully refresh a page.

I therefore currently speculate that power cycling the wifi chipset clears this issue.

I terminated the log gathering shortly after this event.

Andrea Bernabei (faenil) wrote :

@John, not sure if I've noted this down already in the previous comment, but in case I haven't:

I usually just tap on the wifi network I'm connected to, to trigger reconnection.
That fixes the issue for me (until it stops working again after a few minutes, or less than a minute in some cases I think).

Can you confirm your device behaves the same?

Tony Espy (awe) wrote :

@Andrea

Can you provide the info requested on comment #25 too?

Aron Xu (happyaron) on 2016-12-20
tags: added: nm-touch
Changed in canonical-devices-system-image:
assignee: John McAleely (john.mcaleely) → nobody
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers