Access points' "PropertiesChanged" dbus signals freeze UI on mobile devices

Bug #1480877 reported by Andrea Bernabei
140
This bug affects 32 people
Affects Status Importance Assigned to Milestone
Canonical System Image
Fix Released
High
John McAleely
dbus-cpp (Ubuntu)
Fix Released
High
Thomas Voß
dbus-cpp (Ubuntu RTM)
Fix Released
Undecided
Unassigned
location-service (Ubuntu RTM)
Fix Released
High
Thomas Voß
network-manager (Ubuntu)
Incomplete
High
Tony Espy
network-manager (Ubuntu RTM)
Incomplete
High
Tony Espy
qtbase-opensource-src (Ubuntu)
Fix Released
High
Tony Espy
qtbase-opensource-src (Ubuntu RTM)
Fix Released
High
Tony Espy
unity8 (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

Krillin, rc-proposed, r83

DESCRIPTION:
I've been trying to track down the cause of the occasional UI freezes on my Krillin device, and I noticed that whenever the UI freezes for 2-4 seconds, I get a burst of "PropertiesChanged" signals in dbus-monitor

Here's a log of what's shown in dbus-monitor: http://pastebin.ubuntu.com/11992322/

I'd guess the problem is in the code that actually catches the signals and acts accordingly.

HOW TO REPRODUCE:
1) Move to a place where many wifi hotspots are available
2) Connect the device via USB and run "phablet-shell" and then "dbus-monitor"
3) Use the device while keeping an eye on dbus-monitor output

Related branches

CVE References

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in network-manager (Ubuntu):
status: New → Confirmed
Revision history for this message
royden (ryts) wrote :
Revision history for this message
Andrea Bernabei (faenil) wrote :

It seems I hit an evolved version of this bug.

My phone is right now unusable, dbus-daemon taking 99% of the cpu, and dbus-monitor shows about 30 signals like this:

signal sender=:1.0 -> dest=(null destination) serial=78512 path=/com/ubuntu/Upstart; interface=com.ubuntu.Upstart0_6; member=EventEmitted
   string "dbus"
   array [
      string "SIGNAL=PropertiesChanged"
      string "BUS=system"
      string "INTERFACE=org.freedesktop.NetworkManager.AccessPoint"
      string "OBJPATH=/org/freedesktop/NetworkManager/AccessPoint/2281"
      string "SENDER=:1.8"
   ]

Every 2-3 seconds I get a new burst of 30-ish signals.

The phone is so slowed down it takes about 15-20 seconds (and some patience) to be able to just swipe away the lockscreen.

I'll try not to reboot the device right now so that we can keep debugging it and hopefully get rid of this beast :)

so if there's anyone in the audience who'd like me to get some more useful logs, please shout! :)

Revision history for this message
Andrea Bernabei (faenil) wrote :

syslog doesn't show anything interesting (at least not in the last 1k-ish lines)

Revision history for this message
Tony Espy (awe) wrote :

@Andrea

A few questions/requests....

1. Please attach your syslog to the bug

2. Are you connected to an access point when the problem occurs?

3. Have you made the phone write-able and installed any packages via apt-get and/or modified the filesystem in any way?

4. Can you recall any changes that occurred between your original report and the new level of problems as mentioned in comment #3?

The current version of NM in our images has a bug which prevents periodic scanning of WiFi access points, so I'm a bit wary to pin this on NM. Also, the DBus message you point out in comment #3 appears to be from Upstart, not NM.

Changed in network-manager (Ubuntu):
status: Confirmed → Incomplete
assignee: nobody → Andrea Bernabei (faenil)
Revision history for this message
Andrea Bernabei (faenil) wrote :
  • syslog Edit (19.4 MiB, application/octet-stream)

1) syslog attached

2) Yes it thinks it is connected to an access point which doesn't actually exist in the office (it's only available at home). The rest of the access points detected in the network indicator are up to date though. To sum up: it thinks it's connected to an access point that doesn't actually exist where I am now, but the rest of the networks it shows are correct.

3) yes

4) I can't think of any change

Revision history for this message
Andrea Bernabei (faenil) wrote :

just a hint while reading syslog, I think the current problem started yesterday, not before lunch.

Revision history for this message
Andrea Bernabei (faenil) wrote :

after more investigation with awe on IRC, I can provide the following info:

here are the packages I have in the apt cache:
http://pastebin.ubuntu.com/12021300/
nothing that could cause this issue, as far as I can tell

More info about the problem:
"nmcli d" reports mobile data as CONNECTED, wifi as
wlan0 wifi connecting (getting IP configuration) Morino

Morino is an AP that doesn't exist in the place where I am right now, it is an access point which was created via Win8.1 at home (infrastructure mode) in an attempt to share the internet connection of a laptop running Win8.

So, it seems NM is stuck trying to connect to an access point which doesn't currently exist.

The network indicator shows the mobile data icon, yet the wifi network lists shows "Morino" (the non-existing AP) as green, which means it's connected (or connecting as well, maybe?)

Revision history for this message
Andrea Bernabei (faenil) wrote :

after reboot, the problem seems to gone (at least for the moment)

Revision history for this message
Andrea Bernabei (faenil) wrote :

"Morino" AP was created using instructions very similar to what's shown in
http://www.melbpc.org.au/pcupdate/3006/3006article8.htm

I'm still not sure the fact that Morino wasn't a "physical AP" makes a difference...but who knows :) better safe than sorry

Tony Espy (awe)
Changed in indicator-network (Ubuntu):
status: New → Incomplete
Revision history for this message
ingo (ingo-randolf) wrote :

hitting the same issue once and a while.
it happened again yesterday. i could record some traffic on dbus:

this pastebin shows the log with high performance impact.
at the end of the log things are getting normal again:
http://pastebin.com/v55umuBF

this pastebin in comparison shows the traffic after things settled and the phone was usable again.
http://pastebin.com/CrQ0Z3K2

both recordings are about 1 minute.

maybe this helps to track it down...

Revision history for this message
Tony Espy (awe) wrote :

@ingo

Thanks for the comment and DBus logs.

In the original bug report, NetworkManager's WiFi device was stuck connecting to an access point which no longer existed. Also as noted, the access point in question was a softAP configured on a machine running Windows8.1. Rebooting, solved his issue.

In order to make further progress on this bug, what we really need is a concrete set of steps to reproduce the issue.

Revision history for this message
Andrea Bernabei (faenil) wrote :

@awe I'm sorry if the two things are mixing up :(

I should probably file another bug for the connection loop...

When I filed this bug the softAP didn't exist yet, so that can't have been the cause.

We have two (related) problems here:
1) when you're in an area with many access points, many PropertiesChanged signals are fired and they slow down the device for 2-3 seconds

2) when NM gets stuck connecting to a network which doesn't exist, PropertiesChanged signals are fired in an endless loop (well, it pauses 2-3 secs every time before firing all signals again)

Revision history for this message
Tony Espy (awe) wrote :

@Andrea

As mentioned on IRC, we still need more information in order to determine what the real problem(s) are.

I did a little more investigation regarding the DBus events you're seeing on the session bus. By default, an upstart bridge is started on the session bus which forwards all signals from the system bus to the session bus. See /usr/share/upstart/sessions/upstart-dbus-system-bridge.conf for details of this job.

Problem 1:

 * We need some way to quantify "many access points". Does "many" mean more than five? ten? Again, without a reproducible scenario, we can't begin to narrow this down... Taking a look at your first pastebin in your description, I only see ~40 access points involved which doesn't seem to be all that many. I routinely see ~25 in my home office.

 * Do you have a location where this problem always occur? Does the problem happen every time you're in this location? Does the condition persist until you leave this location? What happens if you reboot while at the location? What happens if you disable Wi-Fi while at this location?

 * I will attach a script which will dump some information out about NM's Wi-Fi device, including the number of access points that are currently visible to NM. The next time the problem manifests, can you please run the script and add the output to a comment?

 * This next step may break other things on your system, but it's worth a try... If you can reliably reproduce the problem, before entering the area with the excess of APs, please run the following command as the phablet user on the phone:

% stop upstart-dbus-system-bridge

 * When you're done testing, please remember to start the bridge ( replace "stop" with "start" ), or reboot the device.

 * Please note the times when the problem occurs and make sure to attach your syslog. This will allow me to look at what NM was doing at the time the problem happened. If you want, you can reduce the size of the log by using the following command:

$ grep "NetworkManager" syslog > nm.out

...and attach just the nm.out file.

 * Looking at your syslog from comment #6, I see a couple of notable things:

   - at some point, you appear to be connecting to a Wi-Fi WPA-Enterprise network ( eduroam ). Does this correlate to when you see the problem?

   - I also note that around 11:24 on Aug 6, your device appears to have rapidly roamed between a couple of the office APs in Bluefin. This could also be a potential source of the problem.

 * Finally, before doing any re-test, can you ensure you have the latest updates installed? We just released a new version of NM which fixes a problem with scanning. It's probably not related, but you never know...

Problem 2:

Please try to reproduce and file a separate bug for the issue where NM gets stuck trying to connect to the Windows8.1 AP.

Revision history for this message
Tony Espy (awe) wrote :
Revision history for this message
Andrea Bernabei (faenil) wrote :

I'll try to address some questions here, more answers will come in the next comment, sorry about that.

It seems to always happen at least at the office (BlueFin), where your script reports 26 APs.

Yes the problem disappears if I disable WiFi.

After reboot, the problem disappears, at least at the beginning, the number of PropertiesChanged I see in dbus-monitor is much lower. So it seems to be about something which gets worse with time. Even if I didn't move, I'm seeing 1-4 PropertiesChanged at a time instead of 30-ish.

I've been waiting for 10minutes but I still don't have the problem, after reboot.

I'll try disabling the dbus bridge as soon as I get the problem again, which is just a matter of time I guess.

The eduroam is most likely not related.

Yes I have the latest update installed, r94 rc-proposed, krillin.

Revision history for this message
Andrea Bernabei (faenil) wrote :

I went from one side of the office to the other to trigger AP roaming.

That didn't help reproduce the issue, though.

Now I usually get 1 PropertiesChanged for the AP I'm connected to, and from time to time I get 10-ish PropertiesChanged for other APs

Revision history for this message
Andrea Bernabei (faenil) wrote :

One detail I remember:
before reboot (i.e. last time I had this issue) WiFi was ON, but the phone wasn't connected to the WiFi network which it should autoconnect to, it was using mobile data.

Because of https://bugs.launchpad.net/ubuntu/+source/ubuntu-system-settings/+bug/1480864
I can't disconnect from a WiFi network without switching the WiFi off.

Tomorrow I'll try deleting the office WiFi from the known networks (so that I can keep WiFi enabled without being necessarily connected to an AP) and see if that helps reproduce the bug.

Revision history for this message
Andrea Bernabei (faenil) wrote :

I have the bug again.
I'm on rc-proposed r102, krillin, using network manager 0.9.10.0-4ubuntu15.1.7

Some stats:
Every 10-15 seconds I get a burst of 50 PropertiesChanged which lock my phone for about 4 seconds.

So, every 15 seconds, 4 seconds of frozen UI.

"nmcli d" reports WiFi as connected to the office network (as it should be).

this is the "grep NetworkManager /var/log/syslog" since I got to the office (I don't remember having this issue at home this morning)
http://pastebin.ubuntu.com/12123687/

this is the output of "dbus-monitor --system" in one of the 4 seconds freezes.
http://pastebin.ubuntu.com/12123712/

the output of the wifi script is
http://pastebin.ubuntu.com/12123727/

After getting those logs I disabled WiFi and the freezes disappeared (and the flooding in dbus-monitor with them)

Then I reenabled WiFi, it took something like 1 minute before the phone stopped being laggy like hell (while the wifi list was populating in the indicator panel)....AND THE PROBLEM IS BACK :)

Revision history for this message
Andrea Bernabei (faenil) wrote :

PS even after reenabling wifi and finding the office network, the phone did *not* reconnect to it, but that's probably another bug :)

Revision history for this message
Tony Espy (awe) wrote :

@Andrea

Thanks for the updates.

So our theory that this is related to a large number of APs doesn't pan out, as you're able to reproduce this at the office with only 25 access points involved ( per the output of my script ).

Also, I'm pretty sure the problem is related to WiFi roaming. It appears when you first enable WiFi at Bluefin, goes away when WiFi is disabled, and re-appears when WiFi is re-enabled while at Bluefin.

Looking at your syslog output, I see your phone associate with the office network around 10:40:05, however I also see a suspicious message right before the connection activation succeeds:

Aug 19 10:40:05 ubuntu-phablet NetworkManager[1417]: <info> (wlan0): roamed from BSSID 18:33:9D:F8:AA:B0 (Canonical-2.4GHz-g) to (none) ((none))

This doesn't appear to affect your connection though, as there's no subsequent disconnect, and I also see multiple DHCP renewals on wlan0 @ 10:48, 10:56, 11:05, and 11:14.

As for dbus-monitor output, we're really more concerned with the traffic on the session bus. Can you please try disabling the upstart bridge as mentioned in comment #14?

> * This next step may break other things on your system, but it's worth a try... If you can reliably reproduce the problem,
> before entering the area with the excess of APs, please run the following command as the phablet user on the phone:
>
> % stop upstart-dbus-system-bridge
>
> * When you're done testing, please remember to start the bridge ( replace "stop" with "start" ), or reboot the device.

Please note whether or not this makes the stuttering go away.

Can you also try disabling indicator-network ( as phablet, just run 'stop indicator-network ) and re-test too?

Finally, do you have another device ( ie. an arale or mako ) you could compare results against?

I'm actually working on a related bug for BQ, and may have another test package for you to try, but I'd like to see whether disabling the bridge and/or indicator-network has any effect on the issue first.

Revision history for this message
Andrea Bernabei (faenil) wrote :

More details on the situation in the morning:

when I got to the office I waited more than 1h for it to autoconnect to WiFi, but it didn't.
The WiFi connection at 10:40 was forced by me, I went to settings -> Wifi, and tapped on the office network.

About the bridge, can I disable that after I have the bug? I don't have a way to reproduce this bug on demand, so the only way to make sure it's there is to see the stuttering. The alternative is disable the bridge and then check the number of PropertiesChanged on the system bus, to know if the bug is being triggered ...

About the session bus, the output is basically a copy of what's in the system bus, as previous logs show.

Revision history for this message
Andrea Bernabei (faenil) wrote :

New info about some debugging with "awe":

rc-proposed, r102, krillin, nm 0.9.10.0-4ubuntu15.1.7

----I CLOSED ALL APPS, NOTHING IN THE TASK SWITCHER----

After getting home I still have the bug, I enabled and disabled WiFi a couple of times, the problem comes back every time WiFi is enabled.

The frequency, however, is different: I get 3-4secs freezes every 2 mins, which awe says is likely the WiFi scanning interval.

AFTER DISABLING THE DBUS SESSION BRIDGE --> no change

AFTER DISABLING THE NETWORK INDICATOR --> no change

there's still spamming in the system dbus, and the same 3-4secs freeze (maybe a bit less than with bridge and network indicator enabled though)

Revision history for this message
Tony Espy (awe) wrote :

@Andrea

Thanks for the updates. Your stopping of the bridge and indicator are useful clues.

Again, we still have no proof that there's spamming of the bus yet. Is there a lot of traffic, yes? But as pointed out before, the bus is meant to handle lots of traffic. I'm not claiming the bug isn't related to bus traffic, it's just that we haven't managed to pinpoint a smoking gun yet.

It'll be interesting to see if you get the same set of JobAdded/Removed messages on your bus when enabling/disabling WiFi after you re-flash.

Revision history for this message
Andrea Bernabei (faenil) wrote :

@Tony

hey :)

Yeah, I reflashed yesterday (without formatting the user partition).

I'll keep you posted, I think the bug is still there, as I'm seeing 0.5-1secs freezes and bursts of PropertiesChanged signals on the bus (but not as many as before reflashing).

It usually gets worse over time, so let's see how it will behave tomorrow or in a couple of days.

I'm quite sure the problem is still there, just a matter of waiting for it to get annoying enough.

Revision history for this message
Andrea Bernabei (faenil) wrote :

At the moment, for instance,

the frequency is 1sec UI freeze (with the usual PropertiesChanged burst in dbus monitor) every 1 minute (which I guess is the current WiFi scanning interval.

Revision history for this message
Andrea Bernabei (faenil) wrote :

I forgot one detail, the phone is currently not connected to WiFi (it sees office network, but it doesn't auto connect to it). WiFi is enabled, of course.

Revision history for this message
Andrea Bernabei (faenil) wrote :

I'm quite sure I had the same bug this weekend, the phone was really unusable, kept freezing every few seconds, I used the terminal app to see that it was the usual burst of PropertiesChanged.

I don't have stats and logs as I was out, and the phone died before I got home.
I'll have to wait a couple of days more for the phone to become annoyingly slow again.

Revision history for this message
Pat McGowan (pat-mcgowan) wrote :

I hit this issue while traveling in the states and on arrival to Bluefin. The issue went away as soon as I associated to the AP in the office. I have tried forgetting the network and reproducing it and so far I cannot.

Changed in canonical-devices-system-image:
assignee: nobody → John McAleely (john.mcaleely)
importance: Undecided → High
milestone: none → ww40-2015
status: New → Confirmed
tags: added: connectivity
Revision history for this message
Michael Terry (mterry) wrote :

You folks mention the session bus. But when I've seen this bug (like my notes from bug 1480844), it's been the system dbus-daemon that takes 100% CPU.

I attached gdb to it once when this was happening. We seemed to be spending a chunk of time in get_recipients_from_list(), which iterates over every system DBus connection and sees if it is listening for a given signal, in order to pass it on if so. I don't *know* that the actual majority of time was in that function yet, I'm still digging there.

Revision history for this message
Michael Terry (mterry) wrote :

So I added some debugging output to dbus. When it peaked, it spent over a third of the CPU entirely in get_recipients_from_list(). (This is dbus checking if an incoming signal matches any of the rules added by org.freedesktop.DBus.AddMatch -- i.e. signal subscriptions. It does this for every rule, for every connection.)

So we can reduce the time spent there by being more efficient, by reducing the number of subscriptions, or reducing the number of signals.

Looking at which rules it's checking, the majority are subscriptions on NetworkManager. One per access point, one per device, from multiple connections. Could connectivity-cpp be a bit too subscription-greedy?

I'm not sure that get_recipients_from_list is the only (or worst) culprit here. Just something I noticed.

Revision history for this message
Jonas G. Drange (jonas-drange) wrote :

Hotspot management code that subscribes to NM events [1] was recently moved from System Settings into indicator-network. The code is quite comprehensive whenever you have previously created a hotspot. Those affected by the bug should 1) see if they have previously created a hotspot and 2) see if by removing the hotspot, the issue is resolved (sudo rm /etc/NetworkManager/system-connections/<hotspot name> && sudo reboot).

This wasn't an issue when the code was bound to System Setting's life cycle, but it may be an issue for the indicator.

[1] http://bazaar.launchpad.net/~indicator-applet-developers/indicator-network/trunk.15.10/view/head:/src/indicator/nmofono/hotspot-manager.cpp#L690 (This code is in vivid+overlay too).

Revision history for this message
Andrea Bernabei (faenil) wrote :

I have never created a WiFi hotspot yet

Revision history for this message
Jonas G. Drange (jonas-drange) wrote :

Right, after deleting my hotspot I still saw lockups while moving through London.

Revision history for this message
Michael Terry (mterry) wrote :

I've been seeing this for a while, before I even updated to a hotspot-capable image.

Revision history for this message
Tony Espy (awe) wrote :

I installed bustle on my krillin this afternoon to try and glean some more information from the system DBus. I few observations...

1. The top signal on the system bus is:

org.freedesktop.NetworkManager.AccessPoint.PropertiesChanged

2. Most of these seem to be for a specific AP ( instance 0 ), which I'm pretty sure is the current connected AP. The associated property is 'Strength'. This makes sense, as NM constantly reports the varying signal strength of the AP. These are happening all the time, and I typically see them in bursts of two or three signals.

3. As scans typically occur anywhere from 20s to 2m, when they do occur, we see a flurry of AccessPoint.PropertiesChanged signals representing the changing signals strengths of each access point. There also are AccessPoint.PropertiesChanged signals for the 'LastSeen' property. I'm pretty sure this property was added specifically for the LocationService. Note, ideally this signal would be bundled with the associated 'Strength' signal, however due to the way NM's underlying object/property system was architected, this doesn't seem to be an easy change at first glance. Note, these LastSeen signals are consumed by the LocationService's 'connectivity API' ( which I'm pretty sure uses dbus-cpp ).

So... we still need to quantify how many APs trigger the problem, and if it's being triggered whenever scans occur. In comment #29, Pat mentioned that it happened when he initially arrived in Bluefin, but went away once he connected to an AP. As background scanning had recently been restored in NM, I would've assumed he would continue to see the issue each time a scan was occurred, even when connected.

It also would be interesting to see whether disabling location service has any effect on the problem ( if we can reproduce reliably of course ).

Perhaps I can dummy up wpa_supplicant to push an artificial number of APs to NM and see if I can trigger the problem locally.

Revision history for this message
Tony Espy (awe) wrote :

One other interesting tidbit. On my krillin, on average there are roughly ~200 processes running. Looks like almost half (87) of them are connected to the system bus. You can use the following command to see how many sockets exist for any dbus processes, and then you need to figure out which one is the system bus daemon ( usually the one with the lower PID ):

sudo netstat -nap | grep dbus | grep CONNECTED | awk '{print $8}' | sort | uniq -c

Note, this compares to 55 processes on my desktop connected to the system bus.

Tony Espy (awe)
Changed in network-manager (Ubuntu):
assignee: Andrea Bernabei (faenil) → Tony Espy (awe)
Revision history for this message
Tony Espy (awe) wrote :

I managed to hack up wpa_supplicant such that it artificially adds a hard-coded set of non-existent APs every scan. It's not a pretty patch, but it served the purpose of ramping up the DBus traffic generated by NM significantly on each scan.

See: https://code.launchpad.net/~awe/wpasupplicant/add-fake-aps

I was able to get the UI ( the app switcher animation more specifically ) to freeze up for anywhere from 4-6s whenever a Wi-Fi scan occurred. I currently have the patched wpa_supplicant adding 125 APs to the list, which results in a total of 175 APs in the scan results seen in my loft.

I also looked a bit more closely at the touch-specific change which introduced NM's AccessPoint 'LastSeen' property. This was made in response to a private bug that was reported during the development cycle of the HERE code for the phone. The original complaint was about access points becoming stale in the scan results, resulting in skewed locations. wpa_supplicant's AccessPoint object exposes an 'age' property, but it's not exposed by NetworkManager. So a 'LastSeen' property was added to NM's AccessPoint object, and it gets updated by bss_updated_cb() in nm-device-wifi.c. It looks like for every scan_result, bss_updated signal is being generated for every AccessPoint in the scan results ( note, NM doesn't even look at which property changed... so it's hard to tell which property is triggering this on every scan; most likely it's Strength which is randomly assigned in my hack ).

The original HERE bug stated that it queries the available networks, so we'll need to check if it's truly reliant on these signals or not. It might be possible that addition of the property to query results is enough ).

In the meanwhile, I've built a test version of network-manager with the property-changed notifications for 'LastSeen' removed. It can be found in my PPA: https://launchpad.net/~awe/+archive/ubuntu/ppa

A quick test shows that this change makes the UI stuttering much less severe ( maybe <= 1s stutter ).

Note, NM is basically propagating the same set of signals as wpa_supplicant sends. In theory, this shouldn't be enough to saturate the bus. If it is, then we're really pushing the limits of the system.

One other theory is that we have applications and/or subsystems that are using overly broad DBus match rules. For more details on this, please see:

https://www.collabora.com/about-us/blog/2014/10/06/improving-the-security-of-d-bus/

It *might* be possible to run dbus-1.10 from wily, which includes some debug tools which would allow us to more easily examine the match rules currently in use on our images.

Revision history for this message
Tony Espy (awe) wrote :

A bit more info...

After a deeper look at some DBus traces on a vanilla krillin OTA6 image, I noticed a few more things.

After a ScanDone signal gets sent by NM, you typically see 3-5 PropertiesChanged signals for new APs that include all of the properties. Then, you see a PropertiesChanged from Device.Wireless which includes an array of the current APs. Then you usually see a PropertiesChanged signal for every AP except current ( AP=0 ), with the property 'LastSeen'. These signals all typically have the same value for last-seen, although if the list is long, there might be variation of maybe +1-2 across all the signals.

What's unusual, is I'm then seeing what looks like a second set of PropertiesChanged signals for all of the current APs.

Digging deeper in the trace, it turns out wpa_supplicant actually generates two PropertiesChanged signals every time an object changes. It first uses the new ( more correct ) 'org.freedesktop.DBus.Properties' interface, then sends the signal *again* using the deprecated 'fi.w1.wpa_supplicant1.Interface'! I think this is what in turn triggers the second signal for each AP from NM. I haven't proven this 100% yet, but it certainly explains the behavior of NM.

I'll continue to test.

I also had a discussion with one of our location service engineers, and we think that it might be possible to get rid of the 'LastSeen' PropertiesChanged signals altogether and just use the Device.Wireless 'AccessPoints' property, which gets updated after every scan completes.

Revision history for this message
Tony Espy (awe) wrote :

And more detail...

So looks like the double wpa_supplicant signals are harmless ( other than wasting bandwidth ) as the NM's supplicant manager uses the correct 'org.freedesktop.DBus.Properties' interface when it creates the proxy for each new BSS.

It looks like a change introduced by the patch '0002-wifi-cull-the-scan-list-before-signalling-ScanDone-b' may be the culprit at the NM level. The intention was to ensure that the 'last-seen' property was updated for all APs in the scan list before the 'ScanDone' signal is generated. So, in nm-device-wifi.c::scan_done_cb(), a schedule_scanlist_call() was turned into a direct call to cull_scan_list(). Also, the cull_scan_list() function was changed to update last_seen for every AP in the current list.

There are a few other places though that trigger update_last_seen() calls. Both bss_new_cb() and bss_updated_cb() call schedule_scanlist_cull() before returning. Also bss_updated_cb() also calls update_last_seen() directly!

schedule_scanlist_cull() guards against duplicate scheduling, and uses a 4s timer to schedule the next call to cull_scan_list().

Note, update_last_seen() was modified to guard against setting the property if the value hasn't changed.

So... the original code would get a scan_done_cb() and schedule a scan_list cull. Next it would get a series of new and/or updated bss callbacks which each would schedule their own cull. You see where I'm going with this... In theory, all of these would just result in a single scan_list cull which didn't even touch the last_seen ( it was only updated by bss_updated ). With the current code, at least two calls to scan_list_cull() occur after a scan. Also, it's possible that a direct modification of last-seen in bss_updated_cb () could also contribute some extra signals depending on timing...

I *think*, the correct thing to do is to leave scan_done_cb() as the sole place that updates last_seen. This will prevent duplicate signals for the time being.

It still remains to be seen whether we can get rid 'last-seen' altogether.

Tony Espy (awe)
Changed in location-service (Ubuntu):
status: New → Incomplete
Changed in network-manager (Ubuntu):
status: Incomplete → In Progress
Tony Espy (awe)
no longer affects: indicator-network (Ubuntu RTM)
Changed in network-manager (Ubuntu RTM):
status: New → In Progress
assignee: nobody → Tony Espy (awe)
Changed in network-manager (Ubuntu):
importance: Undecided → High
Changed in network-manager (Ubuntu RTM):
importance: Undecided → High
Revision history for this message
Michael Terry (mterry) wrote :

I just tried Tony's lp:~phablet-team/network-manager/lp1480877-wifi-rm-dup-scan-signals branch and did the same thing I've done before to trigger this bug: walk around my neighborhood with Terminal open, running top.

To be fair, I was testing rc-proposed tip + Tony's branch, whereas I previously tested with OTA6. So there's more than one change here. But I haven't heard of related changes in this area, so I will attribute all improvements to Tony's branch.

Before this patch, I was seeing NM spike to 30-50% CPU while scanning and causing a matching spike in dbus-daemon to 100% for a while (at least 3s, sometimes a fair bit longer).

With this patch, NM stayed at ~15% when scanning, and dbus-daemon never went above 7% (usually at 4%). A massive improvement.

From a UX perspective, I still saw UI stuttering. Seemingly without the CPU-load justification to warrant it. But I subjectively don't think it was as bad or long-lasting as before. And top doesn't lie. The CPU was way less loaded with this patch than without it.

Revision history for this message
Tony Espy (awe) wrote :

Note, one other observation noticed this week. If the HERE Maps application is running and WiFi is enabled, it triggers scans every 4-5s.

Revision history for this message
Andrea Bernabei (faenil) wrote :

Thanks for that note,
I was just wondering how it could work with 1-2mins scan interval :)

tags: added: hotfix
Changed in canonical-devices-system-image:
status: Confirmed → In Progress
Tony Espy (awe)
Changed in network-manager (Ubuntu RTM):
status: In Progress → Fix Committed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package network-manager - 1.0.4-0ubuntu4

---------------
network-manager (1.0.4-0ubuntu4) wily; urgency=medium

  [ Mathieu Trudel-Lapierre ]
  * d/p/0001-wwan-add-support-for-using-oFono-as-a-modem-manager.patch:
    Pre-initialize reason for ofono state changes to NULL; this avoids getting
    to nm_device_set_state with an invalid reason string. (LP: #1499906)
  * d/p/0001-wwan-add-support-for-using-oFono-as-a-modem-manager.patch:
    cleanup ip_string_to_network_address() to work as its NMModemBroadband
    version, and start setting NM_MODEM_IP4_METHOD to _STATIC so IP, prefix,
    gateway and DNS get properly applied on an oFono modem as it comes up.
  * debian/NetworkManager.conf: re-enable the ofono plugin (fixes a regression).
  * debian/patches/add_ofono_settings_support.patch: don't link against
    libnm-glib as this would cause type duplication and crashes. (LP: #1499904)
  * debian/patches/lp1496434_warn_on_invalid_new_link.patch: don't outright
    crash by dereferencing a NULL pointer if a device can't get created by its
    NMDeviceFactory; we should warn though, as we want to know about these
    failures; but no need to make all of NM sad. (LP: #1496434)

  [ Tony Espy ]
  * d/p/0002-wifi-cull-the-scan-list-before-signalling-ScanDone-b.patch:
    Fix duplicate 'LastSeen' PropertiesChanged signals being generated
    after every scan. The 'Last-Seen' property is now only updated when
    a scan finishes, and schedule_scanlist_cull is no longer triggered
    new_bss_cb or updated_bss_cb. (LP: #1480877)
  * d/p/add_ofono_settings_support.patch: Fix memory leaks in plugin.
  * d/p/add_ofono_settings_support.patch: Remove unused plugin DBus code.
  * d/p/CVE-2015-1322.patch: Drop CVE which applied to removed DBus code
    in ofono settings plugin.

 -- Mathieu Trudel-Lapierre <email address hidden> Sat, 26 Sep 2015 00:02:02 -0400

Changed in network-manager (Ubuntu):
status: In Progress → Fix Released
Changed in canonical-devices-system-image:
status: In Progress → Fix Committed
Revision history for this message
Łukasz Zemczak (sil2100) wrote :

This bug was fixed in the package network-manager 0.9.10.0-4ubuntu15.1.8 in https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/stable-phone-overlay

---------------

network-manager (0.9.10.0-4ubuntu15.1.8) vivid; urgency=medium

  * d/p/0002-wifi-cull-the-scan-list-before-signalling-ScanDone-b.patch:
    Fix duplicate 'LastSeen' PropertiesChanged signals being emitted
    after scanning (LP: #1480877).
  * d/p/fix-ofono-plugin-leaks.patch: fix memory leaks in the ofono
    settings plugin.
  * d/p/rm-scofono-plugin-dbus.patch: Remove unused plugin DBus code.
  * d/p/CVE-2015-1322.patch: Drop CVE which applied to removed DBus code
    in ofono settings plugin.

 -- Tony Espy <email address hidden> Fri, 02 Oct 2015 10:54:24 -0400

Changed in network-manager (Ubuntu RTM):
status: Fix Committed → Fix Released
Revision history for this message
Andrea Bernabei (faenil) wrote :

Unfortunately, the effect on the UI is still the same.

3-4 seconds UI freezes while dbus-daemon is spammed with PropertiesChanged signals

Network-manager installed:
0.9.10.0-4ubuntu15.1.8

no longer affects: location-service (Ubuntu)
Changed in canonical-devices-system-image:
status: Fix Committed → Fix Released
Revision history for this message
Jean-Baptiste Lallement (jibel) wrote :

I am still seeing this with latest rc-proposed build on arale:
current build number: 144
device name: arale
channel: ubuntu-touch/rc-proposed/meizu.en

Revision history for this message
Jean-Baptiste Lallement (jibel) wrote :

I am reopening this report given that previous fix didn't entirely fixed the issue.

Changed in canonical-devices-system-image:
milestone: ww40-2015 → ww46-2015
status: Fix Released → Confirmed
Changed in network-manager (Ubuntu):
status: Fix Released → Confirmed
Revision history for this message
Michael Terry (mterry) wrote :

I was about to do the same. :) I still see this issue, just reproduced it in fact. The patch that went in seems to sometimes improve things, but overall this is very easily reproducable still.

Revision history for this message
Andrea Bernabei (faenil) wrote :

krillin, rc-proposed, r157

network-manager 0.9.10.0-4ubuntu15.1.8

No improvements so far, the phone still locks consistently when WiFi is enabled.

While using software that force a new wifi scan, such as the location service, the phone is completely unusable.

Tony Espy (awe)
Changed in network-manager (Ubuntu RTM):
status: Fix Released → Incomplete
Changed in network-manager (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Tony Espy (awe) wrote :
Download full text (4.0 KiB)

One of the frustrating parts about this bug is the incomplete description, and the fact that it's been nigh impossible to nail down a set of concrete steps to reproduce...

Also, the bug description contends that NM 'PropertiesChanged' signals are causing the UI freezes, however we have no idea why...

We haven't been thinking enough about what's listening for the signals. "If a tree falls in the woods and nobody hears it?" The more listeners for a particular DBus message, the more work the daemon has to due ( and the more context switches involved ).

Also, Mike Terry had mentioned in one of this comments that he'd seen the dbus daemon was spending most of it's time in it's match rules logic. Turns out that a new method had been added by upstream to the dbus Stats interface called 'GetAllMatchRules'. I'd been interested in this, for awhile, but finally got time to work on it.

Two other points before I dive into the match rules:

As of the latest rc-proposed for krillin ( # ), when the WiFi is enabled and the HERE app, Nearby scope, or Today scope are displayed, the location services increases the rate of WiFi scanning to once every 12-15s ( usually it's around 2m if connected to an AP ). When I decided to look at the incoming traffic to NM, I noticed that everytime a scan occurs, three different processes all seem to query the same set of NM AccessPoint objects. The clients all use 'Get' instead of 'GetAll' for the properties, and again it looks like each process is generating the same set of method requests. The processes in question are:

 - posclientd
 - slpgwd
 - ubuntu-location-serviced

The amount of traffic isn't a huge impact, but it could be made more efficient.

Back to the who's listening to the signals approach...

First, I got lucky and was able to run dbus 1.10.0-1ubuntu1 from wily on my krillin on first try. This krillin was freshly flashed to the latest rc-proposed ( #162 ).

When I ran the example get-all-match.py script ( see attached ) for the system bus ( use sudo and specify --system ), I was blown away that the same three location services processes previously mentioned above, each had 512 match rules for NM defined! 509 of the 512 were also flagged as warnings as no signal name was specified. It looks like these location processes are adding watches ( sometimes duplicates ) for every single AccessPoint object that ever gets created by NM. It also looks like these watches are never being cleaned up.

So... I tried the same experiment on a mako, also freshly flashed ( rc-proposed, #150 ). Turns out I got lucky with krillin, mako didn't like the new dbus, and wouldn't boot into the UI. I had re-flash, then backport the changes to the version in vivid ( 1.8 ). This verions ( ~awe3 ) is available in my PPA:

https://launchpad.net/~awe/+archive/ubuntu/ppa

When I ran this version on mako, the first thing I noticed was that the match rule counts were much lower at first. As I didn't have apps/scopes to drive the increased WiFi scanning, I triggered extra scans by hand and watched the match rule counts ...and they slowly started to grow! After an hour or so, they actually exceeded 512 rules per process!...

Read more...

Revision history for this message
Tony Espy (awe) wrote :

Here's the two patches necessary to enable the GetAllMatchRules method, as well as the patch that bumped the match rules limit to 5000 ( 20_system_conf_limit.patch ).

Revision history for this message
Tony Espy (awe) wrote :

This script queries GetAllMatchRules and generates output including warnings about less-optimal DBus match rules.

Revision history for this message
Tony Espy (awe) wrote :

Forgot to add the output of the system match rules for mako. Note at the time this was run both slpgwd and posclientd both have 616 match rules. It looks like stopping the location service ( via upstart or the indicator ) may have cleared the rules for ubuntu-location-servicesd.

Revision history for this message
Michael Terry (mterry) wrote :

FYI, to prevent those three processes from respawning, do:

sudo stop ubuntu-location-provider-here-slpgwd
sudo stop ubuntu-location-provider-here-posclientd
sudo stop ubuntu-location-service

Revision history for this message
Tony Espy (awe) wrote :

So re-testing with krillin ( rc-proposed, #162 ).

Confirmed that using a patched version of dbus 1.8, the match rule limits aren't enforced, so each of the loc processes slowly grows its number of match rules till they cap out at 5000 each.

So, WiFi enabled & connected, with the HERE app running *or* the Nearby or Today scopes currently selected.

The scan interval of the phone is controlled by time in a particular state. If "disconnected", the scan interval starts low, and eventually tops out at ~60s. Likewise, if "connected" ( to an AP ), it'll top out at 2m.

What's odd, is how location-services seems to be influencing scanning. If I bring the HERE app, or the Today or Nearby scopes to the foreground, the phone starts to do WiFi scans every 15s. If I bring another app to the foreground, this behavior stops, most of the time. I've able to get the phone into a state where the interval stays at 15s, no matter what app/scope is visible. Perhaps the location-service is losing state? Confirming scan interval is a matter or running wpa_cli and watching the events output as scan occur.

Note, to mimic normal usage, I usually start fix or six apps ( camera, system settings, clock, term, messages, music ).
Also, my confirmation of whether or not the the bug is occurring is *not* based on top, it's based on observation of the UI.

I see UI freezes that occur in sync with WiFi scans, that can last anywhere from 1-6s when:

 - using the switcher, animation ( horizontal or vertical ) freezes
 - edges stop responding
 - app scrolling freezes

I also began to see the Shutdown menu displayed every so often when I let the screen go off, and then woke via the power button.

Freezes start at around 1s when the match rules for the loc services hits ~1000. As the match rules continue to increase, the lockups do seem to last a bit longer over time. When they've maxed out, the lockups might be as long as 3-5s.

When I disabled WiFi, the UI freezes disappeared. This also reset the counts... so I let the phone run overnight and the counts rose back up again, and the freezes returned. Note, overnight all of the running apps appear to have been killed.

Finally, I left WiFi on, but instead stopped the location services process per comment #56. At this point the freezes go away almost completely ( note by disabling location services, the scan interval goes back up to 60s ). I say *almost* go away as there's still the slightest tick whenever a scan occurs, but it's orders of magnitude smaller than when the rule counts have been max'd out.

Revision history for this message
Michael Terry (mterry) wrote :

> I also began to see the Shutdown menu displayed every so often when
> I let the screen go off, and then woke via the power button.

That specific behavior is bug 1508563, whose fix just landed and will be in the next rc-proposed. It happened if unity8 was too being too slow (because the system was loaded).

That is all great news! I will confirm today in my neighborhood and jump for joy if no UI freezes occur (I had tried those steps before but was watching CPU, where spikes still happened -- which I guess is the "slightest tick" you saw in the UI -- when a scan occurs).

I'll note that xenial has dbus 1.10, so it should have GetAllMatches if anyone wants to try Tony's scripts without actually patching dbus. Just install devel-proposed instead of rc-proposed. Of course, that will change a lot of other things on your system.

Revision history for this message
Tony Espy (awe) wrote :

Re: the xenial version of dbus, you need to be careful as although I was able to get it to run on a krillin, it prevented the UI from coming up on mako. This is why I backported the changes and pushed to my PPA ( 1.8.12-1ubuntu5~awe3 ).

https://launchpad.net/~awe/+archive/ubuntu/ppa

@Andrea

I'd appreciate if you could test too. Note, to be able to verify the current number of match rules, you'd need to install dbus from my PPA. Your other choice is stopping the three location service jobs per comment #56, although this obviously will cause you to lose location services for the duration of your testing. Unfortunately, there's no way around this.

Revision history for this message
Andrea Bernabei (faenil) wrote :

@awe: great progress! well done! :)

I took a few videos to demonstrate the problem on my device
KRILLIN, RC-PROPOSED, R163
network-manager 0.9.10.0-4ubuntu15.1.8
wpa-supplicant

It was really easy to shoot videos at the right time because the problem happens exactly every 2 mins (which is the scanning interval Tony reported above)

DISCLAIMER: The problem can actually get worse than what the first video shows (if I wait one or two days more)

NO SERVICES DISABLED
https://www.youtube.com/watch?v=UvGoX4Nh9oo

SLPGWD DISABLED
https://www.youtube.com/watch?v=-Vhd0eS5Af4

SLPGWD POSCLIENTD DISABLED
https://www.youtube.com/watch?v=bTPiT2-LP2Y

SLPGWD POSCLIENTD LOCATION-SERVICE DISABLED
https://www.youtube.com/watch?v=V4SXNfJ9SqM

Then I looked for more location/here related services in the system, and I disabled them as well

SLPGWD POSCLIENTD LOCATION-SERVICE INDICATOR-LOCATION DISABLED
https://www.youtube.com/watch?v=2QuED_1UzEw

SLPGWD POSCLIENTD LOCATION-SERVICE INDICATOR-LOCATION LOCATION-SERVICE-TRUST-STORED DISABLED
https://www.youtube.com/watch?v=9QgssDKDu1Q

As can be seen from the videos, disabling SLPGWD is the change that made most of the difference, disabling POSCLIENTD also improved the situation a bit.
Disabling the rest of the services brought very little to no advantage, as far as I can tell.

However, as Tony was correctly saying, even after disabling everything, there's still a 1sec freeze every time a WiFi scan happens...and that can be quite annoying, especially when the scanning interval is much shorter than 2 mins.

Revision history for this message
Michael Terry (mterry) wrote :

Here's some test results from me. tl;dr stopping location services helps a lot, but still tiny UI stuttering.

My routine is to leave my apartment and walk down the street. This means I go from wifi to 3G. What happens on current images is:

1) As my phone notices that I'm out of wifi range, it freaks out. This is when the UI stutters (and high dbus CPU). This can last up to a minute, making the phone nigh unusable for the duration. The indicator switches to 3G only after all the stuttering stops (making me think it's finally caught up to all the dbus traffic that's been trying to get through the loaded dbus).

2) Every so often after this (15 or 30 seconds maybe?), I get a very brief, maybe 1s, UI stutter. Presumably this is during wifi scans. It goes away if I turn off wifi completely.

I don't see either of those issues when I'm on wifi. It's only going to 3G and staying on 3G that cause the problems. Again, everything points to wifi scans triggering it.

So that's the normal routine. Today I tried that routine once to confirm it was still happening. Then turned off the three services per comment 56. And did my routine again. This time the switch to 3G happened faster. I still saw some stuttering during the transition. Maybe three short bursts of stutter in quick succession. But all over in a matter of seconds, versus the minute before. And I still saw the regular-interval 1s stutter during normal scans.

But disabling locations was a huge help. And makes this bug go from a show-stopper to merely a regular annoyance.

Shall we spin off a bug for the location folks to clean up their dbus usage?

Tony Espy (awe)
Changed in location-service (Ubuntu RTM):
assignee: nobody → Scott Sweeny (ssweeny)
importance: Undecided → High
Revision history for this message
Tony Espy (awe) wrote :

@Michael

Thanks again for your help with testing this.

1) When the phone disconnects from an AP, the scan interval drops to it's lowest possible interval ( 20s ), as the goal is re-connecting as quick as possible. It'll eventually drift back to ~60s.

2) I also get the very brief <= 1s stutter when a WiFi scan occurs. I mention this in comment #57.

If you look at the output of the dbus get-all-match rules script on mako ( comment #55 ), there are many other processes that are attached to the system bus and have specified less-than-optimal DBus match rules. We should indeed try to work through each project.

That *said*, right now I think we should fix the obvious location-service issues which can cause the UI to lockup for close to 5-6s. That needs to be job one. I added a location-services task ( assigned to Scott on my team ) to track. That said, I don't propose we add tasks to this bug for every other project listed in the dbus match rule output... First, we should review with the product team if they're OK with closing out this bug once the location services fix(es) have landed.

As for when you see the problem, again it really depends on what the scan interval is. If you're connected to an AP, it's ~2m, so you might not notice the <=1s stutter when it happens. If the iterval is 15s ( ie. when location-services is in control of scanning ), it's *much* more noticeable.

My krillin right now is max'd out at 5k rules per loc daemon. The UI hangs for 4-5s each time a scan occurs, the frequency is per the above conditions... It's really bad when it gets to this point.

So, the fixes I think we need asap are:

 - fix location-services scanning behavior ( when is it supposed to be turned on, and when is it shut off? seems broken atm )

- fix location-service match rules ( duplicate, and leaks; really only need one rule for all AccessPoint objects )

It also would be nice to understand why the same DBus traffic needs to be monitored by all three processes ( see comment #52 for more on this ).

Revision history for this message
taiebot65 (dedreuil) wrote :

If it is linked to #1462664 Looks like it could be fixed I will try upgrading my system following the fix proposed in the bug and see if it solves the problem.

Revision history for this message
taiebot65 (dedreuil) wrote :

Here my feedback from installing silo 10 as per bug #1462664. I have been walking to town today and did not notice any lag for the time i was in town. Normally i am experiencing some hangs of more than 1 min before i can get a working phone. Today i checked repeatedly my phone and it did wake up like a normal phone it would be nice if someone could confirm this behaviour. I did get though a lot of dialog menu as per https://bugs.launchpad.net/ubuntu/+source/unity8/+bug/1508563. This is less problematic than missing phone calls because the phone is slow as hell because of this bug.

Changed in canonical-devices-system-image:
milestone: ww46-2015 → ww02-2016
Revision history for this message
Tony Espy (awe) wrote :

@Pat

We might want to hotfix this once we come to fully understand it( and I think we're getting close ).

While working with tvoss a fixes to location-services, we realized that by default, the get-all-match.py script only outputs warnings. While reviewing a test run from tvoss, I didn't see any rules at all for posclientd or slpgwd, as he'd fixed the warnings. As I wanted to check whether or not the leak was still there, I realized I needed to specify "--all" with the script.

This revealed a whole other group of processes that also were monitoring AccessPoint signals. I hadn't noticed them when first running the script, as I was just looking at the default warnings output, and these other processes appear to have specified the rules cleanly. It also appears that these processes are also leaking match rules for the AccessPoint signals as well.

In addition to posclientd, and slpgwd ( tvoss has removed the watches for ubuntu-location-servicesd ) the following processes are involved:

 - msyncd ( part of bueto-syncfw )
 - unity8
 - unity8-dash
 - sync-monitor
 - maliit-server

So, when fully maxed out, these too can reach 5k match rules each ( that's 25k for the five of them ). I just verified that they're all north of 1000 match rules on my krillin just now. This explains why after killing the location-service processes, we still were seeing a brief hiccup every time a WiFi scan occurs.

I also noticed that mission-control-5 listed five AccessPoint match rules only, so that while it's listening, it's not leaking...

All of the above processes except mission-control-5 use QtNetworkSession, whereas mission-control-5 uses nm-glib directly.

I've also verified on krillin that AccessPointAdded and Removed signals are being generated when scan occur, and they seem to make sense. One or more of each signals are received, then a PropertiesChanged signal is fired for the AccessPoints property itself, with the changes indicated by the signals now applied. As it appears none of the AcessPoints are ever being removed from these match rules, my guess is that this is a bug in QNetworkSession vs. NM. I've monitored NM and when scanning occurs, APs are continually added/removed. The list can grow or shrink by 3-10 APs at a time. Of course this doesn't absolutely verify that the signals were sent or received.

I'll attach a script which uses dbus-monitor to watch just the signals sent to the Device.Wireless interface, which should result in just monitoring AccessPointAdded/Removed and the associated PropertiesChanged for the AccessPoints property.

Finally, we urgently need to fix the remaining leaks, but we also probably need to have someone determine whether there's any way to get rid of these rules completely. Does maliit-server really need to see 'LastSeen' property changes from individual access points??? Do any of these processes?

Revision history for this message
Tony Espy (awe) wrote :

@taiebot65

I don't think this is related to bug #1462664. This is specifically related to the number of processes watching individual network-manager access point objects, and the fact that these processes leak match rules, so they continue to add match rules to dbus for new access points, but never cleanup the old rules for access points that disappear. Also, per an earlier comment, as a vivid fix related to the ubuntu-software-center relaxed the per process dbus match limits to 5000 ( from 512 ), when all of these processes max out, the total is something like 40k rules that DBus has to match for every incoming message on the bus.

Revision history for this message
Tony Espy (awe) wrote :

Here's a script that calls dbus-monitor with the correct match rules in order to catch AccessPointAdded and Removed signals, as well as PropertiesChanged signals for the Device.Wireless 'AccessPoints' property.

Tony Espy (awe)
no longer affects: buteo-syncfw (Ubuntu)
Revision history for this message
Tony Espy (awe) wrote :

I just added tasks for the packages mentioned in comment #65.

Some things learned today:

 - the sub-component of QNetwork that's most likely the cause is called Bearer Management. One of it's classes, QNetworkSession includes a plugin for NetworkManager which creates listeners for new AccessPoint DBus objects.

 - msyncd ( part of buteo-syncfw ) includes it's own internal NetworkManager class ( dated from 2010 ) which in turn uses QNetworkSession.

 - unity8 appears to use QNetworkAccessManager, which uses QNetworkSession.

 - sync-monitor uses QNetworkConfigurationManager, which is also part of Bearer Management, and most likely causes a QNetworkSession to be instantiated.

- maliit-framework is the only puzzle, as it doesn't seem to have any network code. It does use QGuiApplication however, which might in turn hook into the network code.

 - the bulk of the code involved on the Qt side can be found in the qtbase-opensource-base source package.

 - I'm no expert in QtDBus, but from what I can tell, to listen to a DBus signal ( which causes a match rule to be created ), you need to call connect on an instance of a DBusConnection, similar to what you'd do with a native Qt signal and slot. Conversely, to remove a signal watch, you need to call the disconnect() function.

 - The connect to for a new new AccessPoint object can be found in src/plugins/bearer/networkmanager/networkmanagerservice.cpp in the funcion QNetworkManagerInterfaceAccessPoint::QNetworkManagerInterfaceAccessPoint().

 - I don't see any corresponding cleanup code when an access point is removed ( ie. there's no disconnect() call made to the bus )

That's all I have for now...

Changed in location-service (Ubuntu RTM):
status: New → Triaged
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in qtbase-opensource-src (Ubuntu):
status: New → Confirmed
affects: unity8 (Ubuntu RTM) → qtbase-opensource-src (Ubuntu)
Revision history for this message
Timo Jyrinki (timo-jyrinki) wrote :

In parallel there has been some discussions on bug #1404188 about dropping the usage of network manager backend, even though unlikely in vivid-overlay unless seen as a bullet proof option.

If you want to try out how that would look like, install the silo 032 on rc-proposed but _without_ upgrading the ubuntu-touch-session there. Or if you upgrade ubuntu-touch-session, just move /etc/profile.d/qt_networkmanager_bearer.sh away and reboot. The silo includes also some networking bug fixes besides making the bearer selectable but that's not relevant to this bug.

The reason for keeping network manager bearer has been that with that Qt knows about switching between 3G and WiFi. If you use only generic plugin, have both wifi and 3g enabled and disable wifi, QNAM:s fall dead. Or if you use 3G only and enable wifi, 3G might be continued to be used. If those problems could be resolved on lower levels for example (or if they already are, the NM bearer enabling was done over a year ago), we could switch to the generic plugin.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in qtbase-opensource-src (Ubuntu):
status: New → Confirmed
Revision history for this message
Tony Espy (awe) wrote :

@Timo

Thanks for the update. I commented on bug #1404188 regarding this bug. I agree with you that moving to a new backend for the vivid-overlay PPA is risky. That said, if we keep the networkmanager bug backend, this bug really needs to be fixed.

What you say about the networkmanager backend is true, as NM is the controller of our networking, so querying it makes sense if you want to know whether we're on 3g or WiFi. That said, the connectivity API was originally designed to do the same thing, although it was never actually "published" as an API that could be used by the rest of the system. It was designed for this, however the team never finished the work, so AFAIK to date, it's only been used by indicator-network.

In *theory*, it should be possible to query the same exact information from a connectivity-API based backend. It would give far less control however actually permitting changes to the network configuration and/or what interface was activated. I think that's OK, as we delegate all of that to indicator-network. We have no concept of a network-admin capability that can be granted to apps ( again AFAIK ).

Revision history for this message
Tony Espy (awe) wrote :

Here's a first attempt at a patch, which unfortunately doesn't seem to work...

The general idea is when an AccessPointRemoved signal is received, we need to ensure that the associated DBus watch is removed. According to the QDbus documentation, DBusConnection::disconnect() should do this. I unfortunately didn't add any debug logging, so will have to take another stab on Monday ( unless someone else can take a look ).

Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

The attachment "Initial attempt to patch QNetworkSession to fix match rules leak" seems to be a patch. If it isn't, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are a member of the ~ubuntu-reviewers, unsubscribe the team.

[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issues please contact him.]

tags: added: patch
Revision history for this message
Lorn Potter (lorn-potter) wrote :

You might be running into those dbus blocking calls in the network-manager bearer backend.
The networkmanager bearer plugin currently is using a few blocking dbus calls to get properties and such.
see this abandoned change here:
https://codereview.qt-project.org/#/c/98090/

This change was abandoned as QNAM is synchronous in nature, and expects defaultConfiguration to be ready right away.

I have written a simple bearer backend plugin based on the connectivity-api that might be used instead.
This has one static configuration that it used.

https://codereview.qt-project.org/#/c/140752/

Revision history for this message
Tony Espy (awe) wrote :

@Lorn

I don't think the blocking DBus calls are the culprit per se, although I did see a few while looking at the NM backend. The simple issue is that NM removes AccessPoint DBus objects on a regular basis, the the signal watches for these objects are not being released...

I'd see your new plugin, but I guess there needs to be more discussion regarding whether or not we can replace the NM backend. @Timo seemed to have some reservations about this ( at least for the vivid-phone-overlay ).

Revision history for this message
Pat McGowan (pat-mcgowan) wrote :

re comment #68 and maliit access networking, from Albert on bug #1515712

> We should figure out why we are even doing any network stuff at all,
as we probably shouldn't be.

It comes from the fact that QtMultimedia uses network transparent
methods to load urls used for audio playing, so even if in the keyboard
we're using a local sound for audio feedback, QtMultimedia ends up
bringing up the network backend because it uses those network
transparent methods.

Revision history for this message
Lorn Potter (lorn-potter) wrote :

That's right, any time QNAM is used it uses the Qt Bearer backend (unless Qt is configure without bearer). In the case of the network-manager backend, it calls the blocking dbus calls to get list of services, and then for every service to get that services properties. Even for local file:// access, even though the bearer is not even used for local file access, thanks to QNAM private c'tor.

Revision history for this message
Timo Jyrinki (timo-jyrinki) wrote :

We need only a subset of the NM bearer features so it would be ok to replace or patch to limit its behavior.

What we need is existing QNAM instances to not fall dead if user is on 3G + wifi and then disables wifi, or alternatively if on 3G and wifi gets enabled the connections should migrate to wifi. I don't know if connectivity-api currently has what would be needed for the bearer.

We do not need network controlling in bearer since indicator-network handles that.

Changed in location-service (Ubuntu RTM):
assignee: Scott Sweeny (ssweeny) → Thomas Voß (thomas-voss)
status: Triaged → In Progress
Changed in dbus-cpp (Ubuntu):
assignee: nobody → Thomas Voß (thomas-voss)
importance: Undecided → High
status: New → In Progress
Revision history for this message
Lorn Potter (lorn-potter) wrote :

The generic bearer plugin is brain dead, especially when it comes to knowing when it's actually connected or not.

Although connectivity-api also suffers from blocking dbus calls, I feel the connectivity-api backend is the right way to go for bearer & QNAM right now. In the least, it will lessen the stress on dbus, as ideally, there should be only one Qt entity getting network-manager properties through dbus on a system. At least the blocking calls will be off loaded from everybody and their brother that happens to use QNAM knowingly or blindly.

I am not yet familiar with who originally starts connectivity-api and when things start accessing QNAM at startup, though.

Revision history for this message
Timo Jyrinki (timo-jyrinki) wrote :

I've extracted Lorn's connectivity-api draft commit and made it self-compiling:

https://code.launchpad.net/~timo-jyrinki/+junk/connectivity-api-bearer

You can either compile it yourself on the device or get this build for vivid/rc-proposed: https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/landing-059/+build/8313308/+files/libqt5network5-connectivityapibearer_0.0.1%7Eunreleased-0ubuntu1%7Evivid1%7Etest1_armhf.deb

It seems not to crash at least on device, but I don't know how to test it further. To test it:
1. Install the .deb
2. Remove/move the three other bearers from /usr/lib/arm-linux-gnueabihf/qt5/plugins/bearer away so that only libqubuntubearer.so is left
3. Reboot

There'd now be also the new upstream method of QT_EXCLUDE_GENERIC_BEARER=1 that will be in Ubuntu's Qt 5.5 packages, but then I do not how it would be selected if NM or Ubuntu backend is used - if we get connectivity-api bearer working well enough, we should be able to specify it explicitly.

Revision history for this message
Tony Espy (awe) wrote :

Just a quick update on my debugging.

I've verified that the NM bearer plugin's qnetworkmanagerengine class is receiving the 'AccessPointRemoved', is matching them correctly in it's saved list, and is called delete on the QNetworkManagerInterfaceAccessPoint instance. The associated destructor is called, and my new disconnect() method is called, which succeeds.

All that aside, the rules are still piling up, so something else is wrong. I took a quick look at the DBusConnection code, and it looks like the signal matches are reference counted, so it's possible that the count is never reaching zero, so the rule isn't actually released... ( see qdbusconnection.cpp and qdbusintegrator.cpp ).

Revision history for this message
Tony Espy (awe) wrote :

And some more details, by adding 'env QDBUS_DEBUG=1' to the unity8 upstart job, I'm getting a lot more detail without having to further modify the code.

I'm seeing "Adding rule" log statements produced by QDBusConnectionPrivate, but no corresponding "Removing rule" logs, so sounds like my previous theory is on track ( ie. that reference counting may be the cause ). That said, I'll need to add some more logging and rebuild...

Also, while examining the DBus log, it looks like every time the plugin sees a 'PropertiesChanged' message from the AccessPoint, it calls "GetAll", which is pointless as all the changed properties are transmitted with the signal...

Revision history for this message
Lorn Potter (lorn-potter) wrote :

hmmm... I just noticed/remembered that upstream Qt networkmanager plugin has been updated, which should perform a bit better as it has had some refactoring/rewriting done. It is slightly more sane.

Revision history for this message
Lorn Potter (lorn-potter) wrote :

Actually, the only time Qt's QNetworkManagerInterfaceAccessPoint does a GetAll is in the QNetworkManagerInterfaceAccessPoint c'tor. But that is in the upstream version.

And the current ubuntu version (at least the one I am seeing) does not do GetAll at all.

Revision history for this message
Tony Espy (awe) wrote :

Some more details:

 - the version of the qtbase package in touch atm is: 5.4.1+dfsg-2ubuntu10

- I checked and didn't see significant differences in the networkmanager bearer plugin when comparing against 5.5.1, and 5.6.

- That said, there's been quite a few changes in the QDBus code, specifically qdbusintegrator.cpp. Some of the changes include modifications to the signal connect/disconnect code.

- After some further instrumentation, I *think* I've located the bug. I added more debug logging to the QDBus code, so I see when match rules are added and removed, and when the corresponding connect/disconnect calls are made. What I see is that the bearer plugin is calling DBusConnection::connect() for each AccessPoint object that's added. This results in a single match rule being added, however I see a second call to DBusConnection::connect() for the same object path, which cause the ref count to incremented. When the AccessPoint is removed, disconnect() only gets called once, the ref count is decremented, and the match rule lives on.

 - I originally thought this was due to the fact that QNetworkManagerEngine::nmRegistered() calls managerInterface->setConnections(), but at a closer the look the former is registering for Qt signals, while the latter registers for the associated DBus signals.

 - Looking at the constructor of the plugin engine, QNetworkManagerEngine, it connects the serviceRegistered Qt signal to the method nmEnabled(). At the end of the same function, it checks if the service (NM) has already been registered, and if so invokes nmRegistered again via invokeMethod ( that's some Voodoo I don't quite grok ). Looking at the nmRegistered() function, it has a guard which looks like it was copied from ofonoRegisted() as it guards against ofonoManager != NULL, not managerInterface ( which is what gets created in nmRegistered ). I'll add a bit more logging to confirm, but pretty sure this is the cause of the extra match refs, as if nmRegistered() is called twice, it'll cause connect() to be called for AccessPoint objects twice.

 - I also see twice as many "GetAll" method calls sent for each access point, although I haven't dug into why this is so yet...

Revision history for this message
Tony Espy (awe) wrote :

Note, QNetworkManagerEngine::setupConfigurations() is also suspect as it can get called from a couple of places and has a loops over the current devices and calls deviceAdded().

Revision history for this message
Lorn Potter (lorn-potter) wrote :

nmRegistered will only get called from the connection signal/slot when networkmanager daemon if and when gets registered on the dbus. If it is already registered, it will get called from that invokeMethod.

Revision history for this message
Lorn Potter (lorn-potter) wrote :

You are correct in that GetAll gets called twice for each accesspoint.
I can see that by adding qDebug's and disable/enable wifi

Revision history for this message
Lorn Potter (lorn-potter) wrote :

This patch fixes duplicate entries of access points. Which contributes to but may not totally fix this bug.

Revision history for this message
Lorn Potter (lorn-potter) wrote :

Sorry, thought I'd include more.

2nd patch that obsoletes my previous patch includes disconnecting from system dbus when objects get deleted.

Revision history for this message
Tony Espy (awe) wrote :

I grabbed your patch and incorporated most of it into my latest patch ( I left out a few of the disconnect calls for some of the other NM interfaces not directly involved in the match rules problem ), however it failed to build...

While trying to just pull in the pieces and resolve the conflicts with the additional debug statements I'd added, I managed to remove the wrong function declaration in qnetworkmanagerservice.h. I removed path() from QNetworkManagerInterface instead of QNetworkMangerInterfaceDeviceWireless. I've refreshed the patch and re-pushed to my PPA. Let's hope it builds this time.

Tony Espy (awe)
Changed in qtbase-opensource-src (Ubuntu):
assignee: nobody → Tony Espy (awe)
importance: Undecided → High
status: Confirmed → In Progress
Revision history for this message
Tony Espy (awe) wrote :

Here's a new version of the patch with all the access-point related signals removed.

It makes no sense for any of this logic to be present in our system, as none of our code attempts to control the network configuration via Bearer Mgmt. With this patch, the plugin will never receive any NM access-point related signals, and therefore it'll never populate it's access point list.

Also, as pointed out in other comments, the bearer plugin includes blocking DBus calls that are made in response to every single 'PropertiesChanged' signal sent for an access point, which causes a ripple of DBus traffic every time a scan occurs.

@Lorn, if you could look over my patch, I'd appreciate it.

It'd be also nice to get some testing from @faenil and @mterry before we push forward with landing plans. I've confirmed that the CPU spikes in the dbus daemon go away completely with the changes. That said, it'd be good to get some independent results.

Until we have a silo prep'd for Qt, you can just install the libqt5network5_5.4.1+dfsg-2ubuntu10~awe16_armhf.deb from my PPA:

https://launchpad.net/~awe/+archive/ubuntu/ppa/+packages

You also probably want to install dbus-cpp and location-services from:

https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/landing-026

The status of the location-service silo says it needs a rebuild, but the packages in the PPA should be OK for testing. If you have any questions, ask before you install anything.

Revision history for this message
Lorn Potter (lorn-potter) wrote :

I agree, since nothing controls the connections through QtBearer (and the platform doesn't want that), and AP lists are not allowed in contained apps, it makes sense to simply remove them.

I've fixed up that patch in regards to d'tor disconnects, removed some redundant code.

And since it was a small change, I fixed up the defaultConfiguration code that would not correctly detect when the default route changed from mobile to wifi, thus possibly allowing network requests to go through expensive mobile data when wifi is available.

Revision history for this message
Michael Terry (mterry) wrote :

I tested silo 26 + libqt5network5 from Tony's PPA and did my normal routine of walking out of range of my home network.

It seems much better. dbus-daemon CPU usage is down (only ~30% for a few seconds after initial switch to 3G and ~15% during scans after).

But more importantly, stuttering is much better. Only a couple tiny stutters on initial switch. And none during scans. I'm +100 on trying to land this.

(My test was on a relatively-recently-rebooted phone, so maybe not the ideal test. I'll try again later after a while of uptime to double confirm the results, but this looks good. If I don't post again, assume it worked great.)

Revision history for this message
Thomas Voß (thomas-voss) wrote : Re: [Bug 1480877] Re: Access points' "PropertiesChanged" dbus signals freeze UI on mobile devices

Quick update on silo 26: I'm rebuilding location-service to account
for a recent landing.
The one blocking issue is mediascanner2, which requires a rebuild,
too. It is however enabled for dual-landing
to both vivid and xenial, and we cannot easily do a vivid+o landing
with it right now. I'm working on the resolving that issue,
but it will likely take a day or two to complete.

On Mon, Nov 23, 2015 at 3:49 PM, Michael Terry
<email address hidden> wrote:
> I tested silo 26 + libqt5network5 from Tony's PPA and did my normal
> routine of walking out of range of my home network.
>
> It seems much better. dbus-daemon CPU usage is down (only ~30% for a
> few seconds after initial switch to 3G and ~15% during scans after).
>
> But more importantly, stuttering is much better. Only a couple tiny
> stutters on initial switch. And none during scans. I'm +100 on trying
> to land this.
>
> (My test was on a relatively-recently-rebooted phone, so maybe not the
> ideal test. I'll try again later after a while of uptime to double
> confirm the results, but this looks good. If I don't post again, assume
> it worked great.)
>
> --
> You received this bug notification because you are a bug assignee.
> https://bugs.launchpad.net/bugs/1480877
>
> Title:
> Access points' "PropertiesChanged" dbus signals freeze UI on mobile
> devices
>
> Status in Canonical System Image:
> Confirmed
> Status in Unity 8:
> New
> Status in dbus-cpp package in Ubuntu:
> In Progress
> Status in indicator-network package in Ubuntu:
> Incomplete
> Status in network-manager package in Ubuntu:
> Incomplete
> Status in qtbase-opensource-src package in Ubuntu:
> In Progress
> Status in buteo-syncfw package in Ubuntu RTM:
> New
> Status in location-service package in Ubuntu RTM:
> In Progress
> Status in maliit-framework package in Ubuntu RTM:
> New
> Status in network-manager package in Ubuntu RTM:
> Incomplete
> Status in sync-monitor package in Ubuntu RTM:
> New
>
> Bug description:
> Krillin, rc-proposed, r83
>
>
> DESCRIPTION:
> I've been trying to track down the cause of the occasional UI freezes on my Krillin device, and I noticed that whenever the UI freezes for 2-4 seconds, I get a burst of "PropertiesChanged" signals in dbus-monitor
>
> Here's a log of what's shown in dbus-monitor:
> http://pastebin.ubuntu.com/11992322/
>
> I'd guess the problem is in the code that actually catches the signals
> and acts accordingly.
>
> HOW TO REPRODUCE:
> 1) Move to a place where many wifi hotspots are available
> 2) Connect the device via USB and run "phablet-shell" and then "dbus-monitor"
> 3) Use the device while keeping an eye on dbus-monitor output
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/canonical-devices-system-image/+bug/1480877/+subscriptions

Revision history for this message
Thomas Voß (thomas-voss) wrote :

Silo 26 is now complete.

On Mon, Nov 23, 2015 at 4:47 PM, Thomas Voß <email address hidden> wrote:
> Quick update on silo 26: I'm rebuilding location-service to account
> for a recent landing.
> The one blocking issue is mediascanner2, which requires a rebuild,
> too. It is however enabled for dual-landing
> to both vivid and xenial, and we cannot easily do a vivid+o landing
> with it right now. I'm working on the resolving that issue,
> but it will likely take a day or two to complete.
>
> On Mon, Nov 23, 2015 at 3:49 PM, Michael Terry
> <email address hidden> wrote:
>> I tested silo 26 + libqt5network5 from Tony's PPA and did my normal
>> routine of walking out of range of my home network.
>>
>> It seems much better. dbus-daemon CPU usage is down (only ~30% for a
>> few seconds after initial switch to 3G and ~15% during scans after).
>>
>> But more importantly, stuttering is much better. Only a couple tiny
>> stutters on initial switch. And none during scans. I'm +100 on trying
>> to land this.
>>
>> (My test was on a relatively-recently-rebooted phone, so maybe not the
>> ideal test. I'll try again later after a while of uptime to double
>> confirm the results, but this looks good. If I don't post again, assume
>> it worked great.)
>>
>> --
>> You received this bug notification because you are a bug assignee.
>> https://bugs.launchpad.net/bugs/1480877
>>
>> Title:
>> Access points' "PropertiesChanged" dbus signals freeze UI on mobile
>> devices
>>
>> Status in Canonical System Image:
>> Confirmed
>> Status in Unity 8:
>> New
>> Status in dbus-cpp package in Ubuntu:
>> In Progress
>> Status in indicator-network package in Ubuntu:
>> Incomplete
>> Status in network-manager package in Ubuntu:
>> Incomplete
>> Status in qtbase-opensource-src package in Ubuntu:
>> In Progress
>> Status in buteo-syncfw package in Ubuntu RTM:
>> New
>> Status in location-service package in Ubuntu RTM:
>> In Progress
>> Status in maliit-framework package in Ubuntu RTM:
>> New
>> Status in network-manager package in Ubuntu RTM:
>> Incomplete
>> Status in sync-monitor package in Ubuntu RTM:
>> New
>>
>> Bug description:
>> Krillin, rc-proposed, r83
>>
>>
>> DESCRIPTION:
>> I've been trying to track down the cause of the occasional UI freezes on my Krillin device, and I noticed that whenever the UI freezes for 2-4 seconds, I get a burst of "PropertiesChanged" signals in dbus-monitor
>>
>> Here's a log of what's shown in dbus-monitor:
>> http://pastebin.ubuntu.com/11992322/
>>
>> I'd guess the problem is in the code that actually catches the signals
>> and acts accordingly.
>>
>> HOW TO REPRODUCE:
>> 1) Move to a place where many wifi hotspots are available
>> 2) Connect the device via USB and run "phablet-shell" and then "dbus-monitor"
>> 3) Use the device while keeping an eye on dbus-monitor output
>>
>> To manage notifications about this bug go to:
>> https://bugs.launchpad.net/canonical-devices-system-image/+bug/1480877/+subscriptions

Revision history for this message
Tony Espy (awe) wrote :

@Lorn

Thanks for the updated patch.

The only significant difference besides the DeviceModem destructor fix was the change you mentioned about default routes. Is there a touch-specific bug that this fixes? On our system, the modem connection will never be IPv6, whereas it is possible to get a IPv6 connection for WiFi. Perhaps we keep this separate?

Revision history for this message
Lorn Potter (lorn-potter) wrote :

@Tony
I noticed this on touch.
The problem was that the ConnectionActive was returning true for mobile data ipv6 default route when wifi had the actual default route, so it would never get updated when wifi became default.

Revision history for this message
Tony Espy (awe) wrote :

@Lorn

Did you see this recently? If so, what device and image?

As I mentioned, the mobile connection on touch doesn't support IPv6 connections. There was a bug in NM that falsely reported that a default IPv6 route was available, but this was fixed awhile back. Please see bug #1444162 for details. Perhaps this is what you were seeing?

Revision history for this message
Michael Terry (mterry) wrote :

Here's a followup on living with silo 26 (pre-comment 97, so an earlier version, but I doubt it matters for testing?) and Tony's libqt5network5.

I tested this morning (so a full day of the phone living on this code) and it was slightly worse -- small stutters instead of tiny ones. But still a vast improvement on what it was before these changes. I assume the degradation was due to other actors on the phone leaking signal watches. Or just random chance.

But I still love the changes.

Revision history for this message
Tony Espy (awe) wrote :

@Mike

Small stutters when transitioning from mobile to WiFi or visa versa, or stutters when scanning. The fixes to location-services and Qt both are applicable to the latter case only...

Revision history for this message
Michael Terry (mterry) wrote :

@Tony, small stutters when transitioning from wifi to mobile. I don't see stutters when scanning anymore. And the stutters I do see when transitioning from wifi to mobile seem better (presumably because less dbus signal watchers are registered now for all the traffic that happens on a switch?).

Revision history for this message
Tony Espy (awe) wrote :

Here's my updated patch, the only difference is that it includes a DBusConnection disconnect() call in the destructor for QNetworkManagerInterfaceDeviceModem(), which I'd punted on for expediency while testing earlier this week.

If doesn't include the IPv6 related fix, as this was fixed in NM already.

Our plan is to get this into silo-026 ASAP for testing as a potential hotfix along with the location-services match rules fixes.

Revision history for this message
Lorn Potter (lorn-potter) wrote :

I changed my phones image channel, tested (with a few specific test apps I have for QNAM & friends) that new patch and didn't see any thing too wrong.

Although I am seeing heaps of GetAll calls that do not need to happen (even for non nm managed interfaces - rmnet). Attached patch simplifies/optimizes more. Reduces the number of blocking calls from about 50 to ~4 when default route changes.

Revision history for this message
Tony Espy (awe) wrote :

@Lorn

Can you add some details regarding what image channel you actually used and device for testing? rc-proposed is probably the best base for you to use.

The "GetAll" behavior you're describing is similar to what I was seeing with the AccessPoint objects. The code would see a "PropertiesChanged" signal for an access point, and directly issue a blocking "GetAll" call. This should only be done at initial object creation. Once initialized, the properties can be monitored directly via "PropertiesChanged" signals and the included payload. I imagine this is probably what you've done with your patch.

I'll review your updates, and confer with some others to see if we defer your additional changes to OTA9, or try to get them in as part of a proposed hot-fix.

Thanks again for your help!

Revision history for this message
Timo Jyrinki (timo-jyrinki) wrote :

The version 4 of the patch (from comment #104) is in https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/landing-026 at the moment.

Revision history for this message
Lorn Potter (lorn-potter) wrote :

@Tony, currently I am on stable/ubuntu-developer. Previously I was testing on stable/ubuntu. I was on rc-proposed for quite a while, but it seemed to have only a few scopes, and not the twitter scope which I wanted to try out.

My updated patch removes actions on device added and removed calls. Most of these were rmnet related (which appears to not be managed by nm anyway), and relies on active connection and settings changes.
I haven't tested it in regards to hotspot or bluetooth tethering.

Revision history for this message
Tony Espy (awe) wrote :

@Lorn

Thanks for the update.

I usually do all my testing on rc-proposed, as it's latest and greatest. I wasn't aware of the "ubuntu-developer" channel. I've tested on rc-proposed/bq-aquaris.en ( #188 ) with the full set of packages from silo-026, and everything seemed good to me.

Thar said, I've reviewed your latest changes and they look good to me. Not creating QNetworkManagerInterfaceDevice and DeviceWireless instances definitely gets rid of the a bunch more DBus signal match rules, and unneeded DBus calls. DeviceWireless is still generates PropertiesChanged signals every time a scan finishes...

With this code removed, the plugin operates soley on active and system connection updates.

Regarding the device adds for 'rmnet' devices on mako, in theory these should've only been happening at start-up time and/or whenever flight-mode is turned off.

Anyways, I'm +1 for including your latest revision of the patch. Thanks again for your help!

Revision history for this message
Tony Espy (awe) wrote :

Note, I've also pushed a new version based on your latest patch to my PPA if anyone's interested in testing before it lands in silo-026:

https://launchpad.net/~awe/+archive/ubuntu/ppa

Revision history for this message
Timo Jyrinki (timo-jyrinki) wrote :

Ver 5 in 026 now too (still building), together with the gles counterpart packagae.

Revision history for this message
Lorn Potter (lorn-potter) wrote :

@Tony I suppose that if statement in QNetworkManagerEngine::requestUpdate() could also be removed, leaving the last line only, since we do not do anything with uknown AP's

Revision history for this message
Timo Jyrinki (timo-jyrinki) wrote :

Not related to the bug at hand directly, but applying the signals5.patch on top of Qt 5.5.1 caused a crash to always happen on Unity8 startup:
/usr/bin/unity8:11:/usr/lib/arm-linux-gnueabihf/qt5/plugins/bearer/libqnmbearer.so+7c7c:/usr/lib/arm-linux-gnueabihf/libQt5Core.so.5.5.1+1e4f5a:[stack..3416]+7fd8b8

So my suggestion is to ship whatever needed for 5.4 quickly but have the longer term goal of shipping connectivity-api bearer so that the 5.5/xenial will wait for that (unless someone wants to look at the 5.5 specific problem with this current patch).

Revision history for this message
Tony Espy (awe) wrote :

@Timo

As previously mentioned, I'd like to get this change out as a hot-fix, and then do proper evaluation of the new connectivity API based plugin with the goal to land it as part of OTA9.

@Lorn

I'll take a look at that the line you mentioned, however we need to draw the line at some point and re-focus on the replacement plugin as mentioned above.

Changed in canonical-devices-system-image:
milestone: ww02-2016 → ww50-2015
Revision history for this message
Tony Espy (awe) wrote :

The latest version of the patch is now in a silo-026, included as part of a new qtbase version 5.4.1+dfsg-2ubuntu11~vivid1.

Changed in qtbase-opensource-src (Ubuntu):
status: In Progress → Fix Committed
Changed in sync-monitor (Ubuntu RTM):
status: New → Incomplete
Revision history for this message
Tony Espy (awe) wrote :

The fix has landed in silo-026, as part of location-services version: 2.1+15.04.20151202.1-0ubuntu1

Changed in location-service (Ubuntu RTM):
status: In Progress → Fix Committed
no longer affects: sync-monitor (Ubuntu RTM)
no longer affects: maliit-framework (Ubuntu RTM)
no longer affects: indicator-network (Ubuntu)
no longer affects: buteo-syncfw (Ubuntu RTM)
Changed in dbus-cpp (Ubuntu RTM):
status: New → Fix Committed
Revision history for this message
Tony Espy (awe) wrote :

The associated version for the dbus-cpp in silo-026 is: 4.3.0+15.04.20151126-0ubuntu1

Revision history for this message
Łukasz Zemczak (sil2100) wrote :

This bug was fixed in the package location-service 2.1+15.04.20151202.1-0ubuntu1 in https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/stable-phone-overlay

---------------

location-service (2.1+15.04.20151202.1-0ubuntu1) vivid; urgency=medium

  * Ensure that event connections are cleaned up on destruction. (LP:
    #1480877)

 -- Thomas Voß <email address hidden> Wed, 02 Dec 2015 12:12:21 +0000

Changed in location-service (Ubuntu RTM):
status: Fix Committed → Fix Released
Revision history for this message
Łukasz Zemczak (sil2100) wrote :

This bug was fixed in the package dbus-cpp 4.3.0+15.04.20151126-0ubuntu1 in https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/stable-phone-overlay

---------------

dbus-cpp (4.3.0+15.04.20151126-0ubuntu1) vivid; urgency=medium

  [ CI Train Bot ]
  * New rebuild forced.

  [ Thomas Voß ]
  * Ensure that Signal with non-void argument types correctly narrow
    their match rules. (LP: #1480877)

 -- Thomas Voß <email address hidden> Thu, 26 Nov 2015 07:31:37 +0000

Changed in dbus-cpp (Ubuntu RTM):
status: Fix Committed → Fix Released
Changed in canonical-devices-system-image:
status: Confirmed → Fix Committed
Revision history for this message
Tony Espy (awe) wrote :

Just a quick update to mention that the change to the NM bearer plugin seems to have caused a regression when qtbase is installed from the PPA onto a desktop ( for UI development work ).

To reproduce, the overlay PPA needs to be added as a software source, then ubuntu-system-settings installed/upgraded ( which pulls in qtbase ). See bug #1523975 for details.

Revision history for this message
Selene ToyKeeper (toykeeper) wrote :

Is it possible that this fix might cause a phone to wake up more often, or to stay awake longer? I'm seeing weird idle power behavior ever since this fix landed in krillin rc-proposed 196.

Before: ~9 mA
http://people.canonical.com/~platform-qa/power-results/2015-12-08_03:19:30-krillin-195-power_usage_idle/graph.png

After: ~35 mA
http://people.canonical.com/~platform-qa/power-results/2015-12-08_07:36:16-krillin-196-power_usage_idle/graph.png

Here's the commit log for 196:

http://people.canonical.com/~lzemczak/landing-team/ubuntu-touch/rc-proposed/196.commitlog

Arale results look fine. I don't know why krillin is different, but I'm only seeing this behavior on krillin when wifi is connected... and this bug fix is a likely suspect. I don't see additional forks or crashes in the test logs, which suggests the issue is probably caused by a long-running process.

Revision history for this message
Selene ToyKeeper (toykeeper) wrote :

I created bug 1524133 to track the issues I'm seeing. I don't know if it's related to this bug, but so far it looks probable.

Revision history for this message
Lorn Potter (lorn-potter) wrote :

patch #6 fixes crash in system-settings (#1523975) and removes wifi scanning updates, which could potentially drain the battery (#1524133)

Revision history for this message
Thomas Voß (thomas-voss) wrote :

Thanks for the graphs. One thing that stands out is the time that the
system stays active when occasionally waking up from deep sleep (right
hand side of the graph).
Do you have the raw data producing those graphs handy?

Thanks,

  Thomas

On Wed, Dec 9, 2015 at 1:23 AM, Selene Scriven
<email address hidden> wrote:
> Is it possible that this fix might cause a phone to wake up more often,
> or to stay awake longer? I'm seeing weird idle power behavior ever
> since this fix landed in krillin rc-proposed 196.
>
> Before: ~9 mA
> http://people.canonical.com/~platform-qa/power-results/2015-12-08_03:19:30-krillin-195-power_usage_idle/graph.png
>
> After: ~35 mA
> http://people.canonical.com/~platform-qa/power-results/2015-12-08_07:36:16-krillin-196-power_usage_idle/graph.png
>
> Here's the commit log for 196:
>
> http://people.canonical.com/~lzemczak/landing-team/ubuntu-touch/rc-
> proposed/196.commitlog
>
> Arale results look fine. I don't know why krillin is different, but I'm
> only seeing this behavior on krillin when wifi is connected... and this
> bug fix is a likely suspect. I don't see additional forks or crashes in
> the test logs, which suggests the issue is probably caused by a long-
> running process.
>
> --
> You received this bug notification because you are a bug assignee.
> https://bugs.launchpad.net/bugs/1480877
>
> Title:
> Access points' "PropertiesChanged" dbus signals freeze UI on mobile
> devices
>
> Status in Canonical System Image:
> Fix Committed
> Status in Unity 8:
> New
> Status in dbus-cpp package in Ubuntu:
> In Progress
> Status in network-manager package in Ubuntu:
> Incomplete
> Status in qtbase-opensource-src package in Ubuntu:
> Fix Committed
> Status in dbus-cpp package in Ubuntu RTM:
> Fix Released
> Status in location-service package in Ubuntu RTM:
> Fix Released
> Status in network-manager package in Ubuntu RTM:
> Incomplete
>
> Bug description:
> Krillin, rc-proposed, r83
>
>
> DESCRIPTION:
> I've been trying to track down the cause of the occasional UI freezes on my Krillin device, and I noticed that whenever the UI freezes for 2-4 seconds, I get a burst of "PropertiesChanged" signals in dbus-monitor
>
> Here's a log of what's shown in dbus-monitor:
> http://pastebin.ubuntu.com/11992322/
>
> I'd guess the problem is in the code that actually catches the signals
> and acts accordingly.
>
> HOW TO REPRODUCE:
> 1) Move to a place where many wifi hotspots are available
> 2) Connect the device via USB and run "phablet-shell" and then "dbus-monitor"
> 3) Use the device while keeping an eye on dbus-monitor output
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/canonical-devices-system-image/+bug/1480877/+subscriptions

Revision history for this message
Timo Jyrinki (timo-jyrinki) wrote :

qtbase 5.4.1+dfsg-2ubuntu11~vivid2 is ready for testing at https://requests.ci-train.ubuntu.com/#/ticket/761 with Lorn's net-bearer-nm-disconnect-ap-signals6.patch from bug #1480877 that includes a fix for both u-s-s crasher bug #1523975 and a potential fix to too many wifi scanning updates bug #1524133.

Revision history for this message
Timo Jyrinki (timo-jyrinki) wrote :

The interdiff from signals5 to signals6: http://paste.ubuntu.com/13858649/ - removal of code and one-liner to fix the system-settings.

Revision history for this message
Tony Espy (awe) wrote :

So the latest changes in Lorn's signals6 patch can be summarized as:

1) remove QNetworkManagerEngine::requestUpdate (), which is a public method which can trigger a WiFi scan request to NM.

This method doesn't appear to be called internally by the bearer plugin, so if we think extra WiFi scans are the culprit, there must have been an change to one of our Qt applications which is calling this method. Also as this method wasn't added in the original patch, it's presence can't really be considered a regression. Has anyone attempted to determine if we *are* scanning more often? This could be done manually by running wpa_cli ( as root ), and watching for the frequency of scan events output. It can also be accomplished by looking at the wpa_suppl and/or NM log messages in syslog. And finally, you could also just monitor DBus looking for ScanDone signals from NM.

NOTE - I'm pretty sure this patch won't compile unless this method is also removed from qnetworkmanagerengine.h too, but I haven't tried to build it myself...

 2) check that wiredDevice pointer is valid before using it to call a method

This is the crasher fix when USS/Qt from the PPA gets installed on a desktop. Looks fine to me.

 3) remove most of the logic from the engine's parseConnection() method. This method takes a connection path, and creates a private configuration object, and then based upon the underlying device type, may modify the private configuration instance in a device-specific way before returning it to the caller. The device-specific logic in some cases could have side-effects, such as modifying the global accessPointConfigurations hash table and/or the configuredAccessPoint map.

Again, this chnage looks reasonable to remove, however I don't see how it could have any impact on power. This code only runs during initialization where all existing system connections are loaded, and whenever a new system connection is added ( ie. a user connects to a new AP or APN ).

I'm not sure whether we want to include this the last change if we want to keep the delta as small as possible?

Revision history for this message
Lorn Potter (lorn-potter) wrote :

@Tony
1) requestUpdate() should be left in, as the backend needs to tell other parts when the update request has been completed. I have tested this, so it does compile. In further testing, it does not look like any clients are requesting updates, so the increased power consumption is coming from somewhere else.

2) The crash is because of dead code, and included change is just a quick fix. Proper fix is to go through this and remove the rest of the dead and unneeded code.

3) If you want this delta as small as possible, it is safe to remove this from the patch and only include #2. There is more dead code in there now that it only relies on settings and active connections, which I will remove in a different patch.

Revision history for this message
Lorn Potter (lorn-potter) wrote :

fyi: requestUpdate is called by the qnetworkconfigurationmanager, when updateConfigurations() gets called.

Revision history for this message
Selene ToyKeeper (toykeeper) wrote :

tvoss: The raw data is available. Simply remove "graph.png" from the URL and it'll give you a directory with all sorts of logs.

The green-shaded section is the active measurement period used for statistics. The red-shaded portions indicate where USB was connected. Everything else is unplugged.

Revision history for this message
Selene ToyKeeper (toykeeper) wrote :

FWIW, I see the weird new behavior on every test with wifi enabled, even when the screen is on and the phone is actively doing something. For example, while playing music:

http://people.canonical.com/~platform-qa/power-results/2015-12-08_09:19:36-krillin-196-power_usage_music/graph.png

... versus how it looked before:

http://people.canonical.com/~platform-qa/power-results/2015-12-08_04:50:02-krillin-195-power_usage_music/graph.png

But in flight mode, both graphs are a flat line barely above zero:

http://people.canonical.com/~platform-qa/power-results/2015-12-08_11:07:01-krillin-196-power_usage_flight_mode_on/graph.png

So, it's not just while sleeping; it's any time wifi is enabled.

Revision history for this message
Tony Espy (awe) wrote :

@Selene

Just wanted to confirm that we don't need any more investigation re: your comment #131 ( I think that may be the largest comment # I've seen in a LP bug ), as your latest update to bug #1524133 indicated that you'd isolated the problem to a SIM with an expired data plan?

@Timo

We should probably close this bug out when OTA8.5 is released, and then push another update to the PPA that contains the updated patch with the desktop crash fix ( bug #1523975 ). I've attached yet another of the version of the patch which just includes the additional wiredDevice NULL check from Lorn's latest patch.

@Lorn

Thanks again for all the help. Again, once the OTA8.5 update has been released, we can focus on transitioning to the connectivity API based bearer plugin for our next update.

Revision history for this message
Timo Jyrinki (timo-jyrinki) wrote :

Patch signals7 (with just the system settings crash fixed) building now in the PPA.

Changed in canonical-devices-system-image:
status: Fix Committed → Fix Released
Tony Espy (awe)
Changed in qtbase-opensource-src (Ubuntu):
status: Fix Committed → Fix Released
status: Fix Released → Fix Committed
Changed in qtbase-opensource-src (Ubuntu RTM):
status: New → Fix Released
assignee: nobody → Tony Espy (awe)
importance: Undecided → High
Changed in qtbase-opensource-src (Ubuntu):
status: Fix Committed → Triaged
Changed in dbus-cpp (Ubuntu):
status: In Progress → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package qtbase-opensource-src - 5.5.1+dfsg-13ubuntu2

---------------
qtbase-opensource-src (5.5.1+dfsg-13ubuntu2) xenial; urgency=medium

  * Forward-port networking fixes from 5.4 series:
    - net-bearer-nm-disconnect-ap-signals7.patch (LP: #1480877)
    - qnam-ubuntu-fix6.patch (LP: #1528886)
    - xenial would potentially now have fixes for (LP: #1506015)
      (LP: #1507769) (LP: #1533508)

 -- Timo Jyrinki <email address hidden> Tue, 09 Feb 2016 08:19:43 +0000

Changed in qtbase-opensource-src (Ubuntu):
status: Triaged → Fix Released
Aron Xu (happyaron)
tags: added: nm-touch
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Michał Sawicz (saviq)
affects: unity8 → unity8 (Ubuntu)
Changed in unity8 (Ubuntu):
status: New → Confirmed
Changed in unity8 (Ubuntu):
status: New → Confirmed
To post a comment you must log in.