Comment 52 for bug 1480877

Revision history for this message
Tony Espy (awe) wrote :

One of the frustrating parts about this bug is the incomplete description, and the fact that it's been nigh impossible to nail down a set of concrete steps to reproduce...

Also, the bug description contends that NM 'PropertiesChanged' signals are causing the UI freezes, however we have no idea why...

We haven't been thinking enough about what's listening for the signals. "If a tree falls in the woods and nobody hears it?" The more listeners for a particular DBus message, the more work the daemon has to due ( and the more context switches involved ).

Also, Mike Terry had mentioned in one of this comments that he'd seen the dbus daemon was spending most of it's time in it's match rules logic. Turns out that a new method had been added by upstream to the dbus Stats interface called 'GetAllMatchRules'. I'd been interested in this, for awhile, but finally got time to work on it.

Two other points before I dive into the match rules:

As of the latest rc-proposed for krillin ( # ), when the WiFi is enabled and the HERE app, Nearby scope, or Today scope are displayed, the location services increases the rate of WiFi scanning to once every 12-15s ( usually it's around 2m if connected to an AP ). When I decided to look at the incoming traffic to NM, I noticed that everytime a scan occurs, three different processes all seem to query the same set of NM AccessPoint objects. The clients all use 'Get' instead of 'GetAll' for the properties, and again it looks like each process is generating the same set of method requests. The processes in question are:

 - posclientd
 - slpgwd
 - ubuntu-location-serviced

The amount of traffic isn't a huge impact, but it could be made more efficient.

Back to the who's listening to the signals approach...

First, I got lucky and was able to run dbus 1.10.0-1ubuntu1 from wily on my krillin on first try. This krillin was freshly flashed to the latest rc-proposed ( #162 ).

When I ran the example get-all-match.py script ( see attached ) for the system bus ( use sudo and specify --system ), I was blown away that the same three location services processes previously mentioned above, each had 512 match rules for NM defined! 509 of the 512 were also flagged as warnings as no signal name was specified. It looks like these location processes are adding watches ( sometimes duplicates ) for every single AccessPoint object that ever gets created by NM. It also looks like these watches are never being cleaned up.

So... I tried the same experiment on a mako, also freshly flashed ( rc-proposed, #150 ). Turns out I got lucky with krillin, mako didn't like the new dbus, and wouldn't boot into the UI. I had re-flash, then backport the changes to the version in vivid ( 1.8 ). This verions ( ~awe3 ) is available in my PPA:

https://launchpad.net/~awe/+archive/ubuntu/ppa

When I ran this version on mako, the first thing I noticed was that the match rule counts were much lower at first. As I didn't have apps/scopes to drive the increased WiFi scanning, I triggered extra scans by hand and watched the match rule counts ...and they slowly started to grow! After an hour or so, they actually exceeded 512 rules per process!

It turns out the dbus configuration of the system bus in vivid was changed to allow up to *5000* match rules per process. See the attached patch for bug #454093. As I'm wrapping up this comment, each process is just about to hit 1000 rules each ( after 3h of uptime ).

I tried toggling WiFi on/off/on, but this doesn't seem to effect the number of match rules involved.

Tomorrow, I will try and reproduce the hangs by using the backport version of dbus from my PPA on krillin and let it run all day.

One last note, it'd be good to confirm that fully stopping these location service processes makes the problem go away. That said, I wasn't able to find an easy way to get the processes involved to shut down. Disabling from the indicator doesn't do it, and stopping the system upstart job ( 'sudo stop ubuntu-location-service' ) didn't do so either.