Push-client frequently blocking phone suspend
| Affects | Status | Importance | Assigned to | Milestone | |
|---|---|---|---|---|---|
| | Canonical System Image |
Critical
|
John Lenton | ||
| | Ubuntu Push Notifications |
Fix Released
|
Undecided
|
Samuele Pedroni | |
| | ubuntu-push (Ubuntu) |
Undecided
|
Unassigned | ||
| | ubuntu-push (Ubuntu RTM) |
Undecided
|
Unassigned | ||
Bug Description
current build number: 256
device name: krillin
channel: ubuntu-
last update: 2015-03-22 22:01:56
version version: 256
version ubuntu: 20150312
version device: 20150310-3201c0a
version custom: 20150216-561-29-186
While debugging the suspend process with my krillin, I noticed that sometimes ubuntu-push-client holds a powerd sysstate for too long (minutes), blocking the suspend process entirely.
Follows the log I got from powerd:
Mar 22 23:09:21 ubuntu-phablet powerd[964]: handle_
Mar 22 23:09:21 ubuntu-phablet powerd[964]: name_watch_add: looking for :1.92
Mar 22 23:09:21 ubuntu-phablet powerd[964]: watching :1.92 to see when it disappears on dbus
Mar 22 23:09:21 ubuntu-phablet powerd[964]: libsuspend: acquire_wake_lock: powerd_
Mar 22 23:09:21 ubuntu-phablet powerd[964]: handle_
Mar 22 23:09:21 ubuntu-phablet powerd[964]: Enqueue state change to ACTIVE
Mar 22 23:09:21 ubuntu-phablet powerd[964]: exiting suspend
Mar 22 23:09:21 ubuntu-phablet powerd[964]: libsuspend: calling exit_suspend
Mar 22 23:09:21 ubuntu-phablet kernel: [ 1397.544062]PM: suspend exit 2015-03-23 02:09:21.573597769 UTC
Mar 22 23:09:21 ubuntu-phablet kernel: [ 1397.544113]active wakeup source: MT662x
Mar 22 23:09:21 ubuntu-phablet kernel: [ 1397.544129]active wakeup source: alarm
Mar 22 23:09:21 ubuntu-phablet kernel: [ 1397.554088]
Mar 22 23:09:21 ubuntu-phablet powerd[964]: calling wait_for_next_alarm
Mar 22 23:09:21 ubuntu-phablet kernel: [ 1397.563938]
Mar 22 23:09:22 ubuntu-phablet powerd[964]: libsuspend: exit_suspend succeeded
Mar 22 23:09:22 ubuntu-phablet powerd[964]: Emitting signal for transition to state ACTIVE (1)
Mar 22 23:09:22 ubuntu-phablet powerd[964]: Transition to ACTIVE complete
Mar 22 23:09:22 ubuntu-phablet powerd[964]: libsuspend: release_wake_lock: powerd_
Mar 22 23:09:22 ubuntu-phablet powerd[964]: handle_
Mar 22 23:09:22 ubuntu-phablet powerd[964]: void {anonymous}
Mar 22 23:09:22 ubuntu-phablet powerd[964]: setting hardware wakeup time to 2015-03-23 11:00:00 for indicator-datetime
Mar 22 23:12:10 ubuntu-phablet powerd[964]: handle_
Mar 22 23:12:10 ubuntu-phablet powerd[964]: void {anonymous}
Mar 22 23:12:10 ubuntu-phablet powerd[964]: setting hardware wakeup time to 2015-03-22 23:17:10 for ubuntu push client
Mar 22 23:12:10 ubuntu-phablet powerd[964]: handle_
Mar 22 23:12:10 ubuntu-phablet powerd[964]: clear_sys_request: ubuntu push client - :1.92
Mar 22 23:12:10 ubuntu-phablet powerd[964]: name_watch_remove: looking for :1.92
Mar 22 23:12:10 ubuntu-phablet powerd[964]: name_watch: ref_count for :1.92 is now 0
Mar 22 23:12:10 ubuntu-phablet powerd[964]: no longer watching :1.92, there are no more requests
Mar 22 23:12:10 ubuntu-phablet powerd[964]: libsuspend: acquire_wake_lock: powerd_
Mar 22 23:12:10 ubuntu-phablet powerd[964]: Enqueue state change to SUSPEND
Mar 22 23:12:10 ubuntu-phablet powerd[964]: libsuspend: prepare_suspend.
Mar 22 23:12:10 ubuntu-phablet powerd[964]: Emitting signal for transition to state SUSPEND (0)
Mar 22 23:12:10 ubuntu-phablet powerd[964]: entering suspend
Mar 22 23:12:10 ubuntu-phablet powerd[964]: libsuspend: calling enter_suspend
Mar 22 23:12:10 ubuntu-phablet powerd[964]: libsuspend: enter_suspend succeeded
Mar 22 23:12:10 ubuntu-phablet powerd[964]: Transition to SUSPEND complete
Mar 22 23:12:10 ubuntu-phablet powerd[964]: libsuspend: release_wake_lock: powerd_
It took ubuntu-push-client 169 seconds to release the sysstate from powerd, and that probably only happened after powerd triggered the alarm that was scheduled by ubuntu-push-client.
Related branches
- John Lenton: Approve on 2015-03-24
-
Diff: 46 lines (+10/-5)2 files modifiedclient/session/session.go (+9/-4)
client/session/session_test.go (+1/-1)
- Samuele Pedroni: Approve on 2015-03-30
-
Diff: 85 lines (+20/-3)4 files modifiedclient/client.go (+2/-0)
client/client_test.go (+1/-0)
debian/config.json (+2/-1)
poller/poller.go (+15/-2)
| Changed in canonical-devices-system-image: | |
| status: | New → Confirmed |
| importance: | Undecided → Critical |
| milestone: | none → ww13-ota |
| Changed in canonical-devices-system-image: | |
| assignee: | nobody → John Lenton (chipaca) |
| Ricardo Salveti (rsalveti) wrote : | #1 |
| Ricardo Salveti (rsalveti) wrote : | #2 |
From ./.cache/
2015/03/23 14:49:51.103334 DEBUG got woken up; time is 2015-03-23 14:49:51.103221076 -0300 BRT (𝛥: 103.221076ms)
2015/03/23 14:49:51.108670 DEBUG got wakelock cookie of 239b6ccb-
2015/03/23 14:52:16.501222 DEBUG connectivity: timer signal, state: ConnectedGlobal, checking...
2015/03/23 14:52:16.501368 DEBUG poking polld.
2015/03/23 14:52:16.507786 DEBUG waiting for polld to signal Done.
2015/03/23 14:52:17.326528 DEBUG WatchMethod: ListPersistent(
2015/03/23 14:52:18.053663 DEBUG WatchMethod: ListPersistent(
2015/03/23 14:52:22.039078 INFO connectivity check passed.
2015/03/23 14:52:22.039488 DEBUG connectivity: connection check says: true
2015/03/23 14:52:23.100247 DEBUG polld Done.
2015/03/23 14:52:28.106993 DEBUG requested wakeup at 2015-03-23 14:57:28 -0300 BRT
2015/03/23 14:52:28.131445 DEBUG cleared wakelock cookie 239b6ccb-
See that it took more than 2 minutes for it to release the lock, and the time it spent with polld was actually small (~7 seconds). The major issue seems to be from the time it takes the wakelock until it actually pokes polld.
| Ricardo Salveti (rsalveti) wrote : | #3 |
phablet@
{
"auth_helper": "/usr/lib/
"session_url": "https:/
"registrati
"connect_
"exchange_
"hosts_
"expect_
"addr": "https:/
"cert_
"stabilizin
"recheck_
"connectivi
"connectivi
"log_level": "debug",
"fallback_
"fallback_
"poll_
"poll_settle": "20ms",
"poll_
"poll_
"poll_
}
| Ricardo Salveti (rsalveti) wrote : | #4 |
With the latest available at lp:ubuntu-push/automatic:
2015/03/23 17:26:40.203525 poller.go:178: DEBUG got wakelock cookie of d3817bb1-
2015/03/23 17:26:50.709443 networkmanager.
2015/03/23 17:26:50.709972 connectivity.
2015/03/23 17:26:50.710324 session.go:283: DEBUG session.setState: Running -> Disconnected
2015/03/23 17:26:50.710548 session.go:586: DEBUG session aborting with error on read.
2015/03/23 17:26:50.710971 session.go:283: DEBUG session.setState: Disconnected -> Error
2015/03/23 17:26:50.714776 session.go:766: ERROR session error'ed out with read tcp 91.189.88.31:443: use of closed network connection
2015/03/23 17:28:58.394450 connectivity.
2015/03/23 17:29:01.228832 webchecker.go:87: INFO connectivity check passed.
2015/03/23 17:29:01.229208 connectivity.
2015/03/23 17:29:01.229336 connectivity.
2015/03/23 17:29:01.229526 session.go:433: DEBUG session autoredialier launching Redial goroutine
2015/03/23 17:29:01.229785 session.go:283: DEBUG session.setState: Error -> Disconnected
2015/03/23 17:29:01.229894 session.go:347: DEBUG adding authorization
2015/03/23 17:29:01.230263 client.go:257: DEBUG getting authorization for https:/
2015/03/23 17:29:01.993163 session.go:403: DEBUG trying to connect to: 91.189.88.31:443
2015/03/23 17:29:02.302225 session.go:283: DEBUG session.setState: Disconnected -> Connected
2015/03/23 17:29:03.562364 session.go:657: DEBUG connected 91.189.88.31:443.
2015/03/23 17:29:03.562637 session.go:283: DEBUG session.setState: Connected -> Started
2015/03/23 17:29:03.562870 session.go:283: DEBUG session.setState: Started -> Running
2015/03/23 17:29:03.563195 session.go:743: DEBUG connected after 1 attempts.
2015/03/23 17:29:04.395063 poller.go:178: DEBUG poking polld.
2015/03/23 17:29:04.399025 poller.go:178: DEBUG waiting for polld to signal Done.
2015/03/23 17:29:05.837456 endpoint.go:278: DEBUG WatchMethod: ListPersistent(
2015/03/23 17:29:06.998565 endpoint.go:278: DEBUG WatchMethod: ListPersistent(
2015/03/23 17:29:06.999717 poller.go:178: DEBUG polld Done.
2015/03/23 17:29:12.009293 poller.go:154: DEBUG requested wakeup at 2015-03-23 17:34:12 -0300 BRT
2015/03/23 17:29:12.037118 poller.go:178: DEBUG cleared wakelock cookie d3817bb1-
Almost 3 minutes with the log, in this case it seems because the ne...
| Ricardo Salveti (rsalveti) wrote : | #5 |
2015/03/23 17:34:13.189361 poller.go:170: DEBUG got woken up; time is 2015-03-23 17:34:13.189247951 -0300 BRT (𝛥: 1.189247951s)
2015/03/23 17:34:13.218912 poller.go:178: DEBUG got wakelock cookie of f1bfecae-
2015/03/23 17:38:02.538146 poller.go:178: ERROR not connected after 1m0s; giving up
2015/03/23 17:38:02.547617 poller.go:154: DEBUG requested wakeup at 2015-03-23 17:43:02 -0300 BRT
2015/03/23 17:38:02.574205 poller.go:178: DEBUG cleared wakelock cookie f1bfecae-
Network timeout is 1 minute, but it took almost 4.
| Changed in ubuntu-push: | |
| assignee: | nobody → Samuele Pedroni (pedronis) |
| Bret Barker (noise) wrote : | #6 |
http://
From that log it appears that
time.Sleep(
right after getting the wakelock cookie is taking anywhere from 40-89 seconds, but should be only 20ms.
| Launchpad Janitor (janitor) wrote : | #7 |
Status changed to 'Confirmed' because the bug affects multiple users.
| Changed in ubuntu-push (Ubuntu): | |
| status: | New → Confirmed |
| Changed in canonical-devices-system-image: | |
| status: | Confirmed → In Progress |
| Samuele Pedroni (pedronis) wrote : | #8 |
we merged to our dev branch a attached MP that should address some of the situations,
we still don't understand the issue in comment #6
| Changed in ubuntu-push: | |
| status: | New → In Progress |
| Samuele Pedroni (pedronis) wrote : | #9 |
<noise> pedronis: repro'd the stall
<noise> when i took my phone off USB it must have really slept
<noise> https:/
<noise> line 17
<noise> time.Sleep(
<noise> is the only thing that happens in-between there
<pedronis> but it took 4 minutes
<noise> yah
<noise> wtf right?
<pedronis> yes :/
<pedronis> noise: I suppose as first approximation we could try to reproduce with ubuntu-push-client running under strace -f -tt
an strace of the process with threads and timestamps could maybe help clarify what's happening or exclude some things
| Bret Barker (noise) wrote : | #10 |
Another instance:
http://
With associated strace:
http://
(lines 40-69 are the activity between the logging)
And here's where we get stuck:
[pid 14764] 17:34:40.030047 futex(0x6579e8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 14760] 17:35:24.991469 <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)
| Bret Barker (noise) wrote : | #11 |
| Ricardo Salveti (rsalveti) wrote : Re: [Bug 1435109] Re: Push-client frequently blocking phone suspend | #12 |
On Wed, Mar 25, 2015 at 1:29 PM, Samuele Pedroni
<email address hidden> wrote:
> <noise> pedronis: repro'd the stall
> <noise> when i took my phone off USB it must have really slept
Suspend shouldn't really interfere with this bug as once push gets the
sysstate cookie from powerd, that forces powerd to hold a wakelock
(blocking suspend).
If it would also be related to the suspend process, syslog would at
least show something (and from my original description the syslog is
showing the same symptom)
| Bret Barker (noise) wrote : | #13 |
pedronis made a c program to test the futex:
c code: https:/
arm binary: http://
fuxex_sleep output: https:/
push-client log: https:/
powerd log: https:/
in futex_sleep, big offender:
starting at 14:37:44, overdue at 14:48:11 (lines 48-49)
in powerd:
2212 Mar 27 14:37:52 ubuntu-phablet powerd[827]: Transition to SUSPEND complete
2227 Mar 27 14:42:52 ubuntu-phablet powerd[827]: Transition to ACTIVE complete
2244 Mar 27 14:42:57 ubuntu-phablet powerd[827]: Transition to SUSPEND complete
2259 Mar 27 14:47:57 ubuntu-phablet powerd[827]: Transition to ACTIVE complete
and then 14 secs before our futex came back.
| Bret Barker (noise) wrote : | #14 |
go uses relative timeouts with FUTEX_WAIT, which ignores time spent in suspended state. This causes timers after wake from suspend to be delayed. We are working around the issue by tricking go's runtime with a busy sleep loop to reduce the delay to a max 1s.
| Ricardo Salveti (rsalveti) wrote : | #15 |
This futex behavior is expected, as it counts the relative time and only the active time.
Did one test with a similar program using futex wait and comparing the delta between suspends, and you can find the output at: http://
For it to work the way you expected, you'd need to use FUTEX_WAIT_BITSET and an absolute timeout, as you can see from: http://
For me it's still a bit unclear why this is causing such issues with go (don't know much about go internals yet, and futex is actually behaving as expected), but feel free to ping me once you open a bug against upstream.
| Samuele Pedroni (pedronis) wrote : | #16 |
so the bug with 1.2 is because the futex uses the motonic clock but the go timer heap was based on realtime futures times,
this was changed in go in 1.3, now the timers heap is using monotic future timestamps,
this means that sleep/timers created after a suspend will work as expected, sleep/timers that cross a suspend will wait based on monotic time (which doesn't increase in the suspend) and won't be timely,
discussion of go changes is here:
https:/
in RTM/with 1.2 we need the workaround,
with 1.3 the ugliest big of the bug is gone, but we need to be careful if there are sleep/timers that cross a suspend that won't do what we expect
| Launchpad Janitor (janitor) wrote : | #17 |
This bug was fixed in the package ubuntu-push - 0.64.1+
---------------
ubuntu-push (0.64.1+
[ CI Train Bot ]
* New rebuild forced.
[ Samuele Pedroni (Canonical Services Ltd.) ]
* [Bret Barker] add a hacky busy sleep loop to workaround go's sleep
not accounting for suspended time (lp:1435109). (LP: #1435109,
#1437135)
-- CI Train Bot <email address hidden> Tue, 31 Mar 2015 21:52:22 +0000
| Changed in ubuntu-push (Ubuntu RTM): | |
| status: | New → Fix Released |
| Changed in canonical-devices-system-image: | |
| status: | In Progress → Fix Released |
| Changed in ubuntu-push: | |
| status: | In Progress → Fix Committed |
| Changed in ubuntu-push: | |
| status: | Fix Committed → Fix Released |
| Changed in ubuntu-push (Ubuntu): | |
| status: | Confirmed → Fix Released |


From IRC: ::reset_ alarm_clock( ) found 3 remaining wakeup requests
<rsalveti> Mar 22 23:06:34 ubuntu-phablet powerd[964]: void {anonymous}
<rsalveti> Mar 22 23:06:34 ubuntu-phablet powerd[964]: setting hardware wakeup time to 2015-03-22 23:09:20 for ubuntu push client
<Chipaca> rsalveti: i'm afraid i don't understand what i'm supposed to be gleaning from the kernel logs :(
<Chipaca> rsalveti: i can tell you what the push client should be doing
<rsalveti> so ubuntu push client requested a hardware wakeup time to 2015-03-22 23:09:20, phone suspended fine, woke up at the exact time, then push client got a SysState lock and only released at 23:12:10
<Chipaca> rsalveti: and i _think_ it's doing what it should be doing :)
<Chipaca> rsalveti: push requests a wake up for 5m in the future
<Chipaca> rsalveti: when it wakes up, it asks for a wake lock
* loicm has quit (Ping timeout)
<Chipaca> rsalveti: it then waits up to 1 minute for the network to come up
<Chipaca> rsalveti: when the network comes up, it waits up to 3 minutes for polld to poll stuff
<Chipaca> rsalveti: once polld signals it's done, it asks for a wakeup in 5 minutes, and releases the wake lock
<rsalveti> right, then it is indeed the expected behavior
<Chipaca> phew :)
<rsalveti> but it's annoying that it blocks the suspend for so long
<Chipaca> rsalveti: all those timeouts are configurable
<rsalveti> can't we optimize this?
<Chipaca> rsalveti: you mean, changing those timeouts?
<Chipaca> rsalveti: sure
<rsalveti> Chipaca: when you wait for the network, do you wait the full 60 seconds or do you wait until you get the connection, if less than 60 seconds?
<Chipaca> rsalveti: the latter
<pedronis> the big chunk is polld I suppose
<Chipaca> pedronis: or maybe the network coming up
<rsalveti> network should return quite fast actually
<Chipaca> or all of the above
<rsalveti> 1 minute is a lot of time
<Chipaca> rsalveti: yes
<rsalveti> Chipaca: maybe polld is taking more time now because of the amount of users we have?
<Chipaca> rsalveti: so it's entirely possible there are bugs or unexpected interactions going on; debug logs for push, and the account-polld log, might shed more light.
<Chipaca> rsalveti: polld does not go via our servers
<Chipaca> it's straight to twitter/google
<rsalveti> enabled debug in push-client, how to enable debug for polld?
<Chipaca> rsalveti: do you have a google or twitter account enabled on the device you're checking?
<rsalveti> Chipaca: I got telegram, gmail and twitter
<rsalveti> and I joined the warthogs group in telegram
<Chipaca> rsalveti: gmail and twitter are polld; telegram uses push