Push-client frequently blocking phone suspend

Bug #1435109 reported by Ricardo Salveti
22
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Canonical System Image
Fix Released
Critical
John Lenton
Ubuntu Push Notifications
Fix Released
Undecided
Samuele Pedroni
ubuntu-push (Ubuntu)
Fix Released
Undecided
Unassigned
ubuntu-push (Ubuntu RTM)
Fix Released
Undecided
Unassigned

Bug Description

current build number: 256
device name: krillin
channel: ubuntu-touch/ubuntu-rtm/14.09-proposed
last update: 2015-03-22 22:01:56
version version: 256
version ubuntu: 20150312
version device: 20150310-3201c0a
version custom: 20150216-561-29-186

While debugging the suspend process with my krillin, I noticed that sometimes ubuntu-push-client holds a powerd sysstate for too long (minutes), blocking the suspend process entirely.

Follows the log I got from powerd:
Mar 22 23:09:21 ubuntu-phablet powerd[964]: handle_requestSysState from :1.92 (ubuntu push client) - ACTIVE (1)
Mar 22 23:09:21 ubuntu-phablet powerd[964]: name_watch_add: looking for :1.92
Mar 22 23:09:21 ubuntu-phablet powerd[964]: watching :1.92 to see when it disappears on dbus
Mar 22 23:09:21 ubuntu-phablet powerd[964]: libsuspend: acquire_wake_lock: powerd_power_request
Mar 22 23:09:21 ubuntu-phablet powerd[964]: handle_requestSysState - SUCCESS
Mar 22 23:09:21 ubuntu-phablet powerd[964]: Enqueue state change to ACTIVE
Mar 22 23:09:21 ubuntu-phablet powerd[964]: exiting suspend
Mar 22 23:09:21 ubuntu-phablet powerd[964]: libsuspend: calling exit_suspend
Mar 22 23:09:21 ubuntu-phablet kernel: [ 1397.544062]PM: suspend exit 2015-03-23 02:09:21.573597769 UTC
Mar 22 23:09:21 ubuntu-phablet kernel: [ 1397.544113]active wakeup source: MT662x
Mar 22 23:09:21 ubuntu-phablet kernel: [ 1397.544129]active wakeup source: alarm
Mar 22 23:09:21 ubuntu-phablet kernel: [ 1397.554088][Ker_PM][request_suspend_state]wakeup (3->0) at 1397540686404 (2015-03-23 02:09:21.583610000 UTC)
Mar 22 23:09:21 ubuntu-phablet powerd[964]: calling wait_for_next_alarm
Mar 22 23:09:21 ubuntu-phablet kernel: [ 1397.563938][WMT-CTRL][W]wmt_ctrl_rx:wmt_ctrl_rx be signaled, but no rx data(200)
Mar 22 23:09:22 ubuntu-phablet powerd[964]: libsuspend: exit_suspend succeeded
Mar 22 23:09:22 ubuntu-phablet powerd[964]: Emitting signal for transition to state ACTIVE (1)
Mar 22 23:09:22 ubuntu-phablet powerd[964]: Transition to ACTIVE complete
Mar 22 23:09:22 ubuntu-phablet powerd[964]: libsuspend: release_wake_lock: powerd_power_request
Mar 22 23:09:22 ubuntu-phablet powerd[964]: handle_requestWakeup from :1.70 - indicator-datetime, 1427119200
Mar 22 23:09:22 ubuntu-phablet powerd[964]: void {anonymous}::reset_alarm_clock() found 3 remaining wakeup requests
Mar 22 23:09:22 ubuntu-phablet powerd[964]: setting hardware wakeup time to 2015-03-23 11:00:00 for indicator-datetime

Mar 22 23:12:10 ubuntu-phablet powerd[964]: handle_requestWakeup from :1.92 - ubuntu push client, 1427077030
Mar 22 23:12:10 ubuntu-phablet powerd[964]: void {anonymous}::reset_alarm_clock() found 4 remaining wakeup requests
Mar 22 23:12:10 ubuntu-phablet powerd[964]: setting hardware wakeup time to 2015-03-22 23:17:10 for ubuntu push client
Mar 22 23:12:10 ubuntu-phablet powerd[964]: handle_clearSysState from :1.92, cookie: a840fa3a-fdcf-4178-8efe-d4d3aca46eb2
Mar 22 23:12:10 ubuntu-phablet powerd[964]: clear_sys_request: ubuntu push client - :1.92
Mar 22 23:12:10 ubuntu-phablet powerd[964]: name_watch_remove: looking for :1.92
Mar 22 23:12:10 ubuntu-phablet powerd[964]: name_watch: ref_count for :1.92 is now 0
Mar 22 23:12:10 ubuntu-phablet powerd[964]: no longer watching :1.92, there are no more requests
Mar 22 23:12:10 ubuntu-phablet powerd[964]: libsuspend: acquire_wake_lock: powerd_power_request
Mar 22 23:12:10 ubuntu-phablet powerd[964]: Enqueue state change to SUSPEND
Mar 22 23:12:10 ubuntu-phablet powerd[964]: libsuspend: prepare_suspend.
Mar 22 23:12:10 ubuntu-phablet powerd[964]: Emitting signal for transition to state SUSPEND (0)
Mar 22 23:12:10 ubuntu-phablet powerd[964]: entering suspend
Mar 22 23:12:10 ubuntu-phablet powerd[964]: libsuspend: calling enter_suspend
Mar 22 23:12:10 ubuntu-phablet powerd[964]: libsuspend: enter_suspend succeeded
Mar 22 23:12:10 ubuntu-phablet powerd[964]: Transition to SUSPEND complete
Mar 22 23:12:10 ubuntu-phablet powerd[964]: libsuspend: release_wake_lock: powerd_power_request

It took ubuntu-push-client 169 seconds to release the sysstate from powerd, and that probably only happened after powerd triggered the alarm that was scheduled by ubuntu-push-client.

Related branches

Changed in canonical-devices-system-image:
status: New → Confirmed
importance: Undecided → Critical
milestone: none → ww13-ota
Changed in canonical-devices-system-image:
assignee: nobody → John Lenton (chipaca)
Revision history for this message
Ricardo Salveti (rsalveti) wrote :

From IRC:
<rsalveti> Mar 22 23:06:34 ubuntu-phablet powerd[964]: void {anonymous}::reset_alarm_clock() found 3 remaining wakeup requests
<rsalveti> Mar 22 23:06:34 ubuntu-phablet powerd[964]: setting hardware wakeup time to 2015-03-22 23:09:20 for ubuntu push client
<Chipaca> rsalveti: i'm afraid i don't understand what i'm supposed to be gleaning from the kernel logs :(
<Chipaca> rsalveti: i can tell you what the push client should be doing
<rsalveti> so ubuntu push client requested a hardware wakeup time to 2015-03-22 23:09:20, phone suspended fine, woke up at the exact time, then push client got a SysState lock and only released at 23:12:10
<Chipaca> rsalveti: and i _think_ it's doing what it should be doing :)
<Chipaca> rsalveti: push requests a wake up for 5m in the future
<Chipaca> rsalveti: when it wakes up, it asks for a wake lock
* loicm has quit (Ping timeout)
<Chipaca> rsalveti: it then waits up to 1 minute for the network to come up
<Chipaca> rsalveti: when the network comes up, it waits up to 3 minutes for polld to poll stuff
<Chipaca> rsalveti: once polld signals it's done, it asks for a wakeup in 5 minutes, and releases the wake lock
<rsalveti> right, then it is indeed the expected behavior
<Chipaca> phew :)
<rsalveti> but it's annoying that it blocks the suspend for so long
<Chipaca> rsalveti: all those timeouts are configurable
<rsalveti> can't we optimize this?
<Chipaca> rsalveti: you mean, changing those timeouts?
<Chipaca> rsalveti: sure
<rsalveti> Chipaca: when you wait for the network, do you wait the full 60 seconds or do you wait until you get the connection, if less than 60 seconds?
<Chipaca> rsalveti: the latter
<pedronis> the big chunk is polld I suppose
<Chipaca> pedronis: or maybe the network coming up
<rsalveti> network should return quite fast actually
<Chipaca> or all of the above
<rsalveti> 1 minute is a lot of time
<Chipaca> rsalveti: yes
<rsalveti> Chipaca: maybe polld is taking more time now because of the amount of users we have?
<Chipaca> rsalveti: so it's entirely possible there are bugs or unexpected interactions going on; debug logs for push, and the account-polld log, might shed more light.
<Chipaca> rsalveti: polld does not go via our servers
<Chipaca> it's straight to twitter/google
<rsalveti> enabled debug in push-client, how to enable debug for polld?
<Chipaca> rsalveti: do you have a google or twitter account enabled on the device you're checking?
<rsalveti> Chipaca: I got telegram, gmail and twitter
<rsalveti> and I joined the warthogs group in telegram
<Chipaca> rsalveti: gmail and twitter are polld; telegram uses push

Revision history for this message
Ricardo Salveti (rsalveti) wrote :

From ./.cache/upstart/ubuntu-push-client.log (with debug):

2015/03/23 14:49:51.103334 DEBUG got woken up; time is 2015-03-23 14:49:51.103221076 -0300 BRT (𝛥: 103.221076ms)
2015/03/23 14:49:51.108670 DEBUG got wakelock cookie of 239b6ccb-ea1e-4617-94ff-7569149f8188
2015/03/23 14:52:16.501222 DEBUG connectivity: timer signal, state: ConnectedGlobal, checking...
2015/03/23 14:52:16.501368 DEBUG poking polld.
2015/03/23 14:52:16.507786 DEBUG waiting for polld to signal Done.
2015/03/23 14:52:17.326528 DEBUG WatchMethod: ListPersistent(/com/ubuntu/Postal/com_2eubuntu_2edeveloper_2ewebapps_2ewebapp_2dgmail, []interface {}{"com.ubuntu.developer.webapps.webapp-gmail_webapp-gmail"}, []interface {}{(*service.DBusService)(0x9575c2c0)}) success: []string{"[]"}
2015/03/23 14:52:18.053663 DEBUG WatchMethod: ListPersistent(/com/ubuntu/Postal/com_2eubuntu_2edeveloper_2ewebapps_2ewebapp_2dtwitter, []interface {}{"com.ubuntu.developer.webapps.webapp-twitter_webapp-twitter"}, []interface {}{(*service.DBusService)(0x9575c2c0)}) success: []string{"[]"}
2015/03/23 14:52:22.039078 INFO connectivity check passed.
2015/03/23 14:52:22.039488 DEBUG connectivity: connection check says: true
2015/03/23 14:52:23.100247 DEBUG polld Done.
2015/03/23 14:52:28.106993 DEBUG requested wakeup at 2015-03-23 14:57:28 -0300 BRT
2015/03/23 14:52:28.131445 DEBUG cleared wakelock cookie 239b6ccb-ea1e-4617-94ff-7569149f8188.

See that it took more than 2 minutes for it to release the lock, and the time it spent with polld was actually small (~7 seconds). The major issue seems to be from the time it takes the wakelock until it actually pokes polld.

Revision history for this message
Ricardo Salveti (rsalveti) wrote :

phablet@ubuntu-phablet:~$ cat /etc/xdg/ubuntu-push-client/config.json
{
    "auth_helper": "/usr/lib/ubuntu-push-client/signing-helper",
    "session_url": "https://push.ubuntu.com/",
    "registration_url": "https://push.ubuntu.com",
    "connect_timeout": "20s",
    "exchange_timeout": "30s",
    "hosts_cache_expiry": "12h",
    "expect_all_repaired": "40m",
    "addr": "https://push.ubuntu.com/delivery-hosts",
    "cert_pem_file": "",
    "stabilizing_timeout": "2s",
    "recheck_timeout": "10m",
    "connectivity_check_url": "http://start.ubuntu.com/connectivity-check.html",
    "connectivity_check_md5": "4589f42e1546aa47ca181e5d949d310b",
    "log_level": "debug",
    "fallback_vibration": {"pattern": [100, 100], "repeat": 2},
    "fallback_sound": "sounds/ubuntu/notifications/Slick.ogg",
    "poll_interval": "5m",
    "poll_settle": "20ms",
    "poll_net_wait": "1m",
    "poll_polld_wait": "3m",
    "poll_done_wait": "5s"
}

Revision history for this message
Ricardo Salveti (rsalveti) wrote :
Download full text (3.2 KiB)

With the latest available at lp:ubuntu-push/automatic:
2015/03/23 17:26:40.203525 poller.go:178: DEBUG got wakelock cookie of d3817bb1-b7ec-4204-846b-d1551d9ab487
2015/03/23 17:26:50.709443 networkmanager.go:145: DEBUG got primary connection: /org/freedesktop/NetworkManager/ActiveConnection/1
2015/03/23 17:26:50.709972 connectivity.go:164: DEBUG connectivity: PrimaryConnection changed. lastSent: true, sending 'disconnected'.
2015/03/23 17:26:50.710324 session.go:283: DEBUG session.setState: Running -> Disconnected
2015/03/23 17:26:50.710548 session.go:586: DEBUG session aborting with error on read.
2015/03/23 17:26:50.710971 session.go:283: DEBUG session.setState: Disconnected -> Error
2015/03/23 17:26:50.714776 session.go:766: ERROR session error'ed out with read tcp 91.189.88.31:443: use of closed network connection
2015/03/23 17:28:58.394450 connectivity.go:197: DEBUG connectivity: timer signal, state: ConnectedGlobal, checking...
2015/03/23 17:29:01.228832 webchecker.go:87: INFO connectivity check passed.
2015/03/23 17:29:01.229208 connectivity.go:157: DEBUG connectivity: connection check says: true
2015/03/23 17:29:01.229336 connectivity.go:157: DEBUG connectivity: connection check ok, lastSent: false, sending 'connected'.
2015/03/23 17:29:01.229526 session.go:433: DEBUG session autoredialier launching Redial goroutine
2015/03/23 17:29:01.229785 session.go:283: DEBUG session.setState: Error -> Disconnected
2015/03/23 17:29:01.229894 session.go:347: DEBUG adding authorization
2015/03/23 17:29:01.230263 client.go:257: DEBUG getting authorization for https://push.ubuntu.com/
2015/03/23 17:29:01.993163 session.go:403: DEBUG trying to connect to: 91.189.88.31:443
2015/03/23 17:29:02.302225 session.go:283: DEBUG session.setState: Disconnected -> Connected
2015/03/23 17:29:03.562364 session.go:657: DEBUG connected 91.189.88.31:443.
2015/03/23 17:29:03.562637 session.go:283: DEBUG session.setState: Connected -> Started
2015/03/23 17:29:03.562870 session.go:283: DEBUG session.setState: Started -> Running
2015/03/23 17:29:03.563195 session.go:743: DEBUG connected after 1 attempts.
2015/03/23 17:29:04.395063 poller.go:178: DEBUG poking polld.
2015/03/23 17:29:04.399025 poller.go:178: DEBUG waiting for polld to signal Done.
2015/03/23 17:29:05.837456 endpoint.go:278: DEBUG WatchMethod: ListPersistent(/com/ubuntu/Postal/com_2eubuntu_2edeveloper_2ewebapps_2ewebapp_2dgmail, []interface {}{"com.ubuntu.developer.webapps.webapp-gmail_webapp-gmail"}, []interface {}{(*service.DBusService)(0x956c2180)}) success: []string{"[]"}
2015/03/23 17:29:06.998565 endpoint.go:278: DEBUG WatchMethod: ListPersistent(/com/ubuntu/Postal/com_2eubuntu_2edeveloper_2ewebapps_2ewebapp_2dtwitter, []interface {}{"com.ubuntu.developer.webapps.webapp-twitter_webapp-twitter"}, []interface {}{(*service.DBusService)(0x956c2180)}) success: []string{"[]"}
2015/03/23 17:29:06.999717 poller.go:178: DEBUG polld Done.
2015/03/23 17:29:12.009293 poller.go:154: DEBUG requested wakeup at 2015-03-23 17:34:12 -0300 BRT
2015/03/23 17:29:12.037118 poller.go:178: DEBUG cleared wakelock cookie d3817bb1-b7ec-4204-846b-d1551d9ab487.

Almost 3 minutes with the log, in this case it seems because the ne...

Read more...

Revision history for this message
Ricardo Salveti (rsalveti) wrote :

2015/03/23 17:34:13.189361 poller.go:170: DEBUG got woken up; time is 2015-03-23 17:34:13.189247951 -0300 BRT (𝛥: 1.189247951s)
2015/03/23 17:34:13.218912 poller.go:178: DEBUG got wakelock cookie of f1bfecae-767d-4d04-9ba2-f39cf6c70523
2015/03/23 17:38:02.538146 poller.go:178: ERROR not connected after 1m0s; giving up
2015/03/23 17:38:02.547617 poller.go:154: DEBUG requested wakeup at 2015-03-23 17:43:02 -0300 BRT
2015/03/23 17:38:02.574205 poller.go:178: DEBUG cleared wakelock cookie f1bfecae-767d-4d04-9ba2-f39cf6c70523.

Network timeout is 1 minute, but it took almost 4.

Changed in ubuntu-push:
assignee: nobody → Samuele Pedroni (pedronis)
Revision history for this message
Bret Barker (noise) wrote :

http://paste.ubuntu.com/10671598/

From that log it appears that
 time.Sleep(p.times.SessionStateSettle)
right after getting the wakelock cookie is taking anywhere from 40-89 seconds, but should be only 20ms.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in ubuntu-push (Ubuntu):
status: New → Confirmed
Changed in canonical-devices-system-image:
status: Confirmed → In Progress
Revision history for this message
Samuele Pedroni (pedronis) wrote :

we merged to our dev branch a attached MP that should address some of the situations,

we still don't understand the issue in comment #6

Changed in ubuntu-push:
status: New → In Progress
Revision history for this message
Samuele Pedroni (pedronis) wrote :

<noise> pedronis: repro'd the stall
<noise> when i took my phone off USB it must have really slept
<noise> https://pastebin.canonical.com/128332/
<noise> line 17
<noise> time.Sleep(p.times.SessionStateSettle)
<noise> is the only thing that happens in-between there
<pedronis> but it took 4 minutes
<noise> yah
<noise> wtf right?
<pedronis> yes :/
<pedronis> noise: I suppose as first approximation we could try to reproduce with ubuntu-push-client running under strace -f -tt

an strace of the process with threads and timestamps could maybe help clarify what's happening or exclude some things

Revision history for this message
Bret Barker (noise) wrote :

Another instance:
http://pastebin.ubuntu.com/10679483/

With associated strace:
http://pastebin.ubuntu.com/10679501/
(lines 40-69 are the activity between the logging)

And here's where we get stuck:
[pid 14764] 17:34:40.030047 futex(0x6579e8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 14760] 17:35:24.991469 <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)

Revision history for this message
Bret Barker (noise) wrote :
Revision history for this message
Ricardo Salveti (rsalveti) wrote : Re: [Bug 1435109] Re: Push-client frequently blocking phone suspend

On Wed, Mar 25, 2015 at 1:29 PM, Samuele Pedroni
<email address hidden> wrote:
> <noise> pedronis: repro'd the stall
> <noise> when i took my phone off USB it must have really slept

Suspend shouldn't really interfere with this bug as once push gets the
sysstate cookie from powerd, that forces powerd to hold a wakelock
(blocking suspend).

If it would also be related to the suspend process, syslog would at
least show something (and from my original description the syslog is
showing the same symptom)

Revision history for this message
Bret Barker (noise) wrote :

pedronis made a c program to test the futex:
  c code: https://pastebin.canonical.com/128528/
  arm binary: http://people.canonical.com/~pedronis/futex_sleep

fuxex_sleep output: https://pastebin.canonical.com/128538/
push-client log: https://pastebin.canonical.com/128539/
powerd log: https://pastebin.canonical.com/128540/

in futex_sleep, big offender:

starting at 14:37:44, overdue at 14:48:11 (lines 48-49)

in powerd:

2212 Mar 27 14:37:52 ubuntu-phablet powerd[827]: Transition to SUSPEND complete
2227 Mar 27 14:42:52 ubuntu-phablet powerd[827]: Transition to ACTIVE complete
2244 Mar 27 14:42:57 ubuntu-phablet powerd[827]: Transition to SUSPEND complete
2259 Mar 27 14:47:57 ubuntu-phablet powerd[827]: Transition to ACTIVE complete

and then 14 secs before our futex came back.

Revision history for this message
Bret Barker (noise) wrote :

go uses relative timeouts with FUTEX_WAIT, which ignores time spent in suspended state. This causes timers after wake from suspend to be delayed. We are working around the issue by tricking go's runtime with a busy sleep loop to reduce the delay to a max 1s.

Revision history for this message
Ricardo Salveti (rsalveti) wrote :

This futex behavior is expected, as it counts the relative time and only the active time.

Did one test with a similar program using futex wait and comparing the delta between suspends, and you can find the output at: http://paste.ubuntu.com/10708286/

For it to work the way you expected, you'd need to use FUTEX_WAIT_BITSET and an absolute timeout, as you can see from: http://paste.ubuntu.com/10708958/

For me it's still a bit unclear why this is causing such issues with go (don't know much about go internals yet, and futex is actually behaving as expected), but feel free to ping me once you open a bug against upstream.

Revision history for this message
Samuele Pedroni (pedronis) wrote :

so the bug with 1.2 is because the futex uses the motonic clock but the go timer heap was based on realtime futures times,

this was changed in go in 1.3, now the timers heap is using monotic future timestamps,

this means that sleep/timers created after a suspend will work as expected, sleep/timers that cross a suspend will wait based on monotic time (which doesn't increase in the suspend) and won't be timely,

discussion of go changes is here:

https://github.com/golang/go/issues/6007

in RTM/with 1.2 we need the workaround,

with 1.3 the ugliest big of the bug is gone, but we need to be careful if there are sleep/timers that cross a suspend that won't do what we expect

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package ubuntu-push - 0.64.1+15.04.20150331.2~rtm-0ubuntu1

---------------
ubuntu-push (0.64.1+15.04.20150331.2~rtm-0ubuntu1) 14.09; urgency=medium

  [ CI Train Bot ]
  * New rebuild forced.

  [ Samuele Pedroni (Canonical Services Ltd.) ]
  * [Bret Barker] add a hacky busy sleep loop to workaround go's sleep
    not   accounting for suspended time (lp:1435109). (LP: #1435109,
    #1437135)
 -- CI Train Bot <email address hidden> Tue, 31 Mar 2015 21:52:22 +0000

Changed in ubuntu-push (Ubuntu RTM):
status: New → Fix Released
Changed in canonical-devices-system-image:
status: In Progress → Fix Released
Changed in ubuntu-push:
status: In Progress → Fix Committed
Changed in ubuntu-push:
status: Fix Committed → Fix Released
Changed in ubuntu-push (Ubuntu):
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.