when idle with screen blank, unity8 generates hundreds of context switches per second

Bug #1219871 reported by Colin Ian King
22
This bug affects 3 people
Affects Status Importance Assigned to Milestone
unity8 (Ubuntu)
Invalid
High
Michael Zanetti

Bug Description

To reproduce (on maguro, Samsung Galaxy Nexus):

sudo apt-get install sysstat

ensure idle has occurred and screen is now blank, ssh in and run

pidstat -w 10 1

Linux 3.0.0-3-maguro (ubuntu-phablet) 09/02/2013 _armv7l_ (2 CPU)

03:04:13 PM UID PID cswch/s nvcswch/s Command
03:04:23 PM 0 3 1.19 0.00 ksoftirqd/0
03:04:23 PM 0 9 1.29 0.00 ksoftirqd/1
03:04:23 PM 0 13 0.20 0.00 sync_supers
03:04:23 PM 0 20 21.99 0.00 kworker/0:1
03:04:23 PM 0 23 0.40 0.00 kinteractiveup
03:04:23 PM 0 69 1.29 0.00 kworker/1:1
03:04:23 PM 0 88 2.09 0.00 mmcqd/0
03:04:23 PM 0 96 0.30 0.00 dhd_watchdog
03:04:23 PM 0 97 8.46 0.00 dhd_dpc
03:04:23 PM 0 156 0.90 0.00 jbd2/mmcblk0p12
03:04:23 PM 103 418 0.50 0.00 dbus-daemon
03:04:23 PM 0 429 0.20 0.00 flush-179:0
03:04:23 PM 1000 493 70.05 0.00 surfaceflinger
03:04:23 PM 32011 613 0.20 0.00 init
03:04:23 PM 32011 658 0.10 0.00 dbus-daemon
03:04:23 PM 32011 713 467.86 0.10 unity8
03:04:23 PM 32011 788 0.10 0.00 upstart-dbus-br
03:04:23 PM 32011 790 0.10 0.00 upstart-dbus-br
03:04:23 PM 0 862 3.98 2.79 NetworkManager
03:04:23 PM 0 1179 0.10 0.00 wpa_supplicant
03:04:23 PM 32011 1219 0.10 0.40 indicator-netwo
03:04:23 PM 0 1237 0.20 0.00 ofonod
03:04:23 PM 32011 1335 0.10 0.00 mission-control
03:04:23 PM 32011 1339 0.40 0.00 telepathy-ofono
03:04:23 PM 0 1929 222.59 1.09 kworker/u:0
03:04:23 PM 0 2379 222.89 0.00 kworker/u:1
03:04:23 PM 32011 2944 0.10 2.19 pidstat

unity8 with PID 713 is performing ~450-470 context switches a second. But what is it doing?

sudo add-apt-repository ppa:colin-king/white && sudo apt-get update && sudo apt-get install health-check
sudo health-check -p 713 -d 10

CPU usage:
  PID Process USR% SYS% TOTAL%
   713 unity8 27.29 6.40 33.69 (medium load)

^^ that's kinda busy!

Wakeups:
 No wakeups detected

File I/O operations:
 No file I/O operations detected

System calls traced:
  PID Process Syscall Count Rate/Sec
   713 unity8 futex 3904 390.2976
   713 unity8 clock_gettime 1384 138.3637
   713 unity8 poll 791 79.0792
   713 unity8 read 789 78.8793
   713 unity8 recvmsg 4 0.3999
   713 unity8 write 4 0.3999
   713 unity8 restart_syscall 1 0.1000
 Total 6877 687.5196

Looks like it may be context switching like mad on a futex.

When the screen is not blank but device is idle, unity8 is far less busy.

tags: added: rls-s-incoming
no longer affects: touch-preview-images
Revision history for this message
Colin Ian King (colin-king) wrote :

Is anyone going to look at this bug? It is quite critical as it really sucks power badly when the screen is idle.

Changed in unity8 (Ubuntu):
importance: Undecided → High
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in unity8 (Ubuntu):
status: New → Confirmed
Revision history for this message
Jean-Baptiste Lallement (jibel) wrote :

I can reproduce this behaviour with build 62 on mako. Even if I don't have any really specific test case to reproduce, using the phone (taking photo, opening apps, ...) for a moment is enough to trigger this issue and it must be very common with a normal usage of the phone.

tags: added: avengers qa-manual-testing qa-touch
tags: added: qaucy
tags: added: saucy
removed: qaucy
Changed in unity8 (Ubuntu):
assignee: nobody → Michael Zanetti (mzanetti)
Revision history for this message
Michael Zanetti (mzanetti) wrote :

I'm having quite a hard time to reproduce. The worst case I could catch was unity8 being around 15 switches/s which seems to be related to dhd_dpc. Switching off WiFi makes that go away (We probably want to add something to detach unity8 from dhd_dpc somehow when the display is off) but that doesn't seem related to this bug.

Does this happen every time for you? Or only after using a specific app, or resource?

One interesting thing in your trace is that surfaceflinger is up to 70 switches too while its mostly below 2 in my case. Indicates that it might have something to do with rendering. I tried firing up lots of apps, taking pictures etc, but never reached those values. There must be something that triggers this, some specific app running, some use case on the dash or similar. Also would be interesting to see if this still happens with Mir.

Can you please try to provide more detailed steps on how to reproduce? Thanks.

Revision history for this message
Colin Ian King (colin-king) wrote :

Hi Michael,

I've tried hard to reproduce this on today's image with the latest updates and can no longer reproduce this. I guess something must have got fixed between 2nd September and now. With lots of apps open and various attempts to make this fail, I get a very low CPU load and just a few context switches a second, and no crazy high futex() system call counts per second.

CPU usage:
  PID Process USR% SYS% TOTAL% Duration
   911 unity8 0.01 0.00 0.01 73.50 (very light load)

Wakeups:
  PID Process Wake/Sec Kernel Functions
   911 unity8 0.01 (hrtimer_start_range_ns, hrtimer_wakeup) (very low)

Context Switches:
  PID Process Voluntary Involuntary Total
                             Ctxt Sw/Sec Ctxt Sw/Sec Ctxt Sw/Sec
   911 unity8 0.34 0.00 0.34 (very low)
  1224 unity8 0.16 0.03 0.19 (very low)
  1220 unity8 0.18 0.00 0.18 (very low)
  2113 unity8 0.16 0.00 0.16 (very low)
  1520 unity8 0.16 0.00 0.16 (very low)
  1466 unity8 0.16 0.00 0.16 (very low)
  1465 unity8 0.16 0.00 0.16 (very low)
  1226 unity8 0.16 0.00 0.16 (very low)
  1219 unity8 0.16 0.00 0.16 (very low)
  1218 unity8 0.16 0.00 0.16 (very low)
  1198 unity8 0.16 0.00 0.16 (very low)
  1197 unity8 0.16 0.00 0.16 (very low)
  1192 unity8 0.16 0.00 0.16 (very low)
  1169 unity8 0.16 0.00 0.16 (very low)
  1168 unity8 0.16 0.00 0.16 (very low)
  1156 unity8 0.16 0.00 0.16 (very low)
 Total 2.80 0.03 2.83

I guess this is now fixed.

Revision history for this message
Jean-Baptiste Lallement (jibel) wrote :

I don't think it is fixed, this is what I get with image 20130925.2 but still no reproducer. I took 2 photos, switched to Gallery then to the Application Lens and waited until the screen goes black.

pidstat:
09:12:45 UID PID cswch/s nvcswch/s Command
...
09:12:56 0 606 120.99 0.10 init
09:12:56 103 643 1.39 0.10 dbus-daemon
09:12:56 1000 680 95.54 0.00 surfaceflinger
...
09:12:56 32011 919 0.30 0.00 init
...
09:12:56 32011 1079 483.86 0.00 unity8
...
09:12:56 0 5099 323.56 0.00 kworker/u:0
09:12:56 0 5894 173.27 0.00 kworker/u:1
...
09:12:56 0 6939 19.11 0.79 kworker/0:1
...

health-check:
CPU usage:
  PID Process USR% SYS% TOTAL% Duration
  1079 unity8 36.59 16.00 52.59 10.00 (high load)

Revision history for this message
Colin Ian King (colin-king) wrote :

Following Jean-Baptiste's notes above, I can reproduce it by taking 2 photos and then switching to the Gallery and leaving it to blank the screen. Immediately it blanks the context switch rate jumped from ~230 cs/s to ~1200 cs/s.

Context Switches:
  PID Process Voluntary Involuntary Total
                             Ctxt Sw/Sec Ctxt Sw/Sec Ctxt Sw/Sec
  1267 unity8 326.05 71.53 397.58 (quite high)
   971 unity8 315.04 0.18 315.22 (quite high)
  1235 unity8 1.77 0.00 1.77 (low)

System calls traced:
  PID Process Syscall Count Rate/Sec
   971 unity8 futex 9006 146.4693
  1267 unity8 futex 7162 116.4793
  1267 unity8 ioctl 6902 112.2508

Revision history for this message
Michael Zanetti (mzanetti) wrote :

hmm... I can do whatever I want... not happening for me. I'll keep on trying.

Revision history for this message
Colin Ian King (colin-king) wrote :

Michael, I reproduced this from a clean boot, so maybe that's worth pursuing to see if you can reproduce it.

Revision history for this message
Michael Zanetti (mzanetti) wrote :

I have seen this issue over the weekend while dogfooding the phone. Still no reliable way to reproduce for me, but some more observations:

* Confirming the futex observations. strace prints WAKE_ON_FUTEX messages repeatedly while in this state.

* Attaching gdb showed that there is one thread called "dconf worker" with a stack depth of over 2000. Probably some corrupt stack but I think worth investigating if the dconf integration might be the cause for this.

One difference I noticed is that once in this state, waking up the screen does not fix it any more. Only restarting unity8 gets it quiet again.

Revision history for this message
Michael Zanetti (mzanetti) wrote :

Confirming that every time I end up in this state, the dconf worker thread stack looks broken while it looks perfectly sane when not in this state.

Revision history for this message
Michael Zanetti (mzanetti) wrote :

Some more findings:

This is the strace snipped that gets repeated all over:

read(3, "\2\0\0\0\0\0\0\0", 16) = 8
futex(0x194b7ac, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x194b7a8, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x194b790, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x194b5e4, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x194b7ac, FUTEX_WAIT_PRIVATE, 1280555, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x194b790, FUTEX_WAKE_PRIVATE, 1) = 0
clock_gettime(CLOCK_MONOTONIC, {9601, 188387021}) = 0
clock_gettime(CLOCK_MONOTONIC, {9601, 189027949}) = 0
poll([{fd=3, events=POLLIN}, {fd=22, events=POLLIN}, {fd=23, events=POLLIN}, {fd=36, events=POLLIN}, {fd=39, events=POLLIN}, {fd=48, events=POLLIN}, {fd=49, events=POLLIN}], 7, -1) = 1 ([{fd=3, revents=POLLIN}])

One of the threads when attaching gdb:

Thread 15 (Thread 0x40f15460 (LWP 2067)):
#0 0x418a8fe4 in __libc_do_syscall () from /lib/arm-linux-gnueabihf/libpthread.so.0
#1 0x418a6432 in do_futex_wait () from /lib/arm-linux-gnueabihf/libpthread.so.0
#2 0x418a64a6 in sem_wait@@GLIBC_2.4 () from /lib/arm-linux-gnueabihf/libpthread.so.0
#3 0x413b7ab6 in v8::internal::LinuxSemaphore::Wait (this=0x44b00468) at ../3rdparty/v8/src/platform-linux.cc:857
#4 0x4133f292 in v8::internal::RuntimeProfiler::WaitForSomeIsolateToEnterJS () at ../3rdparty/v8/src/runtime-profiler.cc:420
#5 0x4133f388 in v8::internal::RuntimeProfilerRateLimiter::SuspendIfNecessary (this=this@entry=0x196a2d4) at ../3rdparty/v8/src/runtime-profiler.cc:467
#6 0x413b8928 in v8::internal::SignalSender::Run (this=0x196a2b0) at ../3rdparty/v8/src/platform-linux.cc:1136
#7 0x413b7a98 in v8::internal::ThreadEntry (arg=0x196a2b0) at ../3rdparty/v8/src/platform-linux.cc:733
#8 0x418a1e64 in start_thread () from /lib/arm-linux-gnueabihf/libpthread.so.0
#9 0x410257c8 in ?? () from /lib/arm-linux-gnueabihf/libc.so.6
#10 0x410257c8 in ?? () from /lib/arm-linux-gnueabihf/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

This thread seems to do the futex_waits. According to https://groups.google.com/forum/#!topic/nodejs/XPO7aVOVrds this seems to happen when v8 starts garbage collecting. Other applications using v8 seem to show the exact same behavior. Same strace output and same CPU hogging behavior.

And I still can't reproduce this. While it happens every once in a while I still can't trigger this intentionally.

(After installing tons of dbgsym packages I don't see the corrupt stack trace from the "dconf worker" any more. So the previous comments might be wrong information)

Revision history for this message
Michael Zanetti (mzanetti) wrote :

As Qt switched from v8 to v4 (replacing the code that triggered this) and we haven't seen this in a long time, we can close this now.

Changed in unity8 (Ubuntu):
status: Confirmed → Incomplete
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.