Comment 9 for bug 233734

Revision history for this message
Krzysztof Adamski (kadamski) wrote : Re: 100% cpu usage with nvidia-drivers

I have similar issue with latest Moovida (from bzr) and latest pigment (from svn). I did some research about that, however, and I'm not sure if it's accually Moovida/Pigment BUG. My setup is:
- Fedora 11
- Moovida 1.0.3 (from git)
- Pigment ?.?.? (from svn)
- Mesa
- Xorg
- intel-driver
- Twisted
- GTK

First of all, I've checked if it's a Pigment bug by running python-pigment examples. They run fine so it's not. Second thing i have done was checking what is Moovida actually doing with strace and I found out that this (running again and again):

15519 poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=11, events=POLLIN}, {fd=8, events=POLLIN}, {fd=14, events=POLLIN}, {fd=22, events=POLLIN}, {fd=23, events=POLLIN}, {fd=24, events=POLLIN}, {fd=32, events=POLLIN}, {fd=25, events=POLLIN}, {fd=29, events=POLLIN}, {fd=34, events=POLLIN}, {fd=36, events=POLLIN}, {fd=12, events=POLLIN}, {fd=3, events=POLLIN}, {fd=38, events=POLLIN}], 16, 25) = 1 ([{fd=8, revents=POLLIN}])
15519 gettimeofday({1246013664, 941148}, NULL) = 0
15519 read(11, 0x9606d98, 4096) = -1 EAGAIN (Resource temporarily unavailable)
15519 gettimeofday({1246013664, 941223}, NULL) = 0

15519 poll([{fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=11, events=POLLIN}, {fd=8, events=POLLIN}, {fd=14, events=POLLIN}, {fd=22, events=POLLIN}, {fd=23, events=POLLIN}, {fd=24, events=POLLIN}, {fd=32, events=POLLIN}, {fd=25, events=POLLIN}, {fd=29, events=POLLIN}, {fd=34, events=POLLIN}, {fd=36, events=POLLIN}, {fd=12, events=POLLIN}, {fd=3, events=POLLIN}, {fd=38, events=POLLIN}], 16, 25) = 1 ([{fd=8, revents=POLLIN}])
15519 gettimeofday({1246013664, 941401}, NULL) = 0
15519 read(11, 0x9606d98, 4096) = -1 EAGAIN (Resource temporarily unavailable)
15519 gettimeofday({1246013664, 941569}, NULL) = 0

Note couple things:
- it looks like it's a poll loop running all the time consuming 100% CPU
- poll is returning 1 FD ready (fd=8)
- this fd is NEVER READ (?!)
- after this, fd=11 is read and there is next poll immediately (check out gettimeofday values)
- next poll returns immediately with fd=8 since it wasn't read before.. this continues forever and uses all the CPU

So I've done some more research to find what douse FD's are:
$ ls -l /proc/15519/fd/8 /proc/15519/fd/11
razem 0
lrwx------. 1 k k 64 06-26 13:02 11 -> socket:[244343]
lr-x------. 1 k k 64 06-26 13:02 8 -> pipe:[244347]

and from trace file:

15519 execve("/usr/bin/python", ["python", "moovida/elisa-core/bin/moovida", "-c", "moovida.conf"], [/* 50 vars */]) = 0
[...]
15519 socket(PF_FILE, SOCK_STREAM, 0) = 11
15519 connect(11, {sa_family=AF_FILE, path=@"/tmp/.X11-unix/X0"...}, 20) = 0
15519 getpeername(11, {sa_family=AF_FILE, path=@"/tmp/.X11-unix/X0"...}, [20]) = 0
[...]
15519 fcntl64(11, F_GETFL) = 0x2 (flags O_RDWR)
15519 fcntl64(11, F_SETFL, O_RDWR|O_NONBLOCK) = 0
15519 fcntl64(11, F_SETFD, FD_CLOEXEC) = 0
[...]
15519 rt_sigaction(SIGCHLD, {0x2c51a90, [], 0}, {SIG_DFL, [], 0}, 8) = 0
[...]
15519 pipe([8, 9]) = 0
[...]
15519 pipe([16, 17]) = 0
15519 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xb8004728) = 15523
15519 close(17) = 0
15523 close(16 <unfinished ...>
15519 read(16, <unfinished ...>
(15523 is a short running process making a lot of stats on gstreamer plugin files)
[...]
15523 write(17, "\1\0\0\0"..., 4) = 4
15523 close(17) = 0
15523 exit_group(0) = ?
15519 <... read resumed> "\1\0\0\0"..., 4) = 4
15519 --- SIGCHLD (Child exited) @ 0 (0) ---
15519 write(9, "\0"..., 1) = 1
15519 sigreturn() = ? (mask now [])
15519 close(16) = 0
15519 waitpid(15523, NULL, 0) = 15523
(no more writes to 9)
(no READS from 8!)

Here's what you can find out there:
- FD=11 is X11 socket file, which is set to NOBLOCK, this is why it is read (with -1) after each poll (note it shouldn't be done since FD=11 is in poll so we know when to read it.. but it doesn't do much harm actually)
- Next, we can see that SIGCHLD handler is registered
- fd=8 is one of pipes end (fd=9 is the 2nd one)
- there is a clone syscall, creating new process which reads gstreamer plugins (such a process is created when "import gst" is called)
- when the process is finished, SIGCHLD is send (of course) and and we can see that before sigreturn (which ends signal handler) there is a write to FD=9 (2nd end of the pipe which is returned by our polls)
- fd=8 is NEVER read to the and of the program and this is why our poll is returning immediately wasting all the CPU

Ok, so we know WHY this is happening but I wanted to know WHERE is the bug accually. I have spend quite a lot of time to find out two facts:
- the poll syscall came from Twisted reactor
- the same is true for the signal handler

So it looked like a Twisted bug.. But is it? I tried to find that out too, so first I disabled the SIGCHLD handler in twisted code:
   from /usr/lib/python2.6/site-packages/twisted/internet/base.py:
     if platformType == 'posix':
         signal.signal(signal.SIGCHLD, self._handleSigchld)

when this 2 lines where disabled, CPU was idle.. So it fixes the bug.. ALMOST.. It's more a workaround than a fix so I was searching deeper. I was very surprised by the results:
- I found out that Twisted handler is not called when "import gst" process is terminating.. where is this "write" coming from then? i have no idea:/
- Since i couldn't find where is "write" coming from I started looking for where is "pipe()" creating fd=8 and fd=9 coming from.. and I found out
  it is coming from GTK.. It happens somewhere at the beginning of gobject.MainLoop().run() called from run() method of Gtk2Reactor class in Twisted
- GTK2Reactor is generating further poll() calls which consumes 100% CPU

CONCLUSION:
1. Source of the BUG seams to be somewhere between 3 components:
  - Moovida
  - Twisted
  - GTK2
2. Python-pigment "timing" example runs fine, without any problems and it's very similar in architecture to how Moovida is doing.
3. The bug happens only on SOME (not all) systems.

Since I'm not sure where this BUG is coming from and I have no experience with Twisted which seams to be involved a lot in this bug, I'm hoping you will be able to give me some clues what is wrong. Is Moovida using Twisted in some strace way or is this bug completely outside of Moovida?

I'll continue my research tomorrow and update you if I get any results. Thanks in advance for any suggestions. I can provide trace file if anyone is interested but it is huge (57MB) so I won't attach it here.