gstreamer fails intermittently

Bug #1466273 reported by Michi Henning on 2015-06-18
18
This bug affects 1 person
Affects Status Importance Assigned to Milestone
gstreamer1.0 (Ubuntu)
Critical
Unassigned
thumbnailer (Ubuntu)
Undecided
Unassigned

Bug Description

gstreamer occasionally fails to extract a thumbnail from a video file. It's clearly a race condition because, almost all of the time, it succeeds. To reproduce:

Check out the thumbnailer/devel branch at r217 and build it. Then:

cd build/src/vs-thumb

./vs-thumb fd://0 ./x.jpg <../../../tests/media/testvideo.ogg

Point a browser at the generated x.jpg file and check that the image was extracted correctly.

Now run

while ./vs-thumb fd://0 ./x.jpg <../../../tests/media/testvideo.ogg ; do :; done

Within maybe a hundred iterations or so, I get:

Error creating thumbnail: ThumbnailExtractor: change_state(): reading async messages: GStreamer error: negotiation problem.

Changed in thumbnailer:
importance: Undecided → High
affects: gstreamer (Ubuntu) → gstreamer0.10 (Ubuntu)
Changed in gstreamer0.10 (Ubuntu):
importance: Undecided → High
Michi Henning (michihenning) wrote :

I'm also seeing this on my Nexus 4. Except here, gstreamer hangs and makes the process unkillable.

Here is a bit of trace from our code:

thumbnailer-service: [06:53:02.902] timeout, state = 2
thumbnailer-service: [06:53:02.902] killing process 12806
thumbnailer-service: [06:53:02.902] calling waitpid for process 12806
thumbnailer-service: [06:53:31.900] waitpid for process 12806 returned

This happens when a timer fires. If gstreamer doesn't complete within 10 seconds, we kill the process. (Each video thumbnail is extracted in a separate process, so we can recover from gstreamer crashes and hangs.)

However, take a look at the time stamps. We sent a SIGKILL at 53.02 and immediately called waitpid(). But waitpid() does not return until 29 seconds later. This strongly suggests a kernel problem. (At least as far as I know, the only way for SIGKILL to not kill a process is if the process is asleep in the kernel at uninterruptible priority.)

Changed in gstreamer0.10 (Ubuntu):
importance: High → Critical
no longer affects: gstreamer0.10 (Ubuntu)
Changed in gstreamer1.0 (Ubuntu):
importance: Undecided → Critical
Changed in gstreamer (Ubuntu):
importance: Undecided → Critical
Michi Henning (michihenning) wrote :

In an attempt to diagnose this more, I ran the vs-thumb sub-processes with strace and still managed to reproduce the problem.

Here is the code snippet that runs when we don't hear back from a sub-process within ten seconds. (Please forgive the filthy code; this is the result of endless hacking around trying to narrow down the problem.)

    qDebug() << "timeout, state =" << process_.state();
    if (process_.state() != QProcess::NotRunning)
    {
        qDebug() << "killing process" << process_.pid();
        int rc = kill(process_.processId(), SIGKILL);
        assert(rc == 0);
        //process_.kill();
    }
    int status;
    qDebug() << "calling waitpid for process" << process_.pid();
    int rc = waitpid(process_.pid(), &status, 0);
    qDebug() << "waitpid for process" << process_.pid() << "returned";
    qDebug() << "waitpid() returned" << rc;
    error_ = exe_path_.toStdString() + " (pid " + to_string(process_.pid()) + ") did not return after " +
             to_string(timeout_ms_) + " milliseconds";

So, we call kill(), immediately followed by waitpid(). Here is the trace I see when the problem happens:

thumbnailer-service: [02:28:39.385] timeout, state = 2
thumbnailer-service: [02:28:39.385] killing process 16965
thumbnailer-service: [02:28:39.386] calling waitpid for process 16965
Changing to state PAUSED
Changing to state PAUSED
thumbnailer-service: [02:29:09.395] waitpid for process 16965 returned
thumbnailer-service: [02:29:09.396] waitpid() returned 16965

(Ignore the two "Changing state" messages; these are from other vs-thumb processes that are still running.)

Note the time stamps: we sent the kill at 28:39, but waitpid doesn't return until 30 seconds later.

While things are hanging like this, the process sits as a zombie in the process table.

I've attached the strace log. The last line shows that the process was killed (also 30 seconds late), at the time waitpid() returned.

Michi Henning (michihenning) wrote :

One other item of info: originally, we were using QProcess::kill() and QProcess::waitForFinished(). In that case, the hang happens inside waitForFinished(). Attaching with gdb showed that, ultimately, waitForFinished() ended up blocked in select().

James Henstridge (jamesh) wrote :

gbAnd ultimately the select() call in waitForFinished() was blocking on a pipe that would be written to when the SIGCHLD signal handler was called and waitid() returned the subprocess's pid. So this is functionally the same as the direct waitpid() call.

Michi Henning (michihenning) wrote :
Download full text (3.2 KiB)

Here is a reproducible test case.

$ system-image-cli -i
current build number: 229
device name: mako
channel: ubuntu-touch/devel-proposed/ubuntu
last update: 2015-06-18 07:16:41
version version: 229
version ubuntu: 20150618
version device: 20150529.1
version custom: 20150618

The code was built from this branch at revision 238:

https://code.launchpad.net/~michihenning/thumbnailer/no-download-for-failure-cache-hit

The debian packages are:

http://s-jenkins.ubuntu-ci:8080/job/thumbnailer-devel-wily-armhf-ci/lastSuccessfulBuild/artifact/work/output/thumbnailer-common_1.4bzr238pkg0wily145-0ubuntu1_all.deb
http://s-jenkins.ubuntu-ci:8080/job/thumbnailer-devel-wily-armhf-ci/lastSuccessfulBuild/artifact/work/output/thumbnailer-service_1.4bzr238pkg0wily145-0ubuntu1_armhf.deb
http://s-jenkins.ubuntu-ci:8080/job/thumbnailer-devel-wily-armhf-ci/lastSuccessfulBuild/artifact/work/output/qtdeclarative5-ubuntu-thumbnailer0.1_1.4bzr238pkg0wily145-0ubuntu1_armhf.deb

To reproduce the problem, install the debs and untar the attached archive into an empty directory somewhere.

Run top in another terminal.

Now, from within the video directory, run the following script:

#!/bin/sh
for i in video*.mp4
do
    /usr/lib/arm-linux-gnueabihf/thumbnailer/vs-thumb fd://0 $i.jpg <$i &
done
wait

This fires as many vs-thumb instances concurrently as there are video files (see below).

It's possible that the script will terminate without error. (The fewer videos there are, the more likely it is to succeed.) If it succeeds, just try again.

Eventually, the script will not terminate, and top will show one or more hung vs-thumb processes that are consuming 100% CPU each.

Here comes the good bit. (You may have to try this several times but, eventually, you'll see the problem.)

From another terminal, run

kill -9 <hung-vs-thumb-pid>

Quite often, it'll kill the process as it should. But, every now and then, it doesn't. Instead, the hung vs-thumb keeps hanging around, still consuming 100% CPU until, pretty much exactly 30 seconds later, it exits.

So, the kill -9 doesn't kill until 30 seconds later in some cases.

While experimenting with all this, you will most likely also see this:

Error creating thumbnail: ThumbnailExtractor: change_state(): reading async messages: GStreamer error: negotiation problem.

This error is unrelated and can be reproduced on the desktop by just running a single vs-thumb at a time in a loop. On my Core i7, I get this once every two-hundred iterations or so.

You will probably also see

(vs-thumb:26741): GLib-ERROR **: /build/buildd/glib2.0-2.45.1/./glib/gmem.c:357: overflow allocating 4294967274*8 bytes

That happens when there are too many vs-thumb processes. It doesn't seem to be happening with only four concurrent gstreamer processes. You can show this by moving two of the videos out of the directory. However, the allocation problem is *NOT* related to the signal issue. The signal issue can be reproduced even when this allocation failure doesn't occur, with only four concurrent vs-thumb processes.

The source code for vs-thumb is in the thumbnailer project, in src/vs-thumb. There isn't a lot of code there (around 400 lines), so it's quite eas...

Read more...

Michi Henning (michihenning) wrote :

One other important piece of data...

While I was tinkering with this, I had a test failure on Jenkins in the Arm build. I was watching the console log at the time and noticed that compilation was unusually slow. Presumably, the build machine was loaded more than usual. Then I got the following test failure. Please read through all of what is below, BECAUSE IT'S IMPORTANT!

1: [ RUN ] ThumbnailerTest.slow_vs_thumb
1: thumbnailer-service: [04:03:09.103] Critical: The schema com.canonical.Unity.Thumbnailer is missing
1: thumbnailer-service: [04:03:09.103] Critical: The schema com.canonical.Unity.Thumbnailer is missing
1: thumbnailer-service: [04:03:19.886] timeout, state = 2
1: thumbnailer-service: [04:03:19.886] killing process 11654
1: thumbnailer-service: [04:03:19.886] calling waitpid for process 11654
1: thumbnailer-service: [04:03:19.887] waitpid for process 11654 returned 11654
1: /tmp/buildd/thumbnailer-1.4bzr237pkg0wily144/tests/slow-vs-thumb/slow-vs-thumb_test.cpp:41: Failure
1: Value of: spy.wait(15000)
1: Actual: false
1: Expected: true
1: thumbnailer-service: [04:03:23.882] Warning: QProcess: Destroyed while process ("/tmp/buildd/thumbnailer-1.4bzr237pkg0wily144/tests/slow-vs-thumb/slow/vs-thumb") is still running.
1: [ FAILED ] ThumbnailerTest.slow_vs_thumb (44810 ms)
1: [----------] 1 test from ThumbnailerTest (44810 ms total)

Note the time stamps: This time, waitpid returned straight away. However, the test did not finish for another 30 seconds. Reason: The QProcess destructor also waits for completion of the child, with a 30-second timeout. The QProcess destructor thinks the process is still running and doesn't get notification of the child death either and, after 30 seconds, times out.

However, THERE IS NO GSTREAMER AT ALL IN THIS TEST!

The test uses a fake vs-thumb. Instead, vs-thumb is a shell script: exec /bin/sleep 10

We are using the fake vs-thumb just so we can test that our timeout code works as intended.

This suggests that the SIGKILL problem is not caused by gstreamer per se, but might be load related. Possibly, we are seeing this because all the vs-thumb processes are putting heavy CPU load on the system, possibly interfering with signal delivery somehow?

I've attached the build log. You can find the slow_vs_thumb test in test/slow-vs-thumb. (It's currently disabled. Use revision 237 to get the enabled test.)

James Henstridge (jamesh) wrote :

I gave michi's branch and test videos a try. For the six videos being thumbnailed in parallel, I see two successful thumbnails, two instances that look like they died due to invalid memory allocations:

(vs-thumb:15299): GLib-ERROR **: /build/buildd/glib2.0-2.45.1/./glib/gmem.c:357: overflow allocating 4294967274*8 bytes

(vs-thumb:15298): GLib-ERROR **: /build/buildd/glib2.0-2.45.1/./glib/gmem.c:357: overflow allocating 4294967274*8 bytes

And two that have stuck around eating a little more than one CPU each. I was able to kill -9 them though.

no longer affects: gstreamer (Ubuntu)
Colin Ian King (colin-king) wrote :

It may be that the process is swapped out, so the delivery of the SIGKILL takes a while for it to be swapped back in and to hence get the signal.

To test this hypothesis:

a) one could disable swap and see if the process can be delivered the SIGKILL and how quickly it responds to that. However, turning swap off may cause the OOM killer to change the way things behave and hence is not a viable test pattern.

b) run smemstat (from ppa:colin-king/white) and see how much memory of the given blocked process is swapped out compared to the memory resident. If it is mostly swapped out, it could indicate why it is taken a while to get swapped back in and to respond to the SIGKILL.

c) one could add a mlockall() to the slow responding process or even mlock() on the range of text pages that handle the signal and do the ppoll so it's not swapped out. Note that one needs to allow the process to have the CAP_IPC_LOCK capability or one could tweak the RLIMIT_MEMLOCK soft limit using ulimit -l

Michi Henning (michihenning) wrote :

Here is what smemstat reports for one of the spinning processes:

25375 0.0 B 11.0 M 12.5 M 19.1 M phablet /usr/lib/arm-linux-gnueabihf/thumbnailer/vs-thumb

That's while the process is eating around 110% CPU. So, swap doesn't seem to be the issue.

After running kill -9 25375, top continues to show the process as running, still using 110% CPU, and the wait in the script doesn't return. However, running smemstat again after sending the SIGKILL no longer produces any output. It looks like smemstat thinks the process isn't there anymore (even though top continues to show it).

Then, 30 seconds after having sent the kill, the process finally terminates.

Michi Henning (michihenning) wrote :

As an experiment, I added a separate thread to vs-thumb that runs a timer. When the timer fires (after 10 seconds), the thread calls _exit().

When I have one of these stuck vs-thumb processes that is spinning, I see the timer firing. However, _exit() does not terminate the process immediately. I'm seeing the same 30-second delay that I get after SIGKILL.

Note that this doesn't happen all the time. Sometimes, the _exit() call works (just as sometimes, the SIGKILL works). But, around 20% of the time, I get this 30-second delay before the process finally exits.

Michi Henning (michihenning) wrote :

This bug is rather serious: we have unkillable processes on mako. It's been over a month since I posted the reproducible test case above; is no-one interested in fixing this?

Changed in thumbnailer:
status: New → Triaged
Changed in thumbnailer (Ubuntu):
status: New → Triaged
importance: Undecided → High
no longer affects: thumbnailer
Changed in thumbnailer (Ubuntu):
importance: High → Undecided
Michi Henning (michihenning) wrote :

I give up. Marking as invalid for thumbnailer.

Changed in thumbnailer (Ubuntu):
status: Triaged → Invalid
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers