Mir

server crashes when shutting down in smoke tests

Bug #1560900 reported by Alan Griffiths on 2016-03-23
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mir
Fix Released
High
Alan Griffiths
mir (Ubuntu)
Undecided
Unassigned

Bug Description

The "smoke tests" run each of the demo clients for 3 seconds and ensure they and the server all exit cleanly. But we've seen an instance where this fails:

https://mir-jenkins.ubuntu.com/job/device-runtests-mir/device_type=krillin/240/consoleFull

05:09:56 Starting
05:09:56 libevdev error in fix_invalid_absinfo: BUG: Device "mtk-tpd" has invalid ABS_MT_TRACKING_ID rangeConnected
05:09:56 Mir chose pixel format 1
05:09:56 Surface created
05:09:59 Signal 15 received. Good night.
05:10:08 /home/phablet/mir-testrunner/mir-smoke-test-runner.sh: line 23: 10047 Segmentation fault (core dumped) ${root}mir_demo_server --test-timeout=$timeout --test-client $root$client
05:10:08 I: [FAILED] mir_demo_client_scroll

This looks like the server crashed between signalling the client and shutting down.

All 21 of the other smoke tests pass:

05:11:48 I: The following clients failed to execute successfully:
05:11:48 I: mir_demo_client_scroll
05:11:48 I: Smoke testing complete with returncode -1

Related branches

Alan Griffiths (alan-griffiths) wrote :

The reason I think the server core dumps is that one would expect to see "mirserver: Stopping" one second after "Signal 15 received. Good night".

Alan Griffiths (alan-griffiths) wrote :

Also seen for other MPs:

https://mir-jenkins.ubuntu.com/job/device-runtests-mir/device_type=krillin/232/consoleFull

18:02:02 Signal 15 received. Good night.
18:02:11 /home/phablet/mir-testrunner/mir-smoke-test-runner.sh: line 23: 10036 Segmentation fault (core dumped) ${root}mir_demo_server --test-timeout=$timeout --test-client $root$client
18:02:11 I: [FAILED] mir_demo_client_scroll

https://mir-jenkins.ubuntu.com/job/device-runtests-mir/device_type=krillin/228/console

09:01:13 Signal 15 received. Good night.
09:01:22 /home/phablet/mir-testrunner/mir-smoke-test-runner.sh: line 23: 10055 Segmentation fault (core dumped) ${root}mir_demo_server --test-timeout=$timeout --test-client $root$client
09:01:22 I: [FAILED] mir_demo_client_scroll

Alan Griffiths (alan-griffiths) wrote :

https://mir-jenkins.ubuntu.com/job/device-runtests-mir/device_type=krillin/218/console

18:37:36 Signal 15 received. Good night.
18:37:45 /home/phablet/mir-testrunner/mir-smoke-test-runner.sh: line 23: 10024 Segmentation fault (core dumped) ${root}mir_demo_server --test-timeout=$timeout --test-client $root$client
18:37:45 I: [FAILED] mir_demo_client_scroll

Alan Griffiths (alan-griffiths) wrote :

Hmm, mir_demo_client_scroll is "special" in that it exits without shutting down the connection.

We had a fix for that killing the server in -r 3385. Could this be a second bug in that area? Or the old one resurfacing?

Changed in mir:
status: New → In Progress
assignee: nobody → Alan Griffiths (alan-griffiths)
importance: Undecided → High
milestone: none → 0.22.0
Alan Griffiths (alan-griffiths) wrote :

/All/ I needed was a 14 year old laptop running stress to reproduce the crash. Here's a stacktrace

Alan Griffiths (alan-griffiths) wrote :

It appears that the session mediator is destroyed before the "complete" functor on LL422-429 is invoked. The session mediator report shows "connection dropped without disconnect" just before getting the segfault. (It also reports this in all the successful runs.)

The fix in -r 3385 ought to be fixing that - ~SessionMediator() calls "shell->close_session(session)" which (indirectly) calls ms::SessionManager::close_session() and that calls "scene_session->force_requests_to_complete()" after which there should be no more callbacks and deletion should proceed.

Or not... BufferQueue::give_buffer_to_client() pulls the completion from pending before releasing a lock. That means that force_requests_to_complete() can get the lock and run before the completion is called.

Alan Griffiths (alan-griffiths) wrote :

OK, as a quick hack this failure is fixed by:

=== modified file 'src/server/compositor/buffer_queue.cpp'
--- src/server/compositor/buffer_queue.cpp 2016-03-22 02:29:50 +0000
+++ src/server/compositor/buffer_queue.cpp 2016-03-23 18:13:36 +0000
@@ -554,7 +554,7 @@

     buffers_owned_by_client.push_back(buffer);

- lock.unlock();
+// lock.unlock();
     try
     {
         give_to_client_cb(buffer);

But I need to have a closer look at BufferQueue and its locking - as the lock is also released by the code a few lines above:

        snapshot_released.wait(lock,
            [&]{ return !contains(buffer, pending_snapshots); });

Alan Griffiths (alan-griffiths) wrote :

Note that the test scenaro is broken by -r 3423. Either that needs reverting, or the test needs reinstating elsewhere. (The latter is probably more elegant.)

Daniel van Vugt (vanvugt) wrote :

I kind of support reverting r3423. Although it's lots of extra code, it's nice having real unique demo code and not having every demo use the same logic.

Alberto Aguirre (albaguirre) wrote :

I disagree on reverting r3423. The logic is outdated and wrong in various places. There's no reason we need that under examples.

But reinstating somewhere else would be ok - perhaps under tests/impolite-clients

Alberto Aguirre (albaguirre) wrote :

@Alan,

Here's another instance of the issue that did not involve scroll:

https://mir-jenkins.ubuntu.com/job/device-runtests-mir/311/device_type=krillin/console
"
15:17:20 /home/phablet/mir-testrunner/mir-smoke-test-runner.sh: line 23: 10426 Segmentation fault (core dumped) ${root}mir_demo_server --test-timeout=$timeout --test-client $root$client
15:17:20 I: [FAILED] mir_demo_client_egltriangle
"

That's the first failure I've seen that with a client that exits cleanly. It might have the same root cause as lp:1560900 but I'd really like to reproduce to be sure.

OK, instead of reverting r3423 we landed mir_test_client_impolite_shutdown and updated the "smoke" test script to run all the mir_test_client_* executables packaged in mir-test-tools.

As described above I've been able to reproduce the more common mir_test_client_impolite_shutdown (nee mir_demo_client_scroll) failure reliably (within a minute).

I've left the same setup running with mir_demo_client_egltriangle for hours without seeing a failure (or collecting a crash dump).

The linked branch fixes the mir_test_client_impolite_shutdown failure, I'm not going to look further for a cause of the mir_demo_client_egltriangle failure unless we see it again.

Mir CI Bot (mir-ci-bot) wrote :

Fix committed into lp:mir at revision None, scheduled for release in mir, milestone 0.22.0

Changed in mir:
status: In Progress → Fix Committed
Changed in mir:
status: Fix Committed → Fix Released
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package mir - 0.22.1+16.04.20160516.2-0ubuntu2

---------------
mir (0.22.1+16.04.20160516.2-0ubuntu2) yakkety; urgency=medium

  [ Dimitri John Ledkov ]
  * Fix FTBFS error: call of overloaded ‘abs(float)’ is ambiguous, by
    including cmath c++ header.

 -- Łukasz 'sil2100' Zemczak <email address hidden> Thu, 19 May 2016 21:58:43 +0200

Changed in mir (Ubuntu):
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Bug attachments