Mir

server crashes when shutting down in smoke tests

Bug #1560900 reported by Alan Griffiths
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mir
Fix Released
High
Alan Griffiths
mir (Ubuntu)
Fix Released
Undecided
Unassigned

Bug Description

The "smoke tests" run each of the demo clients for 3 seconds and ensure they and the server all exit cleanly. But we've seen an instance where this fails:

https://mir-jenkins.ubuntu.com/job/device-runtests-mir/device_type=krillin/240/consoleFull

05:09:56 Starting
05:09:56 libevdev error in fix_invalid_absinfo: BUG: Device "mtk-tpd" has invalid ABS_MT_TRACKING_ID rangeConnected
05:09:56 Mir chose pixel format 1
05:09:56 Surface created
05:09:59 Signal 15 received. Good night.
05:10:08 /home/phablet/mir-testrunner/mir-smoke-test-runner.sh: line 23: 10047 Segmentation fault (core dumped) ${root}mir_demo_server --test-timeout=$timeout --test-client $root$client
05:10:08 I: [FAILED] mir_demo_client_scroll

This looks like the server crashed between signalling the client and shutting down.

All 21 of the other smoke tests pass:

05:11:48 I: The following clients failed to execute successfully:
05:11:48 I: mir_demo_client_scroll
05:11:48 I: Smoke testing complete with returncode -1

Tags: testsfail

Related branches

Revision history for this message
Alan Griffiths (alan-griffiths) wrote :

The reason I think the server core dumps is that one would expect to see "mirserver: Stopping" one second after "Signal 15 received. Good night".

Revision history for this message
Alan Griffiths (alan-griffiths) wrote :

Also seen for other MPs:

https://mir-jenkins.ubuntu.com/job/device-runtests-mir/device_type=krillin/232/consoleFull

18:02:02 Signal 15 received. Good night.
18:02:11 /home/phablet/mir-testrunner/mir-smoke-test-runner.sh: line 23: 10036 Segmentation fault (core dumped) ${root}mir_demo_server --test-timeout=$timeout --test-client $root$client
18:02:11 I: [FAILED] mir_demo_client_scroll

https://mir-jenkins.ubuntu.com/job/device-runtests-mir/device_type=krillin/228/console

09:01:13 Signal 15 received. Good night.
09:01:22 /home/phablet/mir-testrunner/mir-smoke-test-runner.sh: line 23: 10055 Segmentation fault (core dumped) ${root}mir_demo_server --test-timeout=$timeout --test-client $root$client
09:01:22 I: [FAILED] mir_demo_client_scroll

Revision history for this message
Alan Griffiths (alan-griffiths) wrote :

https://mir-jenkins.ubuntu.com/job/device-runtests-mir/device_type=krillin/218/console

18:37:36 Signal 15 received. Good night.
18:37:45 /home/phablet/mir-testrunner/mir-smoke-test-runner.sh: line 23: 10024 Segmentation fault (core dumped) ${root}mir_demo_server --test-timeout=$timeout --test-client $root$client
18:37:45 I: [FAILED] mir_demo_client_scroll

Revision history for this message
Alan Griffiths (alan-griffiths) wrote :

Hmm, mir_demo_client_scroll is "special" in that it exits without shutting down the connection.

We had a fix for that killing the server in -r 3385. Could this be a second bug in that area? Or the old one resurfacing?

Changed in mir:
status: New → In Progress
assignee: nobody → Alan Griffiths (alan-griffiths)
importance: Undecided → High
milestone: none → 0.22.0
Revision history for this message
Alan Griffiths (alan-griffiths) wrote :

/All/ I needed was a 14 year old laptop running stress to reproduce the crash. Here's a stacktrace

Revision history for this message
Alan Griffiths (alan-griffiths) wrote :

It appears that the session mediator is destroyed before the "complete" functor on LL422-429 is invoked. The session mediator report shows "connection dropped without disconnect" just before getting the segfault. (It also reports this in all the successful runs.)

The fix in -r 3385 ought to be fixing that - ~SessionMediator() calls "shell->close_session(session)" which (indirectly) calls ms::SessionManager::close_session() and that calls "scene_session->force_requests_to_complete()" after which there should be no more callbacks and deletion should proceed.

Or not... BufferQueue::give_buffer_to_client() pulls the completion from pending before releasing a lock. That means that force_requests_to_complete() can get the lock and run before the completion is called.

Revision history for this message
Alan Griffiths (alan-griffiths) wrote :

OK, as a quick hack this failure is fixed by:

=== modified file 'src/server/compositor/buffer_queue.cpp'
--- src/server/compositor/buffer_queue.cpp 2016-03-22 02:29:50 +0000
+++ src/server/compositor/buffer_queue.cpp 2016-03-23 18:13:36 +0000
@@ -554,7 +554,7 @@

     buffers_owned_by_client.push_back(buffer);

- lock.unlock();
+// lock.unlock();
     try
     {
         give_to_client_cb(buffer);

But I need to have a closer look at BufferQueue and its locking - as the lock is also released by the code a few lines above:

        snapshot_released.wait(lock,
            [&]{ return !contains(buffer, pending_snapshots); });

Revision history for this message
Alan Griffiths (alan-griffiths) wrote :

Note that the test scenaro is broken by -r 3423. Either that needs reverting, or the test needs reinstating elsewhere. (The latter is probably more elegant.)

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

I kind of support reverting r3423. Although it's lots of extra code, it's nice having real unique demo code and not having every demo use the same logic.

Revision history for this message
Alberto Aguirre (albaguirre) wrote :

I disagree on reverting r3423. The logic is outdated and wrong in various places. There's no reason we need that under examples.

But reinstating somewhere else would be ok - perhaps under tests/impolite-clients

Revision history for this message
Alberto Aguirre (albaguirre) wrote :

@Alan,

Here's another instance of the issue that did not involve scroll:

https://mir-jenkins.ubuntu.com/job/device-runtests-mir/311/device_type=krillin/console
"
15:17:20 /home/phablet/mir-testrunner/mir-smoke-test-runner.sh: line 23: 10426 Segmentation fault (core dumped) ${root}mir_demo_server --test-timeout=$timeout --test-client $root$client
15:17:20 I: [FAILED] mir_demo_client_egltriangle
"

Revision history for this message
Alan Griffiths (alan-griffiths) wrote :

That's the first failure I've seen that with a client that exits cleanly. It might have the same root cause as lp:1560900 but I'd really like to reproduce to be sure.

Revision history for this message
Alan Griffiths (alan-griffiths) wrote :

OK, instead of reverting r3423 we landed mir_test_client_impolite_shutdown and updated the "smoke" test script to run all the mir_test_client_* executables packaged in mir-test-tools.

Revision history for this message
Alan Griffiths (alan-griffiths) wrote :

As described above I've been able to reproduce the more common mir_test_client_impolite_shutdown (nee mir_demo_client_scroll) failure reliably (within a minute).

I've left the same setup running with mir_demo_client_egltriangle for hours without seeing a failure (or collecting a crash dump).

The linked branch fixes the mir_test_client_impolite_shutdown failure, I'm not going to look further for a cause of the mir_demo_client_egltriangle failure unless we see it again.

Revision history for this message
Mir CI Bot (mir-ci-bot) wrote :

Fix committed into lp:mir at revision None, scheduled for release in mir, milestone 0.22.0

Changed in mir:
status: In Progress → Fix Committed
Changed in mir:
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package mir - 0.22.1+16.04.20160516.2-0ubuntu2

---------------
mir (0.22.1+16.04.20160516.2-0ubuntu2) yakkety; urgency=medium

  [ Dimitri John Ledkov ]
  * Fix FTBFS error: call of overloaded ‘abs(float)’ is ambiguous, by
    including cmath c++ header.

 -- Łukasz 'sil2100' Zemczak <email address hidden> Thu, 19 May 2016 21:58:43 +0200

Changed in mir (Ubuntu):
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.