qemu target may fail to sync data to the image

Bug #1073899 reported by Paul Sokolovsky
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
LAVA Dispatcher
Won't Fix
Undecided
Unassigned

Bug Description

Trying linaro_test_shell action with qemu target, it appears there's high probability that logs, results, etc. won't be available for pickup from underlying filesystem image after the test run. This appears to be because following happens in quick succession:

1. Log/results files are written in qemu guest.
2. "sync" is performed in qemu guest.
3. qemu is terminated.

So, it seems that step 2., sync, doesn't really flush changed to the fs image on a host immediately, and (abrupt?) termination of qemu (closing of pipe) doesn't flush either.

mem-stress-a7 : PASS
<LAVA_TEST_RUNNER>: 0_passfail exited with: 0
<LAVA_TEST_RUNNER>: exiting<LAVA_DISPATCHER>2012-11-01 12:51:53 PM INFO: lava_test_shell seems to have completed
<LAVA_DISPATCHER>2012-11-01 12:51:53 PM INFO: attempting a filesystem sync before power_off

sync
sync
linaro-test [rc=0]# & 0_passfail-1351770706
&& 0_passfail-1351770706
<LAVA_DISPATCHER>2012-11-01 12:51:57 PM ERROR: error processing results for: 0_passfail-1351770706
Traceback (most recent call last):
  File "/home/pfalcon/devel/linaro/lava/lava-dispatcher/lava_dispatcher/lava_test_shell.py", line 208, in get_bundle
    testruns.append(_get_test_run(results_dir, d, hwctx, swctx))
  File "/home/pfalcon/devel/linaro/lava/lava-dispatcher/lava_dispatcher/lava_test_shell.py", line 164, in _get_test_run
    testdef = json.loads(testdef)
  File "/usr/lib/python2.7/json/__init__.py", line 326, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python2.7/json/decoder.py", line 366, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python2.7/json/decoder.py", line 384, in raw_decode
    raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded
{'test_runs': [], 'format': 'Dashboard Bundle Format 1.3'}
<LAVA_DISPATCHER>2012-11-01 12:51:57 PM INFO: [ACTION-E] lava_test_shell is finished successfully.

Related branches

Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

Doing sleep(3) before closing qemu pipe works around this issue for me, though it's clear a workaround, and there can be cases when 3s sleep would be not enough.

Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

qemu support in lava-dispatcher uses "-drive if=%s,cache=writeback". I though the problem may be due to cache=writeback and tried cache=writethrough, cache=none, but the problem persists. (Maybe nano image 12.03 doesn't implement sync properly?)

Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

For completeness I also tried cache=directsync, with the same result.

So, I'm going to prepare patch for sleep() workaround.

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote : Re: [Bug 1073899] Re: qemu target may fail to sync data to the image

Paul Sokolovsky <email address hidden> writes:

> For completeness I also tried cache=directsync, with the same result.
>
> So, I'm going to prepare patch for sleep() workaround.

We recently added a 'halt' to fastmodel.py to work around this same
issue. Maybe that will work for qemu.py too? Although sleep will
probably work just as well.

Cheers,
mwh

Revision history for this message
Andy Doan (doanac) wrote :

The real issue, at least when I've seen it it is this:

The lava-test-runner gets launched before the root shell prompt is active. The test it runs is really fast, so we get ready to exit before there's a console. Then the qemu.py code calls sync/halt, but they don't get executed because console isn't ready to act on them yet. We then shutdown and the changes get lost.

We need to change the upstart entry for this job to wait on the root console to be ready.

Revision history for this message
Antonio Terceiro (terceiro) wrote :

Is there a reason why we start those tests from an upstart job instead
of waiting for the shell prompt and then launch the test from there?

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

Antonio Terceiro <email address hidden> writes:

> Is there a reason why we start those tests from an upstart job instead
> of waiting for the shell prompt and then launch the test from there?

It felt like it would be more reliable and depend less on serial. But
it turns out we send commands to the prompt anyway and so depend on
serial...

Cheers,
mwh

Revision history for this message
Andy Doan (doanac) wrote :

still depending on serial for attempting to shut down cleanly is better than depending on it to do anything at all.

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

So this is just as simple as the linked branch, right?

Revision history for this message
Paul Sokolovsky (pfalcon) wrote :

Confirming the linked branch fixes the issue for me.

Changed in lava-dispatcher:
status: New → In Progress
Revision history for this message
Alan Bennett (akbennett) wrote :

Fixing this bug does not fit in our development plans. Moving forward, all LAVA bugs will be disposed/scrubbed/triaged weekly.

Changed in lava-dispatcher:
status: In Progress → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.