Checkbox loses results on system freeze

Bug #814801 reported by Marc Legris
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Checkbox
Fix Released
High
Brendan Donegan
Oneiric
Fix Released
Undecided
Unassigned
checkbox (Ubuntu)
Oneiric
Won't Fix
Undecided
Unassigned
Precise
Fix Released
Undecided
Unassigned

Bug Description

When testing natty I've noticed many times when a test such as the wifi slider locks up the system, checkbox will not resume the test where it was left off. It will restart at the beginning of the test run after rebooting the system, causing a lot of wasted time testing.

[Impact]
Since job information may be kept in memory and/or disk cache, it's possible for checkbox to start executing a test which crashes the system before the data is written to disk. When this happens, it often means that all test data up to the point of failure is lost, and the tester is forced to start the run from scratch, wasting time.

The fix for this bug improves reliability by requesting to flush all relevant caches prior to test execution. As mentioned in comment #10, this greatly reduces the chance that test data will be lost.

Since we depend on the goodwill of users to run Ubuntu Friendly tests, at least ensuring that their work doesn't go to waste if the system crashes is a good rationale for requesting rolling this fix into the stable release.

[Development fix]
Checkbox revision 1090 implements a fix by adding a safe_close method that takes care of flushing file descriptors and syncing the disk cache, and gets called before starting a job run.

[Stable fix]
This is fixed in http://bazaar.launchpad.net/~checkbox-dev/checkbox/trunk/revision/1090. Also, the linked branch (https://code.launchpad.net/~roadmr/ubuntu/oneiric/checkbox/0.12.8.1) contains this and other fixes and merges cleanly against Ubuntu checkbox 0.12.8.

[Test Case]
- Start "system testing" (checkbox-gtk version 0.12.8) on an 11.10 system using Unity 3d.
- Press "Next" and provide password when prompted.
- Leave all tests and suites selected and press Next.
- When the tests start, just press "alt-X" to invoke the "Next" button, skipping all the manual tests.
- Wait for a long-running test to start. cpu/clocktest is a good candidate. Monitoring .cache/checkbox/checkbox.log for the test names that are running is one way to do it, because bug 868995 may cause checkbox to hide, which makes it difficult to tell what's happening.
- As soon as the test starts, yank power to the system.
- Reboot the system, and try to run system testing again.

Expected behavior:
- Checkbox should offer to continue the interrupted run
- If accepted, it should go back to the last test that was running (e.g. cpu/clocktest) and continue from there.
Actual behavior:
- Checkbox may not offer to continue an interrupted run, showing that it has no record of one having been started.
- Even if offering to continue the run, checkbox may not start where it was interrupted, losing some test results that have to be repeated.

[Regression potential]
The code is rather simple, but it could cause some perceived slowness during test execution, as checkbox will now insist on flushing everything to disk before each test. Also it's possible that this extra disk activity may trigger latent disk failures that can potentially be blamed on checkbox, but the level of extra activity is comparatively small and should pose no actual problems.

Related branches

Revision history for this message
Marc Legris (maaarc-deactivatedaccount-deactivatedaccount-deactivatedaccount-deactivatedaccount-deactivatedaccount) wrote :

Maybe cb should upload test results at the end of each test or output the test results to an xml file for each uploading later on.

Changed in checkbox-certification:
status: New → Confirmed
importance: Undecided → High
Revision history for this message
Daniel Manrique (roadmr) wrote :

Marc (Tardif) suggested adding a sync call after every successful test, to ensure that test results get written safely before going into another, potentially system-crashing test.

If you launch checkbox manually on these systems as we usually do when certing, it would help if you could :

1- Launch the checkbox run
2- if When the system locks up, reboot and
3- Back up the whole .checkbox directory (so we can check the log files and messages for any corruption or inconsistencies)
4- Restart the run
5- Look at the log file to see what checkbox is thinking happened with the old results.

The .checkbox directory from the failed/crashed attempt might reveal where the problem lies.

Revision history for this message
Jeff Lane  (bladernr) wrote :

I know you're busy this week, but if you get a chance some time could you take a look at this one? Marked as incomplete as it's waiting on some stuff per Daniel's request.

Changed in checkbox-certification:
assignee: nobody → Marc Legris (maaarc)
status: Confirmed → Incomplete
Revision history for this message
Marc Legris (maaarc-deactivatedaccount-deactivatedaccount-deactivatedaccount-deactivatedaccount-deactivatedaccount) wrote :

System has been reimaged, I'll have to wait until I see it again to get the logs

Changed in checkbox-certification:
status: Incomplete → Triaged
Revision history for this message
Brendan Donegan (brendan-donegan) wrote :

Syncing after each test is probably a good idea anyway, so I don't think any further info is needed. I'll assign this to myself as I'm looking at it.

Changed in checkbox-certification:
assignee: Marc Legris (maaarc) → Brendan Donegan (brendan-donegan)
status: Triaged → In Progress
Revision history for this message
Brendan Donegan (brendan-donegan) wrote :

So, it looks like the persist.save() function is only called on prompt_job, which is fired before the test runs. I think putting the call in report_job might just do the trick, but it's a little difficult debugging it since report-job is fired so many times. I'll keep investigating this avenue.

Revision history for this message
Brendan Donegan (brendan-donegan) wrote :

It seems like I can succesfully get the bpickle file to be written fully after every test. Looks like some more needs to be done to make the job store get written properly.

Revision history for this message
Brendan Donegan (brendan-donegan) wrote :

I think I found the function I was looking for in Python, which is os.fsync. Calling this after flushing the file buffer before it is closed (or after it is written to) should result in better reliability.

Revision history for this message
Brendan Donegan (brendan-donegan) wrote :

Just noticed this was in checkbox-certification for some reason and that it was private. Changed that.

affects: checkbox-certification → checkbox
visibility: private → public
Revision history for this message
Brendan Donegan (brendan-donegan) wrote :

Ok, I've tested this and checkbox seems pretty robust now! I don't think I've covered all the bases, so this will need a good bit more testing, but what I've done so far is:

1.) Test before changes. Run checkbox and yank power at some point (e.g. while graphics tests are running). Power on the system and run checkbox. Notice that we're back to square one (even if we press 'Yes' to the recover prompt)

2.) Test after changes. Run checkbox, yank power etc. Power on and notice that we're at the test where we yanked the power.

3.) Profit!

Changed in checkbox:
milestone: none → 0.13
Ara Pulido (ara)
Changed in checkbox (Ubuntu Oneiric):
milestone: none → oneiric-updates
Changed in checkbox:
milestone: 0.13 → 0.12.9
Changed in checkbox (Ubuntu Oneiric):
status: New → Triaged
Changed in checkbox:
status: In Progress → Fix Committed
Revision history for this message
Brendan Donegan (brendan-donegan) wrote :

Marking this fix released for checkbox as it's in the trunk and backported to 0.12 series. Will update the O and P tasks when the SRU and Precise are released respectively.

Changed in checkbox:
status: Fix Committed → Fix Released
status: Fix Released → Fix Committed
status: Fix Committed → Fix Released
Changed in checkbox (Ubuntu Precise):
status: Triaged → Fix Released
Daniel Manrique (roadmr)
description: updated
Daniel Manrique (roadmr)
description: updated
Daniel Manrique (roadmr)
description: updated
Daniel Manrique (roadmr)
no longer affects: checkbox (Ubuntu)
Revision history for this message
Chris Halse Rogers (raof) wrote : Please test proposed package

Hello Marc, or anyone else affected,

Accepted checkbox into oneiric-proposed, the package will build now and be available in a few hours. Please test and give feedback here. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you in advance!

Changed in checkbox (Ubuntu Oneiric):
status: Triaged → Fix Committed
tags: added: verification-needed
Revision history for this message
Brendan Donegan (brendan-donegan) wrote :

I tested this by running Checkbox and then removing the power from my laptop at the camera/video test. Upon restarting Checkbox asked me if I wanted to resume from where I left off and I selected 'Yes'. The tests resumed at the camera/video test.

tags: added: verification-done
removed: verification-needed
Revision history for this message
Rolf Leggewie (r0lf) wrote :

oneiric has seen the end of its life and is no longer receiving any updates. Marking the oneiric task for this ticket as "Won't Fix".

Changed in checkbox (Ubuntu Oneiric):
status: Fix Committed → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.