checkbox-certification server crashes with "Definition of job... has changed" error

Bug #1302576 reported by Rod Smith
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Next Generation Checkbox (CLI)
Fix Released
Medium
Zygmunt Krynicki
PlainBox (Toolkit)
Fix Released
Medium
Zygmunt Krynicki
checkbox-certification
Invalid
Undecided
Unassigned

Bug Description

I've had two checkbox-certification-server runs crash on me today, both with similar errors:

Local job 2013.com.canonical.certification::__info__ produced job <JobDefinition id:'2013.com.canonical.certification::dmi_attachment' plugin:'attachment'> that collides with an existing job 2013.com.canonical.certification::dmi_attachment (from /usr/lib/plainbox-providers-1/checkbox/jobs/info.txt:20-24), the new job was discarded
Executable 'checkbox' invoked with Namespace(c3_url='https://certification.canonical.com/submissions/submit/', check_config=False, command=<checkbox_ng.commands.certification.CertificationCommand object at 0x7f344b26d048>, debug_console=False, debug_interrupt=False, exclude_pattern_list=[], include_pattern_list=[], log_level=None, not_interactive=False, pdb=False, providers=None, secure_id=None, trace=[], whitelist=[]) has crashed
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/plainbox/impl/clitools.py", line 513, in dispatch_and_catch_exceptions
    return self.dispatch_command(ns)
  File "/usr/lib/python3/dist-packages/plainbox/impl/clitools.py", line 509, in dispatch_command
    return ns.command.invoked(ns)
  File "/usr/lib/python3/dist-packages/checkbox_ng/commands/certification.py", line 132, in invoked
    self.settings, ns).run()
  File "/usr/lib/python3/dist-packages/checkbox_ng/commands/cli.py", line 490, in run
    job_list, previous_session_file)
  File "/usr/lib/python3/dist-packages/plainbox/impl/session/manager.py", line 255, in load_session
    state = SessionResumeHelper(job_list).resume(data, early_cb)
  File "/usr/lib/python3/dist-packages/plainbox/impl/session/resume.py", line 154, in resume
    return self._resume_json(json_repr, early_cb)
  File "/usr/lib/python3/dist-packages/plainbox/impl/session/resume.py", line 176, in _resume_json
    self.job_list).resume_json(json_repr, early_cb)
  File "/usr/lib/python3/dist-packages/plainbox/impl/session/resume.py", line 232, in resume_json
    return self._build_SessionState(session_repr, early_cb)
  File "/usr/lib/python3/dist-packages/plainbox/impl/session/resume.py", line 258, in _build_SessionState
    self._restore_SessionState_jobs_and_results(session, session_repr)
  File "/usr/lib/python3/dist-packages/plainbox/impl/session/resume.py", line 293, in _restore_SessionState_jobs_and_results
    self._process_job(session, jobs_repr, results_repr, job_id)
  File "/usr/lib/python3/dist-packages/plainbox/impl/session/resume.py", line 357, in _process_job
    _("Definition of job {!r} has changed").format(job_id))
plainbox.impl.session.resume.IncompatibleJobError: Definition of job '2013.com.canonical.certification::dmi_attachment' has changed
Traceback (most recent call last):
  File "/usr/bin/canonical-certification-server", line 9, in <module>
    load_entry_point('checkbox-ng==0.3.dev', 'console_scripts', 'canonical-certification-server')()
  File "/usr/lib/python3/dist-packages/checkbox_ng/main.py", line 192, in cert_server
    CertificationNGTool().main(['certification-server'] + args))
  File "/usr/lib/python3/dist-packages/plainbox/impl/clitools.py", line 291, in main
    return self.dispatch_and_catch_exceptions(ns)
  File "/usr/lib/python3/dist-packages/plainbox/impl/clitools.py", line 513, in dispatch_and_catch_exceptions
    return self.dispatch_command(ns)
  File "/usr/lib/python3/dist-packages/plainbox/impl/clitools.py", line 509, in dispatch_command
    return ns.command.invoked(ns)
  File "/usr/lib/python3/dist-packages/checkbox_ng/commands/certification.py", line 132, in invoked
    self.settings, ns).run()
  File "/usr/lib/python3/dist-packages/checkbox_ng/commands/cli.py", line 490, in run
    job_list, previous_session_file)
  File "/usr/lib/python3/dist-packages/plainbox/impl/session/manager.py", line 255, in load_session
    state = SessionResumeHelper(job_list).resume(data, early_cb)
  File "/usr/lib/python3/dist-packages/plainbox/impl/session/resume.py", line 154, in resume
    return self._resume_json(json_repr, early_cb)
  File "/usr/lib/python3/dist-packages/plainbox/impl/session/resume.py", line 176, in _resume_json
    self.job_list).resume_json(json_repr, early_cb)
  File "/usr/lib/python3/dist-packages/plainbox/impl/session/resume.py", line 232, in resume_json
    return self._build_SessionState(session_repr, early_cb)
  File "/usr/lib/python3/dist-packages/plainbox/impl/session/resume.py", line 258, in _build_SessionState
    self._restore_SessionState_jobs_and_results(session, session_repr)
  File "/usr/lib/python3/dist-packages/plainbox/impl/session/resume.py", line 293, in _restore_SessionState_jobs_and_results
    self._process_job(session, jobs_repr, results_repr, job_id)
  File "/usr/lib/python3/dist-packages/plainbox/impl/session/resume.py", line 357, in _process_job
    _("Definition of job {!r} has changed").format(job_id))
plainbox.impl.session.resume.IncompatibleJobError: Definition of job '2013.com.canonical.certification::dmi_attachment' has changed

I'm attaching the logs from the ~/.cache directory to this bug report. Both were using the latest (April 3) version of checkbox-ng from the dev PPA.

Revision history for this message
Rod Smith (rodsmith) wrote :
Revision history for this message
Rod Smith (rodsmith) wrote :
tags: added: blocks-hwcert-server
Revision history for this message
Daniel Manrique (roadmr) wrote :

Interesting.

Could you please paste output of;

COLUMNS=200 dpkg --list "*plainbox*" "*checkbox*"

just to confirm the package versions you have.

Changed in checkbox-certification:
status: New → Incomplete
Revision history for this message
Rod Smith (rodsmith) wrote :

$ COLUMNS=200 dpkg --list "*plainbox*" "*checkbox*"
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-============================================-===========================-===========================-=============================================================================================
un checkbox <none> <none> (no description available)
ii checkbox-ng 0.3~dev+bzr2871+pkg1~ubuntu all PlainBox based test runner
ii plainbox-insecure-policy 0.6~dev+bzr2871+pkg2~ubuntu all policykit policy required to use plainbox (insecure version)
ii plainbox-provider-certification-server 0.1~dev+bzr2871+pkg4~ubuntu all Server Certification
ii plainbox-provider-checkbox 0.4~dev+bzr2872+pkg1~ubuntu amd64 CheckBox provider for PlainBox
ii plainbox-provider-resource-generic 0.3~dev+bzr2874+pkg3~ubuntu amd64 CheckBox generic resource jobs provider
un plainbox-secure-policy <none> <none> (no description available)
ii python3-checkbox-ng 0.3~dev+bzr2871+pkg1~ubuntu all PlainBox based test runner (Python 3 library)
ii python3-checkbox-support 0.2~dev+bzr2871+pkg1~ubuntu all collection of Python modules used by PlainBox providers
ii python3-plainbox 0.6~dev+bzr2871+pkg2~ubuntu all toolkit for software and hardware testing (python3 module)

Revision history for this message
Daniel Manrique (roadmr) wrote :

Hi Rod,

Package versions look OK and I'm unable to reproduce this locally. Your trace mentions stuff about restoring a session, and I know that we recently changed that very job definition (dmi_attachment) since it contained a typo. Thus my theory is that you're trying to restore a session (saying "y" to resume testing) and since the session's version of the job differs from the one in the data file, you get this error.

Could you please remove the .cache/plainbox directory (or just rename it if you have important test data there; in that case I can later tell you how to "recreate" the typo so the jobs match and you don't have to start from zero), and then run this again?

Let me know.

Revision history for this message
Zygmunt Krynicki (zyga) wrote :

Daniel is right, basically this is a feature and the bug is that we don't have any user interface which instructs the user on what to do next.

Revision history for this message
Daniel Manrique (roadmr) wrote :

Marked invalid for checkbox-certification as it was a bit of a "stale data" case. Moved to plainbox since it's a core issue, and triaged since we know what happens, although we're not sure how to handle it at the moment. Scheduled for "future" with importance: medium since as things stabilize, the chance of jobs mutating in the middle of a run (with a suspend/resume in between) will be reduced.

affects: plainbox-provider-checkbox → plainbox
Changed in checkbox-certification:
status: Incomplete → Invalid
Changed in plainbox:
status: New → Confirmed
status: Confirmed → Triaged
milestone: none → future
importance: Undecided → Medium
Revision history for this message
Rod Smith (rodsmith) wrote :

The output I posted in the main bug report and the attachment with the C220 data are from a re-run after an initial failure. The C240 attachment is from an original failure WITHOUT re-running anything. Sorry for not mentioning that earlier. The upshot is that this is not simply a failure caused by re-running the program with stale leftover data.

Revision history for this message
Daniel Manrique (roadmr) wrote :

Hi Rod,

That's very strange then. Can you reproduce this in the C240? Could you try both without and with removal of the .cache/plainbox dir? is the backtrace *exactly* the same? (the C240 may be encountering a different problem).

Revision history for this message
Rod Smith (rodsmith) wrote :

I tried again. This time I left old files in place on the C220 and deleted them all, as you suggested, on the C240. Both runs crashed, and both produced identical screen output, at least in the final few lines. (I was running with "screen," so I couldn't scroll back.) I'm attaching the contents of the ~/.cache directory tree for both runs, as well as the visible screen output.

Revision history for this message
Rod Smith (rodsmith) wrote :
Revision history for this message
Rod Smith (rodsmith) wrote :
Revision history for this message
Rod Smith (rodsmith) wrote :

Oh, one more thing: This bug is recent. I ran 14.04 certifications on this hardware as recently as March 14 and the runs then succeeded.

Revision history for this message
Rod Smith (rodsmith) wrote :

FWIW, I was able to finish the tests after a software update yesterday, although one system exhibited a new problem (bug 1304436).

With all due respect, Zygmunt, this is *NOT* a feature. The certification *CRASHED* with no results.html or submission.xml file, so a certification would be impossible had the problem persisted.

Zygmunt Krynicki (zyga)
Changed in plainbox:
milestone: future → 0.6
assignee: nobody → Zygmunt Krynicki (zkrynicki)
status: Triaged → In Progress
Changed in checkbox-ng:
status: New → In Progress
importance: Undecided → Medium
assignee: nobody → Zygmunt Krynicki (zkrynicki)
Changed in plainbox:
milestone: 0.6 → 0.7
Changed in checkbox-ng:
milestone: none → 0.7
Changed in plainbox:
milestone: 0.7 → 0.8
Changed in plainbox:
milestone: 0.8 → 0.9
Changed in checkbox-ng:
milestone: 0.7 → 0.9
Changed in plainbox:
milestone: 0.9 → 0.11
Revision history for this message
Daniel Manrique (roadmr) wrote :

Based on prior comments:

- If a job definition changed between the time a session was saved and the time it is restored, we'll get plainbox.impl.session.resume.IncompatibleJobError.

- This exception is currently not handled.

- We need to handle it in some way.

Options to handle it:

1- Ask the user (use the new job definition, use the old one from the session, discard).
2- Just spit out a warning and do one of those three options. I'd be inclined to either use the new job definition or remove the definition altogether.

Whatever the decision, part of the problem is that a changed job definition may have further implications. What if its dependencies changed? what will happen to jobs depending on it? if we remove it, will it make other jobs suddenly unrunnable due to non-satisfiable dependencies?

So another option to handle this is to indeed "crash", as in, flat out refuse to restore the session. Something like:

- Unable to restore the session due to $REASON. What to do?
  1- Start a new session
  2- Just exit (the user should solve this manually somehow).

for $REASON, I think this message is perfectly clear: Definition of job '2013.com.canonical.certification::dmi_attachment' has changed so I'd be OK to keep that.

And then things continue getting complicated, how do we handle this in checkbox-gui?

ANyway, I'll shift this to "future" pending some discussion on how to solve this.

Changed in plainbox:
milestone: 0.11 → future
Changed in checkbox-ng:
milestone: 0.9 → future
Revision history for this message
Rod Smith (rodsmith) wrote :

I'm not convinced that the original problem was caused by stale data; as I noted, it occurred even when I deleted the ~/.cache directory tree. That said, the original problem also disappeared prior to 14.04 GA, so that's a moot point. Because everything was changing so rapidly back then, and because the systems that exhibited the problem are now in OIL, it would be next to impossible to reproduce the original problem today.

As to dealing with the problem of a changed job definition, my initial reaction is that it would be appropriate to invalidate the original run and re-run everything from scratch (deleting the partial results from the first run). My reasoning is that a job definition should only have changed if a software update had occurred, so continuing runs the risk of having an inconsistent set of software used in test results -- that is, Test A might have version 1.2.3 of Package X, whereas Test B might be run using version 1.2.4 of Package X (where both tests obviously rely on Package X in some important way). Such an inconsistency strikes me as being potentially very confusing, should consistency be important when comparing test results (either manually or, at least in theory, by some other test). Displaying a warning to the effect that old results will be deleted and tests started again seems reasonable, as does giving the option to abort the run and leave whatever temporary results are available intact. That would be helpful if those initial results held important data for debugging or whatever.

Revision history for this message
Zygmunt Krynicki (zyga) wrote :

Hey Roderick.

I agree that we could implement a different behavior on what to do with the "job definition has changed" exception. While I was writing that I wrote that we could do a lot more here but at the time just being able to detect the situation and reject further processing is the safest thing we can do that would prevent us from chasing "impossible" results that got created by running half of the session with one version of some provider and another half with another that are somehow incompatible.

If we want to we can continue discussing what the user interface should be there or if that exception should be less strict (or what that would mean at least).

Revision history for this message
Zygmunt Krynicki (zyga) wrote :

So this bug (crash) has been fixed for a long time. We now capture SessionResumeError and give the users a few options on what to do to continue

Changed in plainbox:
status: In Progress → Fix Released
Changed in checkbox-ng:
status: In Progress → Fix Released
Zygmunt Krynicki (zyga)
Changed in plainbox:
milestone: future → none
Changed in checkbox-ng:
milestone: future → none
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.