PCJ race between process-job-source.py and celery can generate OOPS

Bug #1314569 reported by Colin Watson on 2014-04-30
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Launchpad itself
Critical
Unassigned

Bug Description

I got OOPS-7d0f700be19191e98139cdab67a81ea7, which is:

  InvalidTransition: Transition from Running to Running is invalid.

    Traceback (most recent call last):
  Module lazr.jobrunner.jobrunner, line 194, in runJobHandleError
    self.runJob(job, fallback)
  Module lp.services.job.runner, line 289, in runJob
    super(BaseJobRunner, self).runJob(IRunnableJob(job), fallback)
  Module lazr.jobrunner.jobrunner, line 159, in runJob
    job.start(manage_transaction=True)
  Module lp.services.job.model.job, line 169, in start
    self._set_status(JobStatus.RUNNING)
  Module lp.services.job.model.job, line 120, in _set_status
    raise InvalidTransition(self._status, status)
InvalidTransition: Transition from Running to Running is invalid.

    <oops-message-0>: {'target_archive_id': 1, 'package_copy_job_type': 'Copy packages between archives.', 'job_id': 23532039, 'target_distroseries_id': 108, 'package_copy_job_id': 279234, 'source_archive_id': 1}

This was because the job had been picked up by celery at almost exactly the same time:

[2014-04-30 09:23:13,769: DEBUG3/PoolWorker-3] new transaction
[2014-04-30 09:23:13,881: INFO/PoolWorker-3] Running <PlainPackageCopyJob to copy package gnome-settings-daemon from ubuntu/primary to ubuntu/primary, UPDATES pocket, in ubuntu precise, including binaries> (ID 23532039) in status Waiting

2014-04-30 09:23:13 DEBUG Trying to acquire lease for job in state Waiting
2014-04-30 09:23:13 INFO Running <PlainPackageCopyJob to copy package gnome-settings-daemon from ubuntu/primary to ubuntu/primary, UPDATES pocket, in ubuntu precise, including binaries> (ID 23532039) in status Running
2014-04-30 09:23:14 INFO Job resulted in OOPS: OOPS-7d0f700be19191e98139cdab67a81ea7

So this is harmless in that the copy happened anyway, but Critical by Launchpad bug policy since it shouldn't generate an OOPS.

I thought the point of acquiring a lease for the job was that it couldn't be picked up by another job runner. Does celery not honour that?

Colin Watson (cjwatson) wrote :

I think the problem may be in lazr.jobrunner. RunJob.run does indeed do a job.acquireLease(), but it doesn't commit the transaction at that point (unlike JobRunner.runAll) so other processes won't see it.

William Grant (wgrant) on 2015-07-29
Changed in launchpad:
status: New → Triaged
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers