Unclear failure in TacTestSetup

Bug #889048 reported by Jeroen T. Vermeulen
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Launchpad itself
Triaged
High
Unassigned

Bug Description

A buildmanager tests is giving me this strange error:

    TacException: Could not kill stale process /var/tmp/buildd/build-slave.pid.

When I check, /var/tmp/buildd no longer exists and the “stale process” is no longer running.

At the source of this error, in canonical.launchapd.daemons.tachandler, TacTestSetup.setUp checks whether the daemon it's trying to start might already be running. To check that, it looks for an existing pidfile for the daemon.

If the daemon already appears to be running, then setUp takes roughly the following steps:
 1. Kill it.
 2. Wait for a child process to die.
 3. If pidfile still exists, fail.

That last failure is happening to my branch, but it's a Heisenbug: a bit of debug code inserted between steps 2 and 3 makes it go away.

What I think is happening is this:
 1. Some *different* child process dies first.
 2. The kill function (lp.osutils.two_stage_kill) polls until it receives notification that a child has died.
 3. two_stage_kill gets notification, but about the wrong child process.
 4. Returns early.
 5. TacTestSetup.setUp checks for the pidfile to be gone.
 6. It's not. Kaboom.
 7. Killed daemon cleans up its pidfile.
 8. Killed daemon dies.
 9. Developer receives error.
10. Puzzled frown.

Revision history for this message
Jeroen T. Vermeulen (jtv) wrote :

Nope, it doesn't seem to be that. We do wait for the right pid.

Revision history for this message
Jeroen T. Vermeulen (jtv) wrote :

Still, adding some sleep after a successful kill, or even after a successful os.waitpid, prevents the failure.

It's as if os.waitpid(pid, os.WNOHANG) is returning success before the child has been able to clean up the pidfile.

Revision history for this message
Jeroen T. Vermeulen (jtv) wrote :

Found it! There were two instances of the test fixture.

Changed in launchpad:
status: Triaged → Invalid
Revision history for this message
Martin Pool (mbp) wrote :

I hit this too. I'm not sure why you closed this because it is a fact the failure message is unclear. There are several ways you can hit it.

I put up https://code.launchpad.net/~mbp/rabbitfixture/rabbit-startup/+merge/80285 which gives a some what better message. It is not yet deployed in lp.

Robert suggests dumping the fixture details which will give more information and be more robust.

Changed in launchpad:
assignee: nobody → Martin Pool (mbp)
status: Invalid → Fix Committed
William Grant (wgrant)
Changed in launchpad:
status: Fix Committed → In Progress
Martin Pool (mbp)
Changed in launchpad:
assignee: Martin Pool (mbp) → nobody
status: In Progress → Triaged
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.