TestRunBacklogFailedContinuesDiffApp is racy

Bug #1483079 reported by Michael Hudson-Doyle on 2015-08-10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
ubuntu-push (Ubuntu)

Bug Description

TestRunBacklogFailedContinuesDiffApp fails with Go 1.5 like this:

    // Everything up to here was just set-up.
    // What we're checking for is that, if a helper launch fails, the
    // next one in the backlog is picked up.
    c.Assert(takeNext(ch, c).Input.App, Equals, app1)
... obtained *click.AppId = &click.AppId{Package:"com.example.test", Application:"test-app-2", Version:"", Click:true, original:"com.example.test_test-app-2"} ("com.example.test_test-app-2")
... expected *click.AppId = &click.AppId{Package:"com.example.test", Application:"test-app-1", Version:"", Click:true, original:"com.example.test_test-app-1"} ("com.example.test_test-app-1")

This is a race. It's a bit confusing -- and I don't really understand this code -- but it's something like this:

TestRunBacklogFailedContinuesDiffApp calls s.pool.Start(). This starts pool.loop() in it's own goroutine (the "loop goroutine").

The test causes the "in, ok := <-pool.chIn" case of the select in loop to be entered twice: the first time the loop goroutine calls handleOne and the second time it just puts the input in backlog.

The test then calls "go s.fakeLauncher.done("0")". This winds up calling pool.OneDone in a new goroutine. OneDone does two things relevant to this bug: it calls "pool.chDone <- args.Input.App" and it calls "pool.chOut <- res". The problem is that the loop goroutine is sitting in a select on pool.chDone and pool.chIn, and if it gets to run before OneDone does its second send, it will push the failure for app2 into chOut before OneDone pushes the result for app1. Hence the race above!

It's possible to make this fail on 1.4 by putting a small sleep after the send to chDone.

Related branches

Michael Hudson-Doyle (mwhudson) wrote :

There is a similar ish race in TestRunNthAppToBacklog: this time, the channel read on the other end of s.pool.Run("fake", input3) races with s.fakeLauncher.done("0"). If the latter completes first, no job gets onto the backlog and the test fails.

Changed in ubuntu-push:
assignee: nobody → Samuele Pedroni (pedronis)
Samuele Pedroni (pedronis) wrote :

I have a branch that should fix those,

as I expected I had to hook a bit in the code but not too bad,

otoh TestRunBacklogFailedContinuesDiffApp was actually doing a bit of harmless non-sense, not sure why it was doing ...done("2") there for example

affects: ubuntu-push → ubuntu-push (Ubuntu)
Changed in ubuntu-push (Ubuntu):
assignee: Samuele Pedroni (pedronis) → nobody
Changed in ubuntu-push (Ubuntu):
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers