intermittent unit failure: periodicWorkerSuite.TestWaitWithJitter

Bug #1745332 reported by Anastasia on 2018-01-25
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
juju
Medium
John A Meinel
2.3
Medium
John A Meinel

Bug Description

As per, http://ci.jujucharms.com/blue/organizations/jenkins/github-merge-juju/detail/github-merge-juju/797/pipeline:

FAIL: periodicworker_test.go:74: periodicWorkerSuite.TestWaitWithJitter

periodicworker_test.go:105:

    tPeriod.CheckCalls(c, []jtesting.StubCall{{

        FuncName: "nextPeriod",

        Args: []interface{}{testing.ShortWait, float64(0.2)},

    }})

/home/ubuntu/juju-core_2.3.3/src/github.com/juju/testing/stub.go:248:

    return c.Check(funcNames, jc.DeepEquals, expected)

... obtained []string = []string(nil)

... expected []string = []string{"nextPeriod"}

... mismatch at top level: length mismatch, 0 vs 1; obtained []string(nil); expected []string{"nextPeriod"}

OOPS: 16 passed, 1 FAILED

--- FAIL: TestPackage (3.17s)

FAIL

FAIL github.com/juju/juju/worker 4.194s

John A Meinel (jameinel) wrote :
summary: - intermittent unit failure:
+ intermittent unit failure: periodicWorkerSuite.TestWaitWithJitter
John A Meinel (jameinel) wrote :

I believe the test is flaky because there isn't a synchronization between when the func is called and when we call NewPeriod.

Specifically, the core loop looks like:

for {
  select {
  ...
  case <-timer.CountDown():
    ...
    call(...)
  }
  timer.Reset(nextPeriod(period, w.jitter))
}

But the test synchronization is:

myFunc() {
  called <- struct{}{}
  return
}

Which means that the tests waits for the call() to be triggered, but there is no guarantee that the goroutine will get all the way to timer.Reset(nextPeriod) before we wake up noticing that call() has been triggered.

We could
a) compute newPeriod before we go into the call
b) wait to evaluate calls until we get called another time (we must have called timer.Reset(newPeriod) before we get called.

John A Meinel (jameinel) wrote :

You can get it to fail every time with this change:
diff --git a/worker/periodicworker.go b/worker/periodicworker.go
index 27d0f94..5ef55ae 100644
--- a/worker/periodicworker.go
+++ b/worker/periodicworker.go
@@ -108,6 +108,7 @@ func (w *periodicWorker) run(call PeriodicWorkerCall, period time.Duration) erro
                                return err
                        }
                }
+ time.Sleep(time.Millisecond)
                timer.Reset(nextPeriod(period, w.jitter))
        }
 }

(sleep for 1ms after call() before calling nextPeriod).

We need to fix the test to not assume nextPeriod will always be called *immediately* after.

Changed in juju:
assignee: nobody → John A Meinel (jameinel)
status: Triaged → In Progress
milestone: none → 2.4-beta1
John A Meinel (jameinel) wrote :

The other oddity in the test is this case:
 select {
 case <-funcHasRun:
  c.Fatalf("After the kill we don't expect anymore calls to the function")
 case <-time.After(defaultFireOnceWait):
 }

defaultPeriod = time.Second
defaultFireOnceWait = defaultPeriod / 2

It is tied to "defaultPeriod" but the actual length of time we are waiting is "testing.ShortWait". While it is true that defaultFireOnceWait is significantly longer than testing.ShortWait there is nothing that ties the two times together.
It would make more sense to do something lie:
testing.ShortWait * 2
or something else that clearly means "if it was going to fire, it should have fired by now". It doesn't hurt that ShortWait*2 is also a lot faster than 500ms.

John A Meinel (jameinel) wrote :

merged from 2.3 into develop from https://github.com/juju/juju/pull/8384

Changed in juju:
status: In Progress → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers