WorkerSuite.TestRemoveUnitStopsWatchingContainerSpec race on unclean teardown

Bug #1756685 reported by John A Meinel on 2018-03-18
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ian Booth

Bug Description

It seems if the test fails, then it sets the whole package into race issues:

(its possible the race caused the failure, but it looks more like the failure triggers cleanup paths that race.)

FAIL: worker_test.go:716: WorkerSuite.TestRemoveUnitStopsWatchingContainerSpec

    c.Fatal("timed out sending units change")
... Error: timed out sending units change

    c.Fatal("timed out sending units change")
    c.Errorf("timed out waiting for worker to stop")
... Error: timed out waiting for worker to stop

    c.Fatal("timed out sending units change")
    c.Check(err, jc.ErrorIsNil)
... value *errors.Err = &errors.Err{message:"workertest: worker not stopping", cause:error(nil), previous:error(nil), file:"", line:59} ("workertest: worker not stopping")
... error stack: workertest: worker not stopping

Write at 0x00c420110af8 by goroutine 138:*WorkerSuite).SetUpTest()
      /workspace/src/ +0x94f
      /snap/go/1473/src/runtime/asm_amd64.s:573 +0x3a
      /snap/go/1473/src/reflect/value.go:308 +0xc0*suiteRunner).runFixture.func1()
      /workspace/src/ +0x177*suiteRunner).forkCall.func1()
      /workspace/src/ +0x89

Previous write at 0x00c420110af8 by goroutine 140:
      /snap/go/1473/src/runtime/race_amd64.s:269 +0xb
      /snap/go/1473/src/sync/mutex.go:182 +0x54*Stub).addCall()
      /workspace/src/ +0x2bc*Stub).MethodCall()
      /workspace/src/ +0x88*mockContainerBroker).DeleteUnit()
      /workspace/src/ +0x113*applicationWorker).loop()
      /workspace/src/ +0xbdf*applicationWorker).(
      /workspace/src/ +0x41
      /workspace/src/ +0x66
      /workspace/src/ +0x8e

Goroutine 138 (running) created at:*suiteRunner).forkCall()
      /workspace/src/ +0x419*suiteRunner).runFunc()
      /workspace/src/ +0x7e*suiteRunner).runFixture()
      /workspace/src/ +0x7e*suiteRunner).runFixtureWithPanic()
      /workspace/src/ +0xa7*suiteRunner).forkTest.func1()
      /workspace/src/ +0x207*suiteRunner).forkCall.func1()
      /workspace/src/ +0x89

Goroutine 140 (running) created at:
      /workspace/src/ +0x273
      /workspace/src/ +0x46d*provisioner).loop()
      /workspace/src/ +0x61c*provisioner).(
      /workspace/src/ +0x41
      /workspace/src/ +0x66
      /workspace/src/ +0x8e

John A Meinel (jameinel) on 2018-03-18
description: updated
John A Meinel (jameinel) on 2018-03-18
Changed in juju:
assignee: John A Meinel (jameinel) → nobody
John A Meinel (jameinel) wrote :
Download full text (5.1 KiB)

Digging through it, this is what I see:

 :121 bare channel send that could easily block forever
     w, ok := appWorkers[appId]
     if ok {
      // Before stopping the application worker, inform it that
      // the app is gone so it has a chance to clean up.
      // The worker will act on the removed prior to processing the
      // Stop() request.
      p.appRemoved <- struct{}{}
      if err := worker.Stop(w); err != nil {
       logger.Errorf("stopping application worker for %v: %v", appId, err)
      delete(appWorkers, appId)

We are also sending on a channel, that is being listened to by *all* application workers. There is no guarantee that the worker responsible for appId will be the one that actually handles it in 'application_worker.go:134'
  case <-aw.appRemoved:

So if you ever have 2 applications running, stopping one of them will randomly kill one of them, since receiving that signal means that all units are forcibly destroyed.

What *should* happen if we are trying to tear down at the same time? It's entirely plausible that a goroutine would be hitting for _, appId := range apps at exactly the same time that application_worker notices that aw.catacomb.Dying().
And if the applicationWorker.loop() notices catacomb.Dying before the caasunitprovisioner worker decides it needs to delete that app, then caasunitprovisioner worker will block indefinitely.

*Something like*

delete p.appRemoved, it doesn't belong on the provisioner, and just look directly at the application worker details.
Either that, or get rid of appRemoved and *only* ever trigger on Worker.Stop()
--- a/worker/caasunitprovisioner/application_worker.go
+++ b/worker/caasunitprovisioner/application_worker.go
@@ -48,7 +48,7 @@ func newApplicationWorker(
        applicationUpdater ApplicationUpdater,
        unitGetter UnitGetter,
        unitUpdater UnitUpdater,
-) (worker.Worker, error) {
+) (*applicationWorker, error) {
        w := &applicationWorker{
                application: application,
                jujuManagedUnits: jujuManagedUnits,
diff --git a/worker/caasunitprovisioner/worker.go b/worker/caasunitprovisioner/worker.go
index 15f076dbf3..b661245259 100644
--- a/worker/caasunitprovisioner/worker.go
+++ b/worker/caasunitprovisioner/worker.go
@@ -73,8 +73,6 @@ func NewWorker(config Config) (worker.Worker, error) {
 type provisioner struct {
        catacomb catacomb.Catacomb
        config Config
- appRemoved chan struct{}

 // Kill is part of the worker.Worker interface.
@@ -98,8 +96,7 @@ func (p *provisioner) loop() error {

        // The channel is unbuffered to that we block until
        // requests are processed.
- p.appRemoved = make(chan struct{})
- appWorkers := make(map[string]worker.Worker)
+ appWorkers := make(map[string]*applicationWorker)
        for {
                select {
                case <-p.catacomb.Dying():
@@ -124,7 +121,16 @@ func (p *provisioner) loop() error {
                                                // the app is gone so it has a chance to clean up.
                                                // The worker will act on the r...


John A Meinel (jameinel) wrote :

Can you look over my potential patch, @wallyworld?

Changed in juju:
assignee: nobody → Ian Booth (wallyworld)
Ian Booth (wallyworld) on 2018-03-19
Changed in juju:
status: In Progress → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers