WorkerSuite.TestRemoveUnitStopsWatchingContainerSpec race on unclean teardown

Bug #1756685 reported by John A Meinel
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Ian Booth

Bug Description

It seems if the test fails, then it sets the whole package into race issues:
http://10.125.0.203:8080/job/RunUnittests-race-amd64/194/testReport/junit/github/com_juju_juju_worker_caasunitprovisioner/TestPackage/

github.com/juju/juju/worker/caasunitprovisioner/worker_test.go
WorkerSuite.TestRemoveUnitStopsWatchingContainerSpec

(its possible the race caused the failure, but it looks more like the failure triggers cleanup paths that race.)

FAIL: worker_test.go:716: WorkerSuite.TestRemoveUnitStopsWatchingContainerSpec

worker_test.go:738:
    c.Fatal("timed out sending units change")
... Error: timed out sending units change

worker_test.go:738:
    c.Fatal("timed out sending units change")
/workspace/src/github.com/juju/juju/worker/workertest/check.go:58:
    c.Errorf("timed out waiting for worker to stop")
... Error: timed out waiting for worker to stop

worker_test.go:738:
    c.Fatal("timed out sending units change")
/workspace/src/github.com/juju/juju/worker/workertest/check.go:83:
    c.Check(err, jc.ErrorIsNil)
... value *errors.Err = &errors.Err{message:"workertest: worker not stopping", cause:error(nil), previous:error(nil), file:"github.com/juju/juju/worker/workertest/check.go", line:59} ("workertest: worker not stopping")
... error stack:
 github.com/juju/juju/worker/workertest/check.go:59: workertest: worker not stopping

==================
WARNING: DATA RACE
Write at 0x00c420110af8 by goroutine 138:
  github.com/juju/juju/worker/caasunitprovisioner_test.(*WorkerSuite).SetUpTest()
      /workspace/src/github.com/juju/juju/worker/caasunitprovisioner/worker_test.go:113 +0x94f
  runtime.call32()
      /snap/go/1473/src/runtime/asm_amd64.s:573 +0x3a
  reflect.Value.Call()
      /snap/go/1473/src/reflect/value.go:308 +0xc0
  gopkg.in/check%2ev1.(*suiteRunner).runFixture.func1()
      /workspace/src/gopkg.in/check.v1/check.go:721 +0x177
  gopkg.in/check%2ev1.(*suiteRunner).forkCall.func1()
      /workspace/src/gopkg.in/check.v1/check.go:666 +0x89

Previous write at 0x00c420110af8 by goroutine 140:
  sync/atomic.AddInt32()
      /snap/go/1473/src/runtime/race_amd64.s:269 +0xb
  sync.(*Mutex).Unlock()
      /snap/go/1473/src/sync/mutex.go:182 +0x54
  github.com/juju/testing.(*Stub).addCall()
      /workspace/src/github.com/juju/testing/stub.go:145 +0x2bc
  github.com/juju/testing.(*Stub).MethodCall()
      /workspace/src/github.com/juju/testing/stub.go:174 +0x88
  github.com/juju/juju/worker/caasunitprovisioner_test.(*mockContainerBroker).DeleteUnit()
      /workspace/src/github.com/juju/juju/worker/caasunitprovisioner/mock_test.go:95 +0x113
  github.com/juju/juju/worker/caasunitprovisioner.(*applicationWorker).loop()
      /workspace/src/github.com/juju/juju/worker/caasunitprovisioner/application_worker.go:225 +0xbdf
  github.com/juju/juju/worker/caasunitprovisioner.(*applicationWorker).(github.com/juju/juju/worker/caasunitprovisioner.loop)-fm()
      /workspace/src/github.com/juju/juju/worker/caasunitprovisioner/application_worker.go:68 +0x41
  github.com/juju/juju/worker/catacomb.runSafely()
      /workspace/src/github.com/juju/juju/worker/catacomb/catacomb.go:289 +0x66
  github.com/juju/juju/worker/catacomb.Invoke.func3()
      /workspace/src/github.com/juju/juju/worker/catacomb/catacomb.go:116 +0x8e

Goroutine 138 (running) created at:
  gopkg.in/check%2ev1.(*suiteRunner).forkCall()
      /workspace/src/gopkg.in/check.v1/check.go:663 +0x419
  gopkg.in/check%2ev1.(*suiteRunner).runFunc()
      /workspace/src/gopkg.in/check.v1/check.go:673 +0x7e
  gopkg.in/check%2ev1.(*suiteRunner).runFixture()
      /workspace/src/gopkg.in/check.v1/check.go:717 +0x7e
  gopkg.in/check%2ev1.(*suiteRunner).runFixtureWithPanic()
      /workspace/src/gopkg.in/check.v1/check.go:735 +0xa7
  gopkg.in/check%2ev1.(*suiteRunner).forkTest.func1()
      /workspace/src/gopkg.in/check.v1/check.go:760 +0x207
  gopkg.in/check%2ev1.(*suiteRunner).forkCall.func1()
      /workspace/src/gopkg.in/check.v1/check.go:666 +0x89

Goroutine 140 (running) created at:
  github.com/juju/juju/worker/catacomb.Invoke()
      /workspace/src/github.com/juju/juju/worker/catacomb/catacomb.go:113 +0x273
  github.com/juju/juju/worker/caasunitprovisioner.newApplicationWorker()
      /workspace/src/github.com/juju/juju/worker/caasunitprovisioner/application_worker.go:66 +0x46d
  github.com/juju/juju/worker/caasunitprovisioner.(*provisioner).loop()
      /workspace/src/github.com/juju/juju/worker/caasunitprovisioner/worker.go:148 +0x61c
  github.com/juju/juju/worker/caasunitprovisioner.(*provisioner).(github.com/juju/juju/worker/caasunitprovisioner.loop)-fm()
      /workspace/src/github.com/juju/juju/worker/caasunitprovisioner/worker.go:68 +0x41
  github.com/juju/juju/worker/catacomb.runSafely()
      /workspace/src/github.com/juju/juju/worker/catacomb/catacomb.go:289 +0x66
  github.com/juju/juju/worker/catacomb.Invoke.func3()
      /workspace/src/github.com/juju/juju/worker/catacomb/catacomb.go:116 +0x8e
==================

John A Meinel (jameinel)
description: updated
John A Meinel (jameinel)
Changed in juju:
assignee: John A Meinel (jameinel) → nobody
Revision history for this message
John A Meinel (jameinel) wrote :
Download full text (5.1 KiB)

Digging through it, this is what I see:

worker/caasunitprovisioner/worker.go:
 :121 bare channel send that could easily block forever
     w, ok := appWorkers[appId]
     if ok {
      // Before stopping the application worker, inform it that
      // the app is gone so it has a chance to clean up.
      // The worker will act on the removed prior to processing the
      // Stop() request.
      p.appRemoved <- struct{}{}
      if err := worker.Stop(w); err != nil {
       logger.Errorf("stopping application worker for %v: %v", appId, err)
      }
      delete(appWorkers, appId)
     }

We are also sending on a channel, that is being listened to by *all* application workers. There is no guarantee that the worker responsible for appId will be the one that actually handles it in 'application_worker.go:134'
  case <-aw.appRemoved:

So if you ever have 2 applications running, stopping one of them will randomly kill one of them, since receiving that signal means that all units are forcibly destroyed.

What *should* happen if we are trying to tear down at the same time? It's entirely plausible that a goroutine would be hitting for _, appId := range apps at exactly the same time that application_worker notices that aw.catacomb.Dying().
And if the applicationWorker.loop() notices catacomb.Dying before the caasunitprovisioner worker decides it needs to delete that app, then caasunitprovisioner worker will block indefinitely.

*Something like*

delete p.appRemoved, it doesn't belong on the provisioner, and just look directly at the application worker details.
Either that, or get rid of appRemoved and *only* ever trigger on Worker.Stop()
--- a/worker/caasunitprovisioner/application_worker.go
+++ b/worker/caasunitprovisioner/application_worker.go
@@ -48,7 +48,7 @@ func newApplicationWorker(
        applicationUpdater ApplicationUpdater,
        unitGetter UnitGetter,
        unitUpdater UnitUpdater,
-) (worker.Worker, error) {
+) (*applicationWorker, error) {
        w := &applicationWorker{
                application: application,
                jujuManagedUnits: jujuManagedUnits,
diff --git a/worker/caasunitprovisioner/worker.go b/worker/caasunitprovisioner/worker.go
index 15f076dbf3..b661245259 100644
--- a/worker/caasunitprovisioner/worker.go
+++ b/worker/caasunitprovisioner/worker.go
@@ -73,8 +73,6 @@ func NewWorker(config Config) (worker.Worker, error) {
 type provisioner struct {
        catacomb catacomb.Catacomb
        config Config
-
- appRemoved chan struct{}
 }

 // Kill is part of the worker.Worker interface.
@@ -98,8 +96,7 @@ func (p *provisioner) loop() error {

        // The channel is unbuffered to that we block until
        // requests are processed.
- p.appRemoved = make(chan struct{})
- appWorkers := make(map[string]worker.Worker)
+ appWorkers := make(map[string]*applicationWorker)
        for {
                select {
                case <-p.catacomb.Dying():
@@ -124,7 +121,16 @@ func (p *provisioner) loop() error {
                                                // the app is gone so it has a chance to clean up.
                                                // The worker will act on the r...

Read more...

Revision history for this message
John A Meinel (jameinel) wrote :

Can you look over my potential patch, @wallyworld?

Changed in juju:
assignee: nobody → Ian Booth (wallyworld)
Ian Booth (wallyworld)
Changed in juju:
status: In Progress → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.