WorkerSuite.TestRemoveUnitStopsWatchingContainerSpec race on unclean teardown

Bug #1756685 reported by John A Meinel on 2018-03-18
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju
High
Ian Booth

Bug Description

It seems if the test fails, then it sets the whole package into race issues:
http://10.125.0.203:8080/job/RunUnittests-race-amd64/194/testReport/junit/github/com_juju_juju_worker_caasunitprovisioner/TestPackage/

github.com/juju/juju/worker/caasunitprovisioner/worker_test.go
WorkerSuite.TestRemoveUnitStopsWatchingContainerSpec

(its possible the race caused the failure, but it looks more like the failure triggers cleanup paths that race.)

FAIL: worker_test.go:716: WorkerSuite.TestRemoveUnitStopsWatchingContainerSpec

worker_test.go:738:
    c.Fatal("timed out sending units change")
... Error: timed out sending units change

worker_test.go:738:
    c.Fatal("timed out sending units change")
/workspace/src/github.com/juju/juju/worker/workertest/check.go:58:
    c.Errorf("timed out waiting for worker to stop")
... Error: timed out waiting for worker to stop

worker_test.go:738:
    c.Fatal("timed out sending units change")
/workspace/src/github.com/juju/juju/worker/workertest/check.go:83:
    c.Check(err, jc.ErrorIsNil)
... value *errors.Err = &errors.Err{message:"workertest: worker not stopping", cause:error(nil), previous:error(nil), file:"github.com/juju/juju/worker/workertest/check.go", line:59} ("workertest: worker not stopping")
... error stack:
 github.com/juju/juju/worker/workertest/check.go:59: workertest: worker not stopping

==================
WARNING: DATA RACE
Write at 0x00c420110af8 by goroutine 138:
  github.com/juju/juju/worker/caasunitprovisioner_test.(*WorkerSuite).SetUpTest()
      /workspace/src/github.com/juju/juju/worker/caasunitprovisioner/worker_test.go:113 +0x94f
  runtime.call32()
      /snap/go/1473/src/runtime/asm_amd64.s:573 +0x3a
  reflect.Value.Call()
      /snap/go/1473/src/reflect/value.go:308 +0xc0
  gopkg.in/check%2ev1.(*suiteRunner).runFixture.func1()
      /workspace/src/gopkg.in/check.v1/check.go:721 +0x177
  gopkg.in/check%2ev1.(*suiteRunner).forkCall.func1()
      /workspace/src/gopkg.in/check.v1/check.go:666 +0x89

Previous write at 0x00c420110af8 by goroutine 140:
  sync/atomic.AddInt32()
      /snap/go/1473/src/runtime/race_amd64.s:269 +0xb
  sync.(*Mutex).Unlock()
      /snap/go/1473/src/sync/mutex.go:182 +0x54
  github.com/juju/testing.(*Stub).addCall()
      /workspace/src/github.com/juju/testing/stub.go:145 +0x2bc
  github.com/juju/testing.(*Stub).MethodCall()
      /workspace/src/github.com/juju/testing/stub.go:174 +0x88
  github.com/juju/juju/worker/caasunitprovisioner_test.(*mockContainerBroker).DeleteUnit()
      /workspace/src/github.com/juju/juju/worker/caasunitprovisioner/mock_test.go:95 +0x113
  github.com/juju/juju/worker/caasunitprovisioner.(*applicationWorker).loop()
      /workspace/src/github.com/juju/juju/worker/caasunitprovisioner/application_worker.go:225 +0xbdf
  github.com/juju/juju/worker/caasunitprovisioner.(*applicationWorker).(github.com/juju/juju/worker/caasunitprovisioner.loop)-fm()
      /workspace/src/github.com/juju/juju/worker/caasunitprovisioner/application_worker.go:68 +0x41
  github.com/juju/juju/worker/catacomb.runSafely()
      /workspace/src/github.com/juju/juju/worker/catacomb/catacomb.go:289 +0x66
  github.com/juju/juju/worker/catacomb.Invoke.func3()
      /workspace/src/github.com/juju/juju/worker/catacomb/catacomb.go:116 +0x8e

Goroutine 138 (running) created at:
  gopkg.in/check%2ev1.(*suiteRunner).forkCall()
      /workspace/src/gopkg.in/check.v1/check.go:663 +0x419
  gopkg.in/check%2ev1.(*suiteRunner).runFunc()
      /workspace/src/gopkg.in/check.v1/check.go:673 +0x7e
  gopkg.in/check%2ev1.(*suiteRunner).runFixture()
      /workspace/src/gopkg.in/check.v1/check.go:717 +0x7e
  gopkg.in/check%2ev1.(*suiteRunner).runFixtureWithPanic()
      /workspace/src/gopkg.in/check.v1/check.go:735 +0xa7
  gopkg.in/check%2ev1.(*suiteRunner).forkTest.func1()
      /workspace/src/gopkg.in/check.v1/check.go:760 +0x207
  gopkg.in/check%2ev1.(*suiteRunner).forkCall.func1()
      /workspace/src/gopkg.in/check.v1/check.go:666 +0x89

Goroutine 140 (running) created at:
  github.com/juju/juju/worker/catacomb.Invoke()
      /workspace/src/github.com/juju/juju/worker/catacomb/catacomb.go:113 +0x273
  github.com/juju/juju/worker/caasunitprovisioner.newApplicationWorker()
      /workspace/src/github.com/juju/juju/worker/caasunitprovisioner/application_worker.go:66 +0x46d
  github.com/juju/juju/worker/caasunitprovisioner.(*provisioner).loop()
      /workspace/src/github.com/juju/juju/worker/caasunitprovisioner/worker.go:148 +0x61c
  github.com/juju/juju/worker/caasunitprovisioner.(*provisioner).(github.com/juju/juju/worker/caasunitprovisioner.loop)-fm()
      /workspace/src/github.com/juju/juju/worker/caasunitprovisioner/worker.go:68 +0x41
  github.com/juju/juju/worker/catacomb.runSafely()
      /workspace/src/github.com/juju/juju/worker/catacomb/catacomb.go:289 +0x66
  github.com/juju/juju/worker/catacomb.Invoke.func3()
      /workspace/src/github.com/juju/juju/worker/catacomb/catacomb.go:116 +0x8e
==================

John A Meinel (jameinel) on 2018-03-18
description: updated
John A Meinel (jameinel) on 2018-03-18
Changed in juju:
assignee: John A Meinel (jameinel) → nobody
John A Meinel (jameinel) wrote :
Download full text (5.1 KiB)

Digging through it, this is what I see:

worker/caasunitprovisioner/worker.go:
 :121 bare channel send that could easily block forever
     w, ok := appWorkers[appId]
     if ok {
      // Before stopping the application worker, inform it that
      // the app is gone so it has a chance to clean up.
      // The worker will act on the removed prior to processing the
      // Stop() request.
      p.appRemoved <- struct{}{}
      if err := worker.Stop(w); err != nil {
       logger.Errorf("stopping application worker for %v: %v", appId, err)
      }
      delete(appWorkers, appId)
     }

We are also sending on a channel, that is being listened to by *all* application workers. There is no guarantee that the worker responsible for appId will be the one that actually handles it in 'application_worker.go:134'
  case <-aw.appRemoved:

So if you ever have 2 applications running, stopping one of them will randomly kill one of them, since receiving that signal means that all units are forcibly destroyed.

What *should* happen if we are trying to tear down at the same time? It's entirely plausible that a goroutine would be hitting for _, appId := range apps at exactly the same time that application_worker notices that aw.catacomb.Dying().
And if the applicationWorker.loop() notices catacomb.Dying before the caasunitprovisioner worker decides it needs to delete that app, then caasunitprovisioner worker will block indefinitely.

*Something like*

delete p.appRemoved, it doesn't belong on the provisioner, and just look directly at the application worker details.
Either that, or get rid of appRemoved and *only* ever trigger on Worker.Stop()
--- a/worker/caasunitprovisioner/application_worker.go
+++ b/worker/caasunitprovisioner/application_worker.go
@@ -48,7 +48,7 @@ func newApplicationWorker(
        applicationUpdater ApplicationUpdater,
        unitGetter UnitGetter,
        unitUpdater UnitUpdater,
-) (worker.Worker, error) {
+) (*applicationWorker, error) {
        w := &applicationWorker{
                application: application,
                jujuManagedUnits: jujuManagedUnits,
diff --git a/worker/caasunitprovisioner/worker.go b/worker/caasunitprovisioner/worker.go
index 15f076dbf3..b661245259 100644
--- a/worker/caasunitprovisioner/worker.go
+++ b/worker/caasunitprovisioner/worker.go
@@ -73,8 +73,6 @@ func NewWorker(config Config) (worker.Worker, error) {
 type provisioner struct {
        catacomb catacomb.Catacomb
        config Config
-
- appRemoved chan struct{}
 }

 // Kill is part of the worker.Worker interface.
@@ -98,8 +96,7 @@ func (p *provisioner) loop() error {

        // The channel is unbuffered to that we block until
        // requests are processed.
- p.appRemoved = make(chan struct{})
- appWorkers := make(map[string]worker.Worker)
+ appWorkers := make(map[string]*applicationWorker)
        for {
                select {
                case <-p.catacomb.Dying():
@@ -124,7 +121,16 @@ func (p *provisioner) loop() error {
                                                // the app is gone so it has a chance to clean up.
                                                // The worker will act on the r...

Read more...

John A Meinel (jameinel) wrote :

Can you look over my potential patch, @wallyworld?

Changed in juju:
assignee: nobody → Ian Booth (wallyworld)
Ian Booth (wallyworld) on 2018-03-19
Changed in juju:
status: In Progress → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers