Canonical Juju

Uniter can hang while trying to restart

Bug #1843376 reported by John A Meinel on 2019-09-10

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Fix Released	High	Yang Kelvin Liu	Canonical Juju 2.7-beta1
	2.6	Fix Released	High	Yang Kelvin Liu	Canonical Juju 2.6.9

Bug Description

I'll do some manual investigation, but we've seen evidence in the field of the Uniter not restarting properly. Tim did most of the legwork identifying that worker/leadership/tracker.go has an issue with a blocking call to t.claimer.BlockUntilLeadershipReleased. It turns that into a non-blocking call by wrapping the call in a goroutine and turning the result into a send on the t.claimLease channel, which is properly selected in the main Tracker.loop function.

However, while being torn down, we have a defer with:
  defer func() {
   for _, ticketCh := range t.waitingLeader {
    close(ticketCh)
   }
   for _, ticketCh := range t.waitingMinion {
    close(ticketCh)
   }
   if t.claimLease != nil {
    // wait for the goroutine started
    // by setLeader to exit.
    <-t.claimLease
   }
  }()

Which means we will wait for the claimLease to respond or be closed. However, if we are tearing down because of tomb.Dying we won't actually return from BlockUntilLeadershipReleased as the APIServer is still running just fine.

We might change it to just not wait for claimLease, but make sure that we can still cleanup goroutines during shutdown.
Alternatively, I did dig into whether we could thread through a 'cancel' channel to the rpc.Conn code. Which also uses a bare channel select:

func (conn *Conn) Call(req Request, params, response interface{}) error {
call := &Call{
  Request: req,
  Params: params,
  Response: response,
  Done: make(chan *Call, 1),
}
conn.send(call)
result := <-call.Done
return errors.Trace(result.Error)
}

Essentially, we could just put that last call.Done into a
select {
  case result := <-call.Done:
    blah
  case <-cancel:
    conn.cancel(call)
    return ErrCanceled
}

something like that.
Its a little messy as Call doesn't currently track its request id, but we could add it pretty easily.

Tags:

John A Meinel (jameinel) on 2019-09-10

Changed in juju:
assignee:	nobody → John A Meinel (jameinel)

Revision history for this message

John A Meinel (jameinel) wrote on 2019-09-10:

Just changing the code to not block during TearDown causes problems in the test suite because it loses the synchronous nature. We use a hand-crafted mock/stub, and issue a request and then a Kill, and that request should trigger 2 calls. But since our Kill() *should* kill things in flight, it wasn't a great test to begin with.

Revision history for this message

John A Meinel (jameinel) wrote on 2019-09-10:

There were 2 ways that the test TestOnLeaderFailure can fail

1) We get to the BlockUntilLeadership call, and then we start shutting down. That would cause BlockUntilLeadershipReleased to return ErrBlockCancelled (because the leadershipClaimer isn't running over the API and would notice that tomb.Dying() is returning)

With my patch, we no longer block on the send to claimLease. Which means we eventually close the claimLease channel as the goroutine exits.

In this case, you see 3 calls, Claim + Block + Claim.

The main loop then sees the closed claimLease channel *before* it sees tomb.Dying() itself. Which had an unconditional "if claimLease returns, and it isn't leadership.ErrBlockCancelled (and a closed channel returns nil error), then it would try to make another claim". That can be fixed by noticing the channel was closed and treating that as block cancelled.

2) The other way it could fail is that the main loop *does* notice tomb.Dying while the setMinion goroutine is running. The goroutine does eventually call BlockUntilLeadershipCancelled, but the main loop has already exited and since it doesn't wait for the Cancelled to return/close the channel, the test moves forward and doesn't see the request.

In this case you see 1 call, just to Claim.

Revision history for this message

John A Meinel (jameinel) wrote on 2019-09-11:

https://github.com/juju/juju/pull/10622

Revision history for this message

John A Meinel (jameinel) wrote on 2019-09-12:

Engine report after Wrench kills Uniter Edit (7.9 KiB, text/plain)

So that patch has landed, I'm still putting together a proper Wrench setup to test it. It seems to shut down correctly now, but doesn't get restarted as we would expect.

I'll attach a goroutine report and an engine report.

Revision history for this message

John A Meinel (jameinel) wrote on 2019-09-12:

goroutines after wrench Edit (40.0 KiB, text/plain)

Revision history for this message

John A Meinel (jameinel) wrote on 2019-09-16:

Having done the wrench work, it appears that LeadershipTracker is stopping, but in such a way that it doesn't get restarted, and the Uniter depends on the LeadershipTracker to be running.

Revision history for this message

John A Meinel (jameinel) wrote on 2019-09-16:

To deal with lifetimes of CAAS objects, commit 36b4353d50430c8ca31b7a66d8a853e6db5cc5cc introduced killing the Leadership Tracker if the Uniter dies. However, it does so with a Worker.Kill() which kills the worker with no error, which the DependencyEngine then treats as "don't restart because I'm done". But the Uniter has a dependency on it, so the Uniter is unable to be restarted.

We should decouple restarting the workers.

If we just removed the catacomb.Init here:
https://github.com/juju/juju/blob/4530b4ddf5be45df371438fb7f20ef12b80a7d67/worker/uniter/uniter.go#L207

Then it would regress this PR:
https://github.com/juju/juju/pull/10087

Where unit '3' of a CAAS application might go away, but wouldn't stop the Leadership Tracker, which meant unit 4 couldn't become a leader. This doesn't happen in IAAS because we run a separate agent, and that agent dies when the unit is removed (vs in CAAS where we run all unit agents as a dependency from the application).

Revision history for this message

John A Meinel (jameinel) wrote on 2019-09-17:

https://github.com/juju/juju/pull/10637

should fix this properly for IAAS models, at least.

Changed in juju:
status:	Triaged → In Progress
assignee:	John A Meinel (jameinel) → Yang Kelvin Liu (kelvin.liu)

Ian Booth (wallyworld) on 2019-09-19

Changed in juju:
status:	In Progress → Fix Committed

Anastasia (anastasia-macmood) on 2019-12-09

Changed in juju:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.