Uniter can hang while trying to restart
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Canonical Juju |
Fix Released
|
High
|
Yang Kelvin Liu | ||
2.6 |
Fix Released
|
High
|
Yang Kelvin Liu |
Bug Description
I'll do some manual investigation, but we've seen evidence in the field of the Uniter not restarting properly. Tim did most of the legwork identifying that worker/
However, while being torn down, we have a defer with:
defer func() {
for _, ticketCh := range t.waitingLeader {
close(ticketCh)
}
for _, ticketCh := range t.waitingMinion {
close(ticketCh)
}
if t.claimLease != nil {
// wait for the goroutine started
// by setLeader to exit.
<-t.claimLease
}
}()
Which means we will wait for the claimLease to respond or be closed. However, if we are tearing down because of tomb.Dying we won't actually return from BlockUntilLeade
We might change it to just not wait for claimLease, but make sure that we can still cleanup goroutines during shutdown.
Alternatively, I did dig into whether we could thread through a 'cancel' channel to the rpc.Conn code. Which also uses a bare channel select:
func (conn *Conn) Call(req Request, params, response interface{}) error {
call := &Call{
Request: req,
Params: params,
Response: response,
Done: make(chan *Call, 1),
}
conn.send(call)
result := <-call.Done
return errors.
}
Essentially, we could just put that last call.Done into a
select {
case result := <-call.Done:
blah
case <-cancel:
conn.
return ErrCanceled
}
something like that.
Its a little messy as Call doesn't currently track its request id, but we could add it pretty easily.
Changed in juju: | |
assignee: | nobody → John A Meinel (jameinel) |
Changed in juju: | |
status: | In Progress → Fix Committed |
Changed in juju: | |
status: | Fix Committed → Fix Released |
Just changing the code to not block during TearDown causes problems in the test suite because it loses the synchronous nature. We use a hand-crafted mock/stub, and issue a request and then a Kill, and that request should trigger 2 calls. But since our Kill() *should* kill things in flight, it wasn't a great test to begin with.