Comment 2 for bug 1873482

Revision history for this message
Christian Muirhead (2-xtian) wrote :

We've been digging into this and can see some of the causes but haven't worked it out fully.

The raft timeouts in machine-0.log starting at line 471 indicate that the we can't apply commands to the raft state. (16:45:04) That would prevent us from extending leases.

The calls to extend the lease would fail with a timeout and the is-responsible-flag worker died with the timeout error (line 499). That killed the migration fortress worker, and the other model workers that depend on it (essentially everything that should only run when there's no migration running) started dying with the "fortress worker shutting down" error.

From looking in /var/log/syslog we can see that there's a kernel message at 16:47:13 (line 2915) indicating that a disk was blocked for more than 120s - that would correspond to when the raft timeout started happening.

From the timings for mongo commands following that message (line 2991) it looks like any DB queries were also blocked by that disk stall.

Then at 16:48:18 (line 64) in foundation.log we can see that a `juju wait` command failed - there's lots of output but the error is on line 6792:

ERROR:root:ERROR model "kubernetes" has been removed from the controller, run 'juju models' and switch to one of them.

Chasing through the code the only way that error message can be generated (as far as I can tell) is that we get a bad response from mongodb when we try to get the model from the database, presumably in response to the disk stall.