[2.7.6] Controller reporting model dropped when no action taken on model

Bug #1873482 reported by Michael Skalka
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Invalid
Low
Christian Muirhead

Bug Description

During this test run: https://solutions.qa.canonical.com/#/qa/testRun/5135ff55-420e-4d56-bed8-1b3701ebdf2b while waiting for a model to settle the controller unexpectedly reports the model to be dropped.

Looking in the controller logs before that timestamp the controller's machine-0 worker start to error a minute or so before:

2020-04-16 16:35:53 INFO juju.worker.logger logger.go:118 logger worker started 2020-04-16 16:45:04 WARNING juju.worker.globalclockupdater.raft worker.go:122 timed out updating clock, retrying in 1s 2020-04-16 16:45:04 ERROR juju.worker.dependency engine.go:662 "raft-forwarder" manifold worker returned unexpected error: applying command: timed out enqueuing operation 2020-04-16 16:45:13 ERROR juju.worker.dependency engine.go:662 "raft-forwarder" manifold worker returned unexpected error: applying command: timed out enqueuing operation 2020-04-16 16:45:23 WARNING juju.worker.globalclockupdater.raft worker.go:122 timed out updating clock, retrying in 1s 2020-04-16 16:45:23 ERROR juju.worker.dependency engine.go:662 "raft-forwarder" manifold worker returned unexpected error: applying command: timed out enqueuing operation 2020-04-16 16:45:33 WARNING juju.worker.globalclockupdater.raft worker.go:122 timed out updating clock, retrying in 1s 2020-04-16 16:45:33 ERROR juju.worker.dependency engine.go:662 "raft-forwarder" manifold worker returned unexpected error: applying command: timed out enqueuing operation 2020-04-16 16:45:43 ERROR juju.worker.dependency engine.go:662 "raft-forwarder" manifold worker returned unexpected error: applying command: timed out enqueuing operation 2020-04-16 16:45:48 WARNING juju.worker.lease.raft manager.go:303 [08dd5d] retrying timed out while handling claim {"singular-controller" "e081baef-037d-45e7-8a71-cc66adc9ccf6" "e081baef-037d-45e7-8a71-cc66adc9ccf6"} for "machine-0" 2020-04-16 16:45:48 ERROR juju.worker.dependency engine.go:662 "is-responsible-flag" manifold worker returned unexpected error: lease operation timed out 2020-04-16 16:45:55 WARNING juju.worker.globalclockupdater.raft worker.go:122 timed out updating clock, retrying in 1s 2020-04-16 16:45:55 ERROR juju.worker.dependency engine.go:662 "raft-forwarder" manifold worker returned unexpected error: applying command: timed out enqueuing operation 2020-04-16 16:46:03 ERROR juju.state status.go:558 failed to write status history: read tcp 127.0.0.1:43710->127.0.0.1:37017: i/o timeout 2020-04-16 16:46:05 ERROR juju.state status.go:558 failed to write status history: read tcp 10.245.201.151:32974->10.245.201.151:37017: i/o timeout 2020-04-16 16:46:09 WARNING juju.worker.globalclockupdater.raft worker.go:122 timed out updating clock, retrying in 1s 2020-04-16 16:46:09 ERROR juju.worker.dependency engine.go:662 "raft-forwarder" manifold worker returned unexpected error: applying command: timed out enqueuing operation 2020-04-16 16:46:12 WARNING juju.worker.lease.raft manager.go:303 [08dd5d] retrying timed out while handling claim {"singular-controller" "95ce81b7-40a0-4dc7-84c5-5e78e8bae388" "08dd5d4f-a07e-428b-81f7-7bf6142b6d50"} for "machine-0" 2020-04-16 16:46:12 ERROR juju.worker.dependency engine.go:662 "is-primary-controller-flag" manifold worker returned unexpected error: lease operation timed out 2020-04-16 16:46:14 ERROR juju.state status.go:558 failed to write status history: read tcp 127.0.0.1:43650->127.0.0.1:37017: i/o timeout 2020-04-16 16:46:14 ERROR juju.state status.go:549 failed to update status history: read tcp 10.245.201.151:33018->10.245.201.151:37017: i/o timeout 2020-04-16 16:46:15 ERROR juju.state status.go:549 failed to update status history: read tcp 127.0.0.1:43714->127.0.0.1:37017: i/o timeout 2020-04-16 16:46:16 ERROR juju.state status.go:549 failed to update status history: read tcp 10.245.201.151:33228->10.245.201.151:37017: i/o timeout 2020-04-16 16:46:16 ERROR juju.state status.go:549 failed to update status history: read tcp 127.0.0.1:43672->127.0.0.1:37017: i/o timeout 2020-04-16 16:46:16 ERROR juju.state status.go:549 failed to update status history: read tcp 10.245.201.151:33220->10.245.201.151:37017: i/o timeout 2020-04-16 16:46:17 ERROR juju.state status.go:549 failed to update status history: read tcp 127.0.0.1:43712->127.0.0.1:37017: i/o timeout 2020-04-16 16:46:17 ERROR juju.state status.go:549 failed to update status history: read tcp 10.245.201.151:33224->10.245.201.151:37017: i/o timeout 2020-04-16 16:46:18 WARNING juju.worker.lease.raft manager.go:303 [08dd5d] retrying timed out while handling claim {"singular-controller" "582bddfa-0b76-41c0-8ccb-3f207775be84" "582bddfa-0b76-41c0-8ccb-3f207775be84"} for "machine-0" 2020-04-16 16:46:18 WARNING juju.worker.lease.raft manager.go:303 [08dd5d] retrying timed out while handling claim {"singular-controller" "95ce81b7-40a0-4dc7-84c5-5e78e8bae388" "95ce81b7-40a0-4dc7-84c5-5e78e8bae388"} for "machine-0" 2020-04-16 16:46:18 ERROR juju.worker.dependency engine.go:662 "is-responsible-flag" manifold worker returned unexpected error: lease operation timed out 2020-04-16 16:46:18 ERROR juju.worker.dependency engine.go:662 "is-responsible-flag" manifold worker returned unexpected error: lease operation timed out 2020-04-16 16:46:19 ERROR juju.state status.go:558 failed to write status history: read tcp 127.0.0.1:43674->127.0.0.1:37017: i/o timeout 2020-04-16 16:46:20 ERROR juju.state status.go:549 failed to update status history: read tcp 10.245.201.151:33012->10.245.201.151:37017: i/o timeout 2020-04-16 16:46:23 ERROR juju.state status.go:558 failed to write status history: read tcp 127.0.0.1:43632->127.0.0.1:37017: i/o timeout 2020-04-16 16:46:25 ERROR juju.worker.dependency engine.go:662 "raft-forwarder" manifold worker returned unexpected error: applying command: timed out enqueuing operation 2020-04-16 16:46:26 ERROR juju.state status.go:558 failed to write status history: read tcp 127.0.0.1:43648->127.0.0.1:37017: i/o timeout 2020-04-16 16:46:29 ERROR juju.state status.go:558 failed to write status history: read tcp 10.245.201.151:33572->10.245.201.151:37017: i/o timeout 2020-04-16 16:46:30 ERROR juju.state status.go:558 failed to write status history: read tcp 10.245.201.151:33044->10.245.201.151:37017: i/o timeout 2020-04-16 16:46:41 WARNING juju.worker.lease.raft manager.go:303 [08dd5d] retrying timed out while handling claim {"singular-controller" "e081baef-037d-45e7-8a71-cc66adc9ccf6" "e081baef-037d-45e7-8a71-cc66adc9ccf6"} for "machine-0" 2020-04-16 16:46:41 ERROR juju.worker.dependency engine.go:662 "is-responsible-flag" manifold worker returned unexpected error: lease operation timed out 2020-04-16 16:46:41 ERROR juju.worker.dependency engine.go:662 "raft-forwarder" manifold worker returned unexpected error: applying command: timed out enqueuing operation 2020-04-16 16:46:48 ERROR juju.state status.go:549 failed to update status history: read tcp 127.0.0.1:43716->127.0.0.1:37017: i/o timeout 2020-04-16 16:46:59 WARNING juju.worker.globalclockupdater.raft worker.go:122 timed out updating clock, retrying in 1s 2020-04-16 16:46:59 ERROR juju.worker.dependency engine.go:662 "raft-forwarder" manifold worker returned unexpected error: applying command: timed out enqueuing operation 2020-04-16 16:47:05 WARNING juju.worker.lease.raft manager.go:303 [08dd5d] retrying timed out while handling claim {"singular-controller" "95ce81b7-40a0-4dc7-84c5-5e78e8bae388" "08dd5d4f-a07e-428b-81f7-7bf6142b6d50"} for "machine-0" 2020-04-16 16:47:05 ERROR juju.worker.dependency engine.go:662 "is-primary-controller-flag" manifold worker returned unexpected error: lease operation timed out 2020-04-16 16:47:11 WARNING juju.worker.lease.raft manager.go:303 [08dd5d] retrying timed out while handling claim {"singular-controller" "95ce81b7-40a0-4dc7-84c5-5e78e8bae388" "95ce81b7-40a0-4dc7-84c5-5e78e8bae388"} for "machine-0" 2020-04-16 16:47:11 ERROR juju.worker.dependency engine.go:662 "is-responsible-flag" manifold worker returned unexpected error: lease operation timed out 2020-04-16 16:47:11 WARNING juju.worker.lease.raft manager.go:303 [08dd5d] retrying timed out while handling claim {"singular-controller" "582bddfa-0b76-41c0-8ccb-3f207775be84" "582bddfa-0b76-41c0-8ccb-3f207775be84"} for "machine-0" 2020-04-16 16:47:11 ERROR juju.worker.dependency engine.go:662 "is-responsible-flag" manifold worker returned unexpected error: lease operation timed out 2020-04-16 16:47:20 ERROR juju.worker.dependency engine.go:662 "raft-forwarder" manifold worker returned unexpected error: applying command: timed out enqueuing operation 2020-04-16 16:47:21 ERROR juju.worker.dependency engine.go:662 "migration-master" manifold worker returned unexpected error: fortress worker shutting down 2020-04-16 16:47:21 ERROR juju.worker.dependency engine.go:662 "unit-assigner" manifold worker returned unexpected error: fortress worker shutting down 2020-04-16 16:47:21 ERROR juju.worker.dependency engine.go:662 "metric-worker" manifold worker returned unexpected error: fortress worker shutting down 2020-04-16 16:47:21 ERROR juju.worker.dependency engine.go:662 "instance-poller" manifold worker returned unexpected error: fortress worker shutting down 2020-04-16 16:47:21 ERROR juju.worker.dependency engine.go:662 "storage-provisioner" manifold worker returned unexpected error: fortress worker shutting down 2020-04-16 16:47:21 ERROR juju.worker.dependency engine.go:662 "application-scaler" manifold worker returned unexpected error: fortress worker shutting down

... Long list of workers shutting down ...

This is followed by a number of i/o timeouts connecting to itself.

Revision history for this message
Michael Skalka (mskalka) wrote :
description: updated
Revision history for this message
Christian Muirhead (2-xtian) wrote :

We've been digging into this and can see some of the causes but haven't worked it out fully.

The raft timeouts in machine-0.log starting at line 471 indicate that the we can't apply commands to the raft state. (16:45:04) That would prevent us from extending leases.

The calls to extend the lease would fail with a timeout and the is-responsible-flag worker died with the timeout error (line 499). That killed the migration fortress worker, and the other model workers that depend on it (essentially everything that should only run when there's no migration running) started dying with the "fortress worker shutting down" error.

From looking in /var/log/syslog we can see that there's a kernel message at 16:47:13 (line 2915) indicating that a disk was blocked for more than 120s - that would correspond to when the raft timeout started happening.

From the timings for mongo commands following that message (line 2991) it looks like any DB queries were also blocked by that disk stall.

Then at 16:48:18 (line 64) in foundation.log we can see that a `juju wait` command failed - there's lots of output but the error is on line 6792:

ERROR:root:ERROR model "kubernetes" has been removed from the controller, run 'juju models' and switch to one of them.

Chasing through the code the only way that error message can be generated (as far as I can tell) is that we get a bad response from mongodb when we try to get the model from the database, presumably in response to the disk stall.

Changed in juju:
status: New → In Progress
assignee: nobody → Christian Muirhead (2-xtian)
status: In Progress → Triaged
Revision history for this message
Christian Muirhead (2-xtian) wrote :

We've narrowed down the region of code where this can happen and even though we don't fully understand the cause it's clear it's not a regression in 2.7.6, so we don't think this bug should be a blocker for release.

Revision history for this message
Christian Muirhead (2-xtian) wrote :

To be clearer about the sequence of events:

- The test deployed the kubernetes bundle and then ran `juju wait` (which polls status) to wait for everything to converge.
- While things were being deployed the controller machine's disk stalled for 2 mins.
- That caused the database and raft leases to pause.
- Lease extends failed, so the is-responsible-flag worker bounced, which bounced the migration-fortress worker, which bounced the downstream model workers.
- The database (or our db library) responded with a not-found error to one of the juju wait status calls after the pause finished. (This is the juju bug.)
- The API responded with the ModelNotFound error to juju wait, which failed the test run.
- The test cleaned up, collecting the crashdump and tearing down the controller.

As far as we can tell the ModelNotFound error was transient - if the test hadn't failed and cleaned up the controller it seems like it would have recovered and continued with the deploy (assuming the disk stall was a one-off).

Joshua Genet (genet022)
tags: removed: cdo-release-blocker
Revision history for this message
Canonical Juju QA Bot (juju-qa-bot) wrote :

This bug has not been updated in 2 years, so we're marking it Low importance. If you believe this is incorrect, please update the importance.

Changed in juju:
importance: Undecided → Low
tags: added: expirebugs-bot
Revision history for this message
Konstantinos Kaskavelis (kaskavel) wrote :

Closing this due to inactivity (low number of occurrences, and no hit for more than one year)

tags: added: solutions-qa-expired
Changed in juju:
status: Triaged → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.