Controller upgrade ends up with locked upgrade
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Canonical Juju |
Fix Released
|
High
|
Jack Shaw |
Bug Description
Hello,
Upon updating from 2.9.11 to 2.9.12 I ended up with an unresponsive controller.
here is a trace of the logs at the moment of the upgrade on the 3 controllers : https:/
We can see that controller-0 and controller-1 had their juju process died as a result of the upgrade while the controller-2 was waiting on the mongo primary
/var/log/
Here is the sample of the mongo logs for the 3 units at the same time: https:/
We can see that controller-0 and controller-1's mongo ended as "SECONDARY" and controller-2 as "PRIMARY"
For some reason, the upgrade was stuck waiting for something on mongo as seen on controller-2 juju's log.
Don't hesitate to reach out if you need more logs.
Regards,
tags: | added: canonical-is-upgrades |
Changed in juju: | |
assignee: | nobody → Jack Shaw (jack-shaw) |
importance: | Medium → High |
milestone: | none → 2.9.27 |
Changed in juju: | |
status: | Triaged → Fix Committed |
Changed in juju: | |
status: | Fix Committed → Fix Released |
juju uses this check: IsPrimary( w.tag.Id( ))
w.runUpgrade( )
w.watchUpgrad e()
w.logger. Infof(" waiting for database upgrade on mongodb primary")
isPrimary, err := w.pool.
...
if isPrimary {
} else {
}
...
func (w *upgradeDB) watchUpgrade() {
Since we are seeing that in the log, that means that machine-2 did not
believe that the local mongo was the primary for the replica.
The mongo log has:
Sep 1 14:10:00 juju-511925- controller- 0 mongod. 37017[11128] : cutor] Member 10.25.62.245:37017 is now in state
[ReplicationExe
PRIMARY
vs machine-2's log:
/var/log/ juju/machine- 2.log:2021- 09-01 14:10:31 INFO upgradedatabase worker.go:263 waiting for database upgrade
juju.worker.
on mongodb primary
But I don't know what machine .245 corresponds to.
Looking at machine-0 it sees the disconnect from Mongo, but
successfully reconnects:
/var/log/ juju/machine- 0.log:2021- 09-01 14:09:53 WARNING pool.txnwatcher txnwatcher.go:355 txn watcher sync error: log/juju/ machine- 0.log:2021- 09-01 pool.txnwatcher txnwatcher.go:341 txn sync juju/machine- 0.log:2021- 09-01 14:57:15 INFO juju.cmd 1610287d0db7323 dfa72e8f21 gc go1.14.15]
juju.state.
watcher iteration error: EOF/var/
14:10:02 INFO juju.state.
watcher recovered after 1 retries
/var/log/
supercommand.go:56 running jujud [2.9.11 0
7fcbdb3115b295c
However, it also restarts *much* later but without the upgraded Juju version:
/var/log/ juju/machine- 0.log:2021- 09-01 14:09:31 INFO upgrader upgrader.go:267 upgrade requested from 2.9.11 to juju/machine- 0.log:2021- 09-01 14:57:15 INFO juju.cmd 1610287d0db7323 dfa72e8f21 gc go1.14.15]
juju.worker.
2.9.12
...
/var/log/
supercommand.go:56 running jujud [2.9.11 0
7fcbdb3115b295c
Machine-1 seems to see something similar:
/var/log/ juju/machine- 1.log:2021- 09-01 14:09:31 INFO upgrader upgrader.go:267 upgrade requested from 2.9.11 to juju/machine- 1.log:2021- 09-01 14:09:38 INFO httpserver worker.go:180 shutting down HTTP server
juju.worker.
2.9.12
...
/var/log/
juju.worker.
It seems a little odd that mongo is restarting / electing a new
primary while we are trying to do the upgrade. Looking back at the
Mongo logs, we see:
Sep 1 14:09:55 juju-511925- controller- 0 mongod. 37017[11128] : ceASIO- Replication- 0] Connecting to 10.25.62.246:37017 controller- 0 mongod. 37017[11128] : ceASIO- Replication- 0] Failed to connect to controller- 0 mongod. 37017[11128] : ceASIO- Replication- 0] Dropping all pooled connections controller- 0 mongod. 37017[11128] : cutor] Error in heartbeat request to controller- 0 mongod. 37017[11128] : cutor] Member 10.25.62.245:37017 is now in state
[NetworkInterfa
Sep 1 14:09:55 juju-511925-
[NetworkInterfa
10.25.62.246:37017 - HostUnreachable: Connection refused
Sep 1 14:09:55 juju-511925-
[NetworkInterfa
to 10.25.62.246:37017 due to failed operation on a connection
Sep 1 14:09:55 juju-511925-
[ReplicationExe
10.25.62.246:37017; HostUnreachable: Connection refused
Sep 1 14:10:00 juju-511925-
[ReplicationExe
PRIMARY
So just before .245 got elected, it was failing to connect to .246.
And note that mongo keeps failing to connect to .246.
Approximate steps of what should happen:
- User requests a new version
- All 3 co...