Controller upgrade ends up with locked upgrade

Bug #1942447 reported by Benjamin Allot
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Jack Shaw

Bug Description

Hello,

Upon updating from 2.9.11 to 2.9.12 I ended up with an unresponsive controller.

here is a trace of the logs at the moment of the upgrade on the 3 controllers : https://pastebin.canonical.com/p/rsjDn5H83q/

We can see that controller-0 and controller-1 had their juju process died as a result of the upgrade while the controller-2 was waiting on the mongo primary

/var/log/juju/machine-2.log:2021-09-01 14:10:31 INFO juju.worker.upgradedatabase worker.go:263 waiting for database upgrade on mongodb primary

Here is the sample of the mongo logs for the 3 units at the same time: https://pastebin.canonical.com/p/ScBrF5qbHG/

We can see that controller-0 and controller-1's mongo ended as "SECONDARY" and controller-2 as "PRIMARY"

For some reason, the upgrade was stuck waiting for something on mongo as seen on controller-2 juju's log.

Don't hesitate to reach out if you need more logs.

Regards,

Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 1942447] [NEW] Controller upgrade ends up with locked upgrade
Download full text (5.3 KiB)

juju uses this check:
        isPrimary, err := w.pool.IsPrimary(w.tag.Id())
...
        if isPrimary {
                w.runUpgrade()
        } else {
                w.watchUpgrade()
        }
...
func (w *upgradeDB) watchUpgrade() {
        w.logger.Infof("waiting for database upgrade on mongodb primary")

Since we are seeing that in the log, that means that machine-2 did not
believe that the local mongo was the primary for the replica.

The mongo log has:

Sep 1 14:10:00 juju-511925-controller-0 mongod.37017[11128]:
[ReplicationExecutor] Member 10.25.62.245:37017 is now in state
PRIMARY

vs machine-2's log:

/var/log/juju/machine-2.log:2021-09-01 14:10:31 INFO
juju.worker.upgradedatabase worker.go:263 waiting for database upgrade
on mongodb primary

But I don't know what machine .245 corresponds to.

Looking at machine-0 it sees the disconnect from Mongo, but
successfully reconnects:

/var/log/juju/machine-0.log:2021-09-01 14:09:53 WARNING
juju.state.pool.txnwatcher txnwatcher.go:355 txn watcher sync error:
watcher iteration error: EOF/var/log/juju/machine-0.log:2021-09-01
14:10:02 INFO juju.state.pool.txnwatcher txnwatcher.go:341 txn sync
watcher recovered after 1 retries
/var/log/juju/machine-0.log:2021-09-01 14:57:15 INFO juju.cmd
supercommand.go:56 running jujud [2.9.11 0
7fcbdb3115b295c1610287d0db7323dfa72e8f21 gc go1.14.15]

However, it also restarts *much* later but without the upgraded Juju version:

/var/log/juju/machine-0.log:2021-09-01 14:09:31 INFO
juju.worker.upgrader upgrader.go:267 upgrade requested from 2.9.11 to
2.9.12
...
/var/log/juju/machine-0.log:2021-09-01 14:57:15 INFO juju.cmd
supercommand.go:56 running jujud [2.9.11 0
7fcbdb3115b295c1610287d0db7323dfa72e8f21 gc go1.14.15]

Machine-1 seems to see something similar:

/var/log/juju/machine-1.log:2021-09-01 14:09:31 INFO
juju.worker.upgrader upgrader.go:267 upgrade requested from 2.9.11 to
2.9.12
...
/var/log/juju/machine-1.log:2021-09-01 14:09:38 INFO
juju.worker.httpserver worker.go:180 shutting down HTTP server

It seems a little odd that mongo is restarting / electing a new
primary while we are trying to do the upgrade. Looking back at the
Mongo logs, we see:

Sep 1 14:09:55 juju-511925-controller-0 mongod.37017[11128]:
[NetworkInterfaceASIO-Replication-0] Connecting to 10.25.62.246:37017
Sep 1 14:09:55 juju-511925-controller-0 mongod.37017[11128]:
[NetworkInterfaceASIO-Replication-0] Failed to connect to
10.25.62.246:37017 - HostUnreachable: Connection refused
Sep 1 14:09:55 juju-511925-controller-0 mongod.37017[11128]:
[NetworkInterfaceASIO-Replication-0] Dropping all pooled connections
to 10.25.62.246:37017 due to failed operation on a connection
Sep 1 14:09:55 juju-511925-controller-0 mongod.37017[11128]:
[ReplicationExecutor] Error in heartbeat request to
10.25.62.246:37017; HostUnreachable: Connection refused
Sep 1 14:10:00 juju-511925-controller-0 mongod.37017[11128]:
[ReplicationExecutor] Member 10.25.62.245:37017 is now in state
PRIMARY

So just before .245 got elected, it was failing to connect to .246.

And note that mongo keeps failing to connect to .246.

Approximate steps of what should happen:

   - User requests a new version
   - All 3 co...

Read more...

Revision history for this message
John A Meinel (jameinel) wrote :
Download full text (5.7 KiB)

This would seem that machine-2 didn't see that it was the primary when
restarting, but also that machine-0 and machine-1 failed to restart at all.

On Thu, Sep 2, 2021 at 4:46 PM John Meinel <email address hidden> wrote:

> juju uses this check:
> isPrimary, err := w.pool.IsPrimary(w.tag.Id())
> ...
> if isPrimary {
> w.runUpgrade()
> } else {
> w.watchUpgrade()
> }
> ...
> func (w *upgradeDB) watchUpgrade() {
> w.logger.Infof("waiting for database upgrade on mongodb primary")
>
> Since we are seeing that in the log, that means that machine-2 did not
> believe that the local mongo was the primary for the replica.
>
> The mongo log has:
>
> Sep 1 14:10:00 juju-511925-controller-0 mongod.37017[11128]: [ReplicationExecutor] Member 10.25.62.245:37017 is now in state PRIMARY
>
> vs machine-2's log:
>
> /var/log/juju/machine-2.log:2021-09-01 14:10:31 INFO juju.worker.upgradedatabase worker.go:263 waiting for database upgrade on mongodb primary
>
> But I don't know what machine .245 corresponds to.
>
> Looking at machine-0 it sees the disconnect from Mongo, but successfully reconnects:
>
> /var/log/juju/machine-0.log:2021-09-01 14:09:53 WARNING juju.state.pool.txnwatcher txnwatcher.go:355 txn watcher sync error: watcher iteration error: EOF/var/log/juju/machine-0.log:2021-09-01 14:10:02 INFO juju.state.pool.txnwatcher txnwatcher.go:341 txn sync watcher recovered after 1 retries
> /var/log/juju/machine-0.log:2021-09-01 14:57:15 INFO juju.cmd supercommand.go:56 running jujud [2.9.11 0 7fcbdb3115b295c1610287d0db7323dfa72e8f21 gc go1.14.15]
>
> However, it also restarts *much* later but without the upgraded Juju version:
>
> /var/log/juju/machine-0.log:2021-09-01 14:09:31 INFO juju.worker.upgrader upgrader.go:267 upgrade requested from 2.9.11 to 2.9.12
> ...
> /var/log/juju/machine-0.log:2021-09-01 14:57:15 INFO juju.cmd supercommand.go:56 running jujud [2.9.11 0 7fcbdb3115b295c1610287d0db7323dfa72e8f21 gc go1.14.15]
>
> Machine-1 seems to see something similar:
>
> /var/log/juju/machine-1.log:2021-09-01 14:09:31 INFO juju.worker.upgrader upgrader.go:267 upgrade requested from 2.9.11 to 2.9.12
> ...
> /var/log/juju/machine-1.log:2021-09-01 14:09:38 INFO juju.worker.httpserver worker.go:180 shutting down HTTP server
>
>
> It seems a little odd that mongo is restarting / electing a new primary while we are trying to do the upgrade. Looking back at the Mongo logs, we see:
>
> Sep 1 14:09:55 juju-511925-controller-0 mongod.37017[11128]: [NetworkInterfaceASIO-Replication-0] Connecting to 10.25.62.246:37017
> Sep 1 14:09:55 juju-511925-controller-0 mongod.37017[11128]: [NetworkInterfaceASIO-Replication-0] Failed to connect to 10.25.62.246:37017 - HostUnreachable: Connection refused
> Sep 1 14:09:55 juju-511925-controller-0 mongod.37017[11128]: [NetworkInterfaceASIO-Replication-0] Dropping all pooled connections to 10.25.62.246:37017 due to failed operation on a connection
> Sep 1 14:09:55 juju-511925-controller-0 mongod.37017[11128]: [ReplicationExecutor] Error in heartbeat request to 10.25.62.246:37017; HostUnreachable: Connection refused
> Sep 1 14:10:00 juju-511925-controller-...

Read more...

Revision history for this message
John A Meinel (jameinel) wrote :

It would be good to understand why mongo thinks .246 is unavailable

Changed in juju:
importance: Undecided → Medium
status: New → Incomplete
Tom Haddon (mthaddon)
tags: added: canonical-is-upgrades
Revision history for this message
John A Meinel (jameinel) wrote :

We can certainly manually go in and fix this if it is directly impacting you (it is a matter of making sure all units come up with the same 2.9.12 and are ready to go). But we would want to understand why Mongo is unable to come up with full HA first.

Revision history for this message
Benjamin Allot (ballot) wrote :

Sure, is there something I can do in case this happen again ?

Do you need some other logs ?

Just to be clear, the controller restarted because I manually did it (hence the 45 minutes).
They would have been down otherwise.

Revision history for this message
Benjamin Allot (ballot) wrote :

There is already all the mongo logs around the time needed.

Do not hesitate to tell me if you need others.

.246 is ubuntu/2 (so 3rd controller unit).

Those are syslog around the time of the issue (without filtering only on mongo): https://pastebin.canonical.com/p/BhNj8KYrp4/

Changed in juju:
status: Incomplete → New
Revision history for this message
John A Meinel (jameinel) wrote :

We should do a pre-check for the upgrade that the Mongo replica is in a healthy state before we start trying to initiate the upgrade.

Changed in juju:
status: New → Triaged
tags: added: mongodb upgrade-controller
Revision history for this message
Peter Jose De Sousa (pjds) wrote :

subscribing field-crit - I have a set of controllers which are in the same state and it's on a customer's environment.

Changed in juju:
assignee: nobody → Jack Shaw (jack-shaw)
importance: Medium → High
milestone: none → 2.9.27
Revision history for this message
Jack Shaw (jack-shaw) wrote :

This looks to be the same issue as the one I recently reported here:
https://bugs.launchpad.net/juju/+bug/1963924

A fix is on the way

Revision history for this message
John A Meinel (jameinel) wrote :

@Peter, so there is a workaround, which is you restart the controller that is on the machine with the mongo primary. I think that takes it out of Field Crit, but we can certainly finish up Jack's work. (Note that even with his patch, it won't get your field environment running, as it can't upgrade to 2.9.27 until it comes back up from the earlier one.)

Jack Shaw (jack-shaw)
Changed in juju:
status: Triaged → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.