Bug #1942447 “Controller upgrade ends up with locked upgrade” : Bugs : Canonical Juju

Revision history for this message

John A Meinel (jameinel) wrote on 2021-09-02: Re: [Bug 1942447] [NEW] Controller upgrade ends up with locked upgrade

#1

Download full text (5.3 KiB)

juju uses this check:
        isPrimary, err := w.pool.IsPrimary(w.tag.Id())
...
        if isPrimary {
                w.runUpgrade()
        } else {
                w.watchUpgrade()
        }
...
func (w *upgradeDB) watchUpgrade() {
        w.logger.Infof("waiting for database upgrade on mongodb primary")

Since we are seeing that in the log, that means that machine-2 did not
believe that the local mongo was the primary for the replica.

The mongo log has:

Sep 1 14:10:00 juju-511925-controller-0 mongod.37017[11128]:
[ReplicationExecutor] Member 10.25.62.245:37017 is now in state
PRIMARY

vs machine-2's log:

/var/log/juju/machine-2.log:2021-09-01 14:10:31 INFO
juju.worker.upgradedatabase worker.go:263 waiting for database upgrade
on mongodb primary

But I don't know what machine .245 corresponds to.

Looking at machine-0 it sees the disconnect from Mongo, but
successfully reconnects:

/var/log/juju/machine-0.log:2021-09-01 14:09:53 WARNING
juju.state.pool.txnwatcher txnwatcher.go:355 txn watcher sync error:
watcher iteration error: EOF/var/log/juju/machine-0.log:2021-09-01
14:10:02 INFO juju.state.pool.txnwatcher txnwatcher.go:341 txn sync
watcher recovered after 1 retries
/var/log/juju/machine-0.log:2021-09-01 14:57:15 INFO juju.cmd
supercommand.go:56 running jujud [2.9.11 0
7fcbdb3115b295c1610287d0db7323dfa72e8f21 gc go1.14.15]

However, it also restarts *much* later but without the upgraded Juju version:

/var/log/juju/machine-0.log:2021-09-01 14:09:31 INFO
juju.worker.upgrader upgrader.go:267 upgrade requested from 2.9.11 to
2.9.12
...
/var/log/juju/machine-0.log:2021-09-01 14:57:15 INFO juju.cmd
supercommand.go:56 running jujud [2.9.11 0
7fcbdb3115b295c1610287d0db7323dfa72e8f21 gc go1.14.15]

Machine-1 seems to see something similar:

/var/log/juju/machine-1.log:2021-09-01 14:09:31 INFO
juju.worker.upgrader upgrader.go:267 upgrade requested from 2.9.11 to
2.9.12
...
/var/log/juju/machine-1.log:2021-09-01 14:09:38 INFO
juju.worker.httpserver worker.go:180 shutting down HTTP server

It seems a little odd that mongo is restarting / electing a new
primary while we are trying to do the upgrade. Looking back at the
Mongo logs, we see:

Sep 1 14:09:55 juju-511925-controller-0 mongod.37017[11128]:
[NetworkInterfaceASIO-Replication-0] Connecting to 10.25.62.246:37017
Sep 1 14:09:55 juju-511925-controller-0 mongod.37017[11128]:
[NetworkInterfaceASIO-Replication-0] Failed to connect to
10.25.62.246:37017 - HostUnreachable: Connection refused
Sep 1 14:09:55 juju-511925-controller-0 mongod.37017[11128]:
[NetworkInterfaceASIO-Replication-0] Dropping all pooled connections
to 10.25.62.246:37017 due to failed operation on a connection
Sep 1 14:09:55 juju-511925-controller-0 mongod.37017[11128]:
[ReplicationExecutor] Error in heartbeat request to
10.25.62.246:37017; HostUnreachable: Connection refused
Sep 1 14:10:00 juju-511925-controller-0 mongod.37017[11128]:
[ReplicationExecutor] Member 10.25.62.245:37017 is now in state
PRIMARY

So just before .245 got elected, it was failing to connect to .246.

And note that mongo keeps failing to connect to .246.

Approximate steps of what should happen:

- User requests a new version
- All 3 co...

juju uses this check:
        isPrimary, err := w.pool.IsPrimary(w.tag.Id())
...
        if isPrimary {
                w.runUpgrade()
        } else {
                w.watchUpgrade()
        }
...
func (w *upgradeDB) watchUpgrade() {
        w.logger.Infof("waiting for database upgrade on mongodb primary")

Since we are seeing that in the log, that means that machine-2 did not
believe that the local mongo was the primary for the replica.

The mongo log has:

Sep  1 14:10:00 juju-511925-controller-0 mongod.37017[11128]:
[ReplicationExecutor] Member 10.25.62.245:37017 is now in state
PRIMARY

vs machine-2's log:

/var/log/juju/machine-2.log:2021-09-01 14:10:31 INFO
juju.worker.upgradedatabase worker.go:263 waiting for database upgrade
on mongodb primary

But I don't know what machine .245 corresponds to.

Looking at machine-0 it sees the disconnect from Mongo, but
successfully reconnects:

/var/log/juju/machine-0.log:2021-09-01 14:09:53 WARNING
juju.state.pool.txnwatcher txnwatcher.go:355 txn watcher sync error:
watcher iteration error: EOF/var/log/juju/machine-0.log:2021-09-01
14:10:02 INFO juju.state.pool.txnwatcher txnwatcher.go:341 txn sync
watcher recovered after 1 retries
/var/log/juju/machine-0.log:2021-09-01 14:57:15 INFO juju.cmd
supercommand.go:56 running jujud [2.9.11 0
7fcbdb3115b295c1610287d0db7323dfa72e8f21 gc go1.14.15]

However, it also restarts *much* later but without the upgraded Juju version:

/var/log/juju/machine-0.log:2021-09-01 14:09:31 INFO
juju.worker.upgrader upgrader.go:267 upgrade requested from 2.9.11 to
2.9.12
...
/var/log/juju/machine-0.log:2021-09-01 14:57:15 INFO juju.cmd
supercommand.go:56 running jujud [2.9.11 0
7fcbdb3115b295c1610287d0db7323dfa72e8f21 gc go1.14.15]

Machine-1 seems to see something similar:

/var/log/juju/machine-1.log:2021-09-01 14:09:31 INFO
juju.worker.upgrader upgrader.go:267 upgrade requested from 2.9.11 to
2.9.12
...
/var/log/juju/machine-1.log:2021-09-01 14:09:38 INFO
juju.worker.httpserver worker.go:180 shutting down HTTP server

It seems a little odd that mongo is restarting / electing a new
primary while we are trying to do the upgrade. Looking back at the
Mongo logs, we see:

Sep  1 14:09:55 juju-511925-controller-0 mongod.37017[11128]:
[NetworkInterfaceASIO-Replication-0] Connecting to 10.25.62.246:37017
Sep  1 14:09:55 juju-511925-controller-0 mongod.37017[11128]:
[NetworkInterfaceASIO-Replication-0] Failed to connect to
10.25.62.246:37017 - HostUnreachable: Connection refused
Sep  1 14:09:55 juju-511925-controller-0 mongod.37017[11128]:
[NetworkInterfaceASIO-Replication-0] Dropping all pooled connections
to 10.25.62.246:37017 due to failed operation on a connection
Sep  1 14:09:55 juju-511925-controller-0 mongod.37017[11128]:
[ReplicationExecutor] Error in heartbeat request to
10.25.62.246:37017; HostUnreachable: Connection refused
Sep  1 14:10:00 juju-511925-controller-0 mongod.37017[11128]:
[ReplicationExecutor] Member 10.25.62.245:37017 is now in state
PRIMARY

So just before .245 got elected, it was failing to connect to .246.

And note that mongo keeps failing to connect to .246.

Approximate steps of what should happen:

- User requests a new version
   - All 3 controllers notice that a new version was requested, ensure
that the binary is downloaded locally, shutdown and restart with the
new version
   - As the controllers start up, they record the version of the
controller binary they are running, and then wait
   - The controller that is running on the same machine as the mongodb
primary waits to see that all controllers are at the new version
   - The other controllers stay in waiting
   - Once all the controllers have reported in, then it starts the upgrade steps

It does appear that all three agents downloaded and unpacked the
2.9.12 agent binaries. But only machine-2 succeeded in restarting with
them. Machine-0 didn't restart for 45 minutes, and machine-1 never
restarted

On Thu, Sep 2, 2021 at 10:00 AM Benjamin Allot <1942447@bugs.launchpad.net>
wrote:

> Public bug reported:
>
> Hello,
>
> Upon updating from 2.9.11 to 2.9.12 I ended up with an unresponsive
> controller.
>
> here is a trace of the logs at the moment of the upgrade on the 3
> controllers : https://pastebin.canonical.com/p/rsjDn5H83q/
>
> We can see that controller-0 and controller-1 had their juju process
> died as a result of the upgrade while the controller-2 was waiting on
> the mongo primary
>
> /var/log/juju/machine-2.log:2021-09-01 14:10:31 INFO
> juju.worker.upgradedatabase worker.go:263 waiting for database upgrade
> on mongodb primary
>
> Here is the sample of the mongo logs for the 3 units at the same time:
> https://pastebin.canonical.com/p/ScBrF5qbHG/
>
> We can see that controller-0 and controller-1's mongo ended as
> "SECONDARY" and controller-2 as "PRIMARY"
>
> For some reason, the upgrade was stuck waiting for something on mongo as
> seen on controller-2 juju's log.
>
> Don't hesitate to reach out if you need more logs.
>
> Regards,
>
> ** Affects: juju
>      Importance: Undecided
>          Status: New
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1942447
>
> Title:
>   Controller upgrade ends up with locked upgrade
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1942447/+subscriptions
>
>

Revision history for this message

John A Meinel (jameinel) wrote on 2021-09-02:

#2

Download full text (5.7 KiB)

This would seem that machine-2 didn't see that it was the primary when
restarting, but also that machine-0 and machine-1 failed to restart at all.

On Thu, Sep 2, 2021 at 4:46 PM John Meinel <email address hidden> wrote:

> juju uses this check:
> isPrimary, err := w.pool.IsPrimary(w.tag.Id())
> ...
> if isPrimary {
> w.runUpgrade()
> } else {
> w.watchUpgrade()
> }
> ...
> func (w *upgradeDB) watchUpgrade() {
> w.logger.Infof("waiting for database upgrade on mongodb primary")
>
> Since we are seeing that in the log, that means that machine-2 did not
> believe that the local mongo was the primary for the replica.
>
> The mongo log has:
>
> Sep 1 14:10:00 juju-511925-controller-0 mongod.37017[11128]: [ReplicationExecutor] Member 10.25.62.245:37017 is now in state PRIMARY
>
> vs machine-2's log:
>
> /var/log/juju/machine-2.log:2021-09-01 14:10:31 INFO juju.worker.upgradedatabase worker.go:263 waiting for database upgrade on mongodb primary
>
> But I don't know what machine .245 corresponds to.
>
> Looking at machine-0 it sees the disconnect from Mongo, but successfully reconnects:
>
> /var/log/juju/machine-0.log:2021-09-01 14:09:53 WARNING juju.state.pool.txnwatcher txnwatcher.go:355 txn watcher sync error: watcher iteration error: EOF/var/log/juju/machine-0.log:2021-09-01 14:10:02 INFO juju.state.pool.txnwatcher txnwatcher.go:341 txn sync watcher recovered after 1 retries
> /var/log/juju/machine-0.log:2021-09-01 14:57:15 INFO juju.cmd supercommand.go:56 running jujud [2.9.11 0 7fcbdb3115b295c1610287d0db7323dfa72e8f21 gc go1.14.15]
>
> However, it also restarts *much* later but without the upgraded Juju version:
>
> /var/log/juju/machine-0.log:2021-09-01 14:09:31 INFO juju.worker.upgrader upgrader.go:267 upgrade requested from 2.9.11 to 2.9.12
> ...
> /var/log/juju/machine-0.log:2021-09-01 14:57:15 INFO juju.cmd supercommand.go:56 running jujud [2.9.11 0 7fcbdb3115b295c1610287d0db7323dfa72e8f21 gc go1.14.15]
>
> Machine-1 seems to see something similar:
>
> /var/log/juju/machine-1.log:2021-09-01 14:09:31 INFO juju.worker.upgrader upgrader.go:267 upgrade requested from 2.9.11 to 2.9.12
> ...
> /var/log/juju/machine-1.log:2021-09-01 14:09:38 INFO juju.worker.httpserver worker.go:180 shutting down HTTP server
>
>
> It seems a little odd that mongo is restarting / electing a new primary while we are trying to do the upgrade. Looking back at the Mongo logs, we see:
>
> Sep 1 14:09:55 juju-511925-controller-0 mongod.37017[11128]: [NetworkInterfaceASIO-Replication-0] Connecting to 10.25.62.246:37017
> Sep 1 14:09:55 juju-511925-controller-0 mongod.37017[11128]: [NetworkInterfaceASIO-Replication-0] Failed to connect to 10.25.62.246:37017 - HostUnreachable: Connection refused
> Sep 1 14:09:55 juju-511925-controller-0 mongod.37017[11128]: [NetworkInterfaceASIO-Replication-0] Dropping all pooled connections to 10.25.62.246:37017 due to failed operation on a connection
> Sep 1 14:09:55 juju-511925-controller-0 mongod.37017[11128]: [ReplicationExecutor] Error in heartbeat request to 10.25.62.246:37017; HostUnreachable: Connection refused
> Sep 1 14:10:00 juju-511925-controller-...

This would seem that machine-2 didn't see that it was the primary when
restarting, but also that machine-0 and machine-1 failed to restart at all.

On Thu, Sep 2, 2021 at 4:46 PM John Meinel <john@arbash-meinel.com> wrote:

> juju uses this check:
>         isPrimary, err := w.pool.IsPrimary(w.tag.Id())
> ...
>         if isPrimary {
>                 w.runUpgrade()
>         } else {
>                 w.watchUpgrade()
>         }
> ...
> func (w *upgradeDB) watchUpgrade() {
>         w.logger.Infof("waiting for database upgrade on mongodb primary")
>
> Since we are seeing that in the log, that means that machine-2 did not
> believe that the local mongo was the primary for the replica.
>
> The mongo log has:
>
> Sep  1 14:10:00 juju-511925-controller-0 mongod.37017[11128]: [ReplicationExecutor] Member 10.25.62.245:37017 is now in state PRIMARY
>
> vs machine-2's log:
>
> /var/log/juju/machine-2.log:2021-09-01 14:10:31 INFO juju.worker.upgradedatabase worker.go:263 waiting for database upgrade on mongodb primary
>
> But I don't know what machine .245 corresponds to.
>
> Looking at machine-0 it sees the disconnect from Mongo, but successfully reconnects:
>
> /var/log/juju/machine-0.log:2021-09-01 14:09:53 WARNING juju.state.pool.txnwatcher txnwatcher.go:355 txn watcher sync error: watcher iteration error: EOF/var/log/juju/machine-0.log:2021-09-01 14:10:02 INFO juju.state.pool.txnwatcher txnwatcher.go:341 txn sync watcher recovered after 1 retries
> /var/log/juju/machine-0.log:2021-09-01 14:57:15 INFO juju.cmd supercommand.go:56 running jujud [2.9.11 0 7fcbdb3115b295c1610287d0db7323dfa72e8f21 gc go1.14.15]
>
> However, it also restarts *much* later but without the upgraded Juju version:
>
> /var/log/juju/machine-0.log:2021-09-01 14:09:31 INFO juju.worker.upgrader upgrader.go:267 upgrade requested from 2.9.11 to 2.9.12
> ...
> /var/log/juju/machine-0.log:2021-09-01 14:57:15 INFO juju.cmd supercommand.go:56 running jujud [2.9.11 0 7fcbdb3115b295c1610287d0db7323dfa72e8f21 gc go1.14.15]
>
> Machine-1 seems to see something similar:
>
> /var/log/juju/machine-1.log:2021-09-01 14:09:31 INFO juju.worker.upgrader upgrader.go:267 upgrade requested from 2.9.11 to 2.9.12
> ...
> /var/log/juju/machine-1.log:2021-09-01 14:09:38 INFO juju.worker.httpserver worker.go:180 shutting down HTTP server
>
>
> It seems a little odd that mongo is restarting / electing a new primary while we are trying to do the upgrade. Looking back at the Mongo logs, we see:
>
> Sep  1 14:09:55 juju-511925-controller-0 mongod.37017[11128]: [NetworkInterfaceASIO-Replication-0] Connecting to 10.25.62.246:37017
> Sep  1 14:09:55 juju-511925-controller-0 mongod.37017[11128]: [NetworkInterfaceASIO-Replication-0] Failed to connect to 10.25.62.246:37017 - HostUnreachable: Connection refused
> Sep  1 14:09:55 juju-511925-controller-0 mongod.37017[11128]: [NetworkInterfaceASIO-Replication-0] Dropping all pooled connections to 10.25.62.246:37017 due to failed operation on a connection
> Sep  1 14:09:55 juju-511925-controller-0 mongod.37017[11128]: [ReplicationExecutor] Error in heartbeat request to 10.25.62.246:37017; HostUnreachable: Connection refused
> Sep  1 14:10:00 juju-511925-controller-0 mongod.37017[11128]: [ReplicationExecutor] Member 10.25.62.245:37017 is now in state PRIMARY
>
> So just before .245 got elected, it was failing to connect to .246.
>
> And note that mongo keeps failing to connect to .246.
>
>
> Approximate steps of what should happen:
>
>
>    - User requests a new version
>    - All 3 controllers notice that a new version was requested, ensure that the binary is downloaded locally, shutdown and restart with the new version
>    - As the controllers start up, they record the version of the controller binary they are running, and then wait
>    - The controller that is running on the same machine as the mongodb primary waits to see that all controllers are at the new version
>    - The other controllers stay in waiting
>    - Once all the controllers have reported in, then it starts the upgrade steps
>
> It does appear that all three agents downloaded and unpacked the 2.9.12 agent binaries. But only machine-2 succeeded in restarting with them. Machine-0 didn't restart for 45 minutes, and machine-1 never restarted
>
>
>
> On Thu, Sep 2, 2021 at 10:00 AM Benjamin Allot <1942447@bugs.launchpad.net>
> wrote:
>
>> Public bug reported:
>>
>> Hello,
>>
>> Upon updating from 2.9.11 to 2.9.12 I ended up with an unresponsive
>> controller.
>>
>> here is a trace of the logs at the moment of the upgrade on the 3
>> controllers : https://pastebin.canonical.com/p/rsjDn5H83q/
>>
>> We can see that controller-0 and controller-1 had their juju process
>> died as a result of the upgrade while the controller-2 was waiting on
>> the mongo primary
>>
>> /var/log/juju/machine-2.log:2021-09-01 14:10:31 INFO
>> juju.worker.upgradedatabase worker.go:263 waiting for database upgrade
>> on mongodb primary
>>
>> Here is the sample of the mongo logs for the 3 units at the same time:
>> https://pastebin.canonical.com/p/ScBrF5qbHG/
>>
>> We can see that controller-0 and controller-1's mongo ended as
>> "SECONDARY" and controller-2 as "PRIMARY"
>>
>> For some reason, the upgrade was stuck waiting for something on mongo as
>> seen on controller-2 juju's log.
>>
>> Don't hesitate to reach out if you need more logs.
>>
>> Regards,
>>
>> ** Affects: juju
>>      Importance: Undecided
>>          Status: New
>>
>> --
>> You received this bug notification because you are subscribed to juju.
>> Matching subscriptions: juju bugs
>> https://bugs.launchpad.net/bugs/1942447
>>
>> Title:
>>   Controller upgrade ends up with locked upgrade
>>
>> To manage notifications about this bug go to:
>> https://bugs.launchpad.net/juju/+bug/1942447/+subscriptions
>>
>>

Revision history for this message

John A Meinel (jameinel) wrote on 2021-09-02:

#3

It would be good to understand why mongo thinks .246 is unavailable

Changed in juju:
importance:	Undecided → Medium
status:	New → Incomplete

Tom Haddon (mthaddon) on 2021-09-03

tags:

added: canonical-is-upgrades

Revision history for this message

John A Meinel (jameinel) wrote on 2021-09-03:

#4

We can certainly manually go in and fix this if it is directly impacting you (it is a matter of making sure all units come up with the same 2.9.12 and are ready to go). But we would want to understand why Mongo is unable to come up with full HA first.

Revision history for this message

Benjamin Allot (ballot) wrote on 2021-09-06:

#5

Sure, is there something I can do in case this happen again ?

Do you need some other logs ?

Just to be clear, the controller restarted because I manually did it (hence the 45 minutes).
They would have been down otherwise.

Revision history for this message

Benjamin Allot (ballot) wrote on 2021-09-07:

#6

There is already all the mongo logs around the time needed.

Do not hesitate to tell me if you need others.

.246 is ubuntu/2 (so 3rd controller unit).

Those are syslog around the time of the issue (without filtering only on mongo): https://pastebin.canonical.com/p/BhNj8KYrp4/

Changed in juju:
status:	Incomplete → New

Revision history for this message

John A Meinel (jameinel) wrote on 2021-09-28:

#7

We should do a pre-check for the upgrade that the Mongo replica is in a healthy state before we start trying to initiate the upgrade.

Changed in juju:
status:	New → Triaged
tags:	added: mongodb upgrade-controller

Revision history for this message

Peter Jose De Sousa (pjds) wrote on 2022-03-01:

#8

subscribing field-crit - I have a set of controllers which are in the same state and it's on a customer's environment.

Joseph Phillips (manadart) on 2022-03-15

Changed in juju:
assignee:	nobody → Jack Shaw (jack-shaw)
importance:	Medium → High
milestone:	none → 2.9.27

Revision history for this message

Jack Shaw (jack-shaw) wrote on 2022-03-16:

#9

This looks to be the same issue as the one I recently reported here:
https://bugs.launchpad.net/juju/+bug/1963924

A fix is on the way

Revision history for this message

John A Meinel (jameinel) wrote on 2022-03-16:

#10

@Peter, so there is a workaround, which is you restart the controller that is on the machine with the mongo primary. I think that takes it out of Field Crit, but we can certainly finish up Jack's work. (Note that even with his patch, it won't get your field environment running, as it can't upgrade to 2.9.27 until it comes back up from the earlier one.)

Jack Shaw (jack-shaw) on 2022-03-16

Changed in juju:
status:	Triaged → Fix Committed

Canonical Juju QA Bot (juju-qa-bot) on 2022-03-21

Changed in juju:
status:	Fix Committed → Fix Released

Canonical Juju

Controller upgrade ends up with locked upgrade

Bug Description

Other bug subscribers

Remote bug watches