juju fails to upgrade ha controllers on for (at least) lxd controllers

Bug #1963924 reported by Jack Shaw
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
Undecided
Jack Shaw

Bug Description

Upgrading juju 2.9.25 to 2.9.26, observed only in lxd controllers, but may also effect other machine controllers

Steps to reproduce:
1) Bootstrap an lxd controller and enable ha. (It is currently unknown if this effects other providers or not)
2) Upgrade your controllers (optionally with --build-agent)

Expected result: controllers upgrade to the new version and I am able to proceed with normal operations
Observed result: `juju status` and other juju commands hang forever. Upgrade is not performed

Note, I have been able to reproduce this inside a multipass VM, and (both x86 and arm based) AWS Ubuntu instances. But not using lxd on my Ubuntu desktop

For further clarity in the logs, I have made a small patch to the Juju code base prior to building juju and the agent for upgrades, as follows:

```
diff --git a/mongo/mongo.go b/mongo/mongo.go
index 146d8664e9..f00ee47d32 100644
--- a/mongo/mongo.go
+++ b/mongo/mongo.go
@@ -260,6 +260,7 @@ func isMaster(session *mgo.Session, obj WithAddresses) (bool, error) {
        addrs := obj.Addresses()

        masterHostPort, err := replicaset.MasterHostPort(session)
+ logger.Infof("Master HostPort: %s", masterHostPort)

        // If the replica set has not been configured, then we
        // can have only one master and the caller must
@@ -272,11 +273,15 @@ func isMaster(session *mgo.Session, obj WithAddresses) (bool, error) {
        }

        masterAddr, _, err := net.SplitHostPort(masterHostPort)
+ logger.Infof("Master Address: %s", masterAddr)
        if err != nil {
                return false, err
        }

+ logger.Infof("Checking master host against %d addrs", len(addrs))
        for _, addr := range addrs {
+ logger.Infof("Checking against %s", addr)
+ logger.Infof("Result: %t", addr.Value == masterAddr)
                if addr.Value == masterAddr {
                        return true, nil
                }
diff --git a/worker/upgradedatabase/worker.go b/worker/upgradedatabase/worker.go
index 35e27fec21..5c9d5732b0 100644
--- a/worker/upgradedatabase/worker.go
+++ b/worker/upgradedatabase/worker.go
@@ -184,6 +184,7 @@ func (w *upgradeDB) run() error {

        // If we are the primary we need to run the upgrade steps.
        // Otherwise we watch state and unlock once the primary has run the steps.
+ w.logger.Infof("Is Primary? %t", isPrimary)
        if isPrimary {
                w.runUpgrade()
        } else {
```

Important information of note (particularly IP addresses of controllers)
```
$ juju status -m controller
Model Controller Cloud/Region Version SLA Timestamp
controller c localhost/localhost 2.9.25 unsupported 14:00:13Z

Machine State DNS Inst id Series AZ Message
0 started 10.73.49.98 juju-371cb8-0 focal Running
1 started 10.73.49.251 juju-371cb8-1 focal Running
2 started 10.73.49.189 juju-371cb8-2 focal Running
```

And the resulting logs in each machine controller

machine 0:
INFO juju.mongo mongo.go:263 Master HostPort: 10.73.49.189:37017
INFO juju.mongo mongo.go:276 Master Address: 10.73.49.189
INFO juju.mongo mongo.go:281 Checking master host against 3 addrs
INFO juju.mongo mongo.go:283 Checking against local-cloud:10.73.49.251@space:0
INFO juju.mongo mongo.go:284 Result: false
INFO juju.mongo mongo.go:283 Checking against local-machine:127.0.0.1
INFO juju.mongo mongo.go:284 Result: false
INFO juju.mongo mongo.go:283 Checking against local-machine:::1
INFO juju.mongo mongo.go:284 Result: false
INFO juju.worker.upgradedatabase worker.go:187 Is Primary? false
INFO juju.worker.upgradedatabase worker.go:264 waiting for database upgrade on mongodb primary
ERROR juju.worker.dependency engine.go:693 "api-caller" manifold worker returned unexpected error: [550624] "machine-1" cannot open api: unable to connect to API: dial tcp 127.0.0.1:17070: connect: connection refused
ERROR juju.worker.dependency engine.go:693 "api-caller" manifold worker returned unexpected error: [550624] "machine-1" cannot open api: unable to connect to API: dial tcp 127.0.0.1:17070: connect: connection refused
ERROR juju.worker.upgradedatabase worker.go:309 timed out waiting for primary database upgrade
ERROR juju.worker.dependency engine.go:693 "api-caller" manifold worker returned unexpected error: [550624] "machine-1" cannot open api: unable to connect to API: dial tcp 127.0.0.1:17070: connect: connection refused
ERROR juju.worker.dependency engine.go:693 "api-caller" manifold worker returned unexpected error: [550624] "machine-1" cannot open api: unable to connect to API: dial tcp 127.0.0.1:17070: connect: connection refused

Machine 1:

INFO juju.mongo mongo.go:263 Master HostPort: 10.73.49.189:37017
INFO juju.mongo mongo.go:276 Master Address: 10.73.49.189
INFO juju.mongo mongo.go:281 Checking master host against 3 addrs
INFO juju.mongo mongo.go:283 Checking against local-cloud:10.73.49.251@space:0
INFO juju.mongo mongo.go:284 Result: false
INFO juju.mongo mongo.go:283 Checking against local-machine:127.0.0.1
INFO juju.mongo mongo.go:284 Result: false
INFO juju.mongo mongo.go:283 Checking against local-machine:::1
INFO juju.mongo mongo.go:284 Result: false
INFO juju.worker.upgradedatabase worker.go:187 Is Primary? false
INFO juju.worker.upgradedatabase worker.go:264 waiting for database upgrade on mongodb primary

Machine 2:
INFO juju.mongo mongo.go:263 Master HostPort: 10.73.49.98:37017
INFO juju.mongo mongo.go:276 Master Address: 10.73.49.98
INFO juju.mongo mongo.go:281 Checking master host against 3 addrs
INFO juju.mongo mongo.go:283 Checking against local-cloud:10.73.49.189@space:0
INFO juju.mongo mongo.go:284 Result: false
INFO juju.mongo mongo.go:283 Checking against local-machine:127.0.0.1
INFO juju.mongo mongo.go:284 Result: false
INFO juju.mongo mongo.go:283 Checking against local-machine:::1
INFO juju.mongo mongo.go:284 Result: false
INFO juju.worker.upgradedatabase worker.go:187 Is Primary? false
INFO juju.worker.upgradedatabase worker.go:264 waiting for database upgrade on mongodb primary

Clearly what has happened is none of the controllers believe they are mongo primaries, so they all begin waiting for the primary to perform the db upgrade, which of course never happens leading to a stalemate

It seems this is cause by replicaset.MasterHostPort returning different machines on different machines. This is as far as I have managed to investigate however

Revision history for this message
Alan Baghumian (alanbach) wrote :

I just encountered this issue! My HA controllers (2) are KVM VMs and I was trying to upgrade from 2.9.22 > 2.9.25.

Did you manage to find a solution / fix?

Revision history for this message
Jack Shaw (jack-shaw) wrote (last edit ):

Hi Alan, I believe I have identified a fix for the issue and it is currently in progress. Keep an eye on this issue for further updates

If you'd like, you can confirm for sure this is the same issue by finding the line:

INFO juju.worker.upgradedatabase worker.go:263 waiting for database upgrade on mongodb primary

Inside the logs (/var/log/juju/machine-(0|1).log) of both your controllers

Changed in juju:
status: New → In Progress
assignee: nobody → Jack Shaw (jack-shaw)
Revision history for this message
Simon Richardson (simonrichardson) wrote :

A potential workaround for this would be to bounce the controller agents.

    lxd exec juju-<hash>
    sudo systemctl restart jujud-machine-<machine id>.service

Revision history for this message
Alan Baghumian (alanbach) wrote :

@jack-shaw

Yup! It is the same:

2022-03-09 01:05:34 INFO juju.worker.upgradedatabase worker.go:263 waiting for database upgrade on mongodb primary
2022-03-09 01:12:51 INFO juju.worker.upgradedatabase worker.go:263 waiting for database upgrade on mongodb primary
2022-03-09 01:33:46 INFO juju.worker.upgradedatabase worker.go:263 waiting for database upgrade on mongodb primary

@simonrichardson

I wish it was that easy, that was one of the first things I tried. Including rebooting both controllers.

Revision history for this message
Alan Baghumian (alanbach) wrote :

OK, I fixed my controllers!!

Here is what I did, hopefully this will save someone else as well:

- Turn off all the controllers except the primary (This process will work on secondary instances as well, but if you have access to the primary one, would be better).
- SSH into the remaining controller and:

# Get Mongo credentials

$ export user=$(sudo grep tag /var/lib/juju/agents/machine-*/agent.conf | cut -d' ' -f2)
$ export password=$(sudo grep statepassword /var/lib/juju/agents/machine-*/agent.conf | cut -d' ' -f2)

# Get Mongo client installed

$ sudo apt-get install mongodb-clients && sudo apt-get clean

# Login to Mongo shell

$ mongo 127.0.0.1:37017/juju -u "${user}" -p "${password}" --sslAllowInvalidCertificates --ssl --authenticationDatabase admin

PRIMARY> rs.conf()
PRIMARY> rs.status()
PRIMARY> cfg = rs.conf(); cfg.members = [cfg.members[0]]; rs.reconfig(cfg, {force:1});
PRIMARY> quit();

**NOTE** : Inspect the outputs of rs.conf() and rs.status() and change [cfg.members[0]] piece accordingly. Depending where your primary database stands in the mix.

- We are essentially keeping one instance in the replica set and removing the rest of replicas.
- This should force the mangodb to show PRIMARY (if it is not) and API services should start working and the upgrade process should go through.
- Tail the log (sudo tail -f /var/log/juju/machine-0.log) to make sure things are settling down.
- Turn the offline controllers back on, they should catch up and sync with the primary.
- Watch $ juju status, agents should start coming back online from down.
- Tail the log (sudo tail -f /var/log/juju/machine-0.log) to make sure things are ERROR free.
- Login back to mangodb shell and execute rs.conf() and rs.status() to make sure the rest of the controllers have been added back (this happens automatically)

The whole point of having an HA environment is to protect us, not to shoot ourselves in the leg!!!

Jack Shaw (jack-shaw)
Changed in juju:
milestone: none → 2.9.27
Jack Shaw (jack-shaw)
Changed in juju:
status: In Progress → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.