juju-upgrade from 2.2.9 to 2.3.2 fails with state changing too quickly

Bug #1746265 reported by Sandor Zeestraten on 2018-01-30
30
This bug affects 5 people
Affects Status Importance Assigned to Milestone
juju
High
John A Meinel
2.2
Undecided
Unassigned
2.3
Critical
John A Meinel

Bug Description

# Cloud
MAAS 2.1.5

# Issue
Upgrading our juju environment (3x HA controllers) from 2.1.2 -> 2.2.9 -> 2.3.2 fails during the last upgrade which renders the controllers and juju unusable. The upgrade from 2.1.2 -> 2.2.9 went seemingly fine, however the upgrade from 2.2.9 -> 2.3.2 failed.

# How to reproduce
We manage to reproduce this in our testing environment, bootstrapping a clean 2.1.2 environment, deploying OpenStack, then doing the same upgrade steps (see steps below).

## With a Juju 2.1.2 client
* juju bootstrap devmaas --bootstrap-constraints tags=juju-controller # bootstrap 2.1.2 controller
* juju enable-ha --constraints tags=juju-controller
* juju deploy bundle.yaml # deploy custom openstack HA bundle
* juju upgrade-charm ceph-mon # upgrade all openstack-charmers charms to handle new juju version
* juju upgrade-charm ...

## Switch to a Juju 2.3.2 client
* juju upgrade-juju -m controller # upgrade controllers to 2.2.9
* juju upgrade-juju # upgrade default model containing the openstack deployment to 2.2.9

Both the juju controller and openstack model are upgraded to 2.2.9 at this point and work as expected.

* juju upgrade-juju -m controller # try to upgrade controllers to 2.3.2

This last step renders the controllers and juju unusable.

# Errors
## An excerpt from machine-0.log on controller 0
...
2018-01-30 11:15:22 WARNING juju.worker.upgradesteps worker.go:275 stopped waiting for other controllers: tomb: dying
2018-01-30 11:15:22 ERROR juju.worker.upgradesteps worker.go:379 upgrade from 2.2.9 to 2.3.2 for "machine-0" failed (giving up): tomb: dying
2018-01-30 11:15:23 ERROR juju.worker runner.go:392 exited "leadership": upgrading legacy lease documents: state changing too quickly; try again soon
2018-01-30 11:15:23 ERROR juju.worker runner.go:392 exited "singular": upgrading legacy lease documents: state changing too quickly; try again soon
...

## Juju controller not responding from client
12:36 $ juju status
ERROR could not fetch leaders: upgrading legacy lease documents: state changing too quickly; try again soon (excessive contention)

# Logs
See attached logs for a collection of the machine logs from the controllers. These logs are from a clean reproduction running through the steps from # How to reproduce

Sandor Zeestraten (szeestraten) wrote :
Anastasia (anastasia-macmood) wrote :

This could potentially go hand-in-hand with bug # 1742604.

Symptoms are different but difficulty to upgrade are tear-inducing.

tags: added: upgrade-juju
Anastasia (anastasia-macmood) wrote :

I am adding this to 2.3 as Critical and 2.4 as High.

Changed in juju:
status: New → Triaged
importance: Undecided → High
Anastasia (anastasia-macmood) wrote :

Also added 2.2 as this is the version where the reporter was coming from. We are unlikely to release another 2.2 but we might need to provide a way forward via a script rather than an upgrade step...

Sandor Zeestraten (szeestraten) wrote :

Please let me know if you need more info from the reproduced environment to help find the root cause.

Tim Penhey (thumper) wrote :

@Sandor, do you have a controller with this reproduced right now?

If so, some interactive debugging would be incredibly useful.

Tim Penhey (thumper) wrote :

This sounds like the situation we have were one of the controllers doesn't realise that there is an upgrade to process. The other HA controllers have restarted and waiting for the other one to check-in before progressing.

To unstick it, the agent that hasn't realised needs to be restarted. Things generally progress pretty quickly after that.

We haven't yet determined what is causing one of the API servers to get stuck, so looking at a controller that is in that situation would be very helpful.

Sandor Zeestraten (szeestraten) wrote :

Tim, yes we do. Unfortunately I can't let you in without a rigmarole, but do let me know if there's something I can dump or run through some steps. If that doesn't help, then I can probably reproduce it again with more verbose logging.

I see you mention that there might be a condition where the controllers are waiting on each other. Do the machine logs give a hint at which agent should be restarted?

Sandor Zeestraten (szeestraten) wrote :

# Debugging output

On machine-0: db.leases.find().pretty()
https://paste.ubuntu.com/26502478/

On machine-0: db.txns.find({"o.c": "leases", "s": 5}).sort({"$natural":-1}).limit(1).pretty()
https://paste.ubuntu.com/26502504/

Tim Penhey (thumper) wrote :

Hmm... based on the information you were able to provide to me earlier today, it seems that somehow the database got messed up during the upgrade. Unfortunately the logs aren't helpful at showing exactly how this got out of sync.

The database has content that looks as if it has a transaction half applied. Now this shouldn't happen. Clearly.

So... how to get out of this situation...

The only way without redeploying is some database surgery. I would recommend doing a backup first, but since juju is in a half broken state, this isn't easy. What you would need to do is go and shut down each of the controller API servers. SSH into each of the controller machines and do the following:
  sudo service jujud-machine-x stop (where x is the machine id)

Then you'll want to do a mongo dump of the juju database (found in /var/lib/juju/db)

Then get into the database using something similar to https://pastebin.ubuntu.com/26502452/

conf=/var/lib/juju/agents/machine-*/agent.conf
user=`sudo grep tag $conf | cut -d' ' -f2`
password=`sudo grep statepassword $conf | cut -d' ' -f2`
/usr/lib/juju/mongo*/bin/mongo 127.0.0.1:37017/juju --authenticationDatabase admin --ssl --sslAllowInvalidCertificates --username "$user" --password "$password"

Once inside there, you need to execute the following commands:

db.leases.remove({"_id": "ef61dcef-2fb3-4b58-8ec6-2a9a0b2410c3:application-leadership#nova-cloud-controller#"})
db.leases.remove({"_id": "ef61dcef-2fb3-4b58-8ec6-2a9a0b2410c3:application-leadership#cinder-hacluster#"})
db.leases.remove({"_id": "ef61dcef-2fb3-4b58-8ec6-2a9a0b2410c3:application-leadership#neutron-gateway#"})
db.leases.remove({"_id": "ef61dcef-2fb3-4b58-8ec6-2a9a0b2410c3:application-leadership#ceph-osd#"})
db.leases.remove({"_id": "ef61dcef-2fb3-4b58-8ec6-2a9a0b2410c3:application-leadership#ntp#"})

Before the agents are restarted, we are going to want to ensure that they rerun the upgrade steps. The agents determine this by looking in their agent.conf file to see what version they were last running. This file is found in /var/lib/juju/agents/machine-x directory (where x is the machine id). This is a YAML file, and the key you are looking for is "upgradedToVersion". It probably says "2.3.2", but to make it rerun the upgrade steps (which are idempotent) make it say "2.2.9".

Then restart the agents. They should start up, wait for each other, then run the upgrade steps, then progress normally.

Download full text (4.3 KiB)

Thanks for the guide Tim. I did the steps as following, but things didn't progress normally as we'd hope.

# Steps
1. Stop controller agents
systemctl stop jujud-machine-0.service
systemctl stop jujud-machine-1.service
systemctl stop jujud-machine-2.service

2. Dump mongodb on machine-0 (primary)
/usr/lib/juju/mongo*/bin/mongodump --host 127.0.0.1 --port 37017 --authenticationDatabase admin --ssl --sslAllowInvalidCertificates --username machine-0 --password <password>

3. Remove leases in mongodb
OK

4. Check upgradedToVersion in /var/lib/juju/agents/machine-x
N.B. It was set to "upgradedToVersion: 2.2.9" on all machines.

5. Restart controller agents
systemctl restart jujud-machine-0.service
systemctl restart jujud-machine-1.service
systemctl restart jujud-machine-2.service

6. Observe
Relevant logs (see attached) start around 2018-02-02 12:23

machine-0 seems to get stuck with a lot of the following messages:
2018-02-02 12:24:09 ERROR juju.worker runner.go:392 exited "peergrouper": worker "peergrouper" exited: cannot compute desired peer group: voting non-machine member replicaset.Member{Id:3, Address:"xx.xx.65.252:37017", Arbiter:(*bool)(0xc422786c21), BuildIndexes:(*bool)(0xc422786c31), Hidden:(*bool)(0xc422786c39), Priority:(*float64)(0xc422786c68), Tags:map[string]string{"juju-machine-id":"1"}, SlaveDelay:(*time.Duration)(0xc422786c78), Votes:(*int)(0xc422786cb0)} found in peer group

juju commands started responding again, though the controller model is not happy:
14:08 $ juju status -m controller --format yaml
model:
  name: controller
  controller: devmaas
  cloud: devmaas
  version: 2.3.2
machines:
  "0":
    juju-status:
      current: started
      since: 02 Feb 2018 13:29:05+01:00
      version: 2.3.2
    dns-name: xx.xx.65.251
    ip-addresses:
    - xx.xx.65.251
    instance-id: s6cahd
    machine-status:
      current: running
      message: Deployed
      since: 29 Jan 2018 19:53:26+01:00
    series: xenial
    constraints: mem=3584M tags=juju-controller
    hardware: arch=amd64 cores=2 mem=4096M tags=virtual,juju-controller availability-zone=hcc-dev-rack1
    controller-member-status: has-vote
  "1":
    juju-status:
      current: error
      message: 'upgrade to 2.3.2 failed (giving up): aborted wait for other controllers:
        timed out after 4h0m0s'
      since: 30 Jan 2018 17:55:42+01:00
      version: 2.3.2
    dns-name: xx.xx.65.252
    ip-addresses:
    - xx.xx.65.252
    instance-id: ddpemq
    machine-status:
      current: running
      message: Deployed
      since: 29 Jan 2018 19:58:26+01:00
    series: xenial
    constraints: tags=juju-controller
    hardware: arch=amd64 cores=2 mem=2048M tags=juju-controller,virtual availability-zone=hcc-dev-rack2
    controller-member-status: has-vote
  "2":
    juju-status:
      current: started
      since: 30 Jan 2018 09:02:33+01:00
      version: 2.3.2
    dns-name: xx.xx.65.253
    ip-addresses:
    - xx.xx.65.253
    instance-id: fx7dhn
    machine-status:
      current: running
      message: Deployed
      since: 29 Jan 2018 19:58:22+01:00
    series: xenial
    constraints: tags=juju-controller
    hardware: arch=amd64 cores=2 mem=2048M tags=backend,juju-controll...

Read more...

I'll need some further instructions if you want to keep this environment alive. Otherwise I'll have to reproduce this at another time.

Tim Penhey (thumper) wrote :

I'm not sure we can get much more useful out of this environment.

How frequently was this occurring? One thing that did stand out is that it seemed that the mongo replicaset wasn't entirely happy.

From #juju-dev:

zeestrat> thumper: Regarding your last question in #1746265, it is a blocker as we haven't been able to get our controllers upgraded at all due to hitting this bug in our staging environment after multiple tries in clean environments. I can reproduce it again with some more logging if you'd like, but then I need to know which flags you want.

Was xx.xx.65.252 one of the IP addresses for machine 1? It seems odd that
the peergrouper would complain about a machine it doesn't know about.
Do these machine have multiple IP addresses? I wonder if it is a case of
somehow changing what IP addresses we are trying to assign. (In the 2.2.9
code, we're somehow selecting the first address, but 2.3.2 decided it
prefers the second address.)

How long does it take to reproduce this? Is it hours? Is it minutes?

On Wed, Feb 7, 2018 at 12:58 AM, Sandor Zeestraten <email address hidden>
wrote:

> >From #juju-dev:
>
> zeestrat> thumper: Regarding your last question in #1746265, it is a
> blocker as we haven't been able to get our controllers upgraded at all
> due to hitting this bug in our staging environment after multiple tries
> in clean environments. I can reproduce it again with some more logging
> if you'd like, but then I need to know which flags you want.
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1746265
>
> Title:
> juju-upgrade from 2.2.9 to 2.3.2 fails with state changing too quickly
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1746265/+subscriptions
>

Hi John, the controllers don't have multiple IP addresses, just the ones listed in #11.

Reproduce takes about 3 hours mostly due to the OpenStack deployment.

John A Meinel (jameinel) on 2018-02-15
Changed in juju:
milestone: none → 2.4-beta1
assignee: nobody → John A Meinel (jameinel)
John A Meinel (jameinel) wrote :

merged from 2.3 into develop from https://github.com/juju/juju/pull/8384

Changed in juju:
status: Triaged → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers