Bug #1746265 “juju-upgrade from 2.2.9 to 2.3.2 fails with state ...” : Bugs : Canonical Juju

Revision history for this message

Sandor Zeestraten (szeestraten) wrote on 2018-01-30:

#1

Machine logs from reproduction Edit (1.4 MiB, application/x-tar)

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2018-01-31:

#2

This could potentially go hand-in-hand with bug # 1742604.

Symptoms are different but difficulty to upgrade are tear-inducing.

tags:

added: upgrade-juju

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2018-01-31:

#3

I am adding this to 2.3 as Critical and 2.4 as High.

Changed in juju:
status:	New → Triaged
importance:	Undecided → High

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2018-01-31:

#4

Also added 2.2 as this is the version where the reporter was coming from. We are unlikely to release another 2.2 but we might need to provide a way forward via a script rather than an upgrade step...

Revision history for this message

Sandor Zeestraten (szeestraten) wrote on 2018-01-31:

#5

Please let me know if you need more info from the reproduced environment to help find the root cause.

Revision history for this message

Tim Penhey (thumper) wrote on 2018-02-01:

#6

@Sandor, do you have a controller with this reproduced right now?

If so, some interactive debugging would be incredibly useful.

Revision history for this message

Tim Penhey (thumper) wrote on 2018-02-01:

#7

This sounds like the situation we have were one of the controllers doesn't realise that there is an upgrade to process. The other HA controllers have restarted and waiting for the other one to check-in before progressing.

To unstick it, the agent that hasn't realised needs to be restarted. Things generally progress pretty quickly after that.

We haven't yet determined what is causing one of the API servers to get stuck, so looking at a controller that is in that situation would be very helpful.

Revision history for this message

Sandor Zeestraten (szeestraten) wrote on 2018-02-01:

#8

Tim, yes we do. Unfortunately I can't let you in without a rigmarole, but do let me know if there's something I can dump or run through some steps. If that doesn't help, then I can probably reproduce it again with more verbose logging.

I see you mention that there might be a condition where the controllers are waiting on each other. Do the machine logs give a hint at which agent should be restarted?

Revision history for this message

Sandor Zeestraten (szeestraten) wrote on 2018-02-01:

#9

# Debugging output

On machine-0: db.leases.find().pretty()
https://paste.ubuntu.com/26502478/

On machine-0: db.txns.find({"o.c": "leases", "s": 5}).sort({"$natural":-1}).limit(1).pretty()
https://paste.ubuntu.com/26502504/

Revision history for this message

Tim Penhey (thumper) wrote on 2018-02-02:

#10

Hmm... based on the information you were able to provide to me earlier today, it seems that somehow the database got messed up during the upgrade. Unfortunately the logs aren't helpful at showing exactly how this got out of sync.

The database has content that looks as if it has a transaction half applied. Now this shouldn't happen. Clearly.

So... how to get out of this situation...

The only way without redeploying is some database surgery. I would recommend doing a backup first, but since juju is in a half broken state, this isn't easy. What you would need to do is go and shut down each of the controller API servers. SSH into each of the controller machines and do the following:
sudo service jujud-machine-x stop (where x is the machine id)

Then you'll want to do a mongo dump of the juju database (found in /var/lib/juju/db)

Then get into the database using something similar to https://pastebin.ubuntu.com/26502452/

conf=/var/lib/juju/agents/machine-*/agent.conf
user=`sudo grep tag $conf | cut -d' ' -f2`
password=`sudo grep statepassword $conf | cut -d' ' -f2`
/usr/lib/juju/mongo*/bin/mongo 127.0.0.1:37017/juju --authenticationDatabase admin --ssl --sslAllowInvalidCertificates --username "$user" --password "$password"

Once inside there, you need to execute the following commands:

db.leases.remove({"_id": "ef61dcef-2fb3-4b58-8ec6-2a9a0b2410c3:application-leadership#nova-cloud-controller#"})
db.leases.remove({"_id": "ef61dcef-2fb3-4b58-8ec6-2a9a0b2410c3:application-leadership#cinder-hacluster#"})
db.leases.remove({"_id": "ef61dcef-2fb3-4b58-8ec6-2a9a0b2410c3:application-leadership#neutron-gateway#"})
db.leases.remove({"_id": "ef61dcef-2fb3-4b58-8ec6-2a9a0b2410c3:application-leadership#ceph-osd#"})
db.leases.remove({"_id": "ef61dcef-2fb3-4b58-8ec6-2a9a0b2410c3:application-leadership#ntp#"})

Before the agents are restarted, we are going to want to ensure that they rerun the upgrade steps. The agents determine this by looking in their agent.conf file to see what version they were last running. This file is found in /var/lib/juju/agents/machine-x directory (where x is the machine id). This is a YAML file, and the key you are looking for is "upgradedToVersion". It probably says "2.3.2", but to make it rerun the upgrade steps (which are idempotent) make it say "2.2.9".

Then restart the agents. They should start up, wait for each other, then run the upgrade steps, then progress normally.

Hmm... based on the information you were able to provide to me earlier today, it seems that somehow the database got messed up during the upgrade. Unfortunately the logs aren't helpful at showing exactly how this got out of sync.

The database has content that looks as if it has a transaction half applied. Now this shouldn't happen. Clearly.

So... how to get out of this situation...

The only way without redeploying is some database surgery. I would recommend doing a backup first, but since juju is in a half broken state, this isn't easy. What you would need to do is go and shut down each of the controller API servers.  SSH into each of the controller machines and do the following:
  sudo service jujud-machine-x stop (where x is the machine id)

Then you'll want to do a mongo dump of the juju database (found in /var/lib/juju/db)

Then get into the database using something similar to https://pastebin.ubuntu.com/26502452/

conf=/var/lib/juju/agents/machine-*/agent.conf
user=`sudo grep tag $conf | cut -d' ' -f2`
password=`sudo grep statepassword $conf | cut -d' ' -f2`
/usr/lib/juju/mongo*/bin/mongo 127.0.0.1:37017/juju --authenticationDatabase admin --ssl --sslAllowInvalidCertificates --username "$user" --password "$password"

Once inside there, you need to execute the following commands:

db.leases.remove({"_id": "ef61dcef-2fb3-4b58-8ec6-2a9a0b2410c3:application-leadership#nova-cloud-controller#"})
db.leases.remove({"_id": "ef61dcef-2fb3-4b58-8ec6-2a9a0b2410c3:application-leadership#cinder-hacluster#"})
db.leases.remove({"_id": "ef61dcef-2fb3-4b58-8ec6-2a9a0b2410c3:application-leadership#neutron-gateway#"})
db.leases.remove({"_id": "ef61dcef-2fb3-4b58-8ec6-2a9a0b2410c3:application-leadership#ceph-osd#"})
db.leases.remove({"_id": "ef61dcef-2fb3-4b58-8ec6-2a9a0b2410c3:application-leadership#ntp#"})

Before the agents are restarted, we are going to want to ensure that they rerun the upgrade steps. The agents determine this by looking in their agent.conf file to see what version they were last running. This file is found in /var/lib/juju/agents/machine-x directory (where x is the machine id). This is a YAML file, and the key you are looking for is "upgradedToVersion". It probably says "2.3.2", but to make it rerun the upgrade steps (which are idempotent) make it say "2.2.9".

Then restart the agents. They should start up, wait for each other, then run the upgrade steps, then progress normally.

Revision history for this message

Sandor Zeestraten (szeestraten) wrote on 2018-02-02:

#11

Download full text (4.3 KiB)

Thanks for the guide Tim. I did the steps as following, but things didn't progress normally as we'd hope.

# Steps
1. Stop controller agents
systemctl stop jujud-machine-0.service
systemctl stop jujud-machine-1.service
systemctl stop jujud-machine-2.service

2. Dump mongodb on machine-0 (primary)
/usr/lib/juju/mongo*/bin/mongodump --host 127.0.0.1 --port 37017 --authenticationDatabase admin --ssl --sslAllowInvalidCertificates --username machine-0 --password <password>

3. Remove leases in mongodb
OK

4. Check upgradedToVersion in /var/lib/juju/agents/machine-x
N.B. It was set to "upgradedToVersion: 2.2.9" on all machines.

5. Restart controller agents
systemctl restart jujud-machine-0.service
systemctl restart jujud-machine-1.service
systemctl restart jujud-machine-2.service

6. Observe
Relevant logs (see attached) start around 2018-02-02 12:23

machine-0 seems to get stuck with a lot of the following messages:
2018-02-02 12:24:09 ERROR juju.worker runner.go:392 exited "peergrouper": worker "peergrouper" exited: cannot compute desired peer group: voting non-machine member replicaset.Member{Id:3, Address:"xx.xx.65.252:37017", Arbiter:(*bool)(0xc422786c21), BuildIndexes:(*bool)(0xc422786c31), Hidden:(*bool)(0xc422786c39), Priority:(*float64)(0xc422786c68), Tags:map[string]string{"juju-machine-id":"1"}, SlaveDelay:(*time.Duration)(0xc422786c78), Votes:(*int)(0xc422786cb0)} found in peer group

juju commands started responding again, though the controller model is not happy:
14:08 $ juju status -m controller --format yaml
model:
  name: controller
  controller: devmaas
  cloud: devmaas
  version: 2.3.2
machines:
  "0":
    juju-status:
      current: started
      since: 02 Feb 2018 13:29:05+01:00
      version: 2.3.2
    dns-name: xx.xx.65.251
    ip-addresses:
    - xx.xx.65.251
    instance-id: s6cahd
    machine-status:
      current: running
      message: Deployed
      since: 29 Jan 2018 19:53:26+01:00
    series: xenial
    constraints: mem=3584M tags=juju-controller
    hardware: arch=amd64 cores=2 mem=4096M tags=virtual,juju-controller availability-zone=hcc-dev-rack1
    controller-member-status: has-vote
  "1":
    juju-status:
      current: error
      message: 'upgrade to 2.3.2 failed (giving up): aborted wait for other controllers:
        timed out after 4h0m0s'
      since: 30 Jan 2018 17:55:42+01:00
      version: 2.3.2
    dns-name: xx.xx.65.252
    ip-addresses:
    - xx.xx.65.252
    instance-id: ddpemq
    machine-status:
      current: running
      message: Deployed
      since: 29 Jan 2018 19:58:26+01:00
    series: xenial
    constraints: tags=juju-controller
    hardware: arch=amd64 cores=2 mem=2048M tags=juju-controller,virtual availability-zone=hcc-dev-rack2
    controller-member-status: has-vote
  "2":
    juju-status:
      current: started
      since: 30 Jan 2018 09:02:33+01:00
      version: 2.3.2
    dns-name: xx.xx.65.253
    ip-addresses:
    - xx.xx.65.253
    instance-id: fx7dhn
    machine-status:
      current: running
      message: Deployed
      since: 29 Jan 2018 19:58:22+01:00
    series: xenial
    constraints: tags=juju-controller
    hardware: arch=amd64 cores=2 mem=2048M tags=backend,juju-controll...

Thanks for the guide Tim. I did the steps as following, but things didn't progress normally as we'd hope.

# Steps
1. Stop controller agents
systemctl stop jujud-machine-0.service
systemctl stop jujud-machine-1.service
systemctl stop jujud-machine-2.service

2. Dump mongodb on machine-0 (primary)
/usr/lib/juju/mongo*/bin/mongodump --host 127.0.0.1 --port 37017 --authenticationDatabase admin --ssl --sslAllowInvalidCertificates --username machine-0 --password <password>

3. Remove leases in mongodb
OK

4. Check upgradedToVersion in /var/lib/juju/agents/machine-x
N.B. It was set to "upgradedToVersion: 2.2.9" on all machines.

5. Restart controller agents
systemctl restart jujud-machine-0.service
systemctl restart jujud-machine-1.service
systemctl restart jujud-machine-2.service

6. Observe
Relevant logs (see attached) start around 2018-02-02 12:23

machine-0 seems to get stuck with a lot of the following messages:
2018-02-02 12:24:09 ERROR juju.worker runner.go:392 exited "peergrouper": worker "peergrouper" exited: cannot compute desired peer group: voting non-machine member replicaset.Member{Id:3, Address:"xx.xx.65.252:37017", Arbiter:(*bool)(0xc422786c21), BuildIndexes:(*bool)(0xc422786c31), Hidden:(*bool)(0xc422786c39), Priority:(*float64)(0xc422786c68), Tags:map[string]string{"juju-machine-id":"1"}, SlaveDelay:(*time.Duration)(0xc422786c78), Votes:(*int)(0xc422786cb0)} found in peer group

juju commands started responding again, though the controller model is not happy:
14:08 $ juju status -m controller --format yaml
model:
  name: controller
  controller: devmaas
  cloud: devmaas
  version: 2.3.2
machines:
  "0":
    juju-status:
      current: started
      since: 02 Feb 2018 13:29:05+01:00
      version: 2.3.2
    dns-name: xx.xx.65.251
    ip-addresses:
    - xx.xx.65.251
    instance-id: s6cahd
    machine-status:
      current: running
      message: Deployed
      since: 29 Jan 2018 19:53:26+01:00
    series: xenial
    constraints: mem=3584M tags=juju-controller
    hardware: arch=amd64 cores=2 mem=4096M tags=virtual,juju-controller availability-zone=hcc-dev-rack1
    controller-member-status: has-vote
  "1":
    juju-status:
      current: error
      message: 'upgrade to 2.3.2 failed (giving up): aborted wait for other controllers:
        timed out after 4h0m0s'
      since: 30 Jan 2018 17:55:42+01:00
      version: 2.3.2
    dns-name: xx.xx.65.252
    ip-addresses:
    - xx.xx.65.252
    instance-id: ddpemq
    machine-status:
      current: running
      message: Deployed
      since: 29 Jan 2018 19:58:26+01:00
    series: xenial
    constraints: tags=juju-controller
    hardware: arch=amd64 cores=2 mem=2048M tags=juju-controller,virtual availability-zone=hcc-dev-rack2
    controller-member-status: has-vote
  "2":
    juju-status:
      current: started
      since: 30 Jan 2018 09:02:33+01:00
      version: 2.3.2
    dns-name: xx.xx.65.253
    ip-addresses:
    - xx.xx.65.253
    instance-id: fx7dhn
    machine-status:
      current: running
      message: Deployed
      since: 29 Jan 2018 19:58:22+01:00
    series: xenial
    constraints: tags=juju-controller
    hardware: arch=amd64 cores=2 mem=2048M tags=backend,juju-controller,virtual availability-zone=hcc-dev-rack3
    controller-member-status: has-vote
applications: {}

After waiting a bit, only the agent.conf of machine-0 got an upgradedToVersion 2.3.2.

Regarding the state of the model with deployed charms, it has machine agents logs reporting:
2018-02-02 12:30:36 ERROR juju.worker.dependency engine.go:551 "log-sender" manifold worker returned unexpected error: cannot send log message: write tcp xx.xx.64.45:33738->10.42.65.251:17070: write: connection reset by peer

Most of the charm unit are also in an error state with hook failed: "config-changed" with the unit logs reporting variants of, though this is probably unrelated: 
2018-02-02 13:16:16 DEBUG leader-settings-changed ERROR no network config found for binding "admin"
2018-02-02 13:16:16 DEBUG leader-settings-changed subprocess.CalledProcessError: Command '['network-get', '--primary-address', 'admin']' returned non-zero exit status 1

What's next? Any tips to help recover the environment are welcome but finding and fixing the root cause should probably be the focus. Would another reproduction with more verbose logging help? If so, please let me know which flags you want and where.

Revision history for this message

Sandor Zeestraten (szeestraten) wrote on 2018-02-02:

#12

machine-logs-after-surgery.tar.gz Edit (2.0 MiB, application/x-tar)

Revision history for this message

Sandor Zeestraten (szeestraten) wrote on 2018-02-06:

#13

I'll need some further instructions if you want to keep this environment alive. Otherwise I'll have to reproduce this at another time.

Revision history for this message

Tim Penhey (thumper) wrote on 2018-02-06:

#14

I'm not sure we can get much more useful out of this environment.

How frequently was this occurring? One thing that did stand out is that it seemed that the mongo replicaset wasn't entirely happy.

Revision history for this message

Sandor Zeestraten (szeestraten) wrote on 2018-02-06:

#15

From #juju-dev:

zeestrat> thumper: Regarding your last question in #1746265, it is a blocker as we haven't been able to get our controllers upgraded at all due to hitting this bug in our staging environment after multiple tries in clean environments. I can reproduce it again with some more logging if you'd like, but then I need to know which flags you want.

Revision history for this message

John A Meinel (jameinel) wrote on 2018-02-07: Re: [Bug 1746265] Re: juju-upgrade from 2.2.9 to 2.3.2 fails with state changing too quickly

#16

Was xx.xx.65.252 one of the IP addresses for machine 1? It seems odd that
the peergrouper would complain about a machine it doesn't know about.
Do these machine have multiple IP addresses? I wonder if it is a case of
somehow changing what IP addresses we are trying to assign. (In the 2.2.9
code, we're somehow selecting the first address, but 2.3.2 decided it
prefers the second address.)

How long does it take to reproduce this? Is it hours? Is it minutes?

On Wed, Feb 7, 2018 at 12:58 AM, Sandor Zeestraten <email address hidden>
wrote:

> >From #juju-dev:
>
> zeestrat> thumper: Regarding your last question in #1746265, it is a
> blocker as we haven't been able to get our controllers upgraded at all
> due to hitting this bug in our staging environment after multiple tries
> in clean environments. I can reproduce it again with some more logging
> if you'd like, but then I need to know which flags you want.
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1746265
>
> Title:
> juju-upgrade from 2.2.9 to 2.3.2 fails with state changing too quickly
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1746265/+subscriptions
>

Revision history for this message

Sandor Zeestraten (szeestraten) wrote on 2018-02-07:

#17

Hi John, the controllers don't have multiple IP addresses, just the ones listed in #11.

Reproduce takes about 3 hours mostly due to the OpenStack deployment.

Revision history for this message

John A Meinel (jameinel) wrote on 2018-02-14:

#18

https://github.com/juju/juju/pull/8374

John A Meinel (jameinel) on 2018-02-15

Changed in juju:
milestone:	none → 2.4-beta1
assignee:	nobody → John A Meinel (jameinel)

Revision history for this message

John A Meinel (jameinel) wrote on 2018-02-15:

#19

merged from 2.3 into develop from https://github.com/juju/juju/pull/8384

Changed in juju:
status:	Triaged → Fix Committed

Anastasia (anastasia-macmood) on 2018-07-10

Changed in juju:
status:	Fix Committed → Fix Released

Canonical Juju

juju-upgrade from 2.2.9 to 2.3.2 fails with state changing too quickly

Bug Description

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches

	Status	Importance	Assigned to	Milestone
Canonical Juju	Fix Released	High	John A Meinel	Canonical Juju 2.4-beta1
2.2	Won't Fix	Undecided	Unassigned
2.3	Fix Released	Critical	John A Meinel	Canonical Juju 2.3.3