Bug #1544796 “Backup restore fails: upgrade in progress” : Bugs : Canonical Juju

Curtis Hovey (sinzui) on 2016-02-12

tags:

added: blocker

Revision history for this message

Curtis Hovey (sinzui) wrote on 2016-02-12:

#1

Reviewing the changes between the last blessed master and fail I see:
https://github.com/juju/juju/commit/b4e2785874df8c612c46c55b70903554b16bd156
https://github.com/juju/juju/commit/c2150283643eef3cd37e50ae48df8f5049428dcb
But both of these were merged into the maas-spaces branch and tested, and blessed.

So given maas-spaces has bless commits, the only commit that is in failing master revision, but not maas-spaces is
https://github.com/juju/juju/commit/f1ca8c74e57fe08f82adf883b66ddde226299857

Revision history for this message

Dimiter Naydenov (dimitern) wrote on 2016-02-12:

#2

Download full text (3.9 KiB)

The problem goes much deeper, and I think both the restore process is incorrect, and the assessment job itself not thorough enough in its checks.

Compare the machine-0 logs from these runs:
http://reports.vapour.ws/releases/3595/job/functional-backup-restore/attempt/3660 (from last blessed master run);
http://reports.vapour.ws/releases/3600/job/functional-backup-restore/attempt/3666 (one of the failures after merging maas-spaces).

Briefly, the assess_recovery job with --backup, does roughly these steps:
- bootstraps and waits for it to complete
- deploys some units, waits for status to settle
- runs create-backup (which is not logged almost at all - apart from a single line "INFO juju.state.backups create.go:256 dumping juju state-related files")
- tries to restore immediately to verify that's an error (but DOES NOT log what error it got - it might have been "upgrade in progress", or anything)
- terminates the bootstrap instance, waiting for it to stop
- finally running restore-backup -b (i.e. bootstrap anew) <backup-file>.

There are a few things that come up in a bootstrap followed by a restore:
1) messes up cert addresses and api hosts ports (easy to see by comparing the old controller's 172. address to the new one). That's pre-existing and seems to go a long way back unnoticed.

2) INSTANCE data of the bootstrap node (including inst id and anything else is stale from the old node, apart from the addresses and instance state which are checked and updated by the instancepoller). If the CI job verifies the instance ID and address it sees in status post-restore, that could've been caught a lot earlier. Not only that the machines collection is also messed up (addresses, and pretty much everything from the old controller is the same until workers start jiggling around - both are a controller model machine 0 in state).

3) peergrouper replicaset members - that's interesting and was the reason for this fix: https://github.com/juju/juju/pull/4373, as mongo *requires* all replicaset members to be known either with their hostname/address (NOT "localhost") or ALL of them need to use "localhost" (see http://stackoverflow.com/questions/7954535/mongodb-replicaset-host-name-change-error and https://docs.mongodb.org/manual/reference/replica-configuration/#rsconf.members[n].host - the reason for seeing errors from worker/peergrouper in earlier runs without that fix).

4) as a consequence of the fix in 4373 and the messed up api host ports, we now see a different error: "cannot set replicaset: exception: can't find self in new replset config" - because the agent and insists on using the old controller's address first (as it happens to be sorting earlier than the newer controller's address - on this hinges the occasional successful runs of this job I think).

5) messing up the addresses (api host ports) was not enough to fail the test consistently, as despite the certupdater, instancepoller, agent, and apiserver trying hard to put back the old controller address, the new controller *also* can connect to "localhost" fortunately, so given time (between toolsversionsupdater restarting like crazy and the discoverspaces worker with now introduces a slight dela...

The problem goes much deeper, and I think both the restore process is incorrect, and the assessment job itself not thorough enough in its checks.

Compare the machine-0 logs from these runs:
http://reports.vapour.ws/releases/3595/job/functional-backup-restore/attempt/3660 (from last blessed master run);
http://reports.vapour.ws/releases/3600/job/functional-backup-restore/attempt/3666 (one of the failures after merging maas-spaces).

Briefly, the assess_recovery job with --backup, does roughly these steps:
- bootstraps and waits for it to complete
- deploys some units, waits for status to settle
- runs create-backup (which is not logged almost at all - apart from a single line "INFO juju.state.backups create.go:256 dumping juju state-related files")
- tries to restore immediately to verify that's an error (but DOES NOT log what error it got - it might have been "upgrade in progress", or anything)
- terminates the bootstrap instance, waiting for it to stop
- finally running restore-backup -b (i.e. bootstrap anew) <backup-file>.

There are a few things that come up in a bootstrap followed by a restore:
1) messes up cert addresses and api hosts ports (easy to see by comparing the old controller's 172. address to the new one). That's pre-existing and seems to go a long way back unnoticed.

2) INSTANCE data of the bootstrap node (including inst id and anything else is stale from the old node, apart from the addresses and instance state which are checked and updated by the instancepoller). If the CI job verifies the instance ID and address it sees in status post-restore, that could've been caught a lot earlier. Not only that the machines collection is also messed up (addresses, and pretty much everything from the old controller is the same until workers start jiggling around - both are a controller model machine 0 in state).

3) peergrouper replicaset members - that's interesting and was the reason for this fix: https://github.com/juju/juju/pull/4373, as mongo *requires* all replicaset members to be known either with their hostname/address (NOT "localhost") or ALL of them need to use "localhost" (see http://stackoverflow.com/questions/7954535/mongodb-replicaset-host-name-change-error and https://docs.mongodb.org/manual/reference/replica-configuration/#rsconf.members[n].host - the reason for seeing errors from worker/peergrouper in earlier runs without that fix).

4) as a consequence of the fix in 4373 and the messed up api host ports, we now see a different error: "cannot set replicaset: exception: can't find self in new replset config" - because the agent and insists on using the old controller's address first (as it happens to be sorting earlier than the newer controller's address - on this hinges the occasional successful runs of this job I think).

5) messing up the addresses (api host ports) was not enough to fail the test consistently, as despite the certupdater, instancepoller, agent, and apiserver trying hard to put back the old controller address, the new controller *also* can connect to "localhost" fortunately, so given time (between toolsversionsupdater restarting like crazy and the discoverspaces worker with now introduces a slight delay in client logins) it will fix itself as the machiner starts and updates its addresses.

The root cause of the error: <nil>: upgrade in progress I think though is only now getting exposed. I think it's becase apiserver/admin.go : maintenanceInProgress (which fakes up a user login tag) followed by limitLogins in cmd/jujud/agent/machine.go checking for a user tag.

But it's apparent we need to fix the backup/restore story to properly overwrite (or not dump in the first place) any data for any old controllers, so that the new ones aren't using stale data (esp. for things which are considered "set in stone" after provisioning - well, the provisioner does not even verify if "machine 0 already started as instace i-foo" is true, assuming inst id never change).

Cheryl Jennings (cherylj) on 2016-02-13

Changed in juju-core:
assignee:	nobody → Cheryl Jennings (cherylj)

Revision history for this message

Cheryl Jennings (cherylj) wrote on 2016-02-13:

#3

Commit https://github.com/juju/juju/commit/f1ca8c74e57fe08f82adf883b66ddde226299857 missed a place where err needed to be converted to errors.Cause. Preparing a patch.

Revision history for this message

Cheryl Jennings (cherylj) wrote on 2016-02-15:

#4

This is taking a while because restore is really quite broken and I'm trying to fix latent issues as well.

Revision history for this message

Cheryl Jennings (cherylj) wrote on 2016-02-15:

#5

PR: https://github.com/juju/juju/pull/4421

Cheryl Jennings (cherylj) on 2016-02-15

Changed in juju-core:
status:	Triaged → Fix Committed

Revision history for this message

Canonical Juju QA Bot (juju-qa-bot) wrote on 2016-02-17: Fix Released in juju-core master

#6

Juju-CI verified that this issue is Fix Released in juju-core master:
http://reports.vapour.ws/releases/3647

Changed in juju-core:
status:	Fix Committed → Fix Released

Cheryl Jennings (cherylj) on 2016-03-29

tags:

added: 2.0-count

Canonical Juju QA Bot (juju-qa-bot) on 2016-08-23

affects:	juju-core → juju
Changed in juju:
milestone:	2.0-beta1 → none
milestone:	none → 2.0-beta1

Revision history for this message

Aaron Bentley (abentley) wrote on 2016-09-22:

#7

This was fixed in core, but is a top issue for 1.25.

Changed in juju-core:
status:	New → Triaged
importance:	Undecided → High

Aaron Bentley (abentley) on 2016-09-22

Changed in juju-core:
status:	Triaged → Invalid

Curtis Hovey (sinzui) on 2016-09-22

Changed in juju-core:
status:	Invalid → Triaged

Anastasia (anastasia-macmood) on 2016-09-22

Changed in juju-core:
importance:	High → Critical

Anastasia (anastasia-macmood) on 2016-09-23

tags:

removed: blocker

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2016-10-21:

#8

Backport for 1.25: https://github.com/juju/juju/pull/6484

Anastasia (anastasia-macmood) on 2016-10-21

Changed in juju-core:
status:	Triaged → Fix Committed

Anastasia (anastasia-macmood) on 2017-01-10

Changed in juju-core:
status:	Fix Committed → Fix Released

Canonical Juju

Backup restore fails: upgrade in progress

Bug Description

Duplicates of this bug

Other bug subscribers

Remote bug watches

	Status	Importance	Assigned to	Milestone
Canonical Juju	Fix Released	Critical	Cheryl Jennings	Canonical Juju 2.0-beta1
juju-core	Fix Released	Critical	Unassigned
1.25	Fix Released	Critical	Anastasia	juju-core 1.25.7