Backup restore fails: upgrade in progress

Bug #1544796 reported by Curtis Hovey
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
Critical
Cheryl Jennings
juju-core
Fix Released
Critical
Unassigned
1.25
Fix Released
Critical
Anastasia

Bug Description

As seen in
     http://reports.vapour.ws/releases/issue/5568b3c0749a560a1b7291bd

Master is claiming that an upgrade is in progress, which is impossible because CI tests the newest versions; there is nothing to upgrade too.

This issue was first seen in intermittently in maas tests, but not it is killing both both restore tests. The history of the restore jobs show that master and maas-spaces have not had any issues restoring in many weeks, now however, the feature is spontaneously broken testing a branch in in theory was tested as maas-spaces just before the merge.

The HA version of the test is leaving machines behind in AWS. People are culling the left over machines to keep CI operational.

Curtis Hovey (sinzui)
tags: added: blocker
Revision history for this message
Curtis Hovey (sinzui) wrote :

Reviewing the changes between the last blessed master and fail I see:
    https://github.com/juju/juju/commit/b4e2785874df8c612c46c55b70903554b16bd156
    https://github.com/juju/juju/commit/c2150283643eef3cd37e50ae48df8f5049428dcb
But both of these were merged into the maas-spaces branch and tested, and blessed.

So given maas-spaces has bless commits, the only commit that is in failing master revision, but not maas-spaces is
     https://github.com/juju/juju/commit/f1ca8c74e57fe08f82adf883b66ddde226299857

Revision history for this message
Dimiter Naydenov (dimitern) wrote :
Download full text (3.9 KiB)

The problem goes much deeper, and I think both the restore process is incorrect, and the assessment job itself not thorough enough in its checks.

Compare the machine-0 logs from these runs:
http://reports.vapour.ws/releases/3595/job/functional-backup-restore/attempt/3660 (from last blessed master run);
http://reports.vapour.ws/releases/3600/job/functional-backup-restore/attempt/3666 (one of the failures after merging maas-spaces).

Briefly, the assess_recovery job with --backup, does roughly these steps:
- bootstraps and waits for it to complete
- deploys some units, waits for status to settle
- runs create-backup (which is not logged almost at all - apart from a single line "INFO juju.state.backups create.go:256 dumping juju state-related files")
- tries to restore immediately to verify that's an error (but DOES NOT log what error it got - it might have been "upgrade in progress", or anything)
- terminates the bootstrap instance, waiting for it to stop
- finally running restore-backup -b (i.e. bootstrap anew) <backup-file>.

There are a few things that come up in a bootstrap followed by a restore:
1) messes up cert addresses and api hosts ports (easy to see by comparing the old controller's 172. address to the new one). That's pre-existing and seems to go a long way back unnoticed.

2) INSTANCE data of the bootstrap node (including inst id and anything else is stale from the old node, apart from the addresses and instance state which are checked and updated by the instancepoller). If the CI job verifies the instance ID and address it sees in status post-restore, that could've been caught a lot earlier. Not only that the machines collection is also messed up (addresses, and pretty much everything from the old controller is the same until workers start jiggling around - both are a controller model machine 0 in state).

3) peergrouper replicaset members - that's interesting and was the reason for this fix: https://github.com/juju/juju/pull/4373, as mongo *requires* all replicaset members to be known either with their hostname/address (NOT "localhost") or ALL of them need to use "localhost" (see http://stackoverflow.com/questions/7954535/mongodb-replicaset-host-name-change-error and https://docs.mongodb.org/manual/reference/replica-configuration/#rsconf.members[n].host - the reason for seeing errors from worker/peergrouper in earlier runs without that fix).

4) as a consequence of the fix in 4373 and the messed up api host ports, we now see a different error: "cannot set replicaset: exception: can't find self in new replset config" - because the agent and insists on using the old controller's address first (as it happens to be sorting earlier than the newer controller's address - on this hinges the occasional successful runs of this job I think).

5) messing up the addresses (api host ports) was not enough to fail the test consistently, as despite the certupdater, instancepoller, agent, and apiserver trying hard to put back the old controller address, the new controller *also* can connect to "localhost" fortunately, so given time (between toolsversionsupdater restarting like crazy and the discoverspaces worker with now introduces a slight dela...

Read more...

Changed in juju-core:
assignee: nobody → Cheryl Jennings (cherylj)
Revision history for this message
Cheryl Jennings (cherylj) wrote :

Commit https://github.com/juju/juju/commit/f1ca8c74e57fe08f82adf883b66ddde226299857 missed a place where err needed to be converted to errors.Cause. Preparing a patch.

Revision history for this message
Cheryl Jennings (cherylj) wrote :

This is taking a while because restore is really quite broken and I'm trying to fix latent issues as well.

Revision history for this message
Cheryl Jennings (cherylj) wrote :
Changed in juju-core:
status: Triaged → Fix Committed
Revision history for this message
Canonical Juju QA Bot (juju-qa-bot) wrote : Fix Released in juju-core master

Juju-CI verified that this issue is Fix Released in juju-core master:
    http://reports.vapour.ws/releases/3647

Changed in juju-core:
status: Fix Committed → Fix Released
tags: added: 2.0-count
affects: juju-core → juju
Changed in juju:
milestone: 2.0-beta1 → none
milestone: none → 2.0-beta1
Revision history for this message
Aaron Bentley (abentley) wrote :

This was fixed in core, but is a top issue for 1.25.

Changed in juju-core:
status: New → Triaged
importance: Undecided → High
Aaron Bentley (abentley)
Changed in juju-core:
status: Triaged → Invalid
Curtis Hovey (sinzui)
Changed in juju-core:
status: Invalid → Triaged
Changed in juju-core:
importance: High → Critical
tags: removed: blocker
Revision history for this message
Anastasia (anastasia-macmood) wrote :
Changed in juju-core:
status: Triaged → Fix Committed
Changed in juju-core:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.