Model Migration Fails on 3rd attempt

Bug #1646310 reported by Alexis Bruemmer
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Christian Muirhead
2.2
Fix Released
High
Christian Muirhead

Bug Description

Recreation steps:

Using lxd provider bootstrap controller a and b
add-model foo on controller a
juju deploy elasticsearch-cluster (wait for deploy to finish)
juju migrate foo b & watch --color -n1 juju status -m b:foo --color (watch the migration complete sucessfully)
juju switch b:controller
juju migrate foo a & watch --color -n1 juju status -m a:foo --color (watch the migration complete successfully)
juju switch a:controller
juju migrate foo b & watch --color -n1 juju status -m b:foo --color

fails with:
ERROR source prechecks failed: machine 0 not running (started)

juju status -m a:foo
Model Controller Cloud/Region Version
foo a localhost/localhost 2.1-beta2.1

App Version Status Scale Charm Store Rev OS Notes
elasticsearch waiting 1 elasticsearch jujucharms 15 ubuntu
kibana waiting 1 kibana jujucharms 14 ubuntu

Unit Workload Agent Machine Public address Ports Message
elasticsearch/0* unknown idle 0 192.168.88.162 9200/tcp
kibana/0* active idle 1 192.168.88.140 80/tcp,9200/tcp ready

Machine State DNS Inst id Series AZ
0 started 192.168.88.162 juju-ebe370-0 trusty
1 started 192.168.88.140 juju-ebe370-1 trusty

Relation Provides Consumes Type
peer elasticsearch elasticsearch peer
rest elasticsearch kibana regular

Revision history for this message
Alexis Bruemmer (alexis-bruemmer) wrote :
Changed in juju:
importance: Undecided → High
status: New → Triaged
Revision history for this message
Alexis Bruemmer (alexis-bruemmer) wrote :
Revision history for this message
Christian Muirhead (2-xtian) wrote :

I can reproduce this. I got really confused by the code and the message here first, because the machine status when everything is fine has always been "started" rather than "running". But looking at the code it's checking InstanceStatus(), rather than Status(). Using show-machine to see the instance status gives this on a fresh unmigrated model:

xtian@marathe:~$ juju show-machine 0
model: m2
machines:
  "0":
    juju-status:
      current: started
      since: 01 Dec 2016 16:44:28+13:00
      version: 2.1-beta2.1
    dns-name: 10.218.39.41
    ip-addresses:
    - 10.218.39.41
    instance-id: juju-1cb8c3-0
    machine-status:
      current: running
      message: Running
      since: 01 Dec 2016 16:42:53+13:00
    series: xenial
    hardware: arch=amd64 cores=0 mem=0M

Note machine-status.current == running

Migrating once gives and showing the same machine gives:

xtian@marathe:~$ juju migrate -c A m2 B
Migration started with ID "6dd2f792-d84e-4c14-81b3-06befa1cb8c3:0"
xtian@marathe:~$ juju switch B:m2
A:admin/m2 -> B:admin/m2
xtian@marathe:~$ juju show-machine 0
model: m2
machines:
  "0":
    juju-status:
      current: started
      since: 01 Dec 2016 16:50:51+13:00
      version: 2.1-beta2.1
    dns-name: 10.218.39.41
    ip-addresses:
    - 10.218.39.41
    instance-id: juju-1cb8c3-0
    machine-status:
      current: running
      message: Running
      since: 01 Dec 2016 16:50:41+13:00
    series: xenial
    hardware: arch=amd64

Machine status is still running. Migrating back gives:

xtian@marathe:~$ juju migrate -c B m2 A
Migration started with ID "6dd2f792-d84e-4c14-81b3-06befa1cb8c3:1"
xtian@marathe:~$ juju switch A:m2
B:admin/m2 -> A:admin/m2
xtian@marathe:~$ juju show-machine 0
model: m2
machines:
  "0":
    juju-status:
      current: started
      since: 01 Dec 2016 16:59:12+13:00
      version: 2.1-beta2.1
    dns-name: 10.218.39.41
    ip-addresses:
    - 10.218.39.41
    instance-id: juju-1cb8c3-0
    machine-status:
      current: started
      since: 01 Jan 1970 12:00:00+12:00
    series: xenial
    hardware: arch=amd64

Now machine-status is started, not running. Which is the source of the error message.

Not sure why that's happening - maybe status and instance status are being stored with the same key in a map and it's coming out the right way the first time by chance?

Still chasing, anyway.

Revision history for this message
Christian Muirhead (2-xtian) wrote :

I found an interesting looking line here:
https://github.com/juju/juju/blob/develop/state/migration_import.go#L339

It looks like this incorrect setting is always done as part of the import, but normally is quickly overwritten by a correct running value. In the migration back, this doesn't happen - I haven't worked out why yet.

Revision history for this message
Christian Muirhead (2-xtian) wrote :

Incidentally, changing that value to Running instead of Started just means that something else breaks when the migration starts - the status of the minimal application I have deployed changes to:

    units:
      min/0:
        workload-status:
          current: unknown
          since: 01 Dec 2016 18:05:28+13:00
        juju-status:
          current: failed
          message: resolver loop error
          since: 01 Dec 2016 18:20:50+13:00
          version: 2.1-beta2.1
        machine: "0"
        public-address: 10.218.39.138

Revision history for this message
Christian Muirhead (2-xtian) wrote :

(And the migration hangs at "starting".)

Revision history for this message
Christian Muirhead (2-xtian) wrote :

This turned out to be because:
* The model manager worker was starting the model workers while the model was still importing back. * The migration master would start and see the *old* migration (because the migration mode was importing rather than none).
* Since the phase was DONE, it exited returning ErrMigrated.
* This would get translated into ErrUninstall, so the dependency engine would remove the migration master worker.
* This then meant that the migration master wouldn't unlock the migration fortress, so workers that depend on ifNotMigrating (like the instance poller) couldn't run.

The fix is to make the model worker manager not start the workers until the model is finished importing. I'm preparing a PR for this - it's complicated by the watcher the model worker manager uses only signalling on lifecycle changes, so the mwm doesn't get told when the import is finished.

Revision history for this message
Christian Muirhead (2-xtian) wrote :
Changed in juju:
status: Triaged → In Progress
assignee: Menno Smits (menno.smits) → Christian Muirhead (2-xtian)
Changed in juju:
status: In Progress → Fix Committed
Changed in juju:
milestone: 2.1-rc1 → 2.1-beta3
Revision history for this message
Christian Muirhead (2-xtian) wrote :
Curtis Hovey (sinzui)
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.