node marked "Ready" before poweroff complete

Bug #1325610 reported by Blake Rouse
50
This bug affects 6 people
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
High
Julian Edwards

Bug Description

After bootstrapping and running one deployment, we start the next with the following sequence:

juju destroy-environment -y -e maas
juju bootstrap -e maas

In this case, we're getting handed back one of the machines that previous maas environment used, but the system hasn't been powered off (ie it didn't re-install from scratch). This results in bootstrap failure. I notice oddities during the bootstrap log:

WARNING picked arbitrary tools &{"1.18.3-precise-amd64" "https://streams.canonical.com/juju/tools/releases/juju-1.18.3-precise-amd64.tgz" "c7dee5df130e242c436c43c21278a3f24997d23ca48ee02b93b8126d2f415cd7" %!q(int64=5359855)}
 - /MAAS/api/1.0/nodes/node-9d9cdafa-c4cd-11e3-824b-00163efc5068/
Waiting for address
Attempting to connect to node-1.maas:22
Attempting to connect to 10.245.0.158:22
Warning: Permanently added 'mode-1.maas' (ECDSA) to the list of known hosts.
Logging to /var/log/cloud-init-output.log on remote host
Installing add-apt-repository
Adding apt repository: deb http://ubuntu-cloud.archive.canonical.com/ubuntu precise-updates/cloud-tools main
Running apt-get update
Running apt-get upgrade
Installing package: git
Installing package: curl
Installing package: cpu-checker
Installing package: bridge-utils
Installing package: rsyslog-gnutls
Installing package: --target-release 'precise-updates/cloud-tools' 'mongodb-server'
Fetching tools: curl -sSfw 'tools from %{url_effective} downloaded: HTTP %{http_code}; time %{time_total}s; size %{size_download} bytes; speed %{speed_download} bytes/s ' -o $bin/tools.tar.gz 'https://streams.canonical.com/juju/tools/releases/juju-1.18.3-precise-amd64.tgz'
Starting MongoDB server (juju-db)
2014-06-02 13:43:10,360 - __init__.py[WARNING]: User ubuntu already exists, skipping.

user already exists?

Cloud-init v. 0.7 running 'modules:config' at Mon, 02 Jun 2014 13:43:11 +0000. Up 43.60 seconds.

going from 2014 13.41.08 (bootstrap start) to now is roughly 2 minutes. That's very fast for power down; power off, power on, pxe boot, kernel load.

It appears the host has already been dist-upgraded,, and then finally:

start: Job is already running: juju-db

attempting to start juju-db breaks, and the bootstrap fails.

Related branches

Revision history for this message
Blake Rouse (blake-rouse) wrote :

I think this issue is happening because when the node is "deallocated" the celery task for poweroff is being started, but since maas doesn't check that poweroff has completed before the node is marked ready, that same node gets used again, and juju connects to the node before it has powered down.

Note: the poweroff command is working, after a moment the node will finally turn off.

tags: added: oil
Changed in maas:
importance: Undecided → Critical
Revision history for this message
Raphaël Badin (rvb) wrote :

> I think this issue is happening because when the node is "deallocated" the celery task for poweroff is being started,
> but since maas doesn't check that poweroff has completed before the node is marked ready, that same node gets used
> again, and juju connects to the node before it has powered down.

Yeah, I think you're right. The status of the node is instantly set to 'Ready' but it takes a while for the power off task to complete. Fixing this is part of the "robustness" work: MAAS should wait until the node is off before it sets the node back to 'Ready'.

Changed in maas:
status: New → Triaged
tags: added: node-lifecycle
Raphaël Badin (rvb)
tags: added: robustness
Revision history for this message
Julian Edwards (julian-edwards) wrote :

"robustness" is not an official tag. Can we agree on what we're using please! node-lifecycle is what we were previously using.

Raphaël Badin (rvb)
tags: removed: node-lifecycle
Revision history for this message
Julian Edwards (julian-edwards) wrote :

Assigning to rvb since he's been working on power-related stuff lately and I'm pretty sure it's going to fix this eventually ;)

Changed in maas:
importance: Critical → High
assignee: nobody → Raphaël Badin (rvb)
status: Triaged → In Progress
Changed in maas:
milestone: none → 1.7.0
Raphaël Badin (rvb)
Changed in maas:
assignee: Raphaël Badin (rvb) → nobody
status: In Progress → Triaged
tags: added: cloud-installer landscape
Revision history for this message
David Britton (dpb) wrote :

@blake: Adding to this, I have just had a poweroff fail in 1.7, and the node was not marked 'broken', you probably want to make sure that is part of this fix as well. I'm pasting the output of the celery log

Changed in maas:
assignee: nobody → Julian Edwards (julian-edwards)
status: Triaged → In Progress
Changed in maas:
status: In Progress → Fix Committed
Changed in maas:
status: Fix Committed → Fix Released
Wenhan Shi (w.shi)
description: updated
Revision history for this message
Wenhan Shi (w.shi) wrote :
Download full text (3.4 KiB)

Hi

Re-open this bug as my customer is hitting same issue of below, which is reported a duplicate of this bug #1325610.

  https://bugs.launchpad.net/maas/+bug/1365035

The message in debug mode is as below

$ juju bootstrap maas maas-controller --bootstrap-constraints tags=bootstrap --debug
06:10:51 INFO cmd bootstrap.go:357 Starting new instance for initial controller
Launching controller instance(s) on maas...
06:10:53 DEBUG juju.cloudconfig.instancecfg instancecfg.go:832 Setting numa ctl preference to false
06:10:53 DEBUG juju.service discovery.go:63 discovered init system "systemd" from series "xenial"
06:10:53 DEBUG juju.provider.maas environ.go:1018 maas user data; 4200 bytes
06:10:56 DEBUG juju.provider.maas interfaces.go:265 interface "ens4" has no links
06:10:56 DEBUG juju.provider.maas interfaces.go:265 interface "ens3" has no links
06:10:56 DEBUG juju.provider.maas environ.go:1050 started instance "7ytpek"
- 7ytpek (arch=amd64 mem=16G cores=8)
06:10:56 INFO juju.environs.bootstrap bootstrap.go:606 newest version: 2.2.4
06:10:56 INFO juju.environs.bootstrap bootstrap.go:621 picked bootstrap agent binary version: 2.2.4
06:10:56 INFO juju.environs.bootstrap bootstrap.go:393 Installing Juju agent on bootstrap instance
06:11:01 INFO cmd bootstrap.go:485 Fetching Juju GUI 2.12.3
06:15:21 DEBUG juju.cloudconfig.instancecfg instancecfg.go:832 Setting numa ctl preference to false
Waiting for address
Attempting to connect to 210.252.76.51:22
06:15:22 DEBUG juju.provider.common bootstrap.go:497 connection attempt for 210.252.76.51 failed: /var/lib/juju/nonce.txt does not exist
06:15:27 DEBUG juju.provider.common bootstrap.go:497 connection attempt for 210.252.76.51 failed: /var/lib/juju/nonce.txt does not exist
06:15:32 DEBUG juju.provider.common bootstrap.go:497 connection attempt for 210.252.76.51 failed: /var/lib/juju/nonce.txt does not exist
06:15:38 INFO juju.cloudconfig userdatacfg_unix.go:410 Fetching agent: curl -sSfw 'tools from %{url_effective} downloaded: HTTP %{http_code}; time %{time_total}s; size %{size_download} bytes; speed %{speed_download} bytes/s ' --retry 10 -o $bin/tools.tar.gz <[https://streams.canonical.com/juju/tools/agent/2.2.4/juju-2.2.4-ubuntu-amd64.tgz]>
Timeout, server 210.252.76.51 not responding.
06:17:38 ERROR juju.cmd.juju.commands bootstrap.go:496 failed to bootstrap model: subprocess encountered error code 255
06:17:38 DEBUG juju.cmd.juju.commands bootstrap.go:497 (error details: [{github.com/juju/juju/cmd/juju/commands/bootstrap.go:588: failed to bootstrap model} {subprocess encountered error code 255}])
06:17:38 DEBUG juju.cmd.juju.commands bootstrap.go:1095 cleaning up after failed bootstrap
06:17:38 INFO juju.provider.common destroy.go:20 destroying model "controller"
06:17:38 INFO juju.provider.common destroy.go:31 destroying instances
06:17:39 INFO juju.provider.common destroy.go:51 destroying storage
06:17:39 INFO cmd supercommand.go:465 command finished

We checked the network is working fine. And the version of juju and maas is as below

    juju: 2.2.4-0ubuntu0.16.04.1
    maas: 2.3.0-6434-gd354690-0ubuntu1~16.04.1

this issue can be reproduce in customer's environmen...

Read more...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.