MAAS

node marked "Ready" before poweroff complete

Bug #1325610 reported by Blake Rouse on 2014-06-02

This bug affects 6 people

Affects		Status	Importance	Assigned to	Milestone
	MAAS	Fix Released	High	Julian Edwards	MAAS 1.7.0

Bug Description

After bootstrapping and running one deployment, we start the next with the following sequence:

juju destroy-environment -y -e maas
juju bootstrap -e maas

In this case, we're getting handed back one of the machines that previous maas environment used, but the system hasn't been powered off (ie it didn't re-install from scratch). This results in bootstrap failure. I notice oddities during the bootstrap log:

WARNING picked arbitrary tools &{"1.18.3-precise-amd64" "https://streams.canonical.com/juju/tools/releases/juju-1.18.3-precise-amd64.tgz" "c7dee5df130e242c436c43c21278a3f24997d23ca48ee02b93b8126d2f415cd7" %!q(int64=5359855)}
- /MAAS/api/1.0/nodes/node-9d9cdafa-c4cd-11e3-824b-00163efc5068/
Waiting for address
Attempting to connect to node-1.maas:22
Attempting to connect to 10.245.0.158:22
Warning: Permanently added 'mode-1.maas' (ECDSA) to the list of known hosts.
Logging to /var/log/cloud-init-output.log on remote host
Installing add-apt-repository
Adding apt repository: deb http://ubuntu-cloud.archive.canonical.com/ubuntu precise-updates/cloud-tools main
Running apt-get update
Running apt-get upgrade
Installing package: git
Installing package: curl
Installing package: cpu-checker
Installing package: bridge-utils
Installing package: rsyslog-gnutls
Installing package: --target-release 'precise-updates/cloud-tools' 'mongodb-server'
Fetching tools: curl -sSfw 'tools from %{url_effective} downloaded: HTTP %{http_code}; time %{time_total}s; size %{size_download} bytes; speed %{speed_download} bytes/s ' -o $bin/tools.tar.gz 'https://streams.canonical.com/juju/tools/releases/juju-1.18.3-precise-amd64.tgz'
Starting MongoDB server (juju-db)
2014-06-02 13:43:10,360 - __init__.py[WARNING]: User ubuntu already exists, skipping.

user already exists?

Cloud-init v. 0.7 running 'modules:config' at Mon, 02 Jun 2014 13:43:11 +0000. Up 43.60 seconds.

going from 2014 13.41.08 (bootstrap start) to now is roughly 2 minutes. That's very fast for power down; power off, power on, pxe boot, kernel load.

It appears the host has already been dist-upgraded,, and then finally:

start: Job is already running: juju-db

attempting to start juju-db breaks, and the bootstrap fails.

See original description

Tags:

Related branches

lp:~julian-edwards/maas/only-ready-when-power-is-off

Merged into lp:~maas-committers/maas/trunk at revision 2994

Jeroen T. Vermeulen (community): Approve on 2014-09-15

Revision history for this message

Blake Rouse (blake-rouse) wrote on 2014-06-02:

I think this issue is happening because when the node is "deallocated" the celery task for poweroff is being started, but since maas doesn't check that poweroff has completed before the node is marked ready, that same node gets used again, and juju connects to the node before it has powered down.

Note: the poweroff command is working, after a moment the node will finally turn off.

tags:

added: oil

Daniel Westervelt (danwest) on 2014-06-02

Changed in maas:
importance:	Undecided → Critical

Revision history for this message

Raphaël Badin (rvb) wrote on 2014-06-02:

> I think this issue is happening because when the node is "deallocated" the celery task for poweroff is being started,
> but since maas doesn't check that poweroff has completed before the node is marked ready, that same node gets used
> again, and juju connects to the node before it has powered down.

Yeah, I think you're right. The status of the node is instantly set to 'Ready' but it takes a while for the power off task to complete. Fixing this is part of the "robustness" work: MAAS should wait until the node is off before it sets the node back to 'Ready'.

Julian Edwards (julian-edwards) on 2014-06-03

Changed in maas:
status:	New → Triaged
tags:	added: node-lifecycle

Raphaël Badin (rvb) on 2014-06-05

tags:

added: robustness

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2014-06-06:

"robustness" is not an official tag. Can we agree on what we're using please! node-lifecycle is what we were previously using.

Raphaël Badin (rvb) on 2014-06-06

tags:

removed: node-lifecycle

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2014-08-17:

Assigning to rvb since he's been working on power-related stuff lately and I'm pretty sure it's going to fix this eventually ;)

Changed in maas:
importance:	Critical → High
assignee:	nobody → Raphaël Badin (rvb)
status:	Triaged → In Progress

Julian Edwards (julian-edwards) on 2014-09-02

Changed in maas:
milestone:	none → 1.7.0

Raphaël Badin (rvb) on 2014-09-02

Changed in maas:
assignee:	Raphaël Badin (rvb) → nobody
status:	In Progress → Triaged

Adam Collard (adam-collard) on 2014-09-05

tags:

added: cloud-installer landscape

Revision history for this message

David Britton (dpb) wrote on 2014-09-05:

poweroff-output.txt Edit (15.4 KiB, text/plain)

@blake: Adding to this, I have just had a poweroff fail in 1.7, and the node was not marked 'broken', you probably want to make sure that is part of this fix as well. I'm pasting the output of the celery log

Julian Edwards (julian-edwards) on 2014-09-15

Changed in maas:
assignee:	nobody → Julian Edwards (julian-edwards)
status:	Triaged → In Progress

MAAS Lander (maas-lander) on 2014-09-16

Changed in maas:
status:	In Progress → Fix Committed

Julian Edwards (julian-edwards) on 2014-11-19

Changed in maas:
status:	Fix Committed → Fix Released

Wenhan Shi (w.shi) on 2018-06-27

description:

updated

Revision history for this message

Wenhan Shi (w.shi) wrote on 2018-06-27:

Download full text (3.4 KiB)

Re-open this bug as my customer is hitting same issue of below, which is reported a duplicate of this bug #1325610.

https://bugs.launchpad.net/maas/+bug/1365035

The message in debug mode is as below

$ juju bootstrap maas maas-controller --bootstrap-constraints tags=bootstrap --debug
06:10:51 INFO cmd bootstrap.go:357 Starting new instance for initial controller
Launching controller instance(s) on maas...
06:10:53 DEBUG juju.cloudconfig.instancecfg instancecfg.go:832 Setting numa ctl preference to false
06:10:53 DEBUG juju.service discovery.go:63 discovered init system "systemd" from series "xenial"
06:10:53 DEBUG juju.provider.maas environ.go:1018 maas user data; 4200 bytes
06:10:56 DEBUG juju.provider.maas interfaces.go:265 interface "ens4" has no links
06:10:56 DEBUG juju.provider.maas interfaces.go:265 interface "ens3" has no links
06:10:56 DEBUG juju.provider.maas environ.go:1050 started instance "7ytpek"
- 7ytpek (arch=amd64 mem=16G cores=8)
06:10:56 INFO juju.environs.bootstrap bootstrap.go:606 newest version: 2.2.4
06:10:56 INFO juju.environs.bootstrap bootstrap.go:621 picked bootstrap agent binary version: 2.2.4
06:10:56 INFO juju.environs.bootstrap bootstrap.go:393 Installing Juju agent on bootstrap instance
06:11:01 INFO cmd bootstrap.go:485 Fetching Juju GUI 2.12.3
06:15:21 DEBUG juju.cloudconfig.instancecfg instancecfg.go:832 Setting numa ctl preference to false
Waiting for address
Attempting to connect to 210.252.76.51:22
06:15:22 DEBUG juju.provider.common bootstrap.go:497 connection attempt for 210.252.76.51 failed: /var/lib/juju/nonce.txt does not exist
06:15:27 DEBUG juju.provider.common bootstrap.go:497 connection attempt for 210.252.76.51 failed: /var/lib/juju/nonce.txt does not exist
06:15:32 DEBUG juju.provider.common bootstrap.go:497 connection attempt for 210.252.76.51 failed: /var/lib/juju/nonce.txt does not exist
06:15:38 INFO juju.cloudconfig userdatacfg_unix.go:410 Fetching agent: curl -sSfw 'tools from %{url_effective} downloaded: HTTP %{http_code}; time %{time_total}s; size %{size_download} bytes; speed %{speed_download} bytes/s ' --retry 10 -o $bin/tools.tar.gz <[https://streams.canonical.com/juju/tools/agent/2.2.4/juju-2.2.4-ubuntu-amd64.tgz]>
Timeout, server 210.252.76.51 not responding.
06:17:38 ERROR juju.cmd.juju.commands bootstrap.go:496 failed to bootstrap model: subprocess encountered error code 255
06:17:38 DEBUG juju.cmd.juju.commands bootstrap.go:497 (error details: [{github.com/juju/juju/cmd/juju/commands/bootstrap.go:588: failed to bootstrap model} {subprocess encountered error code 255}])
06:17:38 DEBUG juju.cmd.juju.commands bootstrap.go:1095 cleaning up after failed bootstrap
06:17:38 INFO juju.provider.common destroy.go:20 destroying model "controller"
06:17:38 INFO juju.provider.common destroy.go:31 destroying instances
06:17:39 INFO juju.provider.common destroy.go:51 destroying storage
06:17:39 INFO cmd supercommand.go:465 command finished

We checked the network is working fine. And the version of juju and maas is as below

juju: 2.2.4-0ubuntu0.16.04.1
maas: 2.3.0-6434-gd354690-0ubuntu1~16.04.1

this issue can be reproduce in customer's environmen...

Re-open this bug as my customer is hitting same issue of below, which is reported a duplicate of this bug #1325610.

https://bugs.launchpad.net/maas/+bug/1365035

The message in debug mode is as below

$ juju bootstrap maas maas-controller --bootstrap-constraints tags=bootstrap --debug 
06:10:51 INFO cmd bootstrap.go:357 Starting new instance for initial controller 
Launching controller instance(s) on maas... 
06:10:53 DEBUG juju.cloudconfig.instancecfg instancecfg.go:832 Setting numa ctl preference to false 
06:10:53 DEBUG juju.service discovery.go:63 discovered init system "systemd" from series "xenial" 
06:10:53 DEBUG juju.provider.maas environ.go:1018 maas user data; 4200 bytes 
06:10:56 DEBUG juju.provider.maas interfaces.go:265 interface "ens4" has no links 
06:10:56 DEBUG juju.provider.maas interfaces.go:265 interface "ens3" has no links 
06:10:56 DEBUG juju.provider.maas environ.go:1050 started instance "7ytpek" 
- 7ytpek (arch=amd64 mem=16G cores=8) 
06:10:56 INFO juju.environs.bootstrap bootstrap.go:606 newest version: 2.2.4 
06:10:56 INFO juju.environs.bootstrap bootstrap.go:621 picked bootstrap agent binary version: 2.2.4 
06:10:56 INFO juju.environs.bootstrap bootstrap.go:393 Installing Juju agent on bootstrap instance 
06:11:01 INFO cmd bootstrap.go:485 Fetching Juju GUI 2.12.3 
06:15:21 DEBUG juju.cloudconfig.instancecfg instancecfg.go:832 Setting numa ctl preference to false 
Waiting for address 
Attempting to connect to 210.252.76.51:22 
06:15:22 DEBUG juju.provider.common bootstrap.go:497 connection attempt for 210.252.76.51 failed: /var/lib/juju/nonce.txt does not exist 
06:15:27 DEBUG juju.provider.common bootstrap.go:497 connection attempt for 210.252.76.51 failed: /var/lib/juju/nonce.txt does not exist 
06:15:32 DEBUG juju.provider.common bootstrap.go:497 connection attempt for 210.252.76.51 failed: /var/lib/juju/nonce.txt does not exist 
06:15:38 INFO juju.cloudconfig userdatacfg_unix.go:410 Fetching agent: curl -sSfw 'tools from %{url_effective} downloaded: HTTP %{http_code}; time %{time_total}s; size %{size_download} bytes; speed %{speed_download} bytes/s ' --retry 10 -o $bin/tools.tar.gz <[https://streams.canonical.com/juju/tools/agent/2.2.4/juju-2.2.4-ubuntu-amd64.tgz]> 
Timeout, server 210.252.76.51 not responding. 
06:17:38 ERROR juju.cmd.juju.commands bootstrap.go:496 failed to bootstrap model: subprocess encountered error code 255 
06:17:38 DEBUG juju.cmd.juju.commands bootstrap.go:497 (error details: [{github.com/juju/juju/cmd/juju/commands/bootstrap.go:588: failed to bootstrap model} {subprocess encountered error code 255}]) 
06:17:38 DEBUG juju.cmd.juju.commands bootstrap.go:1095 cleaning up after failed bootstrap 
06:17:38 INFO juju.provider.common destroy.go:20 destroying model "controller" 
06:17:38 INFO juju.provider.common destroy.go:31 destroying instances 
06:17:39 INFO juju.provider.common destroy.go:51 destroying storage 
06:17:39 INFO cmd supercommand.go:465 command finished

We checked the network is working fine. And the version of juju and maas is as below

juju: 2.2.4-0ubuntu0.16.04.1
    maas: 2.3.0-6434-gd354690-0ubuntu1~16.04.1

this issue can be reproduce in customer's environment, 
but in my test lab(KVM) I don't have issue using same version of juju and maas.

It will be helpful if you can give me some advise or comment about this bug, 
Please contact me if you need more detail information about the environment.

regards,

Wenhan