MAAS

Failed to destroy-environment when node is in commissioning or new state

Bug #1381619 reported by Andres Rodriguez on 2014-10-15

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	MAAS	Invalid	Critical	Unassigned
	juju-core	Fix Released	Critical	Michael Foord	juju-core 1.21-beta2

Bug Description

1 ubuntu@maas-trusty-back-may22:~⟫ juju destroy-environment -e maas --force
2014-10-15 16:19:18 WARNING juju.cmd.juju destroyenvironment.go:52 -e/--environment flag is deprecated in 1.18, please supply environment as a positional parameter
WARNING! this command will destroy the "maas" environment (type: maas)
This includes all machines, services, data and other resources.

Continue [y/N]? y
ERROR gomaasapi: got error back from server: 409 CONFLICT (Node(s) cannot be released in their current state: node-2bb9ea2e-1cdc-11e4-a1a8-00163eca07b6 ('Commissioning'), node-a66d0b4a-24b4-11e4-8a6a-00163eca07b6 ('New').)

Tags:

Andres Rodriguez (andreserl) on 2014-10-15

Changed in maas:
assignee:	nobody → Raphaël Badin (rvb)
summary:	- Failed to destroy-environment + Failed to destroy-environment when node is in commissioning state
summary:	- Failed to destroy-environment when node is in commissioning state + Failed to destroy-environment when node is in commissioning or new state

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2014-10-16:

Umm, why is juju trying to release machines that are not in the allocated state?

Changed in maas:
status:	New → Incomplete

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2014-10-16:

This doesn't look like a maas bug. It will possibly be one of:
1. Juju not keeping track of allocated nodes
2. someone has manually changed the state of a node in maas

Revision history for this message

Christian Reis (kiko) wrote on 2014-10-16:

Not for 1.7 as it's not a regression; moving on.

Changed in maas:
assignee:	Raphaël Badin (rvb) → nobody

Andres Rodriguez (andreserl) on 2014-10-23

Changed in maas:
importance:	Undecided → Critical
tags:	added: oil

Revision history for this message

Greg Lutostanski (lutostag) wrote on 2014-10-23:

Just hit this with the other side of the coin, looks like juju tried to grab a node in the releasing state...

Attempting to connect to 10.245.0.164:22
ERROR bootstrap failed: refreshing addresses: no instances found
Stopping instance...
ERROR cannot stop failed bootstrap instance "/MAAS/api/1.0/nodes/node-bad3bc40-cfc9-11e3-a833-00163efc5068/": gomaasapi: got error back from server: 409 CONFLICT (Node(s) cannot be released in their current state: node-bad3bc40-cfc9-11e3-a833-00163efc5068 ('Releasing').)

I am fairly certain this is the event log for the node in question at the time:
INFO 50 minutes ago Node powered off
INFO 50 minutes ago Node changed status — From 'Releasing' to 'Ready'
INFO 50 minutes ago Node changed status — From 'Deploying' to 'Releasing'
INFO 50 minutes ago Powering node off
INFO 52 minutes ago Node powered on
INFO 52 minutes ago Powering node on
INFO 52 minutes ago Node changed status — From 'Allocated' to 'Deploying'
INFO 52 minutes ago Node changed status — From 'Ready' to 'Allocated' (to oil-slave-7)
INFO 53 minutes ago Node powered off
INFO 53 minutes ago Node changed status — From 'Releasing' to 'Ready'
INFO 54 minutes ago Node changed status — From 'Deployed' to 'Releasing'
INFO 54 minutes ago Powering node off

This seems to indicate that maas only went in one direction and was ready/off before deployed again -- but within a minute, could be a race?

Christian Reis (kiko) on 2014-10-23

Changed in maas:
milestone:	none → next

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2014-10-23: Re: [Bug 1381619] Re: Failed to destroy-environment when node is in commissioning or new state

On Thursday 23 Oct 2014 21:53:59 you wrote:
> Attempting to connect to 10.245.0.164:22
> ERROR bootstrap failed: refreshing addresses: no instances found
> Stopping instance...
> ERROR cannot stop failed bootstrap instance
> "/MAAS/api/1.0/nodes/node-bad3bc40-cfc9-11e3-a833-00163efc5068/":
> gomaasapi: got error back from server: 409 CONFLICT (Node(s) cannot be
> released in their current state: node-bad3bc40-cfc9-11e3-a833-00163efc5068
> ('Releasing').)

I suspect here that the status of releasing was folded back to "allocated" so
Juju thinks it needs to retry the release op.

Julian Edwards (julian-edwards) on 2014-10-23

Changed in maas:
status:	Incomplete → Triaged

Curtis Hovey (sinzui) on 2014-10-24

Changed in juju-core:
status:	New → Triaged
importance:	Undecided → High
milestone:	none → 1.21-alpha3

Andres Rodriguez (andreserl) on 2014-10-24

tags:

added: cloud-installer

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2014-10-27:

Thinking about this some more, after I spoke to a juju dev:

The situation that Andres reports cannot arise unless someone has removed machines from underneath Juju's control without telling it. I don't think that's a maas bug at all, unless you want the UI to totally block access to API-allocated nodes. I don't think that's wise, an admin needs to recover problematic nodes. Andres, do you know if that is what's happened? (marking incomplete)

Greg, your situation looks like a separate bug, can you please file something about it and list recreation-instructions, maas/juju versions and full logs.

Changed in maas:
status:	Triaged → Incomplete

Revision history for this message

Raphaël Badin (rvb) wrote on 2014-10-27:

> Attempting to connect to 10.245.0.164:22
> ERROR bootstrap failed: refreshing addresses: no instances found
> Stopping instance...
> ERROR cannot stop failed bootstrap instance "/MAAS/api/1.0/nodes/node-bad3bc40-cfc9-11e3-a833-00163efc5068/": gomaasapi: > got error back from server: 409 CONFLICT (Node(s) cannot be released in their current state: node-bad3bc40-cfc9-11e3-a833-
> 00163efc5068 ('Releasing').)

As Julian said, this looks like a separate and genuine bug. Recreation instructions would be great here.

Making 'release' a no-op when a node is 'releasing' is a way to avoid breakages like this one. Furthermore, it's a way to avoid errors like when two concurrent requests release the same node.

Curtis Hovey (sinzui) on 2014-10-27

Changed in juju-core:
importance:	High → Critical

Revision history for this message

Raphaël Badin (rvb) wrote on 2014-10-27:

@Greg: I tried to recreate your problem but I can't. Can you tell me which version of Juju you're using? (The only way I managed to make 'destroy env' break is by using the disk erasing feature, see bug 1386327)

Curtis Hovey (sinzui) on 2014-10-28

no longer affects:

juju-core/1.20

Revision history for this message

Alexis Bruemmer (alexis-bruemmer) wrote on 2014-10-30:

Please see comment #8, the team needs recreate steps to move forward

Changed in juju-core:
status:	Triaged → Incomplete

Christian Reis (kiko) on 2014-10-30

Changed in maas:
milestone:	next → 1.7.1

Curtis Hovey (sinzui) on 2014-11-04

Changed in juju-core:
milestone:	1.21-alpha3 → 1.21-beta1

Revision history for this message

Ian Booth (wallyworld) wrote on 2014-11-06:

#10

Juju tries to destroy the running nodes gracefully, and then as a final cleanup, find any running nodes and attempt to destroy them (to account for the case where the first attempt failed to clean up some machines). This second sweep across the machines may result in attempting to destory a 2nd time a machine that is still in the process of being decommissioned. Juju should ignore such errors.

Changed in juju-core:
status:	Incomplete → Triaged

John A Meinel (jameinel) on 2014-11-06

Changed in juju-core:
assignee:	nobody → Michael Foord (mfoord)

Revision history for this message

Michael Foord (mfoord) wrote on 2014-11-06:

#11

States: allocated, deploying, deployed we need to cleanup. Other states we ignore - treat as "already done". Also, if destroying a node fails due to the node already being in a "destroying" state, we ignore the error.

Revision history for this message

Michael Foord (mfoord) wrote on 2014-11-06:

#12

If we see unrecognised states then destroy-environment should fail (not delete the jenv file etc), that way destroy-environment can be retried if we have left stuff running.

Revision history for this message

Michael Foord (mfoord) wrote on 2014-11-07:

#13

As a further note: releasing a node and then issuing destroy-environment seems to hang. Not sure yet for how long...

Ian Booth (wallyworld) on 2014-11-10

Changed in juju-core:
milestone:	1.21-beta1 → 1.21-beta2

Revision history for this message

Michael Foord (mfoord) wrote on 2014-11-10:

#14

I can reproduce this by switching on desk erase on release. The call to "release" then returns an error:

gomaasapi.ServerError{error:(*errors.errorString)(0xc21008c1c0), StatusCode:409}
panic: gomaasapi: got error back from server: 409 CONFLICT (Node(s) cannot be released in their current state: node-51abffb8-6699-11e4-923e-525400512a8c ('Disk erasing').)

Revision history for this message

Michael Foord (mfoord) wrote on 2014-11-10:

#15

The release method of the maasapiserver can either succeed, fail to find the node (error 404), or the node can be in a "non-releasable state" - in which case an error 409 is returned.

The releasable statuses are:

# List of statuses for which it makes sense to release a node.
RELEASABLE_STATUSES = [
    NODE_STATUS.ALLOCATED,
    NODE_STATUS.RESERVED,
    NODE_STATUS.BROKEN,
    NODE_STATUS.DEPLOYING,
    NODE_STATUS.DEPLOYED,
    NODE_STATUS.FAILED_DEPLOYMENT,
    NODE_STATUS.FAILED_DISK_ERASING,
    NODE_STATUS.FAILED_RELEASING,
    ]

A simple fix is to ignore error 409 when calling StopInstances. It's possible that "release_or_erase" can itself raise (return) an error of course, but from scanning the code there it doesn't look like it can raise an error 409. So it should be safe to assume that error 409 can always be ignored.

Revision history for this message

Michael Foord (mfoord) wrote on 2014-11-10:

#16

My suggested fix. https://github.com/juju/juju/pull/1095

Michael Foord (mfoord) on 2014-11-11

Changed in juju-core:
status:	Triaged → Fix Committed

Curtis Hovey (sinzui) on 2014-11-14

Changed in juju-core:
status:	Fix Committed → Fix Released

Andres Rodriguez (andreserl) on 2014-12-11

Changed in maas:
milestone:	1.7.1 → 1.7.2

Jason Hobbs (jason-hobbs) on 2015-02-23

Changed in maas:
status:	Incomplete → Invalid

Andres Rodriguez (andreserl) on 2015-03-03

Changed in maas:
milestone:	1.7.2 → 1.7.3
milestone:	1.7.3 → none

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.