Failed to destroy-environment when node is in commissioning or new state

Bug #1381619 reported by Andres Rodriguez
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MAAS
Invalid
Critical
Unassigned
juju-core
Fix Released
Critical
Michael Foord

Bug Description

1 ubuntu@maas-trusty-back-may22:~⟫ juju destroy-environment -e maas --force
2014-10-15 16:19:18 WARNING juju.cmd.juju destroyenvironment.go:52 -e/--environment flag is deprecated in 1.18, please supply environment as a positional parameter
WARNING! this command will destroy the "maas" environment (type: maas)
This includes all machines, services, data and other resources.

Continue [y/N]? y
ERROR gomaasapi: got error back from server: 409 CONFLICT (Node(s) cannot be released in their current state: node-2bb9ea2e-1cdc-11e4-a1a8-00163eca07b6 ('Commissioning'), node-a66d0b4a-24b4-11e4-8a6a-00163eca07b6 ('New').)

Changed in maas:
assignee: nobody → Raphaël Badin (rvb)
summary: - Failed to destroy-environment
+ Failed to destroy-environment when node is in commissioning state
summary: - Failed to destroy-environment when node is in commissioning state
+ Failed to destroy-environment when node is in commissioning or new state
Revision history for this message
Julian Edwards (julian-edwards) wrote :

Umm, why is juju trying to release machines that are not in the allocated state?

Changed in maas:
status: New → Incomplete
Revision history for this message
Julian Edwards (julian-edwards) wrote :

This doesn't look like a maas bug. It will possibly be one of:
 1. Juju not keeping track of allocated nodes
 2. someone has manually changed the state of a node in maas

Revision history for this message
Christian Reis (kiko) wrote :

Not for 1.7 as it's not a regression; moving on.

Changed in maas:
assignee: Raphaël Badin (rvb) → nobody
Changed in maas:
importance: Undecided → Critical
tags: added: oil
Revision history for this message
Greg Lutostanski (lutostag) wrote :

Just hit this with the other side of the coin, looks like juju tried to grab a node in the releasing state...

Attempting to connect to 10.245.0.164:22
ERROR bootstrap failed: refreshing addresses: no instances found
Stopping instance...
ERROR cannot stop failed bootstrap instance "/MAAS/api/1.0/nodes/node-bad3bc40-cfc9-11e3-a833-00163efc5068/": gomaasapi: got error back from server: 409 CONFLICT (Node(s) cannot be released in their current state: node-bad3bc40-cfc9-11e3-a833-00163efc5068 ('Releasing').)

I am fairly certain this is the event log for the node in question at the time:
INFO 50 minutes ago Node powered off
INFO 50 minutes ago Node changed status — From 'Releasing' to 'Ready'
INFO 50 minutes ago Node changed status — From 'Deploying' to 'Releasing'
INFO 50 minutes ago Powering node off
INFO 52 minutes ago Node powered on
INFO 52 minutes ago Powering node on
INFO 52 minutes ago Node changed status — From 'Allocated' to 'Deploying'
INFO 52 minutes ago Node changed status — From 'Ready' to 'Allocated' (to oil-slave-7)
INFO 53 minutes ago Node powered off
INFO 53 minutes ago Node changed status — From 'Releasing' to 'Ready'
INFO 54 minutes ago Node changed status — From 'Deployed' to 'Releasing'
INFO 54 minutes ago Powering node off

This seems to indicate that maas only went in one direction and was ready/off before deployed again -- but within a minute, could be a race?

Christian Reis (kiko)
Changed in maas:
milestone: none → next
Revision history for this message
Julian Edwards (julian-edwards) wrote : Re: [Bug 1381619] Re: Failed to destroy-environment when node is in commissioning or new state

On Thursday 23 Oct 2014 21:53:59 you wrote:
> Attempting to connect to 10.245.0.164:22
> ERROR bootstrap failed: refreshing addresses: no instances found
> Stopping instance...
> ERROR cannot stop failed bootstrap instance
> "/MAAS/api/1.0/nodes/node-bad3bc40-cfc9-11e3-a833-00163efc5068/":
> gomaasapi: got error back from server: 409 CONFLICT (Node(s) cannot be
> released in their current state: node-bad3bc40-cfc9-11e3-a833-00163efc5068
> ('Releasing').)

I suspect here that the status of releasing was folded back to "allocated" so
Juju thinks it needs to retry the release op.

Changed in maas:
status: Incomplete → Triaged
Curtis Hovey (sinzui)
Changed in juju-core:
status: New → Triaged
importance: Undecided → High
milestone: none → 1.21-alpha3
tags: added: cloud-installer
Revision history for this message
Julian Edwards (julian-edwards) wrote :

Thinking about this some more, after I spoke to a juju dev:

The situation that Andres reports cannot arise unless someone has removed machines from underneath Juju's control without telling it. I don't think that's a maas bug at all, unless you want the UI to totally block access to API-allocated nodes. I don't think that's wise, an admin needs to recover problematic nodes. Andres, do you know if that is what's happened? (marking incomplete)

Greg, your situation looks like a separate bug, can you please file something about it and list recreation-instructions, maas/juju versions and full logs.

Changed in maas:
status: Triaged → Incomplete
Revision history for this message
Raphaël Badin (rvb) wrote :

> Attempting to connect to 10.245.0.164:22
> ERROR bootstrap failed: refreshing addresses: no instances found
> Stopping instance...
> ERROR cannot stop failed bootstrap instance "/MAAS/api/1.0/nodes/node-bad3bc40-cfc9-11e3-a833-00163efc5068/": gomaasapi: > got error back from server: 409 CONFLICT (Node(s) cannot be released in their current state: node-bad3bc40-cfc9-11e3-a833-
> 00163efc5068 ('Releasing').)

As Julian said, this looks like a separate and genuine bug. Recreation instructions would be great here.

Making 'release' a no-op when a node is 'releasing' is a way to avoid breakages like this one. Furthermore, it's a way to avoid errors like when two concurrent requests release the same node.

Curtis Hovey (sinzui)
Changed in juju-core:
importance: High → Critical
Revision history for this message
Raphaël Badin (rvb) wrote :

@Greg: I tried to recreate your problem but I can't. Can you tell me which version of Juju you're using? (The only way I managed to make 'destroy env' break is by using the disk erasing feature, see bug 1386327)

Curtis Hovey (sinzui)
no longer affects: juju-core/1.20
Revision history for this message
Alexis Bruemmer (alexis-bruemmer) wrote :

Please see comment #8, the team needs recreate steps to move forward

Changed in juju-core:
status: Triaged → Incomplete
Christian Reis (kiko)
Changed in maas:
milestone: next → 1.7.1
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 1.21-alpha3 → 1.21-beta1
Revision history for this message
Ian Booth (wallyworld) wrote :

Juju tries to destroy the running nodes gracefully, and then as a final cleanup, find any running nodes and attempt to destroy them (to account for the case where the first attempt failed to clean up some machines). This second sweep across the machines may result in attempting to destory a 2nd time a machine that is still in the process of being decommissioned. Juju should ignore such errors.

Changed in juju-core:
status: Incomplete → Triaged
John A Meinel (jameinel)
Changed in juju-core:
assignee: nobody → Michael Foord (mfoord)
Revision history for this message
Michael Foord (mfoord) wrote :

States: allocated, deploying, deployed we need to cleanup. Other states we ignore - treat as "already done". Also, if destroying a node fails due to the node already being in a "destroying" state, we ignore the error.

Revision history for this message
Michael Foord (mfoord) wrote :

If we see unrecognised states then destroy-environment should fail (not delete the jenv file etc), that way destroy-environment can be retried if we have left stuff running.

Revision history for this message
Michael Foord (mfoord) wrote :

As a further note: releasing a node and then issuing destroy-environment seems to hang. Not sure yet for how long...

Ian Booth (wallyworld)
Changed in juju-core:
milestone: 1.21-beta1 → 1.21-beta2
Revision history for this message
Michael Foord (mfoord) wrote :

I can reproduce this by switching on desk erase on release. The call to "release" then returns an error:

gomaasapi.ServerError{error:(*errors.errorString)(0xc21008c1c0), StatusCode:409}
panic: gomaasapi: got error back from server: 409 CONFLICT (Node(s) cannot be released in their current state: node-51abffb8-6699-11e4-923e-525400512a8c ('Disk erasing').)

Revision history for this message
Michael Foord (mfoord) wrote :

The release method of the maasapiserver can either succeed, fail to find the node (error 404), or the node can be in a "non-releasable state" - in which case an error 409 is returned.

The releasable statuses are:

# List of statuses for which it makes sense to release a node.
RELEASABLE_STATUSES = [
    NODE_STATUS.ALLOCATED,
    NODE_STATUS.RESERVED,
    NODE_STATUS.BROKEN,
    NODE_STATUS.DEPLOYING,
    NODE_STATUS.DEPLOYED,
    NODE_STATUS.FAILED_DEPLOYMENT,
    NODE_STATUS.FAILED_DISK_ERASING,
    NODE_STATUS.FAILED_RELEASING,
    ]

A simple fix is to ignore error 409 when calling StopInstances. It's possible that "release_or_erase" can itself raise (return) an error of course, but from scanning the code there it doesn't look like it can raise an error 409. So it should be safe to assume that error 409 can always be ignored.

Revision history for this message
Michael Foord (mfoord) wrote :
Michael Foord (mfoord)
Changed in juju-core:
status: Triaged → Fix Committed
Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → Fix Released
Changed in maas:
milestone: 1.7.1 → 1.7.2
Changed in maas:
status: Incomplete → Invalid
Changed in maas:
milestone: 1.7.2 → 1.7.3
milestone: 1.7.3 → none
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.