Node stuck on DEPLOYING (potentially all *ING) state(s)

Bug #1461937 reported by Lucas Alvares Gomes
22
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Ironic
Fix Released
High
Lucas Alvares Gomes

Bug Description

If a conductor deploying a node dies due some OOM Killer (or whatever reason) and the provision_state of the node is DEPLOYING (the state after the ramdisk have POST'd back) the node will get stuck on that state.

Currently we don't have a timeout for deployments after the DEPLOYWAIT state andalso we don't offer any API that the user can use to get the node out of the DEPLOYING state.

How to reproduce:

The easiest way to reproduce it is to simulate the conductor abruptly dying by killing it with "kill -9". So:

1) Start a deployment
2) Keep looking af the provision state of the node
3) Once it goes from "wait for call-back" to "deploying", kill the ironic-conductor with: kill -9 <ir-cond PID>

Workaround:

The only way I found was to modify the database changing the provision_state of the node from "deploying" to "deploy failed" and then you can do:

$ironic node-set-provision-state <node uuid> deleted

To clean the node up.

Tags: conductor
description: updated
Revision history for this message
Dmitry Tantsur (divius) wrote :

I believe we should have timeouts for all transient states.

Changed in ironic:
importance: Undecided → High
status: New → Triaged
tags: added: conductor
description: updated
Tan Lin (tan-lin-good)
Changed in ironic:
assignee: nobody → Tan Lin (tan-lin-good)
Revision history for this message
Tan Lin (tan-lin-good) wrote :

It's reasonable to have timeouts for all transient states, I will add things like checkout_deploying_timeouts/checkout_delete_timeouts/checkout_clean_timeouts

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ironic (master)

Fix proposed to branch: master
Review: https://review.openstack.org/188688

Revision history for this message
Tan Lin (tan-lin-good) wrote :

Thing are getting interesting. Once conductor restart, it will clear all locks on the host during its init process by running "self.dbapi.clear_node_reservations_for_conductor(self.host)"

So we have two cases now,:
1. Break by restart of conductor(reservation=None)
2. Break by something else so reservation is still fake-reserv

Case one is similar like what Ironic did in _check_deploy_timeouts.
In case two, we have to release lock from a reserved node maybe named "force_release_resources" like release_resource() in task_manager, but I am not sure this should be allowed?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/189587

Changed in ironic:
status: Triaged → In Progress
Revision history for this message
Lucas Alvares Gomes (lucasagomes) wrote :

Hi Tan,

Any news on this?

Revision history for this message
Tan Lin (tan-lin-good) wrote :

I proposed two patches to review:
https://review.openstack.org/#/c/188688/ Break by restart of conductor(reservation=None)
https://review.openstack.org/#/c/189587/ Break by something else so reservation is still hold

Revision history for this message
Lucas Alvares Gomes (lucasagomes) wrote :

Hi Tan,

Please also take a look at https://review.openstack.org/#/c/194132/

Changed in ironic:
assignee: Tan Lin (tan-lin-good) → Lucas Alvares Gomes (lucasagomes)
Changed in ironic:
assignee: Lucas Alvares Gomes (lucasagomes) → Tan Lin (tan-lin-good)
Changed in ironic:
assignee: Tan Lin (tan-lin-good) → Lucas Alvares Gomes (lucasagomes)
Changed in ironic:
assignee: Lucas Alvares Gomes (lucasagomes) → Tan Lin (tan-lin-good)
Changed in ironic:
assignee: Tan Lin (tan-lin-good) → Lucas Alvares Gomes (lucasagomes)
Revision history for this message
Jim Rollenhagen (jim-rollenhagen) wrote :

So this bug is actually invalid for the agent driver -- the node is not locked for much of the deployment and can actually recover just fine from a conductor crash (heartbeats will just be routed to another conductor). Raising another bug around how https://review.openstack.org/#/c/194132/ breaks this.

Revision history for this message
Jim Rollenhagen (jim-rollenhagen) wrote :

Maybe I'll just keep the bug discussion here. tl;dr 194132 assumes that anything in DEPLOYING should also be locked, which is an invalid assumption with the agent driver.

Revision history for this message
Ruby Loo (rloo) wrote :

I'm a bit confused. Lucas submitted this patch to "Periodically checks the status of nodes in DEPLOYING state" : https://review.openstack.org/#/c/197141/

Revision history for this message
Jim Rollenhagen (jim-rollenhagen) wrote :

So I didn't really get around to this today, but poked around a bit. Fixing the agent driver to properly use DEPLOYWAIT is likely the right thing to do, however the agent relies on the DEPLOYING state to know whether it should start or finish the deploy.

I'd love for commands to signal back to a specific endpoint or something, rather than just riding the heartbeat for everything we do. However this is a lot of work and will have backwards compat concerns etc.

Short term, maybe we can fix 194132 by moving the block to fail the deploys above where the conductor releases the lock? Would need to play with it a bit.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to ironic (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/200153

Changed in ironic:
assignee: Lucas Alvares Gomes (lucasagomes) → Jim Rollenhagen (jim-rollenhagen)
assignee: Jim Rollenhagen (jim-rollenhagen) → Lucas Alvares Gomes (lucasagomes)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to ironic (master)

Reviewed: https://review.openstack.org/200153
Committed: https://git.openstack.org/cgit/openstack/ironic/commit/?id=f1929f0155e25c83bafe64c3d235880fc486f323
Submitter: Jenkins
Branch: master

commit f1929f0155e25c83bafe64c3d235880fc486f323
Author: Jim Rollenhagen <email address hidden>
Date: Thu Jul 9 08:39:04 2015 -0700

    Use DEPLOYWAIT while waiting for agent to write image

    There is an assumption that anything in DEPLOYING state is locked; the
    agent driver breaks that assumption. Fix it by using DEPLOYWAIT while
    waiting for the agent to write an image.

    Change-Id: I4957bd9608b1bc92177e69efef66ecb951181de1
    Related-Bug: #1461937

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ironic (master)

Reviewed: https://review.openstack.org/197141
Committed: https://git.openstack.org/cgit/openstack/ironic/commit/?id=e2a145e27d874d6c46a63340abf3510fc3169fbe
Submitter: Jenkins
Branch: master

commit e2a145e27d874d6c46a63340abf3510fc3169fbe
Author: Lucas Alvares Gomes <email address hidden>
Date: Tue Jun 30 13:05:01 2015 +0100

    Periodically checks the status of nodes in DEPLOYING state

    Periodically checks the nodes in DEPLOYING and the state of the conductor
    deploying them. If we find out that a conductor that was provisioning
    the node has died we then break release the node and gracefully mark
    the deployment as failed.

    A new method called "get_offline_conductors" was added to the database
    api which return a list of conductor hostnames that are offline.

    Closes-Bug: #1461937
    Change-Id: I293d3ff18e949b305e6404b8d3ceb75e29b88356

Changed in ironic:
status: In Progress → Fix Committed
Changed in ironic:
milestone: none → 4.0.0
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on ironic (master)

Change abandoned by Tan Lin (<email address hidden>) on branch: master
Review: https://review.openstack.org/188688

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Tan Lin (<email address hidden>) on branch: master
Review: https://review.openstack.org/189587

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ironic (master)

Fix proposed to branch: master
Review: https://review.openstack.org/349971

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/350439

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.