Ironic

Cleaning may leave nodes locked and require manual intervention to unlock

Bug #1442810 reported by Josh Gachnang on 2015-04-10

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Ironic	Fix Released	High	Josh Gachnang	Ironic 2015.1.0 "kilo"

Bug Description

When attempting to clean multiple nodes at the same time, the conductors become sluggish and the logs are filled with errors failing to acquire locks at the beginning of conductor.manager.ConductorManager.continue_node_clean. The API become less responsive and requests started timing out. I believe the problem is that we are doing RPC call instead of cast.

With rpc.call(), I was having issues with more than 2 or 3 nodes getting through cleaning (some would make it fine, others would deadlock). With rpc.cast() I got 20 to go through simultaneously without any issues I could see, with each node running 8 steps, each requiring the use of the continue_node_cleaning RPC. Both were using 2 conductors.

As a note, this only occurs in the agent driver currently, but would happen in any driver doing asynchronous cleaning steps.

See original description

Josh Gachnang (joshnang) on 2015-04-10

description:	updated
description:	updated

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-04-10: Fix proposed to ironic (master)

Fix proposed to branch: master
Review: https://review.openstack.org/172582

Changed in ironic:
assignee:	nobody → Josh Gachnang (joshnang)
status:	New → In Progress

John Stafford (john-stafford) on 2015-04-10

Changed in ironic:
importance:	Undecided → High
milestone:	none → kilo-rc1

aeva black (tenbrae) on 2015-04-13

summary:

- Cleaning results in deadlocks
+ Cleaning may leave nodes locked and require manual intervention to
+ unlock

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-04-13: Fix merged to ironic (master)

Reviewed: https://review.openstack.org/172582
Committed: https://git.openstack.org/cgit/openstack/ironic/commit/?id=af3918fb69701fa794cbbe9de9cafc69ac3e936d
Submitter: Jenkins
Branch: master

commit af3918fb69701fa794cbbe9de9cafc69ac3e936d
Author: Josh Gachnang <email address hidden>
Date: Fri Apr 10 15:15:33 2015 -0700

Convert internal RPC continue_node_cleaning to a "cast"

    The agent driver is using RPCs to call back from the driver to the
    conductor asynchronously. When using the RPC.call() method, some nodes
    would end up with stuck locks when using the agent driver during cleaning.

    The agent driver would issue a call() to continue_node_cleaning() after
    either the first heartbeat (from prepare_cleaning) or a heartbeat after
    a clean step had completed. The conductor would attempt to get a lock,
    but would not be able to. The node would retain its locked state (so
    far as I could tell), even after the error. Other nodes would continue
    and complete cleaning just fine. The exception raised by
    continue_node_cleaning() was likely not caught by the agent driver, but
    caught by vendor_passthru() in the conductor as an expected exception.

    Switching to cast() avoids the issue because the errors are not sent
    back to the caller. I didn't experience any more stuck locks with
    this change.

Change-Id: I4dbb04ccb93199bba4e1a1614bc19b70a068a9ea
Closes-Bug: 1442810

Changed in ironic:
status:	In Progress → Fix Committed

Thierry Carrez (ttx) on 2015-04-14

Changed in ironic:
status:	Fix Committed → Fix Released

Thierry Carrez (ttx) on 2015-04-30

Changed in ironic:
milestone:	kilo-rc1 → 2015.1.0

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.