Cleaning may leave nodes locked and require manual intervention to unlock

Bug #1442810 reported by Josh Gachnang
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ironic
Fix Released
High
Josh Gachnang

Bug Description

When attempting to clean multiple nodes at the same time, the conductors become sluggish and the logs are filled with errors failing to acquire locks at the beginning of conductor.manager.ConductorManager.continue_node_clean. The API become less responsive and requests started timing out. I believe the problem is that we are doing RPC call instead of cast.

With rpc.call(), I was having issues with more than 2 or 3 nodes getting through cleaning (some would make it fine, others would deadlock). With rpc.cast() I got 20 to go through simultaneously without any issues I could see, with each node running 8 steps, each requiring the use of the continue_node_cleaning RPC. Both were using 2 conductors.

As a note, this only occurs in the agent driver currently, but would happen in any driver doing asynchronous cleaning steps.

Josh Gachnang (joshnang)
description: updated
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ironic (master)

Fix proposed to branch: master
Review: https://review.openstack.org/172582

Changed in ironic:
assignee: nobody → Josh Gachnang (joshnang)
status: New → In Progress
Changed in ironic:
importance: Undecided → High
milestone: none → kilo-rc1
aeva black (tenbrae)
summary: - Cleaning results in deadlocks
+ Cleaning may leave nodes locked and require manual intervention to
+ unlock
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ironic (master)

Reviewed: https://review.openstack.org/172582
Committed: https://git.openstack.org/cgit/openstack/ironic/commit/?id=af3918fb69701fa794cbbe9de9cafc69ac3e936d
Submitter: Jenkins
Branch: master

commit af3918fb69701fa794cbbe9de9cafc69ac3e936d
Author: Josh Gachnang <email address hidden>
Date: Fri Apr 10 15:15:33 2015 -0700

    Convert internal RPC continue_node_cleaning to a "cast"

    The agent driver is using RPCs to call back from the driver to the
    conductor asynchronously. When using the RPC.call() method, some nodes
    would end up with stuck locks when using the agent driver during cleaning.

    The agent driver would issue a call() to continue_node_cleaning() after
    either the first heartbeat (from prepare_cleaning) or a heartbeat after
    a clean step had completed. The conductor would attempt to get a lock,
    but would not be able to. The node would retain its locked state (so
    far as I could tell), even after the error. Other nodes would continue
    and complete cleaning just fine. The exception raised by
    continue_node_cleaning() was likely not caught by the agent driver, but
    caught by vendor_passthru() in the conductor as an expected exception.

    Switching to cast() avoids the issue because the errors are not sent
    back to the caller. I didn't experience any more stuck locks with
    this change.

    Change-Id: I4dbb04ccb93199bba4e1a1614bc19b70a068a9ea
    Closes-Bug: 1442810

Changed in ironic:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in ironic:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in ironic:
milestone: kilo-rc1 → 2015.1.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.