A timed out cleaning cannot be retried successfully

Bug #1590146 reported by Julia Kreger
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Ironic
Fix Released
High
Julia Kreger

Bug Description

A user utilizing the manual cleaning process, can be stuck in an infinite cleaning loop of sorts if IPA should silently fail or an external influence causes the cleaning process to timeout.

Presently, upon cleaning timing out, the error handler for cleaning is not called which leaves clean_step preserved. If a user attempts to retry cleaning, the presence of the clean_step entry causes the user to essentially become stuck in a loop with cleaning where they are unable to refresh their steps they wish to be performed. This means that if a user submitted manual cleaning JSON document is part of the cause, the operator must manually clean up the clean_step and driver_internal_info data to break out of the loop.

Note: stable/mitaka links are used below, as of the filing of this bug, the same behavior is present in master branch, and was reproducible in a test environment with manual cleaning on both master branch and stable/mitaka packages.

Sequence of events:

The _check_cleanwait_timeouts code which lacks invocation of error handling, and simply fails the node state:
https://github.com/openstack/ironic/blob/stable/mitaka/ironic/conductor/manager.py#L1378

The result is that node.clean_step is not purged, along with node.driver_internal_info's clean_steps key/value pair.

Upon re-invoking cleaning, the agent driver (via https://github.com/openstack/ironic/blob/stable/mitaka/ironic/conductor/manager.py#L922) indicates that the node should be set to states.CLEANWAIT, which results in the task going into a wait state via https://github.com/openstack/ironic/blob/stable/mitaka/ironic/conductor/manager.py#L945

The node powers up.

The agent driver then heartbeats which if node.clean_step is not empty, the heartbeat results in continue_cleaning being called at https://github.com/openstack/ironic/blob/stable/mitaka/ironic/drivers/modules/agent_base_vendor.py#L450
which checks to see if there are present commands, which if a timeout occurred, there are none most likely, and the continue_cleaning call is returned prior to taking any additional action at https://github.com/openstack/ironic/blob/stable/mitaka/ironic/drivers/modules/agent_base_vendor.py#L305

Essentially, no further action takes place except heartbeat operations. If node.clean_step was empty, self._refresh_clean_steps(task) would have been called, cleaning steps would have been set based upon the task, and the world would have been a happier place.

Steps to reproduce:

1) Initiate manual cleaning, such as a raid configuration process.
2) Once the agent has booted and initiated the processes, manually power-off the node or kill the IPA agent before the raid step has completed.
3) Allow timeout to fail the node.

Possible fix:

In the _check_cleanwait_timeouts method at https://github.com/openstack/ironic/blob/stable/mitaka/ironic/conductor/manager.py#L1378, add a callback parameter pointing to cleaning_error_handler on the _fail_if_in_state call. This technically changes the behavior, although the present behavior appears to be broken for manual cleaning. This leaves an alternative of if we just want to isolate the callback method being called if we detect that manual cleaning is being used.

Revision history for this message
Jay Faulkner (jason-oldos) wrote :

AFAICT, this affects all forms of cleaning, so changed the title.

summary: - A timed out manual cleaning cannot be retried successfuly
+ A timed out cleaning cannot be retried successfully
Changed in ironic:
status: New → Confirmed
importance: Undecided → High
Ruby Loo (rloo)
Changed in ironic:
assignee: nobody → Julia Kreger (juliaashleykreger)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ironic (master)

Fix proposed to branch: master
Review: https://review.openstack.org/327403

Changed in ironic:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ironic (master)

Reviewed: https://review.openstack.org/327403
Committed: https://git.openstack.org/cgit/openstack/ironic/commit/?id=42e2685395d393c0d4cd840f7c23900df7300caa
Submitter: Jenkins
Branch: master

commit 42e2685395d393c0d4cd840f7c23900df7300caa
Author: Julia Kreger <email address hidden>
Date: Wed Jun 8 18:39:09 2016 -0400

    Add cleanwait timeout cleanup process

    Previously, if a node in a cleaning state timed out, the timeout
    process would not purge certain items from the node's configuration
    which resulted in a short circuiting of the logic cleaning being
    retried. This was a result of the node clean_step configuration
    not being purged upon a timeout occuring.

    This change adds a wrapper method around the cleaning failure
    error handler to allow the _fail_if_in_state method to call
    the error handler, since error handler syntax is not uniform
    and the _fail_if_in_state cannot pass arguments.

    It also changes the cleaning error handler to permit the error
    handler to delete the node clean_step, and cleaning related
    driver_internal_info configuration from a node in the event
    the node in in CLEANFAIL state.

    Change-Id: I9ee5c0b385648c9b7e1d330d5d1af9b2c486a436
    Closes-Bug: #1590146

Changed in ironic:
status: In Progress → Fix Released
Revision history for this message
Thierry Carrez (ttx) wrote : Fix included in openstack/ironic 6.0.0

This issue was fixed in the openstack/ironic 6.0.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.