agent_* drivers could timed out mid-deployment

Bug #1475672 reported by Lucas Alvares Gomes
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ironic
Fix Released
Medium
Lucas Alvares Gomes

Bug Description

The patch [1] puts the node in provision_state DEPLOYWAIT while the agent running on the machine writes the image onto the local disk. That's the right thing to do but in the ironic-conductor there's a periodic task [2] that will keep looking at nodes in DEPLOYWAIT provision_state and will check if node.provision_updated_at < CONF.conductor.deploy_callback_timeout. The problem with this approach is that provision_updated_at is only set when the node changes the provision_state, so, if the user image is quite big and it takes more time than CONF.conductor.deploy_callback_timeout to be downloaded and written onto the local disk the ironic-conductor will fail that deployment even if the agent up and running and doing the deployment correctly.

Ideally we should update the 'provision_updated_at' within the node's heartbeat() so that we can indicate to the Ironic conductor that the agent is still up and running.

WORKAROUND:

You can disable that periodic task by setting CONF.deploy_callback_timeout to 0 (zero)

HOW TO REPRODUCE

Set the CONF.deploy_callback_timeout to lower value like 240 (4 minutes, we need to give time to actually boot the ramdisk to it starts heartbeating) and try to deploy a big image.

[1] https://review.openstack.org/#/c/200153/
[2] https://github.com/openstack/ironic/blob/db554f80661e7e8d0a201516c902fef6ffb2871c/ironic/conductor/manager.py#L1134

Changed in ironic:
assignee: nobody → Lucas Alvares Gomes (lucasagomes)
importance: Undecided → Medium
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ironic (master)

Fix proposed to branch: master
Review: https://review.openstack.org/203157

Changed in ironic:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ironic (master)

Reviewed: https://review.openstack.org/203157
Committed: https://git.openstack.org/cgit/openstack/ironic/commit/?id=4c8bdc29ecb6ec4d77ecb7a64e4de52261a5620b
Submitter: Jenkins
Branch: master

commit 4c8bdc29ecb6ec4d77ecb7a64e4de52261a5620b
Author: Lucas Alvares Gomes <email address hidden>
Date: Fri Jul 17 15:30:12 2015 +0100

    Fix the DEPLOYWAIT check for agent_* drivers

    The patch [1] sets the node's provision_state to DEPLOYWAIT while the
    agent is booting and writing the the image onto the local disk. But in
    the ironic-conductor we have a periodic task that checks for nodes in
    DEPLOYWAIT state and see if the deployment timed out based on the node's
    provision_udpated_at field and CONF.conductor.deploy_callback_timeout
    configuration option.

    The problem is that prior to this patch the node's provision_updated_at
    field was only updated when we actually changed the provision state of
    the node. So, if we are deployment a image which takes a long time to
    be downloaded and written to the local disk ironic-conductor could time
    out the deployment even if the agent still deploying that image.

    This patch fixes this problem by touching the node's provision_updated_at
    field when the node is heartbeating, so that we can indicate to the
    ironic-conductor that the deployment is still running and we shouldn't
    time it out.

    [1] f1929f0155e25c83bafe64c3d235880fc486f323

    Closes-Bug: #1475672
    Change-Id: I9373c42168cc1ffaae212f17b067a6e4b6d862fe

Changed in ironic:
status: In Progress → Fix Committed
Changed in ironic:
milestone: none → 4.0.0
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.