bare metal node partitioning does not handle errors well

Bug #1088655 reported by Robert Collins
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Vish Ishaya

Bug Description

When bare metal partitioning fails, ops needs to look in the log to determine the cause - but user requests (such as too-large swap sizes) can cause failures, and so users should get a decent error status set on their instance.

Tags: baremetal
Revision history for this message
aeva black (tenbrae) wrote :

I suspect this happens because of the division between nova-compute and nova-baremetal-deploy-helper. n-cpu calls driver.spawn() and marks the instances as ACTIVE once the machine powers on, while a separate process (nova-baremetal-deploy-helper) does the partitioning and image deployment. Merging these processes would probably resolve this bug.

Revision history for this message
aeva black (tenbrae) wrote :

Another approach would be for nova-baremetal-deploy-helper to record its progress in nova_bm.bm_deployments table.

Changed in nova:
assignee: nobody → Devananda (devananda)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/21564

Changed in nova:
status: Triaged → In Progress
Changed in nova:
assignee: Devananda van der Veen (devananda) → Vish Ishaya (vishvananda)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/21564
Committed: http://github.com/openstack/nova/commit/48439b98a1a7ac2dded34c8899918773f70667f2
Submitter: Jenkins
Branch: master

commit 48439b98a1a7ac2dded34c8899918773f70667f2
Author: Devananda van der Veen <email address hidden>
Date: Fri Feb 8 20:36:19 2013 -0800

    Wait for baremetal deploy inside driver.spawn

    Previously, baremetal driver.spawn returned as soon as the
    machine power turned on, but before the user-image was deployed to the
    hardware node, and long before the node was available on the network.
    This meant the nova instance was marked as ACTIVE before provisioning
    had actually finished. If the deploy failed and the baremetal node was
    set to an ERROR state, the nova instance could still be left as ACTIVE
    and the user was never informed of the error.

    This patch introduces a LoopingCall to monitor the deployment status in
    the baremetal database. As the deployment is performed by
    nova-baremetal-deploy-helper, the database record is updated. Once the
    deployment is complete, driver.spawn() sets the baremetal node status
    and the nova instance status is also set properly. If an error occurs
    during the deployment, an exception is raised within driver.spawn()
    allowing nova to follow the normal cleanup and notify paths.

    This also allows the baremetal PXE driver to delete cached image files
    when a baremetal deployment fails.

    Fixes bug 1088655.

    Change-Id: I4feefd462fd956c9780995ec8b05b13e78278c8b

Changed in nova:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in nova:
milestone: none → grizzly-3
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: grizzly-3 → 2013.1
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.