Failure to set root password leaves instance in ERROR

Bug #1061045 reported by Johannes Erdfelt
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Amir Sadoughi

Bug Description

If the agent isn't running on an instance, then setting the root password will timeout.

The API server will return a 500 error because of an RPC timeout. This should return something other than 500.

Eventually the compute server will timeout as well and leave the instance in ERROR. The instance is still running fine and ERROR seems like an incorrect state to leave the instance in.

Revision history for this message
Johannes Erdfelt (johannes.erdfelt) wrote :

I think both problems are complicated by the retries that happen in the compute layer. There are 10 retries combined with a 30 second timeout for the xenapi driver, this could take 300 seconds total. This is longer than the RPC timeout.

The retry logic seems unnecessary and appears to be a result of legacy code.

If the whole timeout was something reasonable, then an error could be returned synchronously to the client instead of requiring the instance to be moved to ERROR so an asynchronous error could be made available.

Changed in nova:
status: New → Confirmed
importance: Undecided → High
tags: added: xenserver
Revision history for this message
Chris Behrens (cbehrens) wrote :

This must be referring to setting admin password later... after a successful build? set_admin_password in compute_api does a call(), but building a new instance is a cast and wouldn't return a 500 from the API for failed root password setting.

Revision history for this message
Johannes Erdfelt (johannes.erdfelt) wrote :

Yes, the 500 error is only when setting the root password after an instance is already built.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/19854

Changed in nova:
assignee: nobody → Amir Sadoughi (amir-sadoughi)
status: Confirmed → In Progress
Revision history for this message
Amir Sadoughi (amir-sadoughi) wrote :

I wanted to document the test procedure I used to reproduce the bug and test the bugfix:

1. start instance `nova boot test-instance` in Xen/XCP environment from compute node
2. run `nova root-password test-instance`
3. before hitting [Enter] on the second password, kill the nova-agent on 'test-instance'.
4. observe timeout.
5. a. in case of bugfix missing: observe 'test-instance' in ERROR state and 500 error.
    b. in case of bugfix in place: observe 'test-instance' not in ERROR state and 501 error.

tags: added: folsom-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/19854
Committed: http://github.com/openstack/nova/commit/4dc160bf91d21b42363e5187adb96e59f95da717
Submitter: Jenkins
Branch: master

commit 4dc160bf91d21b42363e5187adb96e59f95da717
Author: Amir Sadoughi <email address hidden>
Date: Wed Jan 16 13:15:14 2013 -0600

    Removes retry of set_admin_password

    * An RPC timeout may occur if an agent is missing and set_admin_password is
    invoked. This causes 500 errors in the OpenStack API.
    * Implemented a 501 error in API if the password set fails.
    * Modified xenapi agent to use NotImplementedError instead of Exception in
      set_admin_password.
    * Updated test code around set_admin_password to accept different exceptions.
    * Fixes bug 1061045

    Change-Id: If7fab56c20f12e0490f4774e00004ed1d94242b9

Changed in nova:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in nova:
milestone: none → grizzly-3
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: grizzly-3 → 2013.1
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.