Live migration failure in API doesn't revert task_state to None

Bug #1276214 reported by Loganathan Parthipan
42
This bug affects 6 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Maciej Szankin
Mitaka
Fix Released
Undecided
Lee Yarwood

Bug Description

If API times out on a RPC during the processing of a migrate_server it does not revert the task_state back to NULL before or after sending the error response back to the user. This can prevent further API operations on the VM and leave a good VMs in non-operable state with the exception of perhaps a delete.

This is one possible reproducer. I'm not sure if this is always true, and I'd appreciate if someone else confirm it.

1. Somehow make RPC requests hang
2. Issue a live migration request
3. The call should return an HTTP error (409 perhaps)
4. Check VM. It should be in a good state but the task_state stuck in 'migrating'

Revision history for this message
John Garbutt (johngarbutt) wrote :

We should either put the VM into ERROR, if we can't rollback.

Or we should rollback and reset to ACTIVE.

I have recently made sure we now record instance faults, so there is a tiny bit more fault tracking.

tags: added: compute live-migrate
Changed in nova:
importance: Undecided → Medium
status: New → Triaged
Revision history for this message
John Garbutt (johngarbutt) wrote :

I have marked this as medium, because this kind of error stops a feature feeling "solid".

Changed in nova:
assignee: nobody → Loganathan Parthipan (parthipan)
Revision history for this message
John Garbutt (johngarbutt) wrote :

Certain failures to get handled correctly, see here:
https://github.com/openstack/nova/blob/c85a17447e54ebba192c0dcab1222760319cbe46/nova/scheduler/manager.py#L109

Need to do something similar for when there are RPC timeout issues in other places in the code.

Revision history for this message
Loganathan Parthipan (parthipan) wrote :

I can look at trapping MessageTimeout. I'm not sure what to do when the timeout is from conductor itself. Trying to set the task_state in this case may not succeed.

Revision history for this message
John Garbutt (johngarbutt) wrote :

Hmm, I think this is the bit that needs attention (generally) is in the conductor manager, look at the second except block, it should be more like what we have in the scheduler manager (given above).

Changed in nova:
assignee: Loganathan Parthipan (parthipan) → Davanum Srinivas (DIMS) (dims-v)
assignee: Davanum Srinivas (DIMS) (dims-v) → nobody
Changed in nova:
assignee: nobody → Pawel Koniszewski (pawel-koniszewski)
Changed in nova:
assignee: Pawel Koniszewski (pawel-koniszewski) → Bartosz Fic (bartosz-fic)
Revision history for this message
Bartosz Fic (bartosz-fic) wrote :

I've noticed that hanging other files in conductor leads to the same behaviour.

For example:
- /nova/conductor/rpcapi.py (server_migrate method)
- /nova/conductor/manager.py (_live_migrate method)
- /nova/conductor/tasks/live_migrate.py (execute method)

As a result of live migration we get instance with attributes:
a) vm_state: active
b) task_state: migrating

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/168916

Changed in nova:
status: Triaged → In Progress
tags: added: liberty-rc-potential
Revision history for this message
Matt Riedemann (mriedem) wrote :

This bug was reported over 18 months ago, so not a regression in liberty, so marking as liberty-backport-potential so it doesn't block rc1.

tags: added: liberty-backport-potential
removed: liberty-rc-potential
Paul Murray (pmurray)
tags: added: live-migration
removed: live-migrate
Changed in nova:
assignee: Bartosz Fic (bartosz-fic) → John Garbutt (johngarbutt)
Matt Riedemann (mriedem)
Changed in nova:
assignee: John Garbutt (johngarbutt) → Bartosz Fic (bartosz-fic)
Changed in nova:
assignee: Bartosz Fic (bartosz-fic) → Pawel Koniszewski (pawel-koniszewski)
Changed in nova:
assignee: Pawel Koniszewski (pawel-koniszewski) → nobody
status: In Progress → Confirmed
Changed in nova:
assignee: nobody → Pawel Koniszewski (pawel-koniszewski)
status: Confirmed → In Progress
Changed in nova:
assignee: Pawel Koniszewski (pawel-koniszewski) → Maciej Szankin (mszankin)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/168916
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=f2a1f00829e849e78f850a73489864e57cbd86b3
Submitter: Jenkins
Branch: master

commit f2a1f00829e849e78f850a73489864e57cbd86b3
Author: Maciej Szankin <email address hidden>
Date: Fri Feb 26 11:18:51 2016 +0100

    Live migration failure in API leaves VM in MIGRATING state

    When nova-api calls nova-conductor a RPC MessagingTimeout might
    occur. In such case we shouldn't leave VM in MIGRATING state. Possible
    scenarios are:

    * nova-conductor received message but failed to respond, no additional
    exceptions raised - live migration will start, VM will be moved to
    destination host
    * nova-conductor received message but failed to respond, additional
    exception raised (e.g., LibvirtError) - LM will not start
    * nova-api couldn't reach nova-conductor - LM will not start

    Because we can't predict in API layer what happened below, this patch
    writes instance fault to database when MessagingTimeout is caught.

    Co-Authored-By: Pawel Koniszewski <email address hidden>
                    Bartosz Fic <email address hidden>
    Closes-Bug: #1276214
    Change-Id: Id800e925fbb689d20e7907b698b67c92fd3da979

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/304746

Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote : Fix included in openstack/nova 14.0.0.0b1

This issue was fixed in the openstack/nova 14.0.0.0b1 development milestone.

Matt Riedemann (mriedem)
tags: removed: liberty-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/mitaka)

Reviewed: https://review.openstack.org/304746
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=0c4fc7812b349ccdd423168f568fe694325a8e74
Submitter: Jenkins
Branch: stable/mitaka

commit 0c4fc7812b349ccdd423168f568fe694325a8e74
Author: Maciej Szankin <email address hidden>
Date: Fri Feb 26 11:18:51 2016 +0100

    Live migration failure in API leaves VM in MIGRATING state

    When nova-api calls nova-conductor a RPC MessagingTimeout might
    occur. In such case we shouldn't leave VM in MIGRATING state. Possible
    scenarios are:

    * nova-conductor received message but failed to respond, no additional
    exceptions raised - live migration will start, VM will be moved to
    destination host
    * nova-conductor received message but failed to respond, additional
    exception raised (e.g., LibvirtError) - LM will not start
    * nova-api couldn't reach nova-conductor - LM will not start

    Because we can't predict in API layer what happened below, this patch
    writes instance fault to database when MessagingTimeout is caught.

    Co-Authored-By: Pawel Koniszewski <email address hidden>
                    Bartosz Fic <email address hidden>
    Closes-Bug: #1276214
    Change-Id: Id800e925fbb689d20e7907b698b67c92fd3da979
    (cherry picked from commit f2a1f00829e849e78f850a73489864e57cbd86b3)

Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/nova 13.1.1

This issue was fixed in the openstack/nova 13.1.1 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.