test_create_image_with_reboot fails with InstanceInvalidState in gate-nova-python*

Bug #1266611 reported by Matt Riedemann
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Matt Riedemann

Bug Description

Looks like an intermittent failure:

http://logs.openstack.org/25/64725/4/check/gate-nova-python27/e603e9e/testr_results.html.gz

2014-01-06 21:49:45.870 | Traceback (most recent call last):
2014-01-06 21:49:45.870 | File "nova/tests/api/ec2/test_cloud.py", line 2343, in test_create_image_with_reboot
2014-01-06 21:49:45.870 | self._do_test_create_image(False)
2014-01-06 21:49:45.871 | File "nova/tests/api/ec2/test_cloud.py", line 2316, in _do_test_create_image
2014-01-06 21:49:45.871 | no_reboot=no_reboot)
2014-01-06 21:49:45.871 | File "nova/api/ec2/cloud.py", line 1709, in create_image
2014-01-06 21:49:45.872 | name)
2014-01-06 21:49:45.872 | File "nova/compute/api.py", line 161, in inner
2014-01-06 21:49:45.872 | method=f.__name__)
2014-01-06 21:49:45.873 | InstanceInvalidState: Instance b1d4d924-069c-409c-bbdb-4f0478526057 in task_state powering-off. Cannot snapshot_volume_backed while the instance is in this state.

Matt Riedemann (mriedem)
summary: - test_create_image_with_reboot fails in gate-nova-python27
+ test_create_image_with_reboot fails with InstanceInvalidState in gate-
+ nova-python27
Revision history for this message
Matt Riedemann (mriedem) wrote :
summary: test_create_image_with_reboot fails with InstanceInvalidState in gate-
- nova-python27
+ nova-python*
Revision history for this message
Matt Riedemann (mriedem) wrote :

elastic-recheck query patch is here: https://review.openstack.org/#/c/65344/

Joe Gordon (jogo)
Changed in nova:
status: New → Confirmed
status: Confirmed → Won't Fix
status: Won't Fix → Triaged
status: Triaged → Confirmed
status: Confirmed → Triaged
Revision history for this message
Matt Riedemann (mriedem) wrote :
tags: added: ec2
Revision history for this message
Matt Riedemann (mriedem) wrote :

Opened bug 1291620 for the other failure.

Matt Riedemann (mriedem)
tags: added: db
tags: removed: db
Revision history for this message
Matt Riedemann (mriedem) wrote :

Looks related to bug 1339235.

Changed in nova:
importance: Undecided → Medium
tags: added: compute
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/108014

Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/108014
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=aa1792eb4c1d10e9a192142ce7e20d37871d916a
Submitter: Jenkins
Branch: master

commit aa1792eb4c1d10e9a192142ce7e20d37871d916a
Author: Matt Riedemann <email address hidden>
Date: Tue Sep 2 12:11:55 2014 -0700

    Stop stack tracing when trying to auto-stop a stopped instance

    Commit cc5388bbe81aba635fb757e202d860aeed98f3e8 added locks to
    stop_instance and the _sync_power_states periodic task to try and fix a
    race between stopping the instance via the API where the task_state is
    set to powering-off, and the periodic task seeing the instance
    power_state as shutdown in _sync_instance_power_state and calling the
    stop API again, at which point the task_state is already None from the
    first stop API call and we get an UnexpectedTaskStateError.

    The handle_lifecycle_event method is getting callbacks from the libvirt
    driver on state changes on the VM and calling the
    _sync_instance_power_state method which may try to stop the instance
    asynchronously, and lead to UnexpectedTaskStateError if the instance is
    already stopped by the time it gets the lock and the task_state has
    changed.

    Attempting to lock in handle_lifecycle_event just moves the race around
    so this change adds logic to stop_instance such that if the instance
    says it's active but the virt driver says it's not running, then we add
    None to the expected_task_state so we don't stacktrace on
    instance.save().

    An alternative and/or additional change to this would be doing a call
    rather than a cast when _sync_instance_power_state calls the stop API
    but in some previous testing it doesn't appear to make a significant
    difference in the race found when we hit the stop_instance method.

    Adds a bunch of debug logging since this code is inherently racey and
    is needed when looking at failures around these operations.

    Closes-Bug: #1339235
    Closes-Bug: #1266611
    Related-Bug: #1320628

    Change-Id: Ib495a5ab15de88051c5fa7abfb58a5445691dcad

Changed in nova:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in nova:
milestone: none → juno-rc1
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: juno-rc1 → 2014.2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.