Stuck in vm_state SHUTOFF and task_state rebuild_spawning - stuck in to_xml() ?

Bug #1277494 reported by Mark McLoughlin
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Unassigned

Bug Description

See http://logs.openstack.org/32/71532/1/gate/gate-tempest-dsvm-full/a90b312/

The first test to fail is test_rebuild_server_in_stop_state. Sequence of events is:

  1. rebuild with new image id
  2. wait to transition to SHUTOFF/None
  3. Test is done, run cleanup operations ...
  4. Rebuild to old image id
  5. Wait to hit SHUTOFF/None
  6. Start the instance again

It is step 5 we get stuck and time out. We make this transition:

  2014-02-06 22:29:51,724 State transition "SHUTOFF/rebuilding" ==> "SHUTOFF/rebuild_spawning" after 2 second wait

and never transition to ACTIVE/powering-off ... which suggests we get stuck in spawn() somewhere

The last log entry for req-60fa2fbb-de78-4379-8ffa-bd0c70f52039 in n-cpu is:

  [instance: a3ac8847-db15-4f5a-b087-5256b54a36f5] Start to_xml

We never get the corresponding 'End to_xml'

Now ... interestingly, there's a resume operation happening around the same time (req-aeedee18-be57-419e-8775-0af26dd796de) and it fails with:

  "An error occurred while trying to launch a defined domain with xml:"

Hmm ... also interestingly, there isn't another "Start to_xml" in the logs after this one - perhaps the stuck thread is holding a lock.

tags: added: libvirt
Revision history for this message
Mark McLoughlin (markmc) wrote :

Smells quite similar to bug #1273478 - i.e. deadlock caused by holding a lock while waiting on the async network info operation to complete, and then the async operation tries to take the same lock

Joe Gordon (jogo)
tags: added: testing
Revision history for this message
Mark McLoughlin (markmc) wrote :

Seen this again here: http://logs.openstack.org/02/68402/2/gate/gate-tempest-dsvm-full/b4850b1/logs/

e.g.

 $> grep '\(Start\|End\) to_xml' screen-n-cpu.txt.gz | sed 's/\(.*libvirt.driver\) .* \(\(Start\|End\) .*\)/\1 \2/' | cut -c 1-70 | tail -5
 2014-02-10 10:30:18.457 DEBUG nova.virt.libvirt.driver Start to_xml ne
 2014-02-10 10:30:18.695 DEBUG nova.virt.libvirt.driver End to_xml xml=
 2014-02-10 10:30:31.587 DEBUG nova.virt.libvirt.driver Start to_xml ne
 2014-02-10 10:30:31.797 DEBUG nova.virt.libvirt.driver End to_xml xml=
 2014-02-10 10:30:47.212 DEBUG nova.virt.libvirt.driver Start to_xml ne

Revision history for this message
Mark McLoughlin (markmc) wrote :

Suspect this could be the root cause of many "Timed out waiting for thing ... to become ACTIVE" failures - see bug #1254890

Solly Ross (sross-7)
Changed in nova:
importance: Undecided → High
status: New → Confirmed
Revision history for this message
haruka tanizawa (h-tanizawa) wrote :

Hi
I couldn't reproduce this problem again with my env.
vif_driver = nova.virt.libvirt.vif.LibvirtGenericVIFDriver
virt_type = qemu
http://paste.openstack.org/show/84454/
I either couldn't paste compute log because it's too big to paste http://paste.openstack.org/ .
But I could watch set of Start and End to_xml process log.

Does this still happen?

Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote :

I believe this has been fixed with upgrades to trusty and newer libvirt. please reopen if needed.

Changed in nova:
status: Confirmed → Fix Committed
Thierry Carrez (ttx)
Changed in nova:
milestone: none → juno-rc1
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: juno-rc1 → 2014.2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.