Instance hung on first start, but works after being killed and restarted

Bug #1659648 reported by Jason Hobbs
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
nova (Ubuntu)
Invalid
Undecided
Unassigned
qemu (Ubuntu)
Expired
Undecided
Unassigned

Bug Description

Description
===========
A instance did not come up properly the first time it was started and had to be manually killed and then restarted through the API to get it to work.

Steps To Reproduce
==================
* I started an instance through the API
* I noticed I couldn't connect to its floating IP.
* I checked the nova compute node and the qemu-kvm process was running at 100% cpu, and its console log was empty.
* I tried to shut it down through virsh on the compute node, but it didn't respond to that.
* I had to kill it with kill <pid>.
* After that I started the instance again through the API and it came up properly and everything is working.

Expected Result
===============
The instance starts correctly the first time.

Actual Result
=============
I had to kill and restart the instance.

This doesn't happen every time, but I have seen it more than one time. seems to be some kind of race.

I'm at a bit of a loss as to how to debug this, but I can probably reproduce if there is some other information that would help debug it.

Environment
===========
This is with mitaka running on xenial using libvirt+kvm hypervisor, openvswitch networking, and no attached volumes. amd64.

Package versions:
ii nova-common 2:13.1.2-0ubuntu2 all OpenStack Compute - common files
ii nova-compute 2:13.1.2-0ubuntu2 all OpenStack Compute - compute node base
ii nova-compute-kvm 2:13.1.2-0ubuntu2 all OpenStack Compute - compute node (KVM)
ii nova-compute-libvirt 2:13.1.2-0ubuntu2 all OpenStack Compute - compute node libvirt support
ii python-nova 2:13.1.2-0ubuntu2 all OpenStack Compute Python libraries
ii python-novaclient 2:3.3.1-2ubuntu1 all client library for OpenStack Compute API - Python 2.7
ii libvirt-bin 1.3.1-1ubuntu10.6 amd64 programs for the libvirt library
ii libvirt0:amd64 1.3.1-1ubuntu10.6 amd64 library for interfacing with different virtualization systems
ii python-libvirt 1.3.1-1ubuntu1 amd64 libvirt Python bindings

Linux hayward-44 4.4.0-59-generic #80-Ubuntu SMP Fri Jan 6 17:47:47 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :
description: updated
description: updated
Ryan Beisner (1chb1n)
tags: added: arm64 uosci
tags: added: oil
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hi Jason,
thanks for the report.

FYI - we discussed on that last week and what to debug next time we hit it.

We can't do much yet - for now this report is meant to be a focal point that others can find if they run into the same.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Documenting my brainstorm list of things to check next time here:

This might not be perfect solution, but a start.
It is just what came to my mind and might be incomplete.

0. always report the /var/log/libvirt/qemu/guestname.log so one can consider the timing
0. also include the guest xml description (virsh dumpxml <guestname>
0. the host might have extra info in dmesg if it was a host issue
1. is KVM on this system still doing something, check
    perf kvm stat live
2. I'd check if it is hanging in the kernel or the guest. A check of like "ps axlf" will help you.
    If a wchan is assigned and not chainging it is likely hanging on some host kernel queue.
    In that case report that and then try to dump the host so that one can try to crash analyze it.
    If possible to provide a login to the system while in the state - even better.
3. the qemu monitor can sometimes help to check more of the status qemu thinks things are in.
    See https://en.wikibooks.org/wiki/QEMU/Monitor#info for a start
4. if the Host thinks all is fine, but the guest is hanging (virsh shutdown requires the guest to
    cooperate, which might be your case of not stopping it) you might want to debug the guest.
4a.Start with something as trivial as "virsh console". That can't time-warp back, so if it is stuck it
    is empty now - if you happened to set it up to have the main serial console to file you are lucky and can check that.
4b. The next level is guest debugging via gdb. See https://en.wikibooks.org/wiki/QEMU/Monitor#gdbserver

In general the Monitor can do a lot, but it depends so much on the case that I can't write all up.
I hope the links I added help.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

For now setting to imcomplete to reflect that no one can really work on it until more data was provided.

Changed in qemu (Ubuntu):
status: New → Incomplete
Revision history for this message
Sean Dague (sdague) wrote :

This should probably be tagged to the distro Nova, not the upstream Nova, unless we can figure out that upstream has a specific issue here

no longer affects: nova
James Page (james-page)
Changed in nova (Ubuntu):
status: New → Invalid
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for qemu (Ubuntu) because there has been no activity for 60 days.]

Changed in qemu (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.