virsh api is stuck when vm is down with NFS broken
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Ubuntu Cloud Archive |
Fix Released
|
Undecided
|
Unassigned | ||
Mitaka |
Fix Released
|
Undecided
|
Seyeong Kim | ||
libvirt |
Fix Released
|
High
|
|||
libvirt (Ubuntu) |
Fix Released
|
Undecided
|
Unassigned | ||
Xenial |
Fix Released
|
Undecided
|
Seyeong Kim |
Bug Description
[Impact]
virsh command is hang if there is broken VM on broken NFS
This is affected to Xenial, UCA-Mitaka
[Test Case]
1. deploy VM with NFS storage ( running )
2. block NFS via iptables
- iptables -A OUTPUT -d NFS_SERVER_IP -p tcp --dport 2049 -j DROP ( on host machine )
3. virsh blkdeviotune generic hda => hang
4. virsh domstats => hang
5. virsh list => lang
[Regression]
After patch, we can command domstats and list with short timeout. and libvirt-bin needs to be restarted. so if there are many VMs it will be affected short time while it is restarting.
[Others]
This bug is fixed in redhat bug report[1] and mailing list[2] and git commit[3][4][5]
and it is merged 1.3.5 upstream
[1] https:/
[2] https:/
[3] https:/
[4] https:/
[5] https:/
CVE References
description: | updated |
tags: | added: sts |
Changed in libvirt: | |
importance: | Unknown → High |
status: | Unknown → Fix Released |
Changed in libvirt (Ubuntu): | |
status: | New → Fix Released |
Changed in libvirt (Ubuntu Xenial): | |
status: | New → Triaged |
summary: |
- virsh api is stuck when vm is down with NFS borken + virsh api is stuck when vm is down with NFS broken |
Changed in libvirt (Ubuntu Xenial): | |
assignee: | nobody → Seyeong Kim (xtrusia) |
Changed in cloud-archive: | |
status: | New → Fix Released |
tags: |
added: sts-sru-done removed: sts-sru-needed |
Description of problem: trolInfo( ) stops to respond.
Short summary:
if a QEMU/KVM VM hangs for unresponsive storage (NFS server unreachable), after a random amount of time virDomainGetCon
Packages: tools-rhev- 2.3.0-31. el7_2.14. x86_64 qemu-20130517- 7.gitc4bce43. el7.noarch rhev-2. 3.0-31. el7_2.14. x86_64 daemon- driver- qemu-1. 3.4-1.el7. x86_64 rhev-2. 3.0-31. el7_2.14. x86_64 common- rhev-2. 3.0-31. el7_2.14. x86_64
qemu-kvm-
ipxe-roms-
qemu-kvm-
libvirt-
qemu-img-
qemu-kvm-
libvirt- daemon- driver- storage- 1.3.4-1. el7.x86_ 64 daemon- driver- interface- 1.3.4-1. el7.x86_ 64 debuginfo- 1.3.4-1. el7.x86_ 64 daemon- kvm-1.3. 4-1.el7. x86_64 daemon- config- nwfilter- 1.3.4-1. el7.x86_ 64 daemon- config- network- 1.3.4-1. el7.x86_ 64 client- 1.3.4-1. el7.x86_ 64 daemon- driver- lxc-1.3. 4-1.el7. x86_64 lock-sanlock- 1.3.4-1. el7.x86_ 64 daemon- 1.3.4-1. el7.x86_ 64 daemon- driver- qemu-1. 3.4-1.el7. x86_64 devel-1. 3.4-1.el7. x86_64 daemon- driver- secret- 1.3.4-1. el7.x86_ 64 daemon- lxc-1.3. 4-1.el7. x86_64 nss-1.3. 4-1.el7. x86_64 1.3.4-1. el7.x86_ 64 daemon- driver- nodedev- 1.3.4-1. el7.x86_ 64 python- 1.2.17- 2.el7.x86_ 64 daemon- driver- network- 1.3.4-1. el7.x86_ 64 login-shell- 1.3.4-1. el7.x86_ 64 daemon- driver- nwfilter- 1.3.4-1. el7.x86_ 64 docs-1. 3.4-1.el7. x86_64
libvirt-
libvirt-
libvirt-
libvirt-
libvirt-
libvirt-
libvirt-
libvirt-
libvirt-
libvirt-
libvirt-
libvirt-
libvirt-
libvirt-
libvirt-
libvirt-
libvirt-
libvirt-
libvirt-
libvirt-
libvirt-
libvirt recompiled from git, qemu from RHEL
Context: www.ovirt. org) and uses libvirt to run and monitor VMs. We use QEMU/KVM VMs, over shared storage.
Vdsm is the node management system of oVirt (http://
Among the calls Vdsm periodically run to monitor the VM state:
virConnectGetAl lDomainStats tStats ckIoTune obInfo ckInfo
virDomainListGe
virDomainGetBlo
virDomainBlockJ
virDomainGetBlo
virDomainGetVcpus
We know from experience storage may get unresponsive/ unreachable, so QEMU monitor calls can hang, leading in turn to libvirt call to hang.
Vdsm does the monitoring using a thread pool. Should one of the worker thread become unresponsive, it is replaced. To avoid to stall libvirt, and to leak threads undefinitely, Vdsm has one additional protection layer: it inspects libvirt state before to call which go down to QEMU, using code like
def isDomainReadyFo rCommands( self): controlInfo( ) NotConnectedErr or: libvirtError as e: VIR_ERR_ NO_DOMAIN:
return False
raise VIR_DOMAIN_ CONTROL_ OK
try:
state, details, stateTime = self._dom.
except virdomain.
# this method may be called asynchronously by periodic
# operations. Thus, we must use a try/except block
# to avoid racy checks.
return False
except libvirt.
if e.get_error_code() == libvirt.
else:
else:
return state == libvirt.
Vdsm actually issues the potentially hanging call if and only if the call above returns True (hence virDomainContro lInfo() state is VIR_DOMAIN_ CONTROL_ OK)
When the NFS server is unreachable, the protection layer in Vdsm triggers, and Vdsm avoid to send libvirt calls. After a while, however we see virDomainGetCon trolInfo( ) calls not responding anymore, like
(full log attached)
2016-0...