Description of problem:
Short summary:
if a QEMU/KVM VM hangs for unresponsive storage (NFS server unreachable), after a random amount of time virDomainGetControlInfo() stops to respond.
Context:
Vdsm is the node management system of oVirt (http://www.ovirt.org) and uses libvirt to run and monitor VMs. We use QEMU/KVM VMs, over shared storage.
Among the calls Vdsm periodically run to monitor the VM state:
We know from experience storage may get unresponsive/unreachable, so QEMU monitor calls can hang, leading in turn to libvirt call to hang.
Vdsm does the monitoring using a thread pool. Should one of the worker thread become unresponsive, it is replaced. To avoid to stall libvirt, and to leak threads undefinitely, Vdsm has one additional protection layer: it inspects libvirt state before to call which go down to QEMU, using code like
def isDomainReadyForCommands(self):
try:
state, details, stateTime = self._dom.controlInfo()
except virdomain.NotConnectedError:
# this method may be called asynchronously by periodic
# operations. Thus, we must use a try/except block
# to avoid racy checks.
return False
except libvirt.libvirtError as e:
if e.get_error_code() == libvirt.VIR_ERR_NO_DOMAIN: return False
else: raise
else:
return state == libvirt.VIR_DOMAIN_CONTROL_OK
Vdsm actually issues the potentially hanging call if and only if the call above returns True (hence virDomainControlInfo() state is VIR_DOMAIN_CONTROL_OK)
When the NFS server is unreachable, the protection layer in Vdsm triggers, and Vdsm avoid to send libvirt calls. After a while, however we see virDomainGetControlInfo() calls not responding anymore, like
(full log attached)
2016-05-18 06:01:45.920+0000: 3069: debug : virThreadJobSet:96 : Thread 3069 (virNetServerHandleJob) is now running job remoteDispatchDomainGetVcpus
2016-05-18 06:01:45.920+0000: 3069: info : virObjectNew:202 : OBJECT_NEW: obj=0x7f5a70004070 classname=virDomain
2016-05-18 06:01:45.920+0000: 3069: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a5c000ec0
2016-05-18 06:01:45.920+0000: 3069: debug : virDomainGetVcpus:7733 : dom=0x7f5a70004070, (VM: name=a1, uuid=048f8624-03fc-4729-8f4d-12cb4387f018), info=0x7f5a70002140, maxinfo=2, cpumaps=0x7f5a70002200, maplen=1
2016-05-18 06:01:45.920+0000: 3069: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a64009bf0
2016-05-18 06:01:45.920+0000: 3069: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a930f6f00
2016-05-18 06:01:45.920+0000: 3069: debug : virAccessManagerCheckDomain:234 : manager=0x7f5a930f6f00(name=stack) driver=QEMU domain=0x7f5a64012c40 perm=1
2016-05-18 06:01:45.920+0000: 3069: debug : virAccessManagerCheckDomain:234 : manager=0x7f5a930ebdf0(name=none) driver=QEMU domain=0x7f5a64012c40 perm=1
2016-05-18 06:01:45.920+0000: 3069: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7f5a930f6f00
2016-05-18 06:01:45.920+0000: 3069: debug : qemuGetProcessInfo:1486 : Got status for 3500/3505 user=1507 sys=209 cpu=1 rss=531128
2016-05-18 06:01:45.920+0000: 3069: debug : qemuGetProcessInfo:1486 : Got status for 3500/3506 user=1150 sys=55 cpu=0 rss=531128
2016-05-18 06:01:45.920+0000: 3069: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7f5a64009bf0
2016-05-18 06:01:45.920+0000: 3069: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7f5a70004070
2016-05-18 06:01:45.920+0000: 3069: info : virObjectUnref:261 : OBJECT_DISPOSE: obj=0x7f5a70004070
2016-05-18 06:01:45.920+0000: 3069: debug : virDomainDispose:313 : release domain 0x7f5a70004070 a1 048f8624-03fc-4729-8f4d-12cb4387f018
2016-05-18 06:01:45.920+0000: 3069: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7f5a5c000ec0
2016-05-18 06:01:45.920+0000: 3069: debug : virThreadJobClear:121 : Thread 3069 (virNetServerHandleJob) finished job remoteDispatchDomainGetVcpus with ret=0
2016-05-18 06:01:45.920+0000: 3069: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7f5a5c0009c0
2016-05-18 06:01:45.920+0000: 3069: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7f5a5c0009c0
2016-05-18 06:01:45.920+0000: 3069: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7f5a930ff630
2016-05-18 06:01:45.920+0000: 3069: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7f5a93110890
2016-05-18 06:01:45.922+0000: 3070: warning : qemuDomainObjBeginJobInternal:2180 : Cannot start job (query, none) for domain a1; current job is (query, none) owned by (3066 remoteDispatchDomainGetBlockIoTune, 0 <null>) for (30s, 0s)
2016-05-18 06:01:45.922+0000: 3070: error : qemuDomainObjBeginJobInternal:2192 : Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainGetBlockIoTune)
2016-05-18 06:01:45.922+0000: 3070: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7f5a30178fe0
2016-05-18 06:01:45.922+0000: 3070: debug : virCgroupGetValueStr:814 : Get value /sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-qemu\x2d1\x2da1.scope/cpuacct.usage
2016-05-18 06:01:45.922+0000: 3070: debug : virFileClose:102 : Closed fd 26
2016-05-18 06:01:45.922+0000: 3070: debug : virCgroupGetValueStr:814 : Get value /sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-qemu\x2d1\x2da1.scope/cpuacct.stat
2016-05-18 06:01:45.922+0000: 3070: debug : virFileClose:102 : Closed fd 26
2016-05-18 06:01:45.922+0000: 3070: debug : qemuGetProcessInfo:1486 : Got status for 3500/3505 user=1507 sys=209 cpu=1 rss=531128
2016-05-18 06:01:45.922+0000: 3070: debug : virFileClose:102 : Closed fd 26
2016-05-18 06:01:45.922+0000: 3070: debug : qemuGetProcessInfo:1486 : Got status for 3500/3506 user=1150 sys=55 cpu=0 rss=531128
2016-05-18 06:01:45.922+0000: 3070: debug : virFileClose:102 : Closed fd 26
2016-05-18 06:01:45.922+0000: 3070: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a30178fe0
2016-05-18 06:01:45.922+0000: 3070: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a30178fe0
2016-05-18 06:01:45.922+0000: 3070: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7f5a30178fe0
2016-05-18 e6:01:45.938+0000: 3065: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a93110890
2016-05-18 06:01:45.938+0000: 3065: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a930ff630
2016-05-18 06:01:45.939+0000: 3068: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a5c0009c0
2016-05-18 06:01:45.939+0000: 3068: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a5c0009c0
2016-05-18 06:01:45.939+0000: 3068: debug : virThreadJobSet:96 : Thread 3068 (virNetServerHandleJob) is now running job remoteDispatchDomainGetControlInfo
2016-05-18 06:01:45.939+0000: 3068: info : virObjectNew:202 : OBJECT_NEW: obj=0x7f5a6c002b50 classname=virDomain
2016-05-18 06:01:45.939+0000: 3068: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a5c000ec0
2016-05-18 06:01:45.939+0000: 3068: debug : virDomainGetControlInfo:2526 : dom=0x7f5a6c002b50, (VM: name=a1, uuid=048f8624-03fc-4729-8f4d-12cb4387f018), info=0x7f5a79a50ad0, flags=0
2016-05-18 06:01:45.939+0000: 3068: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a64009bf0
2016-05-18 06:01:47.940+0000: 3065: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a93110890
2016-05-18 06:01:47.940+0000: 3065: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a930ff630
2016-05-18 06:01:47.940+0000: 3067: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a5c0009c0
2016-05-18 06:01:47.940+0000: 3067: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a5c0009c0
2016-05-18 06:01:47.940+0000: 3067: debug : virThreadJobSet:96 : Thread 3067 (virNetServerHandleJob) is now running job remoteDispatchDomainGetControlInfo
Please note that calls begin and never end; previosly calls begun and end in a couple of msecs most.
Version-Release number of selected component (if applicable):
1.3.4
How reproducible:
100% of times, but the issue takes a random amount of time before to surface. I've seen it happen after few minutes or after few hours.
Steps to Reproduce:
1. create a QEMU domain on NFS shared storage (see attached dom.xml for example)
2. periodically run monitoring calls, each one protected by virDomainGetControlInfo() as exposed above.
3. wait some time, at least a couple of hours to have a good chance to see the failure
Actual results:
After a random time, virDomainGetControlInfo() stops responding
Additional info:
1. I can't tell if it is interplay of libvirt calls; so I can't tell if calling only virDomainGetControlInfo() periodically is sufficent to make it hang
2. when virDomainGetControlInfo() hangs, virsh stop responding as well. (even virsh -r list hangs)
Description of problem: trolInfo( ) stops to respond.
Short summary:
if a QEMU/KVM VM hangs for unresponsive storage (NFS server unreachable), after a random amount of time virDomainGetCon
Packages: tools-rhev- 2.3.0-31. el7_2.14. x86_64 qemu-20130517- 7.gitc4bce43. el7.noarch rhev-2. 3.0-31. el7_2.14. x86_64 daemon- driver- qemu-1. 3.4-1.el7. x86_64 rhev-2. 3.0-31. el7_2.14. x86_64 common- rhev-2. 3.0-31. el7_2.14. x86_64
qemu-kvm-
ipxe-roms-
qemu-kvm-
libvirt-
qemu-img-
qemu-kvm-
libvirt- daemon- driver- storage- 1.3.4-1. el7.x86_ 64 daemon- driver- interface- 1.3.4-1. el7.x86_ 64 debuginfo- 1.3.4-1. el7.x86_ 64 daemon- kvm-1.3. 4-1.el7. x86_64 daemon- config- nwfilter- 1.3.4-1. el7.x86_ 64 daemon- config- network- 1.3.4-1. el7.x86_ 64 client- 1.3.4-1. el7.x86_ 64 daemon- driver- lxc-1.3. 4-1.el7. x86_64 lock-sanlock- 1.3.4-1. el7.x86_ 64 daemon- 1.3.4-1. el7.x86_ 64 daemon- driver- qemu-1. 3.4-1.el7. x86_64 devel-1. 3.4-1.el7. x86_64 daemon- driver- secret- 1.3.4-1. el7.x86_ 64 daemon- lxc-1.3. 4-1.el7. x86_64 nss-1.3. 4-1.el7. x86_64 1.3.4-1. el7.x86_ 64 daemon- driver- nodedev- 1.3.4-1. el7.x86_ 64 python- 1.2.17- 2.el7.x86_ 64 daemon- driver- network- 1.3.4-1. el7.x86_ 64 login-shell- 1.3.4-1. el7.x86_ 64 daemon- driver- nwfilter- 1.3.4-1. el7.x86_ 64 docs-1. 3.4-1.el7. x86_64
libvirt-
libvirt-
libvirt-
libvirt-
libvirt-
libvirt-
libvirt-
libvirt-
libvirt-
libvirt-
libvirt-
libvirt-
libvirt-
libvirt-
libvirt-
libvirt-
libvirt-
libvirt-
libvirt-
libvirt-
libvirt-
libvirt recompiled from git, qemu from RHEL
Context: www.ovirt. org) and uses libvirt to run and monitor VMs. We use QEMU/KVM VMs, over shared storage.
Vdsm is the node management system of oVirt (http://
Among the calls Vdsm periodically run to monitor the VM state:
virConnectGetAl lDomainStats tStats ckIoTune obInfo ckInfo
virDomainListGe
virDomainGetBlo
virDomainBlockJ
virDomainGetBlo
virDomainGetVcpus
We know from experience storage may get unresponsive/ unreachable, so QEMU monitor calls can hang, leading in turn to libvirt call to hang.
Vdsm does the monitoring using a thread pool. Should one of the worker thread become unresponsive, it is replaced. To avoid to stall libvirt, and to leak threads undefinitely, Vdsm has one additional protection layer: it inspects libvirt state before to call which go down to QEMU, using code like
def isDomainReadyFo rCommands( self): controlInfo( ) NotConnectedErr or: libvirtError as e: VIR_ERR_ NO_DOMAIN:
return False
raise VIR_DOMAIN_ CONTROL_ OK
try:
state, details, stateTime = self._dom.
except virdomain.
# this method may be called asynchronously by periodic
# operations. Thus, we must use a try/except block
# to avoid racy checks.
return False
except libvirt.
if e.get_error_code() == libvirt.
else:
else:
return state == libvirt.
Vdsm actually issues the potentially hanging call if and only if the call above returns True (hence virDomainContro lInfo() state is VIR_DOMAIN_ CONTROL_ OK)
When the NFS server is unreachable, the protection layer in Vdsm triggers, and Vdsm avoid to send libvirt calls. After a while, however we see virDomainGetCon trolInfo( ) calls not responding anymore, like
(full log attached)
2016-05-18 06:01:45.920+0000: 3069: debug : virThreadJobSet:96 : Thread 3069 (virNetServerHa ndleJob) is now running job remoteDispatchD omainGetVcpus us:7733 : dom=0x7f5a70004070, (VM: name=a1, uuid=048f8624- 03fc-4729- 8f4d-12cb4387f0 18), info=0x7f5a7000 2140, maxinfo=2, cpumaps= 0x7f5a70002200, maplen=1 rCheckDomain: 234 : manager= 0x7f5a930f6f00( name=stack) driver=QEMU domain= 0x7f5a64012c40 perm=1 rCheckDomain: 234 : manager= 0x7f5a930ebdf0( name=none) driver=QEMU domain= 0x7f5a64012c40 perm=1 nfo:1486 : Got status for 3500/3505 user=1507 sys=209 cpu=1 rss=531128 nfo:1486 : Got status for 3500/3506 user=1150 sys=55 cpu=0 rss=531128 e:313 : release domain 0x7f5a70004070 a1 048f8624- 03fc-4729- 8f4d-12cb4387f0 18 ar:121 : Thread 3069 (virNetServerHa ndleJob) finished job remoteDispatchD omainGetVcpus with ret=0 ginJobInternal: 2180 : Cannot start job (query, none) for domain a1; current job is (query, none) owned by (3066 remoteDispatchD omainGetBlockIo Tune, 0 <null>) for (30s, 0s) ginJobInternal: 2192 : Timed out during operation: cannot acquire state change lock (held by remoteDispatchD omainGetBlockIo Tune) ueStr:814 : Get value /sys/fs/ cgroup/ cpu,cpuacct/ machine. slice/machine- qemu\x2d1\ x2da1.scope/ cpuacct. usage ueStr:814 : Get value /sys/fs/ cgroup/ cpu,cpuacct/ machine. slice/machine- qemu\x2d1\ x2da1.scope/ cpuacct. stat nfo:1486 : Got status for 3500/3505 user=1507 sys=209 cpu=1 rss=531128 nfo:1486 : Got status for 3500/3506 user=1150 sys=55 cpu=0 rss=531128 ndleJob) is now running job remoteDispatchD omainGetControl Info trolInfo: 2526 : dom=0x7f5a6c002b50, (VM: name=a1, uuid=048f8624- 03fc-4729- 8f4d-12cb4387f0 18), info=0x7f5a79a5 0ad0, flags=0 ndleJob) is now running job remoteDispatchD omainGetControl Info
2016-05-18 06:01:45.920+0000: 3069: info : virObjectNew:202 : OBJECT_NEW: obj=0x7f5a70004070 classname=virDomain
2016-05-18 06:01:45.920+0000: 3069: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a5c000ec0
2016-05-18 06:01:45.920+0000: 3069: debug : virDomainGetVcp
2016-05-18 06:01:45.920+0000: 3069: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a64009bf0
2016-05-18 06:01:45.920+0000: 3069: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a930f6f00
2016-05-18 06:01:45.920+0000: 3069: debug : virAccessManage
2016-05-18 06:01:45.920+0000: 3069: debug : virAccessManage
2016-05-18 06:01:45.920+0000: 3069: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7f5a930f6f00
2016-05-18 06:01:45.920+0000: 3069: debug : qemuGetProcessI
2016-05-18 06:01:45.920+0000: 3069: debug : qemuGetProcessI
2016-05-18 06:01:45.920+0000: 3069: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7f5a64009bf0
2016-05-18 06:01:45.920+0000: 3069: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7f5a70004070
2016-05-18 06:01:45.920+0000: 3069: info : virObjectUnref:261 : OBJECT_DISPOSE: obj=0x7f5a70004070
2016-05-18 06:01:45.920+0000: 3069: debug : virDomainDispos
2016-05-18 06:01:45.920+0000: 3069: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7f5a5c000ec0
2016-05-18 06:01:45.920+0000: 3069: debug : virThreadJobCle
2016-05-18 06:01:45.920+0000: 3069: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7f5a5c0009c0
2016-05-18 06:01:45.920+0000: 3069: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7f5a5c0009c0
2016-05-18 06:01:45.920+0000: 3069: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7f5a930ff630
2016-05-18 06:01:45.920+0000: 3069: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7f5a93110890
2016-05-18 06:01:45.922+0000: 3070: warning : qemuDomainObjBe
2016-05-18 06:01:45.922+0000: 3070: error : qemuDomainObjBe
2016-05-18 06:01:45.922+0000: 3070: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7f5a30178fe0
2016-05-18 06:01:45.922+0000: 3070: debug : virCgroupGetVal
2016-05-18 06:01:45.922+0000: 3070: debug : virFileClose:102 : Closed fd 26
2016-05-18 06:01:45.922+0000: 3070: debug : virCgroupGetVal
2016-05-18 06:01:45.922+0000: 3070: debug : virFileClose:102 : Closed fd 26
2016-05-18 06:01:45.922+0000: 3070: debug : qemuGetProcessI
2016-05-18 06:01:45.922+0000: 3070: debug : virFileClose:102 : Closed fd 26
2016-05-18 06:01:45.922+0000: 3070: debug : qemuGetProcessI
2016-05-18 06:01:45.922+0000: 3070: debug : virFileClose:102 : Closed fd 26
2016-05-18 06:01:45.922+0000: 3070: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a30178fe0
2016-05-18 06:01:45.922+0000: 3070: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a30178fe0
2016-05-18 06:01:45.922+0000: 3070: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7f5a30178fe0
2016-05-18 e6:01:45.938+0000: 3065: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a93110890
2016-05-18 06:01:45.938+0000: 3065: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a930ff630
2016-05-18 06:01:45.939+0000: 3068: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a5c0009c0
2016-05-18 06:01:45.939+0000: 3068: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a5c0009c0
2016-05-18 06:01:45.939+0000: 3068: debug : virThreadJobSet:96 : Thread 3068 (virNetServerHa
2016-05-18 06:01:45.939+0000: 3068: info : virObjectNew:202 : OBJECT_NEW: obj=0x7f5a6c002b50 classname=virDomain
2016-05-18 06:01:45.939+0000: 3068: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a5c000ec0
2016-05-18 06:01:45.939+0000: 3068: debug : virDomainGetCon
2016-05-18 06:01:45.939+0000: 3068: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a64009bf0
2016-05-18 06:01:47.940+0000: 3065: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a93110890
2016-05-18 06:01:47.940+0000: 3065: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a930ff630
2016-05-18 06:01:47.940+0000: 3067: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a5c0009c0
2016-05-18 06:01:47.940+0000: 3067: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a5c0009c0
2016-05-18 06:01:47.940+0000: 3067: debug : virThreadJobSet:96 : Thread 3067 (virNetServerHa
Please note that calls begin and never end; previosly calls begun and end in a couple of msecs most.
Version-Release number of selected component (if applicable):
1.3.4
How reproducible:
100% of times, but the issue takes a random amount of time before to surface. I've seen it happen after few minutes or after few hours.
Steps to Reproduce: trolInfo( ) as exposed above.
1. create a QEMU domain on NFS shared storage (see attached dom.xml for example)
2. periodically run monitoring calls, each one protected by virDomainGetCon
3. wait some time, at least a couple of hours to have a good chance to see the failure
Actual results: trolInfo( ) stops responding
After a random time, virDomainGetCon
Expected results: trolInfo( ) keeps responding
virDomainGetCon
Additional info: trolInfo( ) periodically is sufficent to make it hang trolInfo( ) hangs, virsh stop responding as well. (even virsh -r list hangs)
1. I can't tell if it is interplay of libvirt calls; so I can't tell if calling only virDomainGetCon
2. when virDomainGetCon