Comment 3 for bug 1746630

Revision history for this message
In , Francesco (francesco-redhat-bugs-1) wrote :

Description of problem:
Short summary:
if a QEMU/KVM VM hangs for unresponsive storage (NFS server unreachable), after a random amount of time virDomainGetControlInfo() stops to respond.

Packages:
qemu-kvm-tools-rhev-2.3.0-31.el7_2.14.x86_64
ipxe-roms-qemu-20130517-7.gitc4bce43.el7.noarch
qemu-kvm-rhev-2.3.0-31.el7_2.14.x86_64
libvirt-daemon-driver-qemu-1.3.4-1.el7.x86_64
qemu-img-rhev-2.3.0-31.el7_2.14.x86_64
qemu-kvm-common-rhev-2.3.0-31.el7_2.14.x86_64

libvirt-daemon-driver-storage-1.3.4-1.el7.x86_64
libvirt-daemon-driver-interface-1.3.4-1.el7.x86_64
libvirt-debuginfo-1.3.4-1.el7.x86_64
libvirt-daemon-kvm-1.3.4-1.el7.x86_64
libvirt-daemon-config-nwfilter-1.3.4-1.el7.x86_64
libvirt-daemon-config-network-1.3.4-1.el7.x86_64
libvirt-client-1.3.4-1.el7.x86_64
libvirt-daemon-driver-lxc-1.3.4-1.el7.x86_64
libvirt-lock-sanlock-1.3.4-1.el7.x86_64
libvirt-daemon-1.3.4-1.el7.x86_64
libvirt-daemon-driver-qemu-1.3.4-1.el7.x86_64
libvirt-devel-1.3.4-1.el7.x86_64
libvirt-daemon-driver-secret-1.3.4-1.el7.x86_64
libvirt-daemon-lxc-1.3.4-1.el7.x86_64
libvirt-nss-1.3.4-1.el7.x86_64
libvirt-1.3.4-1.el7.x86_64
libvirt-daemon-driver-nodedev-1.3.4-1.el7.x86_64
libvirt-python-1.2.17-2.el7.x86_64
libvirt-daemon-driver-network-1.3.4-1.el7.x86_64
libvirt-login-shell-1.3.4-1.el7.x86_64
libvirt-daemon-driver-nwfilter-1.3.4-1.el7.x86_64
libvirt-docs-1.3.4-1.el7.x86_64

libvirt recompiled from git, qemu from RHEL

Context:
Vdsm is the node management system of oVirt (http://www.ovirt.org) and uses libvirt to run and monitor VMs. We use QEMU/KVM VMs, over shared storage.
Among the calls Vdsm periodically run to monitor the VM state:

virConnectGetAllDomainStats
virDomainListGetStats
virDomainGetBlockIoTune
virDomainBlockJobInfo
virDomainGetBlockInfo
virDomainGetVcpus

We know from experience storage may get unresponsive/unreachable, so QEMU monitor calls can hang, leading in turn to libvirt call to hang.

Vdsm does the monitoring using a thread pool. Should one of the worker thread become unresponsive, it is replaced. To avoid to stall libvirt, and to leak threads undefinitely, Vdsm has one additional protection layer: it inspects libvirt state before to call which go down to QEMU, using code like

    def isDomainReadyForCommands(self):
        try:
            state, details, stateTime = self._dom.controlInfo()
        except virdomain.NotConnectedError:
            # this method may be called asynchronously by periodic
            # operations. Thus, we must use a try/except block
            # to avoid racy checks.
            return False
        except libvirt.libvirtError as e:
            if e.get_error_code() == libvirt.VIR_ERR_NO_DOMAIN:
                return False
            else:
                raise
        else:
            return state == libvirt.VIR_DOMAIN_CONTROL_OK

Vdsm actually issues the potentially hanging call if and only if the call above returns True (hence virDomainControlInfo() state is VIR_DOMAIN_CONTROL_OK)

When the NFS server is unreachable, the protection layer in Vdsm triggers, and Vdsm avoid to send libvirt calls. After a while, however we see virDomainGetControlInfo() calls not responding anymore, like
(full log attached)

2016-05-18 06:01:45.920+0000: 3069: debug : virThreadJobSet:96 : Thread 3069 (virNetServerHandleJob) is now running job remoteDispatchDomainGetVcpus
2016-05-18 06:01:45.920+0000: 3069: info : virObjectNew:202 : OBJECT_NEW: obj=0x7f5a70004070 classname=virDomain
2016-05-18 06:01:45.920+0000: 3069: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a5c000ec0
2016-05-18 06:01:45.920+0000: 3069: debug : virDomainGetVcpus:7733 : dom=0x7f5a70004070, (VM: name=a1, uuid=048f8624-03fc-4729-8f4d-12cb4387f018), info=0x7f5a70002140, maxinfo=2, cpumaps=0x7f5a70002200, maplen=1
2016-05-18 06:01:45.920+0000: 3069: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a64009bf0
2016-05-18 06:01:45.920+0000: 3069: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a930f6f00
2016-05-18 06:01:45.920+0000: 3069: debug : virAccessManagerCheckDomain:234 : manager=0x7f5a930f6f00(name=stack) driver=QEMU domain=0x7f5a64012c40 perm=1
2016-05-18 06:01:45.920+0000: 3069: debug : virAccessManagerCheckDomain:234 : manager=0x7f5a930ebdf0(name=none) driver=QEMU domain=0x7f5a64012c40 perm=1
2016-05-18 06:01:45.920+0000: 3069: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7f5a930f6f00
2016-05-18 06:01:45.920+0000: 3069: debug : qemuGetProcessInfo:1486 : Got status for 3500/3505 user=1507 sys=209 cpu=1 rss=531128
2016-05-18 06:01:45.920+0000: 3069: debug : qemuGetProcessInfo:1486 : Got status for 3500/3506 user=1150 sys=55 cpu=0 rss=531128
2016-05-18 06:01:45.920+0000: 3069: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7f5a64009bf0
2016-05-18 06:01:45.920+0000: 3069: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7f5a70004070
2016-05-18 06:01:45.920+0000: 3069: info : virObjectUnref:261 : OBJECT_DISPOSE: obj=0x7f5a70004070
2016-05-18 06:01:45.920+0000: 3069: debug : virDomainDispose:313 : release domain 0x7f5a70004070 a1 048f8624-03fc-4729-8f4d-12cb4387f018
2016-05-18 06:01:45.920+0000: 3069: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7f5a5c000ec0
2016-05-18 06:01:45.920+0000: 3069: debug : virThreadJobClear:121 : Thread 3069 (virNetServerHandleJob) finished job remoteDispatchDomainGetVcpus with ret=0
2016-05-18 06:01:45.920+0000: 3069: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7f5a5c0009c0
2016-05-18 06:01:45.920+0000: 3069: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7f5a5c0009c0
2016-05-18 06:01:45.920+0000: 3069: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7f5a930ff630
2016-05-18 06:01:45.920+0000: 3069: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7f5a93110890
2016-05-18 06:01:45.922+0000: 3070: warning : qemuDomainObjBeginJobInternal:2180 : Cannot start job (query, none) for domain a1; current job is (query, none) owned by (3066 remoteDispatchDomainGetBlockIoTune, 0 <null>) for (30s, 0s)
2016-05-18 06:01:45.922+0000: 3070: error : qemuDomainObjBeginJobInternal:2192 : Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainGetBlockIoTune)
2016-05-18 06:01:45.922+0000: 3070: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7f5a30178fe0
2016-05-18 06:01:45.922+0000: 3070: debug : virCgroupGetValueStr:814 : Get value /sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-qemu\x2d1\x2da1.scope/cpuacct.usage
2016-05-18 06:01:45.922+0000: 3070: debug : virFileClose:102 : Closed fd 26
2016-05-18 06:01:45.922+0000: 3070: debug : virCgroupGetValueStr:814 : Get value /sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-qemu\x2d1\x2da1.scope/cpuacct.stat
2016-05-18 06:01:45.922+0000: 3070: debug : virFileClose:102 : Closed fd 26
2016-05-18 06:01:45.922+0000: 3070: debug : qemuGetProcessInfo:1486 : Got status for 3500/3505 user=1507 sys=209 cpu=1 rss=531128
2016-05-18 06:01:45.922+0000: 3070: debug : virFileClose:102 : Closed fd 26
2016-05-18 06:01:45.922+0000: 3070: debug : qemuGetProcessInfo:1486 : Got status for 3500/3506 user=1150 sys=55 cpu=0 rss=531128
2016-05-18 06:01:45.922+0000: 3070: debug : virFileClose:102 : Closed fd 26
2016-05-18 06:01:45.922+0000: 3070: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a30178fe0
2016-05-18 06:01:45.922+0000: 3070: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a30178fe0
2016-05-18 06:01:45.922+0000: 3070: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7f5a30178fe0
2016-05-18 e6:01:45.938+0000: 3065: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a93110890
2016-05-18 06:01:45.938+0000: 3065: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a930ff630
2016-05-18 06:01:45.939+0000: 3068: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a5c0009c0
2016-05-18 06:01:45.939+0000: 3068: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a5c0009c0
2016-05-18 06:01:45.939+0000: 3068: debug : virThreadJobSet:96 : Thread 3068 (virNetServerHandleJob) is now running job remoteDispatchDomainGetControlInfo
2016-05-18 06:01:45.939+0000: 3068: info : virObjectNew:202 : OBJECT_NEW: obj=0x7f5a6c002b50 classname=virDomain
2016-05-18 06:01:45.939+0000: 3068: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a5c000ec0
2016-05-18 06:01:45.939+0000: 3068: debug : virDomainGetControlInfo:2526 : dom=0x7f5a6c002b50, (VM: name=a1, uuid=048f8624-03fc-4729-8f4d-12cb4387f018), info=0x7f5a79a50ad0, flags=0
2016-05-18 06:01:45.939+0000: 3068: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a64009bf0
2016-05-18 06:01:47.940+0000: 3065: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a93110890
2016-05-18 06:01:47.940+0000: 3065: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a930ff630
2016-05-18 06:01:47.940+0000: 3067: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a5c0009c0
2016-05-18 06:01:47.940+0000: 3067: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a5c0009c0
2016-05-18 06:01:47.940+0000: 3067: debug : virThreadJobSet:96 : Thread 3067 (virNetServerHandleJob) is now running job remoteDispatchDomainGetControlInfo

Please note that calls begin and never end; previosly calls begun and end in a couple of msecs most.

Version-Release number of selected component (if applicable):
1.3.4

How reproducible:
100% of times, but the issue takes a random amount of time before to surface. I've seen it happen after few minutes or after few hours.

Steps to Reproduce:
1. create a QEMU domain on NFS shared storage (see attached dom.xml for example)
2. periodically run monitoring calls, each one protected by virDomainGetControlInfo() as exposed above.
3. wait some time, at least a couple of hours to have a good chance to see the failure

Actual results:
After a random time, virDomainGetControlInfo() stops responding

Expected results:
virDomainGetControlInfo() keeps responding

Additional info:
1. I can't tell if it is interplay of libvirt calls; so I can't tell if calling only virDomainGetControlInfo() periodically is sufficent to make it hang
2. when virDomainGetControlInfo() hangs, virsh stop responding as well. (even virsh -r list hangs)