Bug #1746630 “virsh api is stuck when vm is down with NFS broken...” : Bugs : libvirt package : Ubuntu

Revision history for this message

In Red Hat Bugzilla #1337073, Francesco (francesco-redhat-bugs-1) wrote on 2016-05-18:

#3

Download full text (9.7 KiB)

Description of problem:
Short summary:
if a QEMU/KVM VM hangs for unresponsive storage (NFS server unreachable), after a random amount of time virDomainGetControlInfo() stops to respond.

Packages:
qemu-kvm-tools-rhev-2.3.0-31.el7_2.14.x86_64
ipxe-roms-qemu-20130517-7.gitc4bce43.el7.noarch
qemu-kvm-rhev-2.3.0-31.el7_2.14.x86_64
libvirt-daemon-driver-qemu-1.3.4-1.el7.x86_64
qemu-img-rhev-2.3.0-31.el7_2.14.x86_64
qemu-kvm-common-rhev-2.3.0-31.el7_2.14.x86_64

libvirt-daemon-driver-storage-1.3.4-1.el7.x86_64
libvirt-daemon-driver-interface-1.3.4-1.el7.x86_64
libvirt-debuginfo-1.3.4-1.el7.x86_64
libvirt-daemon-kvm-1.3.4-1.el7.x86_64
libvirt-daemon-config-nwfilter-1.3.4-1.el7.x86_64
libvirt-daemon-config-network-1.3.4-1.el7.x86_64
libvirt-client-1.3.4-1.el7.x86_64
libvirt-daemon-driver-lxc-1.3.4-1.el7.x86_64
libvirt-lock-sanlock-1.3.4-1.el7.x86_64
libvirt-daemon-1.3.4-1.el7.x86_64
libvirt-daemon-driver-qemu-1.3.4-1.el7.x86_64
libvirt-devel-1.3.4-1.el7.x86_64
libvirt-daemon-driver-secret-1.3.4-1.el7.x86_64
libvirt-daemon-lxc-1.3.4-1.el7.x86_64
libvirt-nss-1.3.4-1.el7.x86_64
libvirt-1.3.4-1.el7.x86_64
libvirt-daemon-driver-nodedev-1.3.4-1.el7.x86_64
libvirt-python-1.2.17-2.el7.x86_64
libvirt-daemon-driver-network-1.3.4-1.el7.x86_64
libvirt-login-shell-1.3.4-1.el7.x86_64
libvirt-daemon-driver-nwfilter-1.3.4-1.el7.x86_64
libvirt-docs-1.3.4-1.el7.x86_64

libvirt recompiled from git, qemu from RHEL

Context:
Vdsm is the node management system of oVirt (http://www.ovirt.org) and uses libvirt to run and monitor VMs. We use QEMU/KVM VMs, over shared storage.
Among the calls Vdsm periodically run to monitor the VM state:

virConnectGetAllDomainStats
virDomainListGetStats
virDomainGetBlockIoTune
virDomainBlockJobInfo
virDomainGetBlockInfo
virDomainGetVcpus

We know from experience storage may get unresponsive/unreachable, so QEMU monitor calls can hang, leading in turn to libvirt call to hang.

Vdsm does the monitoring using a thread pool. Should one of the worker thread become unresponsive, it is replaced. To avoid to stall libvirt, and to leak threads undefinitely, Vdsm has one additional protection layer: it inspects libvirt state before to call which go down to QEMU, using code like

    def isDomainReadyForCommands(self):
        try:
            state, details, stateTime = self._dom.controlInfo()
        except virdomain.NotConnectedError:
            # this method may be called asynchronously by periodic
            # operations. Thus, we must use a try/except block
            # to avoid racy checks.
            return False
        except libvirt.libvirtError as e:
            if e.get_error_code() == libvirt.VIR_ERR_NO_DOMAIN:
                return False
            else:
                raise
        else:
            return state == libvirt.VIR_DOMAIN_CONTROL_OK

Vdsm actually issues the potentially hanging call if and only if the call above returns True (hence virDomainControlInfo() state is VIR_DOMAIN_CONTROL_OK)

When the NFS server is unreachable, the protection layer in Vdsm triggers, and Vdsm avoid to send libvirt calls. After a while, however we see virDomainGetControlInfo() calls not responding anymore, like
(full log attached)

2016-0...

Description of problem:
Short summary:
if a QEMU/KVM VM hangs for unresponsive storage (NFS server unreachable), after a random amount of time virDomainGetControlInfo() stops to respond.

Packages:
qemu-kvm-tools-rhev-2.3.0-31.el7_2.14.x86_64
ipxe-roms-qemu-20130517-7.gitc4bce43.el7.noarch
qemu-kvm-rhev-2.3.0-31.el7_2.14.x86_64
libvirt-daemon-driver-qemu-1.3.4-1.el7.x86_64
qemu-img-rhev-2.3.0-31.el7_2.14.x86_64
qemu-kvm-common-rhev-2.3.0-31.el7_2.14.x86_64

libvirt-daemon-driver-storage-1.3.4-1.el7.x86_64
libvirt-daemon-driver-interface-1.3.4-1.el7.x86_64
libvirt-debuginfo-1.3.4-1.el7.x86_64
libvirt-daemon-kvm-1.3.4-1.el7.x86_64
libvirt-daemon-config-nwfilter-1.3.4-1.el7.x86_64
libvirt-daemon-config-network-1.3.4-1.el7.x86_64
libvirt-client-1.3.4-1.el7.x86_64
libvirt-daemon-driver-lxc-1.3.4-1.el7.x86_64
libvirt-lock-sanlock-1.3.4-1.el7.x86_64
libvirt-daemon-1.3.4-1.el7.x86_64
libvirt-daemon-driver-qemu-1.3.4-1.el7.x86_64
libvirt-devel-1.3.4-1.el7.x86_64
libvirt-daemon-driver-secret-1.3.4-1.el7.x86_64
libvirt-daemon-lxc-1.3.4-1.el7.x86_64
libvirt-nss-1.3.4-1.el7.x86_64
libvirt-1.3.4-1.el7.x86_64
libvirt-daemon-driver-nodedev-1.3.4-1.el7.x86_64
libvirt-python-1.2.17-2.el7.x86_64
libvirt-daemon-driver-network-1.3.4-1.el7.x86_64
libvirt-login-shell-1.3.4-1.el7.x86_64
libvirt-daemon-driver-nwfilter-1.3.4-1.el7.x86_64
libvirt-docs-1.3.4-1.el7.x86_64

libvirt recompiled from git, qemu from RHEL

Context:
Vdsm is the node management system of oVirt (http://www.ovirt.org) and uses libvirt to run and monitor VMs. We use QEMU/KVM VMs, over shared storage.
Among the calls Vdsm periodically run to monitor the VM state:

virConnectGetAllDomainStats
virDomainListGetStats
virDomainGetBlockIoTune
virDomainBlockJobInfo
virDomainGetBlockInfo
virDomainGetVcpus

We know from experience storage may get unresponsive/unreachable, so QEMU monitor calls can hang, leading in turn to libvirt call to hang.

Vdsm does the monitoring using a thread pool. Should one of the worker thread become unresponsive, it is replaced. To avoid to stall libvirt, and to leak threads undefinitely, Vdsm has one additional protection layer: it inspects libvirt state before to call which go down to QEMU, using code like

def isDomainReadyForCommands(self):
        try:
            state, details, stateTime = self._dom.controlInfo()
        except virdomain.NotConnectedError:
            # this method may be called asynchronously by periodic
            # operations. Thus, we must use a try/except block
            # to avoid racy checks.
            return False
        except libvirt.libvirtError as e:
            if e.get_error_code() == libvirt.VIR_ERR_NO_DOMAIN:
                return False
            else:
                raise
        else:
            return state == libvirt.VIR_DOMAIN_CONTROL_OK

Vdsm actually issues the potentially hanging call if and only if the call above returns True (hence virDomainControlInfo() state is VIR_DOMAIN_CONTROL_OK)

When the NFS server is unreachable, the protection layer in Vdsm triggers, and Vdsm avoid to send libvirt calls. After a while, however we see virDomainGetControlInfo() calls not responding anymore, like
(full log attached)

2016-05-18 06:01:45.920+0000: 3069: debug : virThreadJobSet:96 : Thread 3069 (virNetServerHandleJob) is now running job remoteDispatchDomainGetVcpus
2016-05-18 06:01:45.920+0000: 3069: info : virObjectNew:202 : OBJECT_NEW: obj=0x7f5a70004070 classname=virDomain
2016-05-18 06:01:45.920+0000: 3069: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a5c000ec0
2016-05-18 06:01:45.920+0000: 3069: debug : virDomainGetVcpus:7733 : dom=0x7f5a70004070, (VM: name=a1, uuid=048f8624-03fc-4729-8f4d-12cb4387f018), info=0x7f5a70002140, maxinfo=2, cpumaps=0x7f5a70002200, maplen=1
2016-05-18 06:01:45.920+0000: 3069: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a64009bf0
2016-05-18 06:01:45.920+0000: 3069: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a930f6f00
2016-05-18 06:01:45.920+0000: 3069: debug : virAccessManagerCheckDomain:234 : manager=0x7f5a930f6f00(name=stack) driver=QEMU domain=0x7f5a64012c40 perm=1
2016-05-18 06:01:45.920+0000: 3069: debug : virAccessManagerCheckDomain:234 : manager=0x7f5a930ebdf0(name=none) driver=QEMU domain=0x7f5a64012c40 perm=1
2016-05-18 06:01:45.920+0000: 3069: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7f5a930f6f00
2016-05-18 06:01:45.920+0000: 3069: debug : qemuGetProcessInfo:1486 : Got status for 3500/3505 user=1507 sys=209 cpu=1 rss=531128
2016-05-18 06:01:45.920+0000: 3069: debug : qemuGetProcessInfo:1486 : Got status for 3500/3506 user=1150 sys=55 cpu=0 rss=531128
2016-05-18 06:01:45.920+0000: 3069: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7f5a64009bf0
2016-05-18 06:01:45.920+0000: 3069: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7f5a70004070
2016-05-18 06:01:45.920+0000: 3069: info : virObjectUnref:261 : OBJECT_DISPOSE: obj=0x7f5a70004070
2016-05-18 06:01:45.920+0000: 3069: debug : virDomainDispose:313 : release domain 0x7f5a70004070 a1 048f8624-03fc-4729-8f4d-12cb4387f018
2016-05-18 06:01:45.920+0000: 3069: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7f5a5c000ec0
2016-05-18 06:01:45.920+0000: 3069: debug : virThreadJobClear:121 : Thread 3069 (virNetServerHandleJob) finished job remoteDispatchDomainGetVcpus with ret=0
2016-05-18 06:01:45.920+0000: 3069: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7f5a5c0009c0
2016-05-18 06:01:45.920+0000: 3069: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7f5a5c0009c0
2016-05-18 06:01:45.920+0000: 3069: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7f5a930ff630
2016-05-18 06:01:45.920+0000: 3069: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7f5a93110890
2016-05-18 06:01:45.922+0000: 3070: warning : qemuDomainObjBeginJobInternal:2180 : Cannot start job (query, none) for domain a1; current job is (query, none) owned by (3066 remoteDispatchDomainGetBlockIoTune, 0 <null>) for (30s, 0s)
2016-05-18 06:01:45.922+0000: 3070: error : qemuDomainObjBeginJobInternal:2192 : Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainGetBlockIoTune)
2016-05-18 06:01:45.922+0000: 3070: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7f5a30178fe0
2016-05-18 06:01:45.922+0000: 3070: debug : virCgroupGetValueStr:814 : Get value /sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-qemu\x2d1\x2da1.scope/cpuacct.usage
2016-05-18 06:01:45.922+0000: 3070: debug : virFileClose:102 : Closed fd 26
2016-05-18 06:01:45.922+0000: 3070: debug : virCgroupGetValueStr:814 : Get value /sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-qemu\x2d1\x2da1.scope/cpuacct.stat
2016-05-18 06:01:45.922+0000: 3070: debug : virFileClose:102 : Closed fd 26
2016-05-18 06:01:45.922+0000: 3070: debug : qemuGetProcessInfo:1486 : Got status for 3500/3505 user=1507 sys=209 cpu=1 rss=531128
2016-05-18 06:01:45.922+0000: 3070: debug : virFileClose:102 : Closed fd 26
2016-05-18 06:01:45.922+0000: 3070: debug : qemuGetProcessInfo:1486 : Got status for 3500/3506 user=1150 sys=55 cpu=0 rss=531128
2016-05-18 06:01:45.922+0000: 3070: debug : virFileClose:102 : Closed fd 26
2016-05-18 06:01:45.922+0000: 3070: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a30178fe0
2016-05-18 06:01:45.922+0000: 3070: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a30178fe0
2016-05-18 06:01:45.922+0000: 3070: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7f5a30178fe0
2016-05-18 e6:01:45.938+0000: 3065: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a93110890
2016-05-18 06:01:45.938+0000: 3065: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a930ff630
2016-05-18 06:01:45.939+0000: 3068: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a5c0009c0
2016-05-18 06:01:45.939+0000: 3068: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a5c0009c0
2016-05-18 06:01:45.939+0000: 3068: debug : virThreadJobSet:96 : Thread 3068 (virNetServerHandleJob) is now running job remoteDispatchDomainGetControlInfo
2016-05-18 06:01:45.939+0000: 3068: info : virObjectNew:202 : OBJECT_NEW: obj=0x7f5a6c002b50 classname=virDomain
2016-05-18 06:01:45.939+0000: 3068: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a5c000ec0
2016-05-18 06:01:45.939+0000: 3068: debug : virDomainGetControlInfo:2526 : dom=0x7f5a6c002b50, (VM: name=a1, uuid=048f8624-03fc-4729-8f4d-12cb4387f018), info=0x7f5a79a50ad0, flags=0
2016-05-18 06:01:45.939+0000: 3068: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a64009bf0
2016-05-18 06:01:47.940+0000: 3065: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a93110890
2016-05-18 06:01:47.940+0000: 3065: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a930ff630
2016-05-18 06:01:47.940+0000: 3067: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a5c0009c0
2016-05-18 06:01:47.940+0000: 3067: info : virObjectRef:296 : OBJECT_REF: obj=0x7f5a5c0009c0
2016-05-18 06:01:47.940+0000: 3067: debug : virThreadJobSet:96 : Thread 3067 (virNetServerHandleJob) is now running job remoteDispatchDomainGetControlInfo

Please note that calls begin and never end; previosly calls begun and end in a couple of msecs most.

Version-Release number of selected component (if applicable):
1.3.4

How reproducible:
100% of times, but the issue takes a random amount of time before to surface. I've seen it happen after few minutes or after few hours.

Steps to Reproduce:
1. create a QEMU domain on NFS shared storage (see attached dom.xml for example)
2. periodically run monitoring calls, each one protected by virDomainGetControlInfo() as exposed above.
3. wait some time, at least a couple of hours to have a good chance to see the failure

Actual results:
After a random time, virDomainGetControlInfo() stops responding

Expected results:
virDomainGetControlInfo() keeps responding

Additional info:
1. I can't tell if it is interplay of libvirt calls; so I can't tell if calling only virDomainGetControlInfo() periodically is sufficent to make it hang
2. when virDomainGetControlInfo() hangs, virsh stop responding as well. (even virsh -r list hangs)

Revision history for this message

In Red Hat Bugzilla #1337073, Francesco (francesco-redhat-bugs-1) wrote on 2016-05-18:

#4

Created attachment 1158650
vdsm + libvirtd debug logs

virDomainGetControlInfo stop responding around the 8:01:45 mark.
Please note libvirtd logs are 2hrs behind (look around 6:01:45)

Revision history for this message

In Red Hat Bugzilla #1337073, Francesco (francesco-redhat-bugs-1) wrote on 2016-05-18:

#5

Created attachment 1158651
Example domain

Revision history for this message

In Red Hat Bugzilla #1337073, Francesco (francesco-redhat-bugs-1) wrote on 2016-05-18:

#6

Created attachment 1158652
libvirtd configuration

Revision history for this message

In Red Hat Bugzilla #1337073, Francesco (francesco-redhat-bugs-1) wrote on 2016-05-18:

#7

Created attachment 1158653
libvirtd qemu configuration

Revision history for this message

In Red Hat Bugzilla #1337073, Jiri (jiri-redhat-bugs) wrote on 2016-05-18:

#8

So everything works, until virConnectGetAllDomainStats is called on the
domain:

2016-05-18 06:01:15.922+0000: 3070: debug : virThreadJobSet:96 :
    Thread 3070 (virNetServerHandleJob) is now running job
    remoteDispatchConnectGetAllDomainStats
2016-05-18 06:01:15.922+0000: 3070: debug : virConnectGetAllDomainStats:11489 :
    conn=0x7f5a5c000ec0, stats=0x0, retStats=0x7f5a8124fad0, flags=0x0

The API wants to start a job...

2016-05-18 06:01:15.922+0000: 3070: debug : qemuDomainObjBeginJobInternal:2097 :
Starting job: query (vm=0x7f5a64009bf0 name=a1, current job=query async=none)
2016-05-18 06:01:15.922+0000: 3070: debug : qemuDomainObjBeginJobInternal:2120 :
Waiting for job (vm=0x7f5a64009bf0 name=a1)

... and keeps waiting for it. In the meantime, other APIs (which don't require
a job) keep working just fine:

2016-05-18 06:01:45.920+0000: 3069: debug : virThreadJobSet:96 :
    Thread 3069 (virNetServerHandleJob) is now running job
    remoteDispatchDomainGetVcpus
2016-05-18 06:01:45.920+0000: 3069: debug : virDomainGetVcpus:7733 :
    dom=0x7f5a70004070, (VM: name=a1, uuid=048f8624-03fc-4729-8f4d-12cb4387f018),
    info=0x7f5a70002140, maxinfo=2, cpumaps=0x7f5a70002200, maplen=1
...
2016-05-18 06:01:45.920+0000: 3069: debug : virDomainDispose:313 :
    release domain 0x7f5a70004070 a1 048f8624-03fc-4729-8f4d-12cb4387f018
2016-05-18 06:01:45.920+0000: 3069: debug : virThreadJobClear:121 :
    Thread 3069 (virNetServerHandleJob) finished job
    remoteDispatchDomainGetVcpus with ret=0

But when virConnectGetAllDomainStats times out on acquiring a job,

2016-05-18 06:01:45.922+0000: 3070: warning : qemuDomainObjBeginJobInternal:2180 :
    Cannot start job (query, none) for domain a1; current job is (query, none)
    owned by (3066 remoteDispatchDomainGetBlockIoTune, 0 <null>) for (30s, 0s)
2016-05-18 06:01:45.922+0000: 3070: error : qemuDomainObjBeginJobInternal:2192 :
    Timed out during operation: cannot acquire state change lock
    (held by remoteDispatchDomainGetBlockIoTune)
2016-05-18 06:01:45.922+0000: 3070: debug : virCgroupGetValueStr:814 :
    Get value /sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-qemu\x2d1
    \x2da1.scope/cpuacct.usage
2016-05-18 06:01:45.922+0000: 3070: debug : virCgroupGetValueStr:814 :
    Get value /sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-qemu\x2d1
    \x2da1.scope/cpuacct.stat
2016-05-18 06:01:45.922+0000: 3070: debug : qemuGetProcessInfo:1486 :
    Got status for 3500/3505 user=1507 sys=209 cpu=1 rss=531128
2016-05-18 06:01:45.922+0000: 3070: debug : qemuGetProcessInfo:1486 :
    Got status for 3500/3506 user=1150 sys=55 cpu=0 rss=531128

the domain object stays locked and all APIs will be blocked on trying to lock
the domain...

2016-05-18 06:01:45.939+0000: 3068: debug : virThreadJobSet:96 :
    Thread 3068 (virNetServerHandleJob) is now running job
    remoteDispatchDomainGetControlInfo
2016-05-18 06:01:45.939+0000: 3068: debug : virDomainGetControlInfo:2526 :
    dom=0x7f5a6c002b50, (VM: name=a1, uuid=048f8624-03fc-4729-8f4d-12cb4387f018),
    info=0x7f5a79a50ad0, flags=0
...

So everything works, until virConnectGetAllDomainStats is called on the
domain:

2016-05-18 06:01:15.922+0000: 3070: debug : virThreadJobSet:96 :
    Thread 3070 (virNetServerHandleJob) is now running job
    remoteDispatchConnectGetAllDomainStats
2016-05-18 06:01:15.922+0000: 3070: debug : virConnectGetAllDomainStats:11489 :
    conn=0x7f5a5c000ec0, stats=0x0, retStats=0x7f5a8124fad0, flags=0x0

The API wants to start a job...

2016-05-18 06:01:15.922+0000: 3070: debug : qemuDomainObjBeginJobInternal:2097 :
    Starting job: query (vm=0x7f5a64009bf0 name=a1, current job=query async=none)
2016-05-18 06:01:15.922+0000: 3070: debug : qemuDomainObjBeginJobInternal:2120 :
    Waiting for job (vm=0x7f5a64009bf0 name=a1)

... and keeps waiting for it. In the meantime, other APIs (which don't require
a job) keep working just fine:

2016-05-18 06:01:45.920+0000: 3069: debug : virThreadJobSet:96 :
    Thread 3069 (virNetServerHandleJob) is now running job
    remoteDispatchDomainGetVcpus
2016-05-18 06:01:45.920+0000: 3069: debug : virDomainGetVcpus:7733 :
    dom=0x7f5a70004070, (VM: name=a1, uuid=048f8624-03fc-4729-8f4d-12cb4387f018),
    info=0x7f5a70002140, maxinfo=2, cpumaps=0x7f5a70002200, maplen=1
...
2016-05-18 06:01:45.920+0000: 3069: debug : virDomainDispose:313 :
    release domain 0x7f5a70004070 a1 048f8624-03fc-4729-8f4d-12cb4387f018
2016-05-18 06:01:45.920+0000: 3069: debug : virThreadJobClear:121 :
    Thread 3069 (virNetServerHandleJob) finished job
    remoteDispatchDomainGetVcpus with ret=0

But when virConnectGetAllDomainStats times out on acquiring a job,

2016-05-18 06:01:45.922+0000: 3070: warning : qemuDomainObjBeginJobInternal:2180 :
    Cannot start job (query, none) for domain a1; current job is (query, none)
    owned by (3066 remoteDispatchDomainGetBlockIoTune, 0 <null>) for (30s, 0s)
2016-05-18 06:01:45.922+0000: 3070: error : qemuDomainObjBeginJobInternal:2192 :
    Timed out during operation: cannot acquire state change lock
    (held by remoteDispatchDomainGetBlockIoTune)
2016-05-18 06:01:45.922+0000: 3070: debug : virCgroupGetValueStr:814 :
    Get value /sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-qemu\x2d1
    \x2da1.scope/cpuacct.usage
2016-05-18 06:01:45.922+0000: 3070: debug : virCgroupGetValueStr:814 :
    Get value /sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-qemu\x2d1
    \x2da1.scope/cpuacct.stat
2016-05-18 06:01:45.922+0000: 3070: debug : qemuGetProcessInfo:1486 :
    Got status for 3500/3505 user=1507 sys=209 cpu=1 rss=531128
2016-05-18 06:01:45.922+0000: 3070: debug : qemuGetProcessInfo:1486 :
    Got status for 3500/3506 user=1150 sys=55 cpu=0 rss=531128

the domain object stays locked and all APIs will be blocked on trying to lock
the domain...

2016-05-18 06:01:45.939+0000: 3068: debug : virThreadJobSet:96 :
    Thread 3068 (virNetServerHandleJob) is now running job
    remoteDispatchDomainGetControlInfo
2016-05-18 06:01:45.939+0000: 3068: debug : virDomainGetControlInfo:2526 :
    dom=0x7f5a6c002b50, (VM: name=a1, uuid=048f8624-03fc-4729-8f4d-12cb4387f018),
    info=0x7f5a79a50ad0, flags=0
...

Revision history for this message

In Red Hat Bugzilla #1337073, Peter (peter-redhat-bugs) wrote on 2016-05-19:

#9

Fixed upstream:

commit 71d2c172edb997bae1e883b2e1bafa97d9f953a1
Author: Peter Krempa <email address hidden>
Date: Wed May 18 14:58:25 2016 +0200

qemu: bulk stats: Don't access possibly blocked storage

If the stats for a block device can't be acquired from qemu we've
fallen back to loading them from the file on the disk in libvirt.

    If qemu is not cooperating due to being stuck on an inaccessible NFS
    share we would then attempt to read the files and get stuck too with
    the VM object locked. All other APIs would eventually get stuck waiting
    on the VM lock.

Avoid this problem by skipping the block stats if the VM is online but
the monitor did not provide any stats.

Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1337073

commit 3aa5d51a9530a8737ca584b393c29297dd9bbc37
Author: Peter Krempa <email address hidden>
Date: Wed May 18 14:40:10 2016 +0200

qemu: driver: Separate bulk stats worker for block devices

Extract the fallback path that reloads the stats from disk into a
separate function.

commit 5d2b0e6f12b4e57d75ed1047ab1c36443b7a54b3
Author: Peter Krempa <email address hidden>
Date: Wed May 18 14:17:07 2016 +0200

qemu: driver: Remove unnecessary flag in qemuDomainGetStatsBlock

'abbreviated' was true if 'stats' were NULL

Revision history for this message

In Red Hat Bugzilla #1337073, Pei (pei-redhat-bugs) wrote on 2016-05-24:

#10

Download full text (3.4 KiB)

Try to reproduce it via libvirt.
Version:
libvirt-1.3.4-1.el7.x86_64
qemu-kvm-rhev-2.6.0-2.el7.x86_64

Step:
1. Start a guest with image on NFS storage
# virsh list
Id Name State
----------------------------------------------------
2 r72 running

2. disconnect with the NFS storage server.
# iptables -A OUTPUT -d $IP -p tcp --dport 2049 -j DROP

I want to know when it will hang on virDomainGetControlInfo(). So I tried this :
#!/bin/bash
while :
do
virsh domcontrol r72
sleep 30
done

3. on terminal 1, check IO throughttling with the image.
# virsh domblklist r72
Target Source
------------------------------------------------
hda -
vda /tmp/zp/r7.2.img

# virsh blkdeviotune r72 vda

......hang here

log as following :
......
2016-05-24 01:50:16.712+0000: 2512: debug : qemuDomainObjBeginJobInternal:2097 : Starting job: query (vm=0x7fe7501f2d40 name=r72, current job=query async=none)
2016-05-24 01:50:16.712+0000: 2512: debug : qemuDomainObjBeginJobInternal:2120 : Waiting for job (vm=0x7fe7501f2d40 name=r72)
2016-05-24 01:50:17.666+0000: 1309: info : virObjectRef:296 : OBJECT_REF: obj=0x7fe7d0a34300
......

4.change to termianl 2, check other info, it works well
# virsh list
Id Name State
----------------------------------------------------
2 r72 running

# virsh schedinfo r72
Scheduler : posix
cpu_shares : 1024
vcpu_period : 100000
vcpu_quota : -1
emulator_period: 100000
emulator_quota : -1
global_period : 100000
global_quota : -1

5. change to terminal 3. check block stat info, it will hang for a few seconds, then it will report an error.
# virsh domblkstat r72
(hang a few seconds)
error: Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainGetBlockIoTune)

log as following :
......
2016-05-24 01:50:46.698+0000: 1309: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7fe7d0a47970
2016-05-24 01:50:46.712+0000: 2512: warning : qemuDomainObjBeginJobInternal:2180 : Cannot start job (query, none) for domain r72; current job is (query, none) owned by (2513 remoteDispatchDomainGetBlockIoTune, 0 <null>) for (41s, 0s)
2016-05-24 01:50:46.712+0000: 2512: error : qemuDomainObjBeginJobInternal:2192 : Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainGetBlockIoTune)
......

2016-05-24 01:45:32.325+0000: 2514: debug : virThreadJobSet:96 : Thread 2514 (virNetServerHandleJob) is now running job remoteDispatchDomainGetInfo
2016-05-24 01:45:32.325+0000: 2514: info : virObjectNew:202 : OBJECT_NEW: obj=0x7fe7a403d340 classname=virDomain
......
2016-05-24 01:45:32.325+0000: 2514: debug : virThreadJobClear:121 : Thread 2514 (virNetServerHandleJob) finished job remoteDispatchDomainGetInfo with ret=0
......

6. It seems that it will hang in the terminal 1:
# virsh blkdeviotune r72 vda
...... hang here, nothing output.

but virsh list or other checkinfo except block related works well.

7.If I Ctrl+c terminal this command and re-try it again:
# virsh blkdeviotune r72 vda
^C
# virsh blkdeviotune r72 vda
error: Unable to get number of...

Try to reproduce it via libvirt.
Version:
libvirt-1.3.4-1.el7.x86_64
qemu-kvm-rhev-2.6.0-2.el7.x86_64

Step:
1. Start a guest with image on NFS storage
# virsh list 
 Id    Name                           State
----------------------------------------------------
 2     r72                            running

2. disconnect with the NFS storage server.
# iptables -A OUTPUT -d $IP -p tcp --dport 2049 -j DROP

I want to know when it will hang on virDomainGetControlInfo(). So I tried this :
#!/bin/bash
while :
do 
virsh domcontrol r72
sleep 30
done

3. on terminal 1, check IO throughttling with the image.
# virsh domblklist r72
Target     Source
------------------------------------------------
hda        -
vda        /tmp/zp/r7.2.img

# virsh blkdeviotune r72 vda

......hang here

log as following :
......
2016-05-24 01:50:16.712+0000: 2512: debug : qemuDomainObjBeginJobInternal:2097 : Starting job: query (vm=0x7fe7501f2d40 name=r72, current job=query async=none)
2016-05-24 01:50:16.712+0000: 2512: debug : qemuDomainObjBeginJobInternal:2120 : Waiting for job (vm=0x7fe7501f2d40 name=r72)
2016-05-24 01:50:17.666+0000: 1309: info : virObjectRef:296 : OBJECT_REF: obj=0x7fe7d0a34300
......

4.change to termianl 2, check other info, it works well
# virsh list 
 Id    Name                           State
----------------------------------------------------
 2     r72                            running

# virsh schedinfo r72
Scheduler      : posix
cpu_shares     : 1024
vcpu_period    : 100000
vcpu_quota     : -1
emulator_period: 100000
emulator_quota : -1
global_period  : 100000
global_quota   : -1

5. change to terminal 3. check block stat info, it will hang for a few seconds, then it will report an error.
# virsh domblkstat r72 
(hang a few seconds)
error: Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainGetBlockIoTune)

log as following :
......
2016-05-24 01:50:46.698+0000: 1309: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x7fe7d0a47970
2016-05-24 01:50:46.712+0000: 2512: warning : qemuDomainObjBeginJobInternal:2180 : Cannot start job (query, none) for domain r72; current job is (query, none) owned by (2513 remoteDispatchDomainGetBlockIoTune, 0 <null>) for (41s, 0s)
2016-05-24 01:50:46.712+0000: 2512: error : qemuDomainObjBeginJobInternal:2192 : Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainGetBlockIoTune)
......

2016-05-24 01:45:32.325+0000: 2514: debug : virThreadJobSet:96 : Thread 2514 (virNetServerHandleJob) is now running job remoteDispatchDomainGetInfo
2016-05-24 01:45:32.325+0000: 2514: info : virObjectNew:202 : OBJECT_NEW: obj=0x7fe7a403d340 classname=virDomain
......
2016-05-24 01:45:32.325+0000: 2514: debug : virThreadJobClear:121 : Thread 2514 (virNetServerHandleJob) finished job remoteDispatchDomainGetInfo with ret=0
......

6. It seems that it will hang in the terminal 1:
# virsh blkdeviotune r72 vda 
...... hang here, nothing output.

but virsh list or other checkinfo except block related works well.

7.If I Ctrl+c terminal this command and re-try it again:
# virsh blkdeviotune r72 vda
^C
# virsh blkdeviotune r72 vda
error: Unable to get number of block I/O throttle parameters
error: Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainGetBlockIoTune)

As above, Do these steps can be used to reproduce this issue? It will hang in step6 and step7.

Revision history for this message

In Red Hat Bugzilla #1337073, Peter (peter-redhat-bugs) wrote on 2016-05-26:

#11

(In reply to Pei Zhang from comment #8)
> Try to reproduce it via libvirt.

[...]

> 2. disconnect with the NFS storage server.
> # iptables -A OUTPUT -d $IP -p tcp --dport 2049 -j DROP

[...]

> 3. on terminal 1, check IO throughttling with the image.
> # virsh domblklist r72
> Target Source
> ------------------------------------------------
> hda -
> vda /tmp/zp/r7.2.img
>
> # virsh blkdeviotune r72 vda

[...]

> 5. change to terminal 3. check block stat info, it will hang for a few
> seconds, then it will report an error.
> # virsh domblkstat r72
> (hang a few seconds)
> error: Timed out during operation: cannot acquire state change lock (held by
> remoteDispatchDomainGetBlockIoTune)

You have to call 'virsh domstats' here and request all, or at least "--block" at the point where the above API is returning failures. That should block the daemon. Listing guests and/or operations on the single guest should then block too.

Revision history for this message

In Red Hat Bugzilla #1337073, Pei (pei-redhat-bugs) wrote on 2016-05-26:

#12

Thanks a lot for Peter's info. and reproduce it like following :
version:
libvirt-1.3.4-1.el7.x86_64

steps:
1. start a guest with image on NFS storage
# virsh list
Id Name State
----------------------------------------------------
20 r72 running

2. Disconnect with NFS server
# iptables -A OUTPUT -d $IP -p tcp --dport 2049 -j DROP
#

3. In terminal 1, execute 'blkdeviotune'

# virsh blkdeviotune r72 vda

4.in terminal 2, try 'domstats' to get statistices for all domains.
# virsh domstats

5. After few seconds, in terminal 1, we can get an error
# virsh blkdeviotune r72 vda
error: Unable to get number of block I/O throttle parameters
error: Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainGetBlockIoTune)

6. check termianl 2, virsh domstats still hang

# virsh domstats
......hang here

7. in terminal 3, check virsh list, it will also hang
#virsh list
...... hang here.

Actual results:
As step 6 and step 7, virsh domstats hang and listing guest also hang.

Expected results:
It should return less data for active VMs and won't hang.

Revision history for this message

In Red Hat Bugzilla #1337073, Michael (michael-redhat-bugs) wrote on 2016-06-28:

#13

Hi,
I have got a problem with the same error reported when starting a VM :
2016-05-30 08:46:30.437+0000: 5404: warning : qemuDomainObjBeginJobInternal:1572 : Cannot start job (query, none) for domain instance-00000039; current job is (modify, none) owned by (5405 remoteDispatchDomainCreateWithFlags, 0 <null>) for (598s, 0s)
2016-05-30 08:46:30.437+0000: 5404: error : qemuDomainObjBeginJobInternal:1584 : Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainCreateWithFlags)
2016-05-30 08:47:00.461+0000: 5402: warning : qemuDomainObjBeginJobInternal:1572 : Cannot start job (query, none) for domain instance-00000039; current job is (modify, none) owned by (5405 remoteDispatchDomainCreateWithFlags, 0 <null>) for (628s, 0s)
2016-05-30 08:47:00.461+0000: 5402: error : qemuDomainObjBeginJobInternal:1584 : Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainCreateWithFlags)

I use the backed storage of ceph for VM, and the ceph had some unexpected problems before.I rebooted the host system when the ceph's problem has been resolved，then the VM was in state of pause using "virsh list".In libvirtd.log,lots of prints like：

Cannot start job (query, none) for domain instance-00000039; current job is (modify, none) owned by (5405 remoteDispatchDomainCreateWithFlags, 0 <null>) for (658s, 0s)

It looks like that because of ceph's storage ,but having no error log to support this. Does anyone have ideas on this problem?

libvirt version：
# virsh version
Compiled against library: libvirt 1.2.21
Using library: libvirt 1.2.21
Using API: QEMU 1.2.21
Running hypervisor: QEMU 2.5.0

Revision history for this message

In Red Hat Bugzilla #1337073, Pei (pei-redhat-bugs) wrote on 2016-08-24:

#14

Verfied versions:
libvirt-2.0.0-6.el7.x86_64
qemu-kvm-rhev-2.6.0-22.el7.x86_64

Steps:
1. Prepare a NFS storage, mount the NFS storage
then start a guest which image located in the NFS storage.

# virsh list --all
Id Name State
----------------------------------------------------
11 vm1 running

# virsh domblklist vm1
Target Source
------------------------------------------------
hdc -
vda /tmp/zp/beaker/r73q2-1.img

check the guest is working well

2. disconnect the NFS server
# iptables -A OUTPUT -d $IP -p tcp --dport 2049 -j DROP

3. In termial 1, check IO throttlling using blkdeviotune

# virsh blkdeviotune vm1 vda
...... It will hange for a few minutes here

4. In the terminal 2, check domstats for the running guest using domstas
It also needs few minutes to get a return. But it won't hang forever.

# virsh domstats
Domain: 'vm1'
  state.state=1
  state.reason=1
  cpu.time=30014234910
  cpu.user=1760000000
  cpu.system=8750000000
  balloon.current=1048576
  balloon.maximum=1048576
  vcpu.current=1
  vcpu.maximum=8
  vcpu.0.state=1
  vcpu.0.time=26690000000
  vcpu.0.wait=0
  net.count=0
  block.count=2
  block.0.name=hdc
  block.1.name=vda
  block.1.path=/tmp/zp/beaker/r73q2-1.img

check terminal 1 agian, blkdeviotune also has a return.

# virsh blkdeviotune vm1 vda
total_bytes_sec: 0
read_bytes_sec : 0
write_bytes_sec: 0
total_iops_sec : 0
read_iops_sec : 0
write_iops_sec : 0
total_bytes_sec_max: 0
read_bytes_sec_max: 0
write_bytes_sec_max: 0
total_iops_sec_max: 0
read_iops_sec_max: 0
write_iops_sec_max: 0
size_iops_sec : 0

5. check virsh list, it also won't hang.
# virsh list
Id Name State
----------------------------------------------------
11 vm1 running

As above, domstats can get a return, it won't hang and it won't block other commands. From libvirt, it has been fixed.

Revision history for this message

In Red Hat Bugzilla #1337073, Pei (pei-redhat-bugs) wrote on 2016-08-24:

#15

Hi Francesco,
I verified this bug using latest libvirt on rhel7.3. The result is expected for libvirt.
I was wondering if you can also help verify it on RHV to make sure that the issue is fixed on RHV.
Thanks a lot in advance!

Revision history for this message

In Red Hat Bugzilla #1337073, Elad (elad-redhat-bugs) wrote on 2016-09-14:

#16

Tested according to https://bugzilla.redhat.com/show_bug.cgi?id=1339963#c11:

1. prepare a RHEV setup: one Engine host, one virtualization host, one storage host (so three different hosts).
2. make sure to set storage as shared (default) over NFS
3. provision and run one (or more) VM(s). make sure the VM has 1+ disks over NFS
4. kill the storage, either with iptables or physically (shutdown, disconnect)
5. wait random amount of time, I recommend 2+ hours to get a good chance to recreate the conditions
6. verify Vdsm thread count is NOT growing unbounded, but stays constant.
7. In the scenario which highlighted the bug, the Vdsm thread count was growing over time in the hundreds. We are of course taking corrective action at Vdsm level to prevent this grow/leak.

vdsm Threads pool is not growing unbounded, it stays constant after few hours that the storage is unreachable while there is a running VM holding a disk on it:

[root@seal09 ~]# ps aux |grep vdsm

vdsm 19254 1.0 0.3 5564648 110240 ? S<sl 12:58 1:27 /usr/bin/python /usr/share/vdsm/vdsm

[root@seal09 ~]# grep Threads /proc/19254/status
Threads: 55

===================
Used:

vdsm-4.18.13-1.el7ev.x86_64
libvirt-daemon-driver-nwfilter-2.0.0-8.el7.x86_64
libvirt-daemon-config-network-2.0.0-8.el7.x86_64
libvirt-daemon-driver-secret-2.0.0-8.el7.x86_64
libvirt-lock-sanlock-2.0.0-8.el7.x86_64
libvirt-daemon-2.0.0-8.el7.x86_64
libvirt-daemon-driver-qemu-2.0.0-8.el7.x86_64
libvirt-daemon-config-nwfilter-2.0.0-8.el7.x86_64
libvirt-daemon-kvm-2.0.0-8.el7.x86_64
libvirt-client-2.0.0-8.el7.x86_64
libvirt-daemon-driver-storage-2.0.0-8.el7.x86_64
libvirt-daemon-driver-interface-2.0.0-8.el7.x86_64
libvirt-2.0.0-8.el7.x86_64
libvirt-daemon-driver-nodedev-2.0.0-8.el7.x86_64
libvirt-python-2.0.0-2.el7.x86_64
libvirt-daemon-driver-network-2.0.0-8.el7.x86_64
libvirt-daemon-driver-lxc-2.0.0-8.el7.x86_64
qemu-kvm-rhev-2.6.0-22.el7.x86_64
rhevm-4.0.4-0.1.el7ev.noarch
===================

Moving to VERIFIED

Tested according to https://bugzilla.redhat.com/show_bug.cgi?id=1339963#c11:

1. prepare a RHEV setup: one Engine host, one virtualization host, one storage host (so three different hosts).
2. make sure to set storage as shared (default) over NFS
3. provision and run one (or more) VM(s). make sure the VM has 1+ disks over NFS
4. kill the storage, either with iptables or physically (shutdown, disconnect)
5. wait random amount of time, I recommend 2+ hours to get a good chance to recreate the conditions
6. verify Vdsm thread count is NOT growing unbounded, but stays constant.
7. In the scenario which highlighted the bug, the Vdsm thread count was growing over time in the hundreds. We are of course taking corrective action at Vdsm level to prevent this grow/leak.

vdsm Threads pool is not growing unbounded, it stays constant after few hours  that the storage is unreachable while there is a running VM holding a disk on it:

[root@seal09 ~]# ps aux |grep vdsm

vdsm     19254  1.0  0.3 5564648 110240 ?      S<sl 12:58   1:27 /usr/bin/python /usr/share/vdsm/vdsm

[root@seal09 ~]# grep Threads /proc/19254/status
Threads:        55

===================
Used:

vdsm-4.18.13-1.el7ev.x86_64
libvirt-daemon-driver-nwfilter-2.0.0-8.el7.x86_64
libvirt-daemon-config-network-2.0.0-8.el7.x86_64
libvirt-daemon-driver-secret-2.0.0-8.el7.x86_64
libvirt-lock-sanlock-2.0.0-8.el7.x86_64
libvirt-daemon-2.0.0-8.el7.x86_64
libvirt-daemon-driver-qemu-2.0.0-8.el7.x86_64
libvirt-daemon-config-nwfilter-2.0.0-8.el7.x86_64
libvirt-daemon-kvm-2.0.0-8.el7.x86_64
libvirt-client-2.0.0-8.el7.x86_64
libvirt-daemon-driver-storage-2.0.0-8.el7.x86_64
libvirt-daemon-driver-interface-2.0.0-8.el7.x86_64
libvirt-2.0.0-8.el7.x86_64
libvirt-daemon-driver-nodedev-2.0.0-8.el7.x86_64
libvirt-python-2.0.0-2.el7.x86_64
libvirt-daemon-driver-network-2.0.0-8.el7.x86_64
libvirt-daemon-driver-lxc-2.0.0-8.el7.x86_64
qemu-kvm-rhev-2.6.0-22.el7.x86_64
rhevm-4.0.4-0.1.el7ev.noarch
===================

Moving to VERIFIED

Revision history for this message

In Red Hat Bugzilla #1337073, errata-xmlrpc (errata-xmlrpc-redhat-bugs) wrote on 2016-11-03:

#17

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-2577.html

Seyeong Kim (seyeongkim) on 2018-02-01

description:	updated
tags:	added: sts

Revision history for this message

Seyeong Kim (seyeongkim) wrote on 2018-02-01:

#1

lp1746630_xenial.debdiff Edit (13.1 KiB, text/plain)

description:	updated
tags:	added: sts-sru-needed

Revision history for this message

Seyeong Kim (seyeongkim) wrote on 2018-02-01:

#2

lp1746630_mitaka.debdiff Edit (12.2 KiB, text/plain)

description:

updated

Bug Watch Updater (bug-watch-updater) on 2018-02-01

Changed in libvirt:
importance:	Unknown → High
status:	Unknown → Fix Released

Revision history for this message

Ubuntu Foundations Team Bug Bot (crichton) wrote on 2018-02-01: Re: virsh api is stuck when vm is down with NFS borken

#18

The attachment "lp1746630_xenial.debdiff" seems to be a debdiff. The ubuntu-sponsors team has been subscribed to the bug report so that they can review and hopefully sponsor the debdiff. If the attachment isn't a patch, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are member of the ~ubuntu-sponsors, unsubscribe the team.

[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issue please contact him.]

tags:

added: patch

Christian Ehrhardt  (paelzer) on 2018-02-02

Changed in libvirt (Ubuntu):
status:	New → Fix Released
Changed in libvirt (Ubuntu Xenial):
status:	New → Triaged

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2018-02-02:

#19

Hi Seyeong,
thanks for the analysis and backport.

The patches look good to me
- two structural changes with no net-effect (to let the backport apply)
- there could have been a backport without those, but I agree that this looks clearer
- the actual fix seems ok, skipping if no data avail sounds right
- some cleanups in the patch headers is required, but I can do that on upload for you.
- these changes are upstream a long time and are still the way implemented by this change in 4.0

Thanks a lot, two things:
1. do you have a ppa with that already that I should run checks against (otherwise I'll open one up when really prepping the SRU)?
2. there is a security update in flight we have to wait for - I'm postponing this fix until that is complete.

Revision history for this message

Seyeong Kim (seyeongkim) wrote on 2018-02-02:

#20

hello paelzer

thanks a lot

PPA link is

https://launchpad.net/~xtrusia/+archive/ubuntu/sf161119

Christian Ehrhardt  (paelzer) on 2018-02-05

summary:

- virsh api is stuck when vm is down with NFS borken
+ virsh api is stuck when vm is down with NFS broken

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2018-02-05:

#21

Regression tests on ppa complete, I'm good to sponsor this once the security updated are complete.
Since those are complex that might be a few more days.

I'll keep an open todo on this to check the status.

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2018-02-08:

#22

I refreshed the tested changes on top of the security update and pushed it for SRU review into Xenial.

Changed in libvirt (Ubuntu Xenial):
status:	Triaged → Fix Committed
status:	Fix Committed → In Progress

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2018-02-08:

#23

Waiting for the SRU Team to ack it into x-proposed now

Revision history for this message

Brian Murray (brian-murray) wrote on 2018-02-08: Please test proposed package

#24

Hello Seyeong, or anyone else affected,

Accepted libvirt into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/libvirt/1.3.1-1ubuntu10.18 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-xenial to verification-done-xenial. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-xenial. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in libvirt (Ubuntu Xenial):
status:	In Progress → Fix Committed
tags:	added: verification-needed verification-needed-xenial

Seyeong Kim (seyeongkim) on 2018-02-09

Changed in libvirt (Ubuntu Xenial):
assignee:	nobody → Seyeong Kim (xtrusia)

Revision history for this message

Seyeong Kim (seyeongkim) wrote on 2018-02-09:

#25

Hello

I confirmed this is working

I did steps as test case on description.

virsh domstats -> responded in 1 min
virsh list -> responded in 10 secs

ii libvirt-bin 1.3.1-1ubuntu10.18 amd64 programs for the libvirt library
ii libvirt0:amd64 1.3.1-1ubuntu10.18 amd64 library for interfacing with different virtualization systems

tags:

added: verification-done-xenial
removed: verification-needed-xenial

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2018-02-14:

#26

All verifications done, set the generic (non release) verify to done as well.

tags:

added: verification-done
removed: verification-needed

Revision history for this message

Launchpad Janitor (janitor) wrote on 2018-02-15:

#27

This bug was fixed in the package libvirt - 1.3.1-1ubuntu10.18

---------------
libvirt (1.3.1-1ubuntu10.18) xenial; urgency=medium

  * virsh api is stuck when vm is down with NFS borken (LP: #1746630)
    - d/p/0001-qemu-driver-Remove-unnecessary-flag-in-qemuDomainGet.patch
      qemu: driver: Remove unnecessary flag in qemuDomainGetStatsBlock
    - d/p/0002-qemu-driver-Separate-bulk-stats-worker-for-block-dev.patch
      qemu: driver: Separate bulk stats worker for block devices
    - d/p/0003-qemu-bulk-stats-Don-t-access-possibly-blocked-storag.patch
      qemu: bulk stats: Don't access possibly blocked storage

-- Seyeong Kim <email address hidden> Thu, 01 Feb 2018 09:43:45 +0900

Changed in libvirt (Ubuntu Xenial):
status:	Fix Committed → Fix Released

Revision history for this message

Brian Murray (brian-murray) wrote on 2018-02-15: Update Released

#28

The verification of the Stable Release Update for libvirt has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

James Page (james-page) on 2018-02-19

Changed in cloud-archive:
status:	New → Fix Released

Edward Hope-Morley (hopem) on 2018-02-20

tags:

added: sts-sru-done
removed: sts-sru-needed

Revision history for this message

Seyeong Kim (seyeongkim) wrote on 2018-03-22:

#29

I checked this fix for mitaka is already released.

changing to fix released.

libvirt (1.3.1-1ubuntu10.18~cloud0) trusty-mitaka; urgency=medium

* New update for the Ubuntu Cloud Archive.

libvirt (1.3.1-1ubuntu10.18) xenial; urgency=medium

  * virsh api is stuck when vm is down with NFS borken (LP: #1746630)
    - d/p/0001-qemu-driver-Remove-unnecessary-flag-in-qemuDomainGet.patch
      qemu: driver: Remove unnecessary flag in qemuDomainGetStatsBlock
    - d/p/0002-qemu-driver-Separate-bulk-stats-worker-for-block-dev.patch
      qemu: driver: Separate bulk stats worker for block devices
    - d/p/0003-qemu-bulk-stats-Don-t-access-possibly-blocked-storag.patch
      qemu: bulk stats: Don't access possibly blocked storage

libvirt (1.3.1-1ubuntu10.17) xenial-security; urgency=medium

  * SECURITY UPDATE: Add support for Spectre mitigations
    - debian/patches/CVE-2017-5715-ibrs*.patch: add CPU features for
      indirect branch prediction protection and add new *-IBRS CPU models.
    - debian/control: add Breaks to get updated qemu with new CPU models.
    - CVE-2017-5715

-- Corey Bryant <email address hidden> Fri, 16 Feb 2018 10:16:44 -0500

Revision history for this message

Corey Bryant (corey.bryant) wrote on 2018-03-28:

#30

The verification of the Stable Release Update for libvirt has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message

Corey Bryant (corey.bryant) wrote on 2018-03-28:

#31

This bug was fixed in the package libvirt - 1.3.1-1ubuntu10.18~cloud0
---------------

libvirt (1.3.1-1ubuntu10.18~cloud0) trusty-mitaka; urgency=medium
.
   * New update for the Ubuntu Cloud Archive.
.
libvirt (1.3.1-1ubuntu10.18) xenial; urgency=medium
.
   * virsh api is stuck when vm is down with NFS borken (LP: #1746630)
     - d/p/0001-qemu-driver-Remove-unnecessary-flag-in-qemuDomainGet.patch
       qemu: driver: Remove unnecessary flag in qemuDomainGetStatsBlock
     - d/p/0002-qemu-driver-Separate-bulk-stats-worker-for-block-dev.patch
       qemu: driver: Separate bulk stats worker for block devices
     - d/p/0003-qemu-bulk-stats-Don-t-access-possibly-blocked-storag.patch
       qemu: bulk stats: Don't access possibly blocked storage
.
libvirt (1.3.1-1ubuntu10.17) xenial-security; urgency=medium
.
   * SECURITY UPDATE: Add support for Spectre mitigations
     - debian/patches/CVE-2017-5715-ibrs*.patch: add CPU features for
       indirect branch prediction protection and add new *-IBRS CPU models.
     - debian/control: add Breaks to get updated qemu with new CPU models.
     - CVE-2017-5715

Revision history for this message

Corey Bryant (corey.bryant) wrote on 2018-03-28:

#32

Tempest regression testing on mitaka-proposed successful:

======
Totals
======
Ran: 102 tests in 1015.8719 sec.
- Passed: 94
- Skipped: 8
- Expected Fail: 0
- Unexpected Success: 0
- Failed: 0
Sum of execute time for each test: 573.4555 sec.

	Status	Importance	Assigned to
Ubuntu Cloud Archive	Fix Released	Undecided	Unassigned
Mitaka	Fix Released	Undecided	Seyeong Kim
libvirt	Fix Released	High	redhat-bugs #1337073
libvirt (Ubuntu)	Fix Released	Undecided	Unassigned
Xenial	Fix Released	Undecided	Seyeong Kim

Ubuntu
libvirt package

virsh api is stuck when vm is down with NFS broken

Bug Description

CVE References

Other bug subscribers

Patches

Remote bug watches

Ubuntulibvirt package

virsh api is stuck when vm is down with NFS broken

Bug Description

CVE References

Other bug subscribers

Patches

Remote bug watches

Ubuntu
libvirt package