[18.10] Timed out message while taking dump using virsh dumpxml command & fails with 'held by remoteDispatchDomainCoreDump' error

Bug #1771827 reported by bugproxy
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
The Ubuntu-power-systems project
Fix Released
Medium
David Britton
libvirt (Ubuntu)
Fix Released
Undecided
Ubuntu on IBM Power Systems Bug Triage
Bionic
Won't Fix
Undecided
Unassigned

Bug Description

Problem Description:
=========================
Tried to take dump using virsh dumpxml command and it fails with Timed out 'held by remoteDispatchDomainCoreDump' error.

Steps to re-create:
============================
1. boslcp3g4 is installed with 4.15.0-15-generic kernel.
2. LTP & memory map tests were running inside guest.
3. After some time guest in hung state.
4. Tried to take dump using virsh dumpxml.

root@boslcp3:~# virsh dump boslcp3g4 boslcp3g4_mmap_ltp --memory-only

error: Failed to core dump domain boslcp3g4 to boslcp3g4_mmap_ltp
error: Disconnected from qemu:///system due to keepalive timeout
error: Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainCoreDump)

root@boslcp3:~# virsh list --all
 Id Name State
 1 boslcp3g3 running
 2 boslcp3g4 paused
 4 boslcp3g1 running

5. It fails with Timed out during opearation & with held by remoteDispatchDomainCoreDump error
6. /var/log/syslog dumps

Apr 18 03:29:14 boslcp3 libvirtd[5538]: 2018-04-18 08:29:13.956+0000: 5576: warning : qemuDomainObjBeginJobInternal:4863 : Cannot start job (query, none) for domain boslcp3g4; current job is (async nested, dump) owned by (5574 remoteDispatchDomainCoreDump, 5574 remoteDispatchDomainCoreDump) for (701s, 701s)
Apr 18 03:29:14 boslcp3 libvirtd[5538]: 2018-04-18 08:29:13.958+0000: 5576: error : qemuDomainObjBeginJobInternal:4875 : Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainCoreDump)
Apr 18 03:29:44 boslcp3 libvirtd[5538]: 2018-04-18 08:29:44.492+0000: 5573: warning : qemuDomainObjBeginJobInternal:4863 : Cannot start job (query, none) for domain boslcp3g4; current job is (async nested, dump) owned by (5574 remoteDispatchDomainCoreDump, 5574 remoteDispatchDomainCoreDump) for (731s, 732s)

7. Attached syslog & sosreport

== Comment: #3 - Application Cdeadmin <email address hidden> - 2018-04-18 08:11:01 ==
When i tried for second time same command it was successful but syslog dumps below warnings continuously

warning : :4863 : Cannot start job (query, none) for domain boslcp3g4; current job is (async nested, dump) owned by (5574 remoteDispatchDomainCoreDump, 5574

root@boslcp3:~# virsh dump boslcp3g4 boslcp3g4_mmapltp --memory-only
Domain boslcp3g4 dumped to boslcp3g4_mmapltp

vmcore located at:
vmcore at kte111:/LOGS/boslcp3g4/boslcp3g4_mmapltp
Access kte111 using debug@9.3.111.155 (don2rry)

== Comment: #8 - Application Cdeadmin <email address hidden> - 2018-04-19 05:26:32 ==
Tried to start the guest boslcp3g1 guest which has qlogic disk as boot & IO disk
root@boslcp3:~# virsh list --all
 Id Name State
 1 boslcp3g4 running
 3 boslcp3g3 running
 - boslcp3g1 shut off

root@boslcp3:~# echo 10240 > /proc/sys/vm/nr_hugepages
root@boslcp3:~# virsh start --console boslcp3g1

--> Than saw guest went to paused state.
root@boslcp3:/home# virsh list --all
 Id Name State
 1 boslcp3g4 running
 3 boslcp3g3 running
 5 boslcp3g1 paused

Then tried to destroy the guest and its fails with Timed out during operation: cannot acquire state change lock. Even resume command also failing as below

Corresponding syslog from /var/log:
Apr 19 05:17:09 boslcp3 libvirtd[5576]: 2018-04-19 10:17:09.056+0000: 5635: error : virProcessKillPainfully:401 : Failed to terminate process 142520 with SIGKILL: Device or resource busy

== Comment: #26 - Shivaprasad G. Bhat <email address hidden> - 2018-05-17 08:57:25 ==
Got to test the patches independently. The below commits from upstream fix the false alarms and allows the dump to go through clean.

a5bc7130f3
e712579200
150930e309
9a1755b7fe
501e3c3c96
88c2360753
3455a7359c
fd1a9e5c56
2a4d847e77
9d73df98c2
93412bb827
a8ef7b69dc
5870f95a7a
3f99bb06d1

Revision history for this message
bugproxy (bugproxy) wrote : boslcp1 host logs

Default Comment by Bridge

tags: added: architecture-ppc64le bugnameltc-166912 severity-medium targetmilestone-inin1804
Changed in ubuntu:
assignee: nobody → Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage)
affects: ubuntu → libvirt (Ubuntu)
Changed in ubuntu-power-systems:
assignee: nobody → David Britton (davidpbritton)
importance: Undecided → Medium
tags: added: triage-g
Revision history for this message
Christian Ehrhardt  (paelzer) wrote : Re: Timed out message while taking dump using virsh dumpxml command & fails with 'held by remoteDispatchDomainCoreDump' error

Hi,
I was a bit scared first that mem dumps would be broken in general.
But then I have seen that this is only when super-stressing from inside the guest.

### workload dependency ###

You picked LTP and memory map test in the guest, I tried to recreate that without success.

1. normal mem dump
$ virsh dump b1 b1.dump2 --verbose --memory-only

Works fine with and without verbose as well in the async mode and checkable with domjobinfo.

2. your case that keeps the guest mem busy
I gave it 20 CPUs and spun plenty of memory stressors

$ stress-ng --mmap 20 --mmapmany 20 --mmapaddr 20 --shm 20 --memcpy 20 --memthrash 20 --metrics-brief --timeout 300
This consumes all CPUs multiple times and sets memory under stress.
But I have to admit, I thought the non --live dump even pauses the guest while dumping?

While the above ran I dumped the guest, still fine.
$ virsh dump b1 b1.dump3 --verbose --memory-only

3. busy and non paused
With above load still running:
$ virsh dump b1 b1.dump3 --live --verbose --memory-only

Also good.

So I summarize this part of it as the guest really is dumpable in all but the most extreme cases.
My workload already is memory-stress wise way beyond any normal workload.

### changes review ###

The suggested series also is a huge series of 14 patches.
It is essentially a whole new set of features and behavior change on dumping guest memory.
It definetly doesn't seem to be SRUable as the risk to affect many working dump cases is way too high IMHO.

### TL;DR ###

I like the upstream changes, they are all in 4.3 and I'll strive to pick something >= 4.3 for the upcoming Ubuntu Cosmic release.
But for SRUs in the current form we would need
- a much smaller change (more a fix than a bunch of features)
- an at least semi realistic case where the error occurs

So I'm marking this for work in Cosmic, but will for now set Bionic to Won't Fix

Changed in libvirt (Ubuntu Bionic):
status: New → Won't Fix
Changed in libvirt (Ubuntu):
status: New → Triaged
tags: added: libvirt-18.10
Frank Heimes (fheimes)
Changed in ubuntu-power-systems:
status: New → Triaged
summary: - Timed out message while taking dump using virsh dumpxml command & fails
- with 'held by remoteDispatchDomainCoreDump' error
+ [18.10] Timed out message while taking dump using virsh dumpxml command
+ & fails with 'held by remoteDispatchDomainCoreDump' error
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Since all discussed changes were in 4.1 and one in 4.3 with the merge of libvirt 4.6 this is done for Ubuntu Cosmic 18.10.

Changed in libvirt (Ubuntu):
status: Triaged → Fix Released
Changed in ubuntu-power-systems:
status: Triaged → Fix Released
bugproxy (bugproxy)
tags: added: targetmilestone-inin1810
removed: targetmilestone-inin1804
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2019-06-17 15:13 EDT-------
If he hasn't tested it in three months, I think it's safe to conclude you should just close it.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.