libvirt domain is not listed/managed after libvirt restart with messages "internal error: no monitor path" and "Failed to load config for domain"

Bug #2059272 reported by Mauricio Faria de Oliveira
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
libvirt (Ubuntu)
Fix Released
Undecided
Unassigned
Focal
Fix Released
Medium
Mauricio Faria de Oliveira
Jammy
Fix Released
Medium
Mauricio Faria de Oliveira

Bug Description

[ Impact ]

 * If a race condition occurs on libvirtd shutdown,
   a QEMU domain status XML (/run/libvirt/qemu/*.xml)
   might lose the QEMU-driver specific information,
   such as '<monitor path=.../>'.
   (The race condition details are in [Other Info].)

 * On the next libvirtd startup, the parsing of that
   QEMU domain's status XML fails as '<monitor path='
   is not found:

  $ journalctl -b -u libvirtd.service | tail
  ...
  ... libvirtd[2789]: internal error: no monitor path
  ... libvirtd[2789]: Failed to load config for domain 'test-vm'

 * As a result, the domain is not listed in `virsh list`,
   and `virsh` commands to it fail.

  $ virsh list
   Id Name State
  --------------------

 * The domain is still running, but libvirt considers
   it as shutdown, which might cause conflicts/issues
   with higher-level tools (e.g., openstack nova).

  $ virsh list --all
   Id Name State
  --------------------------
   - test-vm shut off

  $ pgrep -af qemu-system-x86_64 | cut -d, -f1
  2638 /usr/bin/qemu-system-x86_64 -name guest=test-vm,

[ Test Plan ]

 * (Focal/Jammy) shutdown-on-runtime:
   Synthetic reproducer/verification with GDB in comments #1 and #2 (Jammy) and #12 and #14 (Focal).

 * (Focal-only) shutdown-on-init:
   Synthetic reproducer/verification with GDB in comments #13 and #15.

 * On failure, the XML is saved *without* '<monitor path='
   and libvirt fails to parse the domain on startup.
   The domain is *not* listed in `virsh list`.

 * On success, the XML is saved *with* '<monitor path='
   and libvirt correctly parses the domain on startup.
   The domain is listed in `virsh list`.

 * Normal 'restart' testing in comment #5.

 * Test packages built successfully in all architectures
   with -proposed enabled in Launchpad PPA mfo/lp2059272 [0]

[0] https://launchpad.net/~mfo/+archive/ubuntu/lp2059272

[ Regression Potential ]

 * One patch changes *where* in the libvirt qemu driver's
   shutdown path the worker thread pool is stopped/freed:
   from _after_ releasing other data to _before_ doing so.

 * The other patch (Focal-only) skips the update of the
   QEMU domain status XML file during initialization if
   libvirt is shutting down. (This is OK since the file
   is not going to be used anyway in the current run as
   it is shutting down, and it will be updated again in
   the next run anyway.)

 * Therefore, the potential for regression is limited to
   the libvirt qemu driver's shutdown path, and would be
   observed when stopping/restarting libvirtd.service.

 * The behavior during normal operation is not affected.

[Other Info]

 * In Focal, race windows exist if libvirtd shuts down
   _after_ initialization and _during_ initialization
   (which is unlikely in practice, but it's possible.)

   Say, 'shutdown'on-runtime' and 'shutdown-on-init'.

 * In Jammy, only 'shutdown-on-runtime' might happen,
   due to the introduction of the '.stateShutdownWait'
   driver callback (not available in Focal), which
   indirectly prevents the 'shutdown-on-init' race
   due to additional synchronization with locking.

 * For 'shutdown-on-runtime': use upstream commit [1].
   It's needed in Focal and Jammy (included in Mantic).

 * For 'shutdown-on-init' (Focal-only), we should use a
   downstream-only patch (with conservative behavior),
   since upstream addressed this issue indirectly with
   the '.stateShutdownWait' callbacks and other changes
   (which are not SRU material, ~10 patches, redesign [2])
   in 6.8.0.

[1] https://gitlab.com/libvirt/libvirt/-/commit/152770333449cd3b78b4f5a9f1148fc1f482d842

 $ git describe --contains 152770333449cd3b78b4f5a9f1148fc1f482d842
 v9.3.0-rc1~90

 $ rmadison -a source libvirt | sed -n '/focal/,$p'
  libvirt | 6.0.0-0ubuntu8 | focal | source
  libvirt | 6.0.0-0ubuntu8.16 | focal-security | source
  libvirt | 6.0.0-0ubuntu8.16 | focal-updates | source
  libvirt | 6.0.0-0ubuntu8.17 | focal-proposed | source
  libvirt | 8.0.0-1ubuntu7 | jammy | source
  libvirt | 8.0.0-1ubuntu7.5 | jammy-security | source
  libvirt | 8.0.0-1ubuntu7.8 | jammy-updates | source
  libvirt | 9.6.0-1ubuntu1 | mantic | source
  libvirt | 10.0.0-2ubuntu1 | noble | source
  libvirt | 10.0.0-2ubuntu5 | noble-proposed | source

[2] https://listman.redhat.com/archives/libvir-list/2020-July/205291.html
[PATCH 00/10] resolve hangs/crashes on libvirtd shutdown

commit 94e45d1042e21e03a15ce993f90fbef626f1ae41
Author: Nikolay Shirokovskiy <email address hidden>
Date: Thu Jul 23 09:53:04 2020 +0300

rpc: finish all threads before exiting main loop

$ git describe --contains 94e45d1042e21e03a15ce993f90fbef626f1ae41
v6.8.0-rc1~279

[Original Description]

There's a race condition on libvirtd shutdown
that might cause the domain status XML file(s)
to lose the '<monitor path=...'> tag/field.

This causes an error on libvirtd startup, and
the domain is not listed/managed, despite it
is still running.

 $ virsh list
  Id Name State
 -------------------------
  1 test-vm running

 $ sudo systemctl restart libvirtd.service

 $ journalctl -b -u libvirtd.service | tail
 ...
 ... libvirtd[2789]: internal error: no monitor path
 ... libvirtd[2789]: Failed to load config for domain 'test-vm'

 $ virsh list
  Id Name State
 --------------------

 $ virsh list --all
  Id Name State
 --------------------------
  - test-vm shut off

 $ pgrep -af qemu-system-x86_64 | cut -d, -f1
 2638 /usr/bin/qemu-system-x86_64 -name guest=test-vm,

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote (last edit ):
Download full text (6.9 KiB)

Steps to reproduce on Jammy
---

Stop libvirt systemd units

 sudo systemctl stop 'libvirtd*'

Start libvirt in GDB

 sudo gdb \
   -iex 'set confirm off' \
   -iex 'set pagination off' \
   -iex 'set debuginfod enabled on' \
   -iex 'set debuginfod urls https://debuginfod.ubuntu.com' \
   -ex 'set non-stop on' \
   -ex 'handle SIGTERM nostop noprint pass' \
   -ex 'add-symbol-file /usr/sbin/libvirtd' \
   -ex 'add-symbol-file /usr/lib/x86_64-linux-gnu/libvirt.so.0' \
   -ex 'add-symbol-file /usr/lib/x86_64-linux-gnu/libvirt-qemu.so.0' \
   -ex 'add-symbol-file /usr/lib/x86_64-linux-gnu/libvirt/connection-driver/libvirt_driver_qemu.so' \
   /usr/sbin/libvirtd

Add breakpoints for qemu driver cleanup and device deleted event

 b qemuStateCleanup
 b processDeviceDeletedEvent
 run

Start test VM with an USB mouse device

 cat <<-EOF >test-vm.xml
 <domain type='qemu'>
   <name>test-vm</name>
   <os>
     <type>hvm</type>
   </os>
   <memory unit='MiB'>32</memory>
   <vcpu>1</vcpu>
   <devices>
     <input type='mouse' bus='usb'/>
   </devices>
 </domain>
 EOF

 virsh define test-vm.xml
 virsh start test-vm

 $ virsh list
  Id Name State
 -------------------------
  1 test-vm running

Delete the USB mouse device

 DEVICE_ID=$(virsh qemu-monitor-command test-vm --hmp 'info qtree' | grep 'dev: usb-mouse' | cut -d'"' -f2)
 virsh qemu-monitor-command test-vm --hmp "device_del $DEVICE_ID"

Back to GDB

 Thread 25 "qemu-event" hit Breakpoint 2, 0x00007f6179ed20a7 in processDeviceDeletedEvent (devAlias=<optimized out>, vm=0x7f61842f1020, driver=0x7f6184035e20) at ../../src/qemu/qemu_driver.c:3536

Add breakpoint to domain status XML save, and continue the thread above

 b virDomainObjSave
 t 25
 c

 Thread 25 "qemu-event" hit Breakpoint 3, virDomainObjSave (obj=0x7f61842f1020, xmlopt=0x7f6184028010, statusDir=0x7f6184035460 "/run/libvirt/qemu") at ../../src/conf/domain_conf.c:28879

Check the backtrace of the domain status XML save function, coming from device deleted event

 (gdb) bt
 #0 virDomainObjSave (obj=0x7f61842f1020, xmlopt=0x7f6184028010, statusDir=0x7f6184035460 "/run/libvirt/qemu") at ../../src/conf/domain_conf.c:28879
 #1 0x00007f6179eb68c3 in qemuDomainObjSaveStatus (driver=0x7f6184035e20, obj=0x7f61842f1020) at ../../src/qemu/qemu_domain.c:5801
 #2 0x00007f6179ed2159 in processDeviceDeletedEvent (devAlias=0x7f617c0073e0 "input0", vm=0x7f61842f1020, driver=0x7f6184035e20) at ../../src/qemu/qemu_driver.c:3557
 #3 qemuProcessEventHandler (data=0x7f617c0072b0, opaque=0x7f6184035e20) at ../../src/qemu/qemu_driver.c:4184
 #4 0x00007f61974fc983 in virThreadPoolWorker (opaque=<optimized out>) at ../../src/util/virthreadpool.c:164
 #5 0x00007f61974fb4d9 in virThreadHelper (data=<optimized out>) at ../../src/util/virthread.c:241
 #6 0x00007f6196e64ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
 #7 0x00007f6196ef6850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Leave the thread at this point

Let's trigger the shutdown path

First, increase the shutdown timer (30 seconds is too fast for me; use 30 minutes)

 (gdb) b virEventAddTimeout

 $ sudo kill $(pidof libvirtd)

 Thread 1 "libvirtd"...

Read more...

Changed in libvirt (Ubuntu):
status: New → Fix Released
Changed in libvirt (Ubuntu Focal):
status: New → Confirmed
importance: Undecided → Medium
assignee: nobody → Mauricio Faria de Oliveira (mfo)
Changed in libvirt (Ubuntu Jammy):
status: New → Confirmed
importance: Undecided → Medium
assignee: nobody → Mauricio Faria de Oliveira (mfo)
description: updated
description: updated
Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :
Download full text (3.8 KiB)

Steps with test packages on Jammy
---

Test packages built successfully in all architectures
with -proposed enabled in Launchpad PPA mfo/lp2059272.

https://launchpad.net/~mfo/+archive/ubuntu/lp2059272

Upgrade the libvirt packages and install debug symbols

 $ sudo add-apt-repository -y -c 'main' -c 'main/debug' ppa:mfo/lp2059272
 $ sudo apt install --yes libvirt{0,-daemon{,-driver-qemu}}{,-dbgsym}

 $ dpkg -l | grep libvirt
 ii libvirt-clients 8.0.0-1ubuntu7.9 ...
 ii libvirt-daemon 8.0.0-1ubuntu7.9 ...
 ii libvirt-daemon-config-network 8.0.0-1ubuntu7.9 ...
 ii libvirt-daemon-config-nwfilter 8.0.0-1ubuntu7.9 ...
 ii libvirt-daemon-dbgsym 8.0.0-1ubuntu7.9 ...
 ii libvirt-daemon-driver-qemu 8.0.0-1ubuntu7.9 ...
 ii libvirt-daemon-driver-qemu-dbgsym 8.0.0-1ubuntu7.9 ...
 ii libvirt-daemon-system 8.0.0-1ubuntu7.9 ...
 ii libvirt-daemon-system-systemd 8.0.0-1ubuntu7.9 ...
 ii libvirt0:amd64 8.0.0-1ubuntu7.9 ...
 ii libvirt0-dbgsym:amd64 8.0.0-1ubuntu7.9 ...

...

Repeat the 'Steps to reproduce' in comment #1, until this point,
and notice the differences from here.

...

Check there are 2 threads: cleanup and domain status XML save

 (gdb) i th
   Id Target Id Frame
   1 Thread 0x7f1e79642ac0 (LWP 4404) "libvirtd" qemuStateCleanup () at ../../src/qemu/qemu_driver.c:1070
   18 Thread 0x7f1e507f8640 (LWP 4424) "gmain" (running)
   19 Thread 0x7f1e4fff7640 (LWP 4425) "gdbus" (running)
   20 Thread 0x7f1e4f7f6640 (LWP 4426) "udev-event" (running)
   26 Thread 0x7f1e50ff9640 (LWP 4496) "vm-test-vm" (running)
   27 Thread 0x7f1e4e7f4640 (LWP 4506) "qemu-event" virDomainObjSave (obj=0x7f1e6c074040, xmlopt=0x7f1e6c028010, statusDir=0x7f1e6c03b3d0 "/run/libvirt/qemu") at ../../src/conf/domain_conf.c:28879

Confirm the qemu driver's domain xml formatter/options is set/referenced:

 t 27

 (gdb) p xmlopt.privateData.format
 $1 = (virDomainXMLPrivateDataFormatFunc) 0x7f1e7054ada0 <qemuDomainObjPrivateXMLFormat>

 (gdb) p xmlopt.parent.parent_instance
 $2 = {g_type_instance = {g_class = 0x7f1e6c052000}, ref_count = 1, qdata = 0x0}

Let the cleanup function and shutdown path finish

 t 1
 c &

Check the formatter/options again; it is *STILL* referenced:

 (gdb) p xmlopt.privateData.format
 $3 = (virDomainXMLPrivateDataFormatFunc) 0x7f1e7054ada0 <qemuDomainObjPrivateXMLFormat>

 (gdb) p xmlopt.parent.parent_instance
 $4 = {g_type_instance = {g_class = 0x7f1e6c052000}, ref_count = 1, qdata = 0x0}

So, we keep `xmlopt.privateData.format` as it is
(and NOT set it to `0` as in Steps to Reproduce).

Check the VM status XML *before* the save function finishes:

 $ sudo grep -e '<domstatus' -e '<domain' -e 'monitor path' /run/libvirt/qemu/test-vm.xml
 <domstatus state='running' reason='booted' pid='4493'>
   <monitor path='/var/lib/libvirt/qemu/domain-4-test-vm/monitor.sock' type='unix'/>
   <domain type='qemu' id='4'>

Let the save function continue, and libvirt finishes shutting down:

 (gdb) c
 Continuing.
 ...
 [Inferior 1 (process 4404) exited norma...

Read more...

Revision history for this message
Sergio Durigan Junior (sergiodj) wrote :

Mauricio,

Wow! Such an awesome reproducer. Very detailed and easy to follow. Thanks for providing it.

I was able to verify the problem here, and looked at the upstream patch[1] that fixes it. The rationale makes sense to me, although that cleanup function is pretty involved and it's hard to say if there can be any fallout from moving the worker thread pool freeing action earlier.

I looked at the upstream repository and could not find any amends/fixes to the commit in question. It's present in Mantic and Noble, which is a good sign.

Do you have a PPA build that I can try with the reproducer to check the fix, please?

Otherwise, this LGTM and I'm +1 on proceeding with the SRU. Thanks!

[1]: For reference: https://gitlab.com/libvirt/libvirt/-/commit/152770333449cd3b78b4f5a9f1148fc1f482d842

Revision history for this message
Sergio Durigan Junior (sergiodj) wrote :

Hah, I didn't see your comment before I posted mine. Mid air conflict!

Anyway, thanks for providing the PPA. I'll take it for a spin :-).

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Test case for normal operations in this code path:

Restart libvirt 100 times with 10 QEMU domains.

All domains continued to be managed by libvirt.
No error messages observed.

$ for NAME in test-vm-{1..10}; do cat <<-EOF >test-vms.xml && virsh define test-vms.xml && virsh start $NAME; done
<domain type='qemu'>
  <name>${NAME}</name>
  <os>
    <type>hvm</type>
  </os>
  <memory unit='MiB'>32</memory>
  <vcpu>1</vcpu>
</domain>
EOF

Domain 'test-vm-1' defined from test-vms.xml
Domain 'test-vm-1' started
...
Domain 'test-vm-10' defined from test-vms.xml
Domain 'test-vm-10' started

$ virsh list
 Id Name State
----------------------------
 6 test-vm-1 running
 7 test-vm-2 running
 8 test-vm-3 running
 9 test-vm-4 running
 10 test-vm-5 running
 11 test-vm-6 running
 12 test-vm-7 running
 13 test-vm-8 running
 14 test-vm-9 running
 15 test-vm-10 running

$ for i in {1..100}; do echo restart $i; sudo systemctl restart libvirtd.service; sleep 10; done
restart 1
...
restart 100

$ virsh list
 Id Name State
----------------------------
 6 test-vm-1 running
 7 test-vm-2 running
 8 test-vm-3 running
 9 test-vm-4 running
 10 test-vm-5 running
 11 test-vm-6 running
 12 test-vm-7 running
 13 test-vm-8 running
 14 test-vm-9 running
 15 test-vm-10 running

$ journalctl -b -u libvirtd.service | grep -v -e 'systemd' -e 'hostname:' | cut -d' ' -f7- | sort | uniq -c | sort -rn
    108 check dnsmasq binary /usr/sbin/dnsmasq: No such file or directory
    106 version: 8.0.0, package: 1ubuntu7.9 (Mauricio Faria de Oliveira <email address hidden> Wed, 27 Mar 2024 12:47:46 -0300)
      1 version: 8.0.0, package: 1ubuntu7.8 (Lena Voytek <email address hidden> Wed, 29 Nov 2023 14:52:52 -0700)

Revision history for this message
Sergio Durigan Junior (sergiodj) wrote :

OK, I tested the fixed package from your PPA and verified that it indeed solves the issue.

Just to reiterate, then: LGTM, +1!

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Uploaded to Jammy.

description: updated
Changed in libvirt (Ubuntu Jammy):
status: Confirmed → In Progress
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Please include the "normal" execution from comment #5 in the test plan, besides the synthetic one.

Changed in libvirt (Ubuntu Jammy):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-jammy
Revision history for this message
Andreas Hasenack (ahasenack) wrote : Please test proposed package

Hello Mauricio, or anyone else affected,

Accepted libvirt into jammy-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/libvirt/8.0.0-1ubuntu7.9 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-jammy to verification-done-jammy. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-jammy. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

> Please include the "normal" execution from comment #5 in the test plan, besides the synthetic one.

Done; thanks!

description: updated
description: updated
Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote (last edit ):
Download full text (4.3 KiB)

Verification done on jammy-proposed.
---

Part 1: comment #2, libvirt starts without errors, and can list and manage the domain.
Part 2: comment #5, libvirt restarts without errors 100 times with 10 domains.

Environment:
---

LXD virtual machine

 lxc launch --vm ubuntu:jammy lp2059272-jammy
 lxc exec lp2059272-jammy -- su - ubuntu

Enable -proposed

 sudo add-apt-repository -yp proposed

 cat <<EOF | sudo tee /etc/apt/preferences.d/proposed
 Package: *
 Pin: release a=jammy-proposed
 Pin-Priority: 400
 EOF

Install

 sudo apt install -y -t jammy-proposed libvirt-daemon libvirt-daemon-system libvirt-daemon-driver-qemu
 sudo apt install -y gdb qemu-system-x86

 $ apt-cache policy libvirt-daemon
 libvirt-daemon:
   Installed: 8.0.0-1ubuntu7.9
   Candidate: 8.0.0-1ubuntu7.9
   Version table:
  *** 8.0.0-1ubuntu7.9 400
  400 http://security.ubuntu.com/ubuntu jammy-proposed/main amd64 Packages
  400 http://archive.ubuntu.com/ubuntu jammy-proposed/main amd64 Packages
  100 /var/lib/dpkg/status
      8.0.0-1ubuntu7.8 500
  500 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 Packages
      8.0.0-1ubuntu7.5 500
  500 http://security.ubuntu.com/ubuntu jammy-security/main amd64 Packages
      8.0.0-1ubuntu7 500
  500 http://archive.ubuntu.com/ubuntu jammy/main amd64 Packages

Part 1:
---

GDB
...

Check there are 2 threads: cleanup and domain status XML save

 (gdb) i th
   Id Target Id Frame
   1 Thread 0x7fe039120ac0 (LWP 3405) "libvirtd" qemuStateCleanup () at ../../src/qemu/qemu_driver.c:1070
   18 Thread 0x7fe00ffff640 (LWP 3425) "gmain" (running)
   19 Thread 0x7fe00f7fe640 (LWP 3426) "gdbus" (running)
   20 Thread 0x7fe00effd640 (LWP 3473) "udev-event" (running)
   24 Thread 0x7fe014e07640 (LWP 3600) "vm-test-vm" (running)
   25 Thread 0x7fe00dffb640 (LWP 3620) "qemu-event" virDomainObjSave (obj=0x7fe028301a30, xmlopt=0x7fe028040010, statusDir=0x7fe0280373b0 "/run/libvirt/qemu") at ../../src/conf/domain_conf.c:28879

Confirm the qemu driver's domain xml formatter/options is set/referenced:

 (gdb) t 25

 (gdb) p xmlopt.privateData.format
 $1 = (virDomainXMLPrivateDataFormatFunc) 0x7fe0176b4da0 <qemuDomainObjPrivateXMLFormat>

 (gdb) p xmlopt.parent.parent_instance
 $2 = {g_type_instance = {g_class = 0x7fe02806cf20}, ref_count = 1, qdata = 0x0}

Let the cleanup function and shutdown path finish

 (gdb) t 1
 (gdb) c &

 (gdb) t 25

Check the formatter/options again; it is *STILL* referenced:

 (gdb) p xmlopt.privateData.format
 $3 = (virDomainXMLPrivateDataFormatFunc) 0x7fe0176b4da0 <qemuDomainObjPrivateXMLFormat>

 (gdb) p xmlopt.parent.parent_instance
 $4 = {g_type_instance = {g_class = 0x7fe02806cf20}, ref_count = 1, qdata = 0x0}

Check the VM status XML *before* the save function finishes:

 $ sudo grep -e '<domstatus' -e '<domain' -e 'monitor path' /run/libvirt/qemu/test-vm.xml
 <domstatus state='running' reason='booted' pid='3597'>
   <monitor path='/var/lib/libvirt/qemu/domain-1-test-vm/monitor.sock' type='unix'/>
   <domain type='qemu' id='1'>

Let the save function continue, and libvirt finishes shutting down:

 (gdb) c
 ...
 [Inferior 1 (process 3405) exited normal...

Read more...

tags: added: verification-done-jammy
removed: verification-needed-jammy
Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote (last edit ):
Download full text (7.1 KiB)

Steps to reproduce on Focal (shutdown-on-runtime)
---

LXD virtual machine

 lxc launch --vm ubuntu:focal lp2059272-focal
 lxc exec lp2059272-focal -- su - ubuntu

Latest Packages and Debug Symbols:

 cat <<EOF | sudo tee /etc/apt/sources.list.d/proposed.list
 deb http://archive.ubuntu.com/ubuntu focal-proposed main universe
 deb http://ddebs.ubuntu.com focal-proposed main restricted
 EOF

 cat <<EOF | sudo tee /etc/apt/preferences.d/proposed
 Package: *
 Pin: release a=focal-proposed
 Pin-Priority: 400
 EOF

 sudo apt install --yes --no-install-recommends gdb qemu-system-x86 ubuntu-dbgsym-keyring
 sudo apt install --yes --no-install-recommends -t focal-proposed libvirt{0,-daemon{,-driver-qemu,-system}}{,-dbgsym} libvirt-clients

 $ dpkg -l | awk '$2 ~ /^libvirt/ { print $2, $3 }'
 libvirt-clients 6.0.0-0ubuntu8.17
 libvirt-daemon 6.0.0-0ubuntu8.17
 libvirt-daemon-dbgsym 6.0.0-0ubuntu8.17
 libvirt-daemon-driver-qemu 6.0.0-0ubuntu8.17
 libvirt-daemon-driver-qemu-dbgsym 6.0.0-0ubuntu8.17
 libvirt-daemon-system 6.0.0-0ubuntu8.17
 libvirt-daemon-system-systemd 6.0.0-0ubuntu8.17
 libvirt0:amd64 6.0.0-0ubuntu8.17
 libvirt0-dbgsym:amd64 6.0.0-0ubuntu8.17

Stop libvirt systemd units

 sudo systemctl stop 'libvirtd*'

Start libvirt in GDB

  sudo gdb \
    -iex 'set confirm off' \
    -iex 'set pagination off' \
    -ex 'set non-stop on' \
    -ex 'handle SIGTERM nostop noprint pass' \
    -ex 'add-symbol-file /usr/sbin/libvirtd' \
    -ex 'add-symbol-file /usr/lib/x86_64-linux-gnu/libvirt.so.0' \
    -ex 'add-symbol-file /usr/lib/x86_64-linux-gnu/libvirt-qemu.so.0' \
    -ex 'add-symbol-file /usr/lib/x86_64-linux-gnu/libvirt/connection-driver/libvirt_driver_qemu.so' \
    /usr/sbin/libvirtd

Add breakpoints for qemu driver cleanup and device deleted event

 b qemuStateCleanup
 b processDeviceDeletedEvent
 run

Start test VM with an USB mouse device

  cat <<-EOF >test-vm.xml
  <domain type='qemu'>
    <name>test-vm</name>
    <os>
      <type>hvm</type>
    </os>
    <memory unit='MiB'>32</memory>
    <vcpu>1</vcpu>
    <devices>
      <input type='mouse' bus='usb'/>
    </devices>
  </domain>
EOF

 virsh define test-vm.xml
 virsh start test-vm

 $ virsh list
 Id Name State
 -------------------------
 1 test-vm running

Delete the USB mouse device

 DEVICE_ID=$(virsh qemu-monitor-command test-vm --hmp 'info qtree' | grep 'dev: usb-mouse' | cut -d'"' -f2)
 virsh qemu-monitor-command test-vm --hmp "device_del $DEVICE_ID"

Back to GDB

 Thread 20 "libvirtd" hit Breakpoint 2, 0x00007f316df78ffe in processDeviceDeletedEvent (devAlias=<optimized out>, vm=0x7f3178016890, driver=0x7f317802edd0) at ../../../src/qemu/qemu_driver.c:4879

Add breakpoint to domain status XML save, and continue the thread above

 b virDomainObjSave
 t 20
 c

 Thread 20 "libvirtd" hit Breakpoint 3, virDomainObjSave (obj=0x7f3178016890, xmlopt=0x7f3178037780, statusDir=0x7f317800aa40 "/run/libvirt/qemu") at ../../../src/conf/domain_conf.c:29157

Check the backtrace of the domain status XML save function, coming from device deleted event

 (gdb) bt
 #0 virDomainObjSave (obj=0x7f3178016890, xmlopt=0x7f3178037780, statusDir=0x7f317800aa40 "/run/libvirt/qemu") at ../../../...

Read more...

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote (last edit ):
Download full text (6.3 KiB)

Steps to reproduce on Focal (shutdown-on-init)
---

LXD virtual machine

 lxc launch --vm ubuntu:focal lp2059272-focal
 lxc exec lp2059272-focal -- su - ubuntu

Latest Packages and Debug Symbols:

 cat <<EOF | sudo tee /etc/apt/sources.list.d/proposed.list
 deb http://archive.ubuntu.com/ubuntu focal-proposed main universe
 deb http://ddebs.ubuntu.com focal-proposed main restricted
 EOF

 cat <<EOF | sudo tee /etc/apt/preferences.d/proposed
 Package: *
 Pin: release a=focal-proposed
 Pin-Priority: 400
 EOF

 sudo apt install --yes --no-install-recommends gdb qemu-system-x86 ubuntu-dbgsym-keyring
 sudo apt update
 sudo apt install --yes --no-install-recommends -t focal-proposed libvirt{0,-daemon{,-driver-qemu,-system}}{,-dbgsym} libvirt-clients

 $ dpkg -l | awk '$2 ~ /^libvirt/ { print $2, $3 }'
 libvirt-clients 6.0.0-0ubuntu8.17
 libvirt-daemon 6.0.0-0ubuntu8.17
 libvirt-daemon-dbgsym 6.0.0-0ubuntu8.17
 libvirt-daemon-driver-qemu 6.0.0-0ubuntu8.17
 libvirt-daemon-driver-qemu-dbgsym 6.0.0-0ubuntu8.17
 libvirt-daemon-system 6.0.0-0ubuntu8.17
 libvirt-daemon-system-systemd 6.0.0-0ubuntu8.17
 libvirt0:amd64 6.0.0-0ubuntu8.17
 libvirt0-dbgsym:amd64 6.0.0-0ubuntu8.17

Start test VM

 cat <<-EOF >test-vm.xml
 <domain type='qemu'>
   <name>test-vm</name>
   <os>
     <type>hvm</type>
   </os>
   <memory unit='MiB'>32</memory>
   <vcpu>1</vcpu>
 </domain>
 EOF

 virsh define test-vm.xml
 virsh start test-vm

 $ virsh list
  Id Name State
 -------------------------
  1 test-vm running

Stop libvirt systemd units

 sudo systemctl stop 'libvirtd*'

Start libvirt in GDB

 sudo gdb \
   -iex 'set confirm off' \
   -iex 'set pagination off' \
   -ex 'set non-stop on' \
   -ex 'handle SIGTERM nostop noprint pass' \
   -ex 'add-symbol-file /usr/sbin/libvirtd' \
   -ex 'add-symbol-file /usr/lib/x86_64-linux-gnu/libvirt.so.0' \
   -ex 'add-symbol-file /usr/lib/x86_64-linux-gnu/libvirt-qemu.so.0' \
   -ex 'add-symbol-file /usr/lib/x86_64-linux-gnu/libvirt/connection-driver/libvirt_driver_qemu.so' \
   /usr/sbin/libvirtd

Add breakpoints for qemu driver cleanup and domain status XML save

 b qemuStateCleanup
 b virDomainObjSave
 run

 Thread 20 "libvirtd" hit Breakpoint 2, virDomainObjSave (obj=0x555cf5d83480, xmlopt=0x555cf5d7e6a0, statusDir=0x555cf5d26e70 "/run/libvirt/qemu") at ../../../src/conf/domain_conf.c:29157

Check the backtrace of the domain status XML save function, coming from QEMU process reconnect:

 t 20

 (gdb) bt
 #0 virDomainObjSave (obj=0x555cf5d83480, xmlopt=0x555cf5d7e6a0, statusDir=0x555cf5d26e70 "/run/libvirt/qemu") at ../../../src/conf/domain_conf.c:29157
 #1 0x00007f743b666268 in qemuProcessReconnect (opaque=<optimized out>) at ../../../src/qemu/qemu_process.c:8122
 #2 0x00007f7460b9054a in virThreadHelper (data=<optimized out>) at ../../../src/util/virthread.c:196
 #3 0x00007f7460851609 in start_thread (arg=<optimized out>) at pthread_create.c:477
 #4 0x00007f7460776353 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Leave the thread at this point

Let's trigger the shutdown path

 $ sudo kill $(pidof libvirtd)

 Thread 1 "libvirtd" hit Breakpoint 1, qemuStateCleanup () at ../../../src/qemu/qemu_driver.c:1...

Read more...

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :
Download full text (4.3 KiB)

Steps with test packages on Focal (shutdown-on-runtime)
---

Environment:
---

On top of LXD VM in comments #12/#13.

Enable PPA & debug symbols

 sudo add-apt-repository -yn ppa:mfo/lp2059272
 sudo sed '/^deb / s,$, main/debug,' -i /etc/apt/sources.list.d/mfo-ubuntu-lp2059272-focal.list
 sudo apt update

Install packages

 sudo apt install --yes libvirt{0,-daemon{,-driver-qemu}}{,-dbgsym} libvirt-clients gdb qemu-system-x86

 $ dpkg -s libvirt-daemon | grep ^Version:
 Version: 6.0.0-0ubuntu8.18~ppa1

Libvirtd debug logging

 cat <<EOF | sudo tee -a /etc/libvirt/libvirtd.conf
 log_filters="1:qemu 1:libvirt"
 log_outputs="3:syslog:libvirtd 1:file:/var/log/libvirt/libvirtd-debug.log"
 EOF

Follow `Steps to reproduce on Focal (shutdown-on-runtime)` in comment #12
---

Up to ...

Check the backtrace of the domain status XML save function, coming from device deleted event

 Thread 20 "libvirtd" hit Breakpoint 3, virDomainObjSave (obj=0x7fb3202733b0, xmlopt=0x7fb3200400d0, statusDir=0x7fb32000a620 "/run/libvirt/qemu") at ../../../src/conf/domain_conf.c:29157
 29157 ../../../src/conf/domain_conf.c: No such file or directory.
 (gdb) bt
 #0 virDomainObjSave (obj=0x7fb3202733b0, xmlopt=0x7fb3200400d0, statusDir=0x7fb32000a620 "/run/libvirt/qemu") at ../../../src/conf/domain_conf.c:29157
 #1 0x00007fb32c20d627 in processDeviceDeletedEvent (devAlias=0x564ae9f032e0 "input0", vm=0x7fb3202733b0, driver=0x7fb320039ed0) at ../../../src/qemu/qemu_driver.c:4367
 #2 qemuProcessEventHandler (data=0x564ae9fc88f0, opaque=0x7fb320039ed0) at ../../../src/qemu/qemu_driver.c:4943
 #3 0x00007fb333ed51af in virThreadPoolWorker (opaque=opaque@entry=0x564ae9fd3080) at ../../../src/util/virthreadpool.c:163
 #4 0x00007fb333ed451c in virThreadHelper (data=<optimized out>) at ../../../src/util/virthread.c:196
 #5 0x00007fb333b95609 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
 #6 0x00007fb333aba353 in clone () from /lib/x86_64-linux-gnu/libc.so.6

 $ sudo kill $(pidof libvirtd)

 Thread 1 "libvirtd" hit Breakpoint 1, qemuStateCleanup () at ../../../src/qemu/qemu_driver.c:1180

 t 20

 (gdb) p xmlopt.privateData.format
 $1 = (virDomainXMLPrivateDataFormatFunc) 0x7fb32c167890 <qemuDomainObjPrivateXMLFormat>

 t 1
 c &

Check the formatter/options again; it is *STILL* referenced, not 0x0 anymore:

 t 20

 (gdb) p xmlopt.privateData.format
 $2 = (virDomainXMLPrivateDataFormatFunc) 0x7fb32c167890 <qemuDomainObjPrivateXMLFormat>

Check the shutdown/cleanup thread is waiting for it,
in the path to free the worker thread pool:

 (gdb) i th 1
   Id Target Id Frame
   1 Thread 0x7fb33007bb40 (LWP 6585) "libvirtd" (running)

 t 1
 interrupt

 (gdb) bt
 #0 0x00007fb333b9c376 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
 #1 0x00007fb333ed479b in virCondWait (c=<optimized out>, m=<optimized out>) at ../../../src/util/virthread.c:144
 #2 0x00007fb333ed5438 in virThreadPoolFree (pool=<optimized out>) at ../../../src/util/virthreadpool.c:286
 #3 0x00007fb32c1d89e3 in qemuStateCleanup () at ../../../src/qemu/qemu_driver.c:1186
 #4 0x00007fb33407246f in virStateCleanup () at ../../../src/libv...

Read more...

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote (last edit ):
Download full text (15.5 KiB)

Steps with test packages on Focal (shutdown-on-init)
---

Environment:
---

On top of LXD VM in comments #12/#13.

Enable PPA & debug symbols

 sudo add-apt-repository -yn ppa:mfo/lp2059272
 sudo sed '/^deb / s,$, main/debug,' -i /etc/apt/sources.list.d/mfo-ubuntu-lp2059272-focal.list
 sudo apt update

Install packages

 sudo apt install --yes libvirt{0,-daemon{,-driver-qemu}}{,-dbgsym} libvirt-clients gdb qemu-system-x86

 $ dpkg -s libvirt-daemon | grep ^Version:
 Version: 6.0.0-0ubuntu8.18~ppa1

Libvirtd debug logging

 cat <<EOF | sudo tee -a /etc/libvirt/libvirtd.conf
 log_filters="1:qemu 1:libvirt"
 log_outputs="3:syslog:libvirtd 1:file:/var/log/libvirt/libvirtd-debug.log"
 EOF

Follow `Steps to reproduce on Focal (shutdown-on-init)` in comment #13
---

Up to ...

Check the backtrace of the domain status XML save function, coming from QEMU process reconnect:

 t 20

 (gdb) bt
 #0 virDomainObjSave (obj=0x7fe638012540, xmlopt=0x7fe63800d4e0, statusDir=0x7fe63800cf10 "/run/libvirt/qemu") at ../../../src/conf/domain_conf.c:29157
 #1 0x00007fe644190545 in qemuProcessReconnect (opaque=<optimized out>) at ../../../src/qemu/qemu_process.c:8123
 #2 0x00007fe64aebd54a in virThreadHelper (data=<optimized out>) at ../../../src/util/virthread.c:196
 #3 0x00007fe64ab7e609 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
 #4 0x00007fe64aaa3353 in clone () from /lib/x86_64-linux-gnu/libc.so.6

 $ sudo kill $(pidof libvirtd)

 Thread 1 "libvirtd" hit Breakpoint 1, qemuStateCleanup () at ../../../src/qemu/qemu_driver.c:1180

 t 20

 (gdb) p xmlopt.privateData.format
 $1 = (virDomainXMLPrivateDataFormatFunc) 0x7fe644152890 <qemuDomainObjPrivateXMLFormat>

Let the cleanup function finish

 t 1
 finish

Notice it took a while (30 seconds).

 (gdb) t 20
 (gdb) p xmlopt.privateData.format
 $3 = (virDomainXMLPrivateDataFormatFunc) 0x0

Let the save function continue, and libvirt finish shutdown:

 (gdb) c &
 (gdb) t 1
 (gdb) c
 (gdb) q

Check the VM status XML *after*:

 ubuntu@lp2059272-focal:~$ sudo grep -e '<domstatus' -e '<domain' -e 'monitor path' /run/libvirt/qemu/test-vm.xml
 <domstatus state='running' reason='booted' pid='6817'>
   <domain type='qemu' id='1'>

And everything happened as in the reproducer.
i.e., the SAME behavior happened BY DEFAULT.
Just with a 30 seconds delay.

Checking the libvirtd debug logs to confirm the patch behavior:

 $ sudo tail -n50 /var/log/libvirt/libvirtd-debug.log | sed -n '/qemuStateCleanupWait/,$p'
 2024-03-30 22:49:24.737+0000: 6875: debug : qemuStateCleanupWait:1144 : timeout 30, timeout_env '(null)'
 2024-03-30 22:49:24.737+0000: 6875: debug : qemuStateCleanupWait:1150 : threads 1, seconds 0
 2024-03-30 22:49:24.737+0000: 6875: warning : qemuStateCleanupWait:1153 : Waiting for qemuProcessReconnect() threads (1) to end. Configure with LIBVIRT_QEMU_STATE_CLEANUP_WAIT_TIMEOUT (-1 = wait; 0 = do not wait; N = wait up to N seconds; current = 30)
 2024-03-30 22:49:25.740+0000: 6875: debug : qemuStateCleanupWait:1150 : threads 1, seconds 1
 2024-03-30 22:49:26.740+0000: 6875: debug : qemuStateCleanupWait:1150 : threads 1, seconds 2
 2024-03-30 22:49:27.740+0000: 6875: debug : qemuStateCleanupWait:1150 :...

description: updated
description: updated
Revision history for this message
Sergio Durigan Junior (sergiodj) wrote (last edit ):

Mauricio asked me to review the debdiff for Focal as well.

This debdiff is different because he had to implement some code to synchronize the destruction of the qemuProcessReconnect threads with the free'ing of the thread pool.

I looked at the new code, analyzed it as best as I could, asked a few questions to Mauricio regarding the new environment variable being created, and finally got convinced that everything seems OK. The implementation is sound and the concept is simple: qemuStateCleanupWait acts similarly to a thread barrier and makes sure that all threads previously created by qemuProcessReconnect are able to finish before qemuStateCleanup can proceed.

In a nutshell: LGTM, +1.

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Thanks for the reviews, Sergio!

That certainly helps building additional confidence in the patch.

With that and tests covering all options (comments #13 and #15)
showing good/expected results, I'm happy to upload it to Focal.

Changed in libvirt (Ubuntu Focal):
status: Confirmed → In Progress
Revision history for this message
Chris Halse Rogers (raof) wrote :

Hello Mauricio, or anyone else affected,

Accepted libvirt into focal-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/libvirt/6.0.0-0ubuntu8.18 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-focal to verification-done-focal. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-focal. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in libvirt (Ubuntu Focal):
status: In Progress → Fix Committed
tags: added: verification-needed-focal
Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

(Un)fortunately, during focal-proposed testing today, a corner case was identified (late),
which will require changes.

Some of the qemuProcessReconnect() threads finished within the timeout but others didn't.

The ones which didn't were waiting on a reply back from the QEMU monitor, but that never
fires because it is processed/delivered by the main event loop, which is already stopped
at libvirtd level by the time the driver-level qemuStateCleanup() runs (where we wait).

The libvirt shutdown does happen after the timeout, but this is slow and not as planned.

I've been considering design options to address it and will submit an incremental upload.

Apologies for the inconvenience.

tags: added: verification-failed-focal
removed: verification-needed-focal
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package libvirt - 8.0.0-1ubuntu7.9

---------------
libvirt (8.0.0-1ubuntu7.9) jammy; urgency=medium

  * d/p/u/lp2059272-qemu-Fix-potential-crash-during-driver-cleanup.patch:
    On QEMU driver cleanup, release (stop) the worker thread pool _first_,
    before other data used by possibly running worker threads (LP: #2059272)

 -- Mauricio Faria de Oliveira <email address hidden> Wed, 27 Mar 2024 12:47:46 -0300

Changed in libvirt (Ubuntu Jammy):
status: Fix Committed → Fix Released
Revision history for this message
Brian Murray (brian-murray) wrote : Update Released

The verification of the Stable Release Update for libvirt has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

description: updated
Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

The new fix for shutdown-on-init on Focal is simpler:
Skip the XML update on init if libvirt is shutting down.

This is based on 2 points:

1) The XML update on initialization will not be used at all
   in this run of libvirtd, since libvirtd is shutting down.

2) The XML update in _this_ initialization will be overwritten
by the XML update in the _next_ initialization anyway!

Hence, it is OK to skip this XML update if shutting down.

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Test packages with the new fix in ppa:mfo/lp2059272
built correctly in all supported architectures.

The next 3 comments show steps to reproduce/verify
the issue/fix with the test packages -- all passed.

Environment
---

LXD virtual machine

 lxc launch --vm ubuntu:focal lp2059272-focal
 lxc exec lp2059272-focal -- su - ubuntu

Enable PPA & debug symbols

 sudo add-apt-repository -yn ppa:mfo/lp2059272
 sudo sed '/^deb / s,$, main/debug,' -i /etc/apt/sources.list.d/mfo-ubuntu-lp2059272-focal.list
 sudo apt update

Install packages

 sudo apt install --yes libvirt{0,-daemon{,-driver-qemu,-system}}{,-dbgsym} libvirt-daemon-system-systemd libvirt-clients gdb qemu-system-x86
        newgrp libvirt # or logout/login

 $ apt-cache policy libvirt-daemon
 libvirt-daemon:
   Installed: 6.0.0-0ubuntu8.19
   Candidate: 6.0.0-0ubuntu8.19
   Version table:
  *** 6.0.0-0ubuntu8.19 500
  500 http://ppa.launchpad.net/mfo/lp2059272/ubuntu focal/main amd64 Packages
  100 /var/lib/dpkg/status
      6.0.0-0ubuntu8.16 500
  500 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 Packages
  500 http://security.ubuntu.com/ubuntu focal-security/main amd64 Packages
      6.0.0-0ubuntu8 500
  500 http://archive.ubuntu.com/ubuntu focal/main amd64 Packages

Libvirtd debug logging

 cat <<-EOF | sudo tee -a /etc/libvirt/libvirtd.conf
 log_filters="1:qemu 1:libvirt"
 log_outputs="3:syslog:libvirtd 1:file:/var/log/libvirt/libvirtd-debug.log"
 EOF

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :
Download full text (3.7 KiB)

Steps with test packages on Focal (normal restarts)
---

Restart libvirt 100 times with 10 QEMU domains.

All domains continued to be managed by libvirt.

Create 10 test VMs (test-vm-1, test-vm-2, ..., test-vm-10):

 for NAME in test-vm-{1..10}; do cat <<-EOF >test-vms.xml && virsh define test-vms.xml && virsh start $NAME; done
 <domain type='qemu'>
   <name>${NAME}</name>
   <os>
     <type>hvm</type>
   </os>
   <memory unit='MiB'>32</memory>
   <vcpu>1</vcpu>
 </domain>
 EOF

Disable the systemd unit rate limiting for (re)starts:

 sudo mkdir -p /etc/systemd/system/libvirtd.service.d/
 cat <<EOF | sudo tee /etc/systemd/system/libvirtd.service.d/override.conf
 [Unit]
 StartLimitIntervalSec=0
 EOF
 sudo systemctl daemon-reload

Now, we'll restart libvirtd 100 times with a configurable interval between restarts.

We'll restart it 100 times _for each_ configurable interval, so as to stress the
initialization code path running at that time to go together with shutdown path
in subtle variations of timing.

The restart intervals range from 0.1 seconds to 2.0 seconds, in steps of 0.1 second.
(i.e., we restart libvirtd (2.0 / 0.1) * (1 + 100) = 2020 times in this test.)

 for SLEEP in $(seq 0.1 0.1 2.0); do

   echo 'Reset libvirtd debug log'
   sudo systemctl stop 'libvirtd*'
   sudo rm -f /var/log/libvirt/libvirtd-debug.log
   sudo systemctl start libvirtd.service
   sleep 5

   for RESTART in $(seq 1 100); do
     echo "Sleep $SLEEP, Restart $RESTART"
     sudo systemctl restart libvirtd.service
     sleep $SLEEP
   done

   echo 'Check libvirtd debug log'
   sudo grep 'Leaving the update of' /var/log/libvirt/libvirtd-debug.log
   sudo cp /var/log/libvirt/libvirtd-debug.log /tmp/libvirtd-debug.log.SLEEP-${SLEEP}
   echo
 done 2>&1 | tee /tmp/libvirtd-restart.log

 Reset libvirtd debug log
 Sleep 0.1, Restart 1
 ...
 Sleep 0.1, Restart 100
 Check libvirtd debug log

 Reset libvirtd debug log
 Sleep 0.2, Restart 1
 ...
 Sleep 0.2, Restart 100
 Check libvirtd debug log

 ...

 Reset libvirtd debug log
 Sleep 2.0, Restart 1

 Sleep 2.0, Restart 100
 Check libvirtd debug log

Checking that libvirtd is started 1+100 times for each restart interval:

 $ sudo grep -c 'libvirt version' /tmp/libvirtd-debug.log.SLEEP-*
 /tmp/libvirtd-debug.log.SLEEP-0.1:101
 /tmp/libvirtd-debug.log.SLEEP-0.2:101
 /tmp/libvirtd-debug.log.SLEEP-0.3:101
 /tmp/libvirtd-debug.log.SLEEP-0.4:101
 /tmp/libvirtd-debug.log.SLEEP-0.5:101
 /tmp/libvirtd-debug.log.SLEEP-0.6:101
 /tmp/libvirtd-debug.log.SLEEP-0.7:101
 /tmp/libvirtd-debug.log.SLEEP-0.8:101
 /tmp/libvirtd-debug.log.SLEEP-0.9:101
 /tmp/libvirtd-debug.log.SLEEP-1.0:101
 /tmp/libvirtd-debug.log.SLEEP-1.1:101
 /tmp/libvirtd-debug.log.SLEEP-1.2:101
 /tmp/libvirtd-debug.log.SLEEP-1.3:101
 /tmp/libvirtd-debug.log.SLEEP-1.4:101
 /tmp/libvirtd-debug.log.SLEEP-1.5:101
 /tmp/libvirtd-debug.log.SLEEP-1.6:101
 /tmp/libvirtd-debug.log.SLEEP-1.7:101
 /tmp/libvirtd-debug.log.SLEEP-1.8:101
 /tmp/libvirtd-debug.log.SLEEP-1.9:101
 /tmp/libvirtd-debug.log.SLEEP-2.0:101

All VMs are still managed by libvirt:

 $ virsh list
  Id Name State
 ----------------------------
  2 test-vm-1 running
  3 test-vm-2 running
...

Read more...

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :
Download full text (17.8 KiB)

Steps with test packages on Focal (shutdown-on-init)
---

Start test VM

  cat <<-EOF >test-vm.xml
  <domain type='qemu'>
    <name>test-vm</name>
    <os>
      <type>hvm</type>
    </os>
    <memory unit='MiB'>32</memory>
    <vcpu>1</vcpu>
  </domain>
 EOF

  virsh define test-vm.xml
  virsh start test-vm

  $ virsh list
   Id Name State
  -------------------------
   1 test-vm running

Stop libvirtd systemd units

  sudo systemctl stop 'libvirtd*'

Scenario 1) Shutdown wins race against XML update (ie, shutdown happens first)

Start libvirtd in GDB

 sudo gdb \
    -iex 'set confirm off' \
    -iex 'set pagination off' \
    -ex 'set non-stop on' \
    -ex 'handle SIGTERM nostop noprint pass' \
    -ex 'add-symbol-file /usr/sbin/libvirtd' \
    -ex 'add-symbol-file /usr/lib/x86_64-linux-gnu/libvirt.so.0' \
    -ex 'add-symbol-file /usr/lib/x86_64-linux-gnu/libvirt-qemu.so.0' \
    -ex 'add-symbol-file /usr/lib/x86_64-linux-gnu/libvirt/connection-driver/libvirt_driver_qemu.so' \
 /usr/sbin/libvirtd

Stop on initialization

 (gdb) b qemuStateInitialize
 (gdb) run

 Thread 17 "libvirtd" hit Breakpoint 1, qemuStateInitialize (privileged=true, callback=0x5558939f10c0 <daemonInhibitCallback>, opaque=0x555893b905d0) at ../../../src/qemu/qemu_driver.c:644

Save the daemon 'opaque' pointer in $ptr (global variable qemu_driver_dmn is not accessible):

 (gdb) p qemu_driver_dmn
 Cannot access memory at address 0x1e39a8

 (gdb) p 'src/qemu/qemu_driver.c'::qemu_driver_dmn
 Cannot access memory at address 0x1e39a8

 (gdb) t 17
 (gdb) set $ptr = opaque

Run until qemuProcessReconnect

 (gdb) b qemuProcessReconnect
 (gdb) c

Thread 20 "libvirtd" hit Breakpoint 2, qemuProcessReconnect (opaque=0x7fd82c054900) at ../../../src/qemu/qemu_process.c:7922

Run this thread until the lock on qemu_driver_dmn:

 (gdb) b virObjectLock thread 20 if anyobj == $ptr
 (gdb) t 20
 (gdb) c

Thread 20 "libvirtd" hit Breakpoint 3, virObjectLock (anyobj=0x555893b905d0) at ../../../src/util/virobject.c:427

See the daemon is not yet shutting down

 (gdb) t 20
 (gdb) p ((virNetDaemonPtr)anyobj)->quit
 $1 = false

Stop the shutdown path in the main thread on the lock on qemu_driver_dmn

 (gdb) b virObjectLock thread 1 if anyobj == $ptr

 $ sudo kill $(pidof libvirtd)

Thread 1 "libvirtd" hit Breakpoint 4, virObjectLock (anyobj=0x555893b905d0) at ../../../src/util/virobject.c:427

 (gdb) t 1
 #0 virObjectLock (anyobj=0x555893b905d0) at ../../../src/util/virobject.c:427
 #1 0x00007fd83eabc2d5 in virNetDaemonSignalEvent (watch=watch@entry=2, fd=<optimized out>, events=events@entry=1, opaque=opaque@entry=0x555893b905d0) at ../../../src/rpc/virnetdaemon.c:630
 #2 0x00007fd83e97da0d in virEventPollDispatchHandles (fds=0x555893bc21c0, nfds=<optimized out>) at ../../../src/util/vireventpoll.c:503
 #3 virEventPollRunOnce () at ../../../src/util/vireventpoll.c:658
 #4 0x00007fd83e97c095 in virEventRunDefaultImpl () at ../../../src/util/virevent.c:353
 #5 0x00007fd83eabd495 in virNetDaemonRun (dmn=0x555893b905d0) at ../../../src/rpc/virnetdaemon.c:836
 #6 0x00005558939ef7d1 in main (argc=<optimized out>, argv=<optimized out>) at ../../../src/remote/remote_daemon.c:1430

Let it delive...

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :
Download full text (5.6 KiB)

Steps with test packages on Focal (shutdown-on-runtime)
---

Stop libvirtd systemd units

  sudo systemctl stop 'libvirtd*'

Start libvirt in GDB

  sudo gdb \
    -iex 'set confirm off' \
    -iex 'set pagination off' \
    -ex 'set non-stop on' \
    -ex 'handle SIGTERM nostop noprint pass' \
    -ex 'add-symbol-file /usr/sbin/libvirtd' \
    -ex 'add-symbol-file /usr/lib/x86_64-linux-gnu/libvirt.so.0' \
    -ex 'add-symbol-file /usr/lib/x86_64-linux-gnu/libvirt-qemu.so.0' \
    -ex 'add-symbol-file /usr/lib/x86_64-linux-gnu/libvirt/connection-driver/libvirt_driver_qemu.so' \
    /usr/sbin/libvirtd

Add breakpoints for qemu driver cleanup and device deleted event

  b qemuStateCleanup
  b processDeviceDeletedEvent
  run

Start test VM with an USB mouse device

   cat <<-EOF >test-vm.xml
   <domain type='qemu'>
     <name>test-vm</name>
     <os>
       <type>hvm</type>
     </os>
     <memory unit='MiB'>32</memory>
     <vcpu>1</vcpu>
     <devices>
       <input type='mouse' bus='usb'/>
     </devices>
   </domain>
 EOF

  virsh define test-vm.xml
  virsh start test-vm

  $ virsh list
  Id Name State
  -------------------------
  1 test-vm running

Delete the USB mouse device

  DEVICE_ID=$(virsh qemu-monitor-command test-vm --hmp 'info qtree' | grep 'dev: usb-mouse' | cut -d'"' -f2)
  virsh qemu-monitor-command test-vm --hmp "device_del $DEVICE_ID"

Back to GDB

  Thread 20 "libvirtd" hit Breakpoint 2, 0x00007ffba902204e in processDeviceDeletedEvent (devAlias=<optimized out>, vm=0x7ffbac00de90, driver=0x7ffbac021380) at ../../../src/qemu/qemu_driver.c:4888

Add breakpoint to domain status XML save, and continue the thread above

  b virDomainObjSave
  t 20
  c

 Thread 20 "libvirtd" hit Breakpoint 3, virDomainObjSave (obj=0x7ffbac00de90, xmlopt=0x7ffbac044130, statusDir=0x7ffbac01f530 "/run/libvirt/qemu") at ../../../src/conf/domain_conf.c:29157

Check the backtrace of the domain status XML save function, coming from device deleted event

  (gdb) bt
 #0 virDomainObjSave (obj=0x7ffbac00de90, xmlopt=0x7ffbac044130, statusDir=0x7ffbac01f530 "/run/libvirt/qemu") at ../../../src/conf/domain_conf.c:29157
 #1 0x00007ffba9022127 in processDeviceDeletedEvent (devAlias=0x556074b5e3f0 "input0", vm=0x7ffbac00de90, driver=0x7ffbac021380) at ../../../src/qemu/qemu_driver.c:4312
 #2 qemuProcessEventHandler (data=0x556074b63a10, opaque=0x7ffbac021380) at ../../../src/qemu/qemu_driver.c:4888
 #3 0x00007ffbbee8f1af in virThreadPoolWorker (opaque=opaque@entry=0x556074c047a0) at ../../../src/util/virthreadpool.c:163
 #4 0x00007ffbbee8e51c in virThreadHelper (data=<optimized out>) at ../../../src/util/virthread.c:196
 #5 0x00007ffbbeb4f609 in start_thread (arg=<optimized out>) at pthread_create.c:477
 #6 0x00007ffbbea74353 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Leave the thread at this point

Let's trigger the shutdown path

  $ sudo kill $(pidof libvirtd)

 Thread 1 "libvirtd" hit Breakpoint 1, qemuStateCleanup () at ../../../src/qemu/qemu_driver.c:1127

Check the function pointer is non-NULL _before_ cleanup

 (gdb) p xmlopt.privateData.format
 $1 = (virDomainXMLPrivateDataFormatFunc) 0x7ffba8f7c7c0 <qemuDomainObjPrivateXMLForma...

Read more...

Revision history for this message
Sergio Durigan Junior (sergiodj) wrote :

Hi Mauricio,

Thanks for the detailed explanations, as usual.

I took some time here to read your patch. It took me down the rabbit hole and I did some archaeology to find out what these "inhibit*" pointers are about. It was fun; I learned that they were introduced back in 2012 as a means to allow drivers to signal libvirt when to inhibit shutdown because there are still VMs around. But I digress.

I liked your approach here and indeed, it seems simpler than the other one (although the initial approach was also simple to grasp and elegant; too bad it didn't work for all cases).

My only comment/question here would be this: the changes to qemuProcessReconnect mention that they're being done in order to prevent the XML "corruption" when there's a shutdown detected during initialization, but (and I may be wrong here) it seems to me that this new code path could be reached even after the initialization, couldn't it? Either way, this is not really important and doesn't affect the patch (aside from the possible amends to the comments being added), and it's OK if you don't know the answer, too.

My other source of "concern" here was the reference handling for the new pointer, but I couldn't find any potential problems with what you did. Worst case scenario (i.e., if we fail to consider a code path), we will have a "memory leak" during shutdown, which is not the end of the world. The lock/unlock dance is also simple and trivially verifiable.

Otherwise, the patch LGTM and I'm satisfied with the testing you did. Feel free to go ahead and upload it.

Thanks again.

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Hi Sergio,

Thanks for taking the time to read and review the patch and test cases!

> this new code path could be reached even after the initialization, couldn't it?

No, this code path should only be reached during initialization, AFAICT.

That is because qemuProcessReconnect() can only be reached from function
calls in the initialization path when the daemon initializes the drivers.

For reference, this is the only call path reaching qemuProcessReconnect().
(I've started with it, and gone back into callers.)

@ src/remote/remote_daemon.c
- main()
- daemonStateInit()
- daemonRunStateInit()
@ src/libvirt.c
- virStateInitialize()
- virStateDriverTab[i]->stateInitialize()
@ src/qemu/qemu_driver.c
- virStateDriver qemuStateDriver.stateInitialize = qemuStateInitialize
- qemuStateInitialize()
@ src/qemu/qemu_process.c
- qemuProcessReconnectAll()
- qemuProcessReconnectHelper()
- qemuProcessReconnect()

> Worst case scenario (i.e., if we fail to consider a code path), we will
> have a "memory leak" during shutdown, which is not the end of the world.

Right, that's comforting. :)

And just to clarify on the code path consideration (so as to provide more
reassurance for the patch, regarding a code path possibly not considered):

There should be only one code path leading to qemuProcessReconnect() (above),
and fortunately the points to inc/dec references are straightforward there:

The thread is created (1) either successfully or fails (2), and has a single
return point (3).

So, if the reference count is incremented right before thread creation (1),
there is only 2 code sites to decrement it: on thread creation failure (2)
(since the function doesn't run, its return point "dec" doesn't run either),
and thread creation success: the function runs, so "dec" in return point (3).

> Otherwise, the patch LGTM and I'm satisfied with the testing you did.
> Feel free to go ahead and upload it.

Ok, cool. I think the clarifications above should address the two points
you brought up, so I'll continue and rebase and upload it.

Thanks again!

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

SRU team: this is on hold to double check the library symbols (thanks, Robie!), please do not accept yet.

Revision history for this message
Robie Basak (racb) wrote :

Mauricio identified that there are many symbols added in libvirt-daemon-driver-qemu.so, due to the inclusion of the RPC archives, associated with gcc/ld export-dynamic in the build (which includes all symbols, not just actually used). On consultation with both Mauricio and Sergio, we concluded that this is OK, and not worth the effort and risk of further changes to suppress them.

Revision history for this message
Robie Basak (racb) wrote : Please test proposed package

Hello Mauricio, or anyone else affected,

Accepted libvirt into focal-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/libvirt/6.0.0-0ubuntu8.20 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-focal to verification-done-focal. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-focal. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

tags: added: verification-needed-focal
removed: verification-failed-focal
Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :
Download full text (6.2 KiB)

Verification done on focal-proposed, following comments 23, 24, 25, 26.

Including in this comment a few key snippets from each test/comment.

---
Environment
---

LXD virtual machine

 lxc launch --vm ubuntu:focal lp2059272-focal
 lxc exec lp2059272-focal -- su - ubuntu

Enable proposed & debug symbols

 cat <<EOF | sudo tee /etc/apt/sources.list.d/proposed.list
 deb http://archive.ubuntu.com/ubuntu focal-proposed main universe
 deb http://ddebs.ubuntu.com focal-proposed main universe
 EOF

 cat <<EOF | sudo tee /etc/apt/preferences.d/proposed
 Package: *
 Pin: release a=focal-proposed
 Pin-Priority: 400
 EOF

 sudo apt install --yes --no-install-recommends gdb qemu-system-x86 ubuntu-dbgsym-keyring
 sudo apt update
 sudo apt install --yes --no-install-recommends -t focal-proposed libvirt{0,-daemon{,-driver-qemu,-system}}{,-dbgsym} libvirt-clients

 $ apt-cache policy libvirt-daemon-driver-qemu
 libvirt-daemon-driver-qemu:
   Installed: 6.0.0-0ubuntu8.20
   Candidate: 6.0.0-0ubuntu8.20
   Version table:
  *** 6.0.0-0ubuntu8.20 400
  400 http://archive.ubuntu.com/ubuntu focal-proposed/main amd64 Packages
  100 /var/lib/dpkg/status
      6.0.0-0ubuntu8.19 500
  500 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 Packages
  500 http://security.ubuntu.com/ubuntu focal-security/main amd64 Packages
      6.0.0-0ubuntu8 500
  500 http://archive.ubuntu.com/ubuntu focal/main amd64 Packages

 newgrp libvirt # or logout/login

Libvirtd debug logging

 cat <<-EOF | sudo tee -a /etc/libvirt/libvirtd.conf
 log_filters="1:qemu 1:libvirt"
 log_outputs="3:syslog:libvirtd 1:file:/var/log/libvirt/libvirtd-debug.log"
 EOF

---
Steps with test packages on Focal (normal restarts)
---

 <...>
 for SLEEP in $(seq 0.1 0.1 2.0); do
 <...>

All VMs are still managed by libvirt:

 $ virsh list
  Id Name State
 ----------------------------
  1 test-vm-1 running
  2 test-vm-2 running
  3 test-vm-3 running
  4 test-vm-4 running
  5 test-vm-5 running
  6 test-vm-6 running
  7 test-vm-7 running
  8 test-vm-8 running
  9 test-vm-9 running
  10 test-vm-10 running

---
Steps with test packages on Focal (shutdown-on-init)
---

Scenario 1) Shutdown wins race against XML update (ie, shutdown happens first)

<...>

Now, let the qemuProcessReconnect thread continue, it will not update the XML file,
because 'quit' is set (ie, shutdown in progress)

 (gdb) t 20
 (gdb) p ((virNetDaemonPtr)anyobj)->quit
 $2 = true

 $ ls -l /run/libvirt/qemu/test-vm.xml
 -rw------- 1 root root 10189 Apr 24 12:02 /run/libvirt/qemu/test-vm.xml

 (gdb) c &

 $ ls -l /run/libvirt/qemu/test-vm.xml
 -rw------- 1 root root 10189 Apr 24 12:02 /run/libvirt/qemu/test-vm.xml

 <...>

 $ sudo grep 'Leaving the update of .* domain status XML' /var/log/libvirt/libvirtd-debug.log
 2024-04-24 12:08:40.054+0000: 3770: info : qemuProcessReconnect:8157 : Leaving the update of 'test-vm' domain status XML for the next initialization (shutdown detected on this initialization).

 <...>

 $ sudo grep -e '<domstatus' -e '<domain' -e 'monitor path' /run/libvirt/qemu/test-vm.xml
 <domstatus state='running' reason='booted' pid='3726'>
   <monitor pat...

Read more...

tags: added: verification-done verification-done-focal
removed: verification-needed verification-needed-focal
Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

The packages in focal-proposed have also been verified successfully in real-world/non-synthetic tests by one of our Ubuntu Pro support customers.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package libvirt - 6.0.0-0ubuntu8.20

---------------
libvirt (6.0.0-0ubuntu8.20) focal; urgency=medium

  * d/p/u/lp2059272-2-qemu-Wait-qemuProcessReconnect-threads-in-cleanup.patch:
    Remove patch. It is not possible to wait for qemuProcessReconnect()
    in cleanup: it talks to QEMU monitor, which blocks on replies from
    event loop, but it's already stopped at cleanup, delaying shutdown.

  * d/p/u/lp2059272-2-qemu-Do-not-save-XML-in-shutdown-on-init.patch:
    Instead of waiting at cleanup for threads which might be blocked
    thus would _not even reach_ the function that causes the problem,
    just skip that function if it is _actually reached_ while daemon
    shutdown is in progress. That is in the init path and would just
    run again anyway the next time libvirtd is started (LP: #2059272)

  * NOTE: This package contains the changes from 6.0.0-0ubuntu8.18 and
    6.0.0-0ubuntu8.17 in focal-proposed (with symbolic changelog entry)
    superseded by 6.0.0-0ubuntu8.19 in focal-security.

libvirt (6.0.0-0ubuntu8.20~ubuntu8.18) focal; urgency=medium

  * d/p/u/lp2059272-1-qemu-Fix-potential-crash-during-driver-cleanup.patch:
    On QEMU driver cleanup, release (stop) the worker thread pool _first_,
    before other data used by possibly running worker threads (LP: #2059272)

  * d/p/u/lp2059272-2-qemu-Wait-qemuProcessReconnect-threads-in-cleanup.patch:
    On QEMU driver cleanup, also wait for qemuProcessReconnect() threads,
    as they are independent of the worker thread pool. (LP: #2059272)
    Focal needs this as it has no .stateShutdownWait() callback yet.
    (The wait timeout is set in LIBVIRT_QEMU_STATE_CLEANUP_WAIT_TIMEOUT:
     -1 = wait indefinitely; 0 = do not wait; N = wait up to N seconds.)

libvirt (6.0.0-0ubuntu8.20~ubuntu8.17) focal; urgency=medium

  * d/p/u/lp-1989078-*.patch: allow arm64 to lock its OVMF/AAVMF resources
    (LP: #1989078)

 -- Mauricio Faria de Oliveira <email address hidden> Tue, 16 Apr 2024 14:20:13 -0300

Changed in libvirt (Ubuntu Focal):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.