memory of libvirtd process grows steadily over time

Bug #2024114 reported by Rafael Lopez
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
libvirt (Ubuntu)
Fix Released
Medium
Rafael Lopez
Jammy
Fix Released
Medium
Rafael Lopez
Kinetic
Won't Fix
Medium
Rafael Lopez

Bug Description

[ Impact ]
Memory leak causing growing memory footprints in long running libvirt processes. In a fairly busy openstack env, this showed a steady linear growth up to ~15GB after a couple of months.
This would impact many openstack deployments and anyone else using libvirt with particular PCI devices (VPD capable), forcing them to restart libvirt regularly to reset the memory consumption. This memory leak has only been observed so far in a hardware (metal) environment with mellanox devices, but ostensibly occurs wherever a VPD capable device exists.

[ Test Plan ]
It is only possible to reproduce this on certain hardware, seemingly hosts that have PCI cards that present VPD (Vital Product Data). For example, this was noticed on a host where libvirt was obtaining data from a mellanox card that presented vpd data. You can tell if a PCI device presents vpd data by looking at the sysfs entry /sys/bus/pci/devices/{address}/vpd, or from `lswh` if you see 'vpd' in the list of capabilities, for example:
   *-network:0
        description: Ethernet interface
        product: MT2892 Family [ConnectX-6 Dx]
        vendor: Mellanox Technologies
        ...snip...
        capabilities: pciexpress vpd msix pm bus_master cap_list rom ethernet physical 1000bt-fd 10000bt-fd 25000bt-fd 40000bt-fd autonegotiation
        ...snip...

1. Knowing the server has a VPD capable device, and monitoring the memory consumption over time can show if the issue is present as well as when it is fixed. Before the fix there is a clear linear growth, which should flatten out after applying the patch.

2. This is also another simple test that can be done:
Run "virsh nodedev-list" for 1000 times, and check the memory occupied by the libvirtd service.

#!/bin/sh
systemctl start libvirtd
systemctl status libvirtd
i=0
while [ $i -ne 1000 ]
do
virsh nodedev-list
i=$(($i+1))
echo "$i"
done
systemctl status libvirtd

and watch the "Memory:" field grow (or not, if the fix is there).

[ Where problems could occur ]
The functions changed are only called in environments where VPD devices exist, and the patch adjusts pointers and contents of data structures related to VPD capable PCI devices found by libvirt.
Things could go wrong in environments wherever VPD capable devices are present, and may show up as garbage data about the device, null pointer where there should be data, segfaults.

[ Other Info ]
The backport is derived from an upstream fix:
https://github.com/libvirt/libvirt/commit/64d32118540aca3d42bc5ee21c8b780cafe04bfa
This patch is missing from Jammy and Kinetic, but present in Lunar+. The same issue has not been observed in a similar environment running Focal.

Running libvirt in valgrind which will stacks like the following:

==3411871== 7,559,541 (407,160 direct, 7,152,381 indirect) bytes in 16,965 blocks are definitely lost in loss record 2,846 of 2,846
==3411871== at 0x484DA83: calloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==3411871== by 0x4D53C50: g_malloc0 (gmem.c:161)
==3411871== by 0x49A2832: virPCIVPDParse (virpcivpd.c:672)
==3411871== by 0x4983BD8: virPCIDeviceGetVPD (virpci.c:2694)
==3411871== by 0x4A2CEB7: UnknownInlinedFun (node_device_conf.c:3032)
==3411871== by 0x4A2CEB7: virNodeDeviceGetPCIDynamicCaps (node_device_conf.c:3065)
==3411871== by 0x4A2D03D: virNodeDeviceUpdateCaps (node_device_conf.c:2636)
==3411871== by 0xFC8CD35: nodeDeviceGetXMLDesc (node_device_driver.c:370)
==3411871== by 0x4B7E9D1: virNodeDeviceGetXMLDesc (libvirt-nodedev.c:275)
==3411871== by 0x15519A: UnknownInlinedFun (remote_daemon_dispatch_stubs.h:15507)
==3411871== by 0x15519A: remoteDispatchNodeDeviceGetXMLDescHelper.lto_priv.0 (remote_daemon_dispatch_stubs.h:15484)
==3411871== by 0x4A59785: UnknownInlinedFun (virnetserverprogram.c:428)
==3411871== by 0x4A59785: virNetServerProgramDispatch (virnetserverprogram.c:302)
==3411871== by 0x4A60067: UnknownInlinedFun (virnetserver.c:140)
==3411871== by 0x4A60067: virNetServerHandleJob (virnetserver.c:160)
==3411871== by 0x499B982: virThreadPoolWorker (virthreadpool.c:164)
==3411871== by 0x499A4D8: virThreadHelper (virthread.c:241)
==3411871== by 0x514CB42: start_thread (pthread_create.c:442)
==3411871== by 0x51DDBB3: clone (clone.S:100)

Changed in libvirt (Ubuntu):
importance: Undecided → Medium
status: New → In Progress
Changed in libvirt (Ubuntu Jammy):
status: New → In Progress
importance: Undecided → Medium
description: updated
Changed in libvirt (Ubuntu Jammy):
assignee: nobody → Rafael Lopez (rafael.lopez)
description: updated
Revision history for this message
Rafael Lopez (rafael.lopez) wrote :

Attached debdiff for jammy

Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

The attachment "lp-2024114-pcivpd-memleak-jammy.debdiff" seems to be a debdiff. The ubuntu-sponsors team has been subscribed to the bug report so that they can review and hopefully sponsor the debdiff. If the attachment isn't a patch, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are member of the ~ubuntu-sponsors, unsubscribe the team.

[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issue please contact him.]

tags: added: patch
Revision history for this message
Paride Legovini (paride) wrote (last edit ):

Hello Rafael, this is a well done SRU which is basically ready for sponsoring, however I have a couple of questions before proceeding:

1. Can you confirm you have access to the right hardware and will perform the SRU verification?

2. The DEP-3 headers of the patch have "Origin: upstream", but the bug description says that "The backport is derived from an upstream fix". In DEP-3 language "backport" means that an upstream patch/commit had to be modified to cleanly apply. Is this a clean cherry-pick (Origin: upstream) or a backport (Origin: backport)?

3. By checking the libvirt versions looks like this bug is Fix Released in >= Kinetic. Can you please confirm this?

4. Was the fix already tested by building the package in a PPA?

Thanks!

Revision history for this message
Rafael Lopez (rafael.lopez) wrote :

Hi Paride, thanks for the quick response. Please find answers in line below:

> 1. Can you confirm you have access to the right hardware and will perform the SRU verification?
Yes, this is possible.

> 2. The DEP-3 headers of the patch have "Origin: upstream", but the bug description says that "The backport is derived from an upstream fix". In DEP-3 language "backport" means that an upstream patch/commit had to be modified to cleanly apply. Is this a clean cherry-pick (Origin: upstream) or a backport (Origin: backport)?
Sorry for the ambiguity. There was no modification to the actual code, however the line numbers where code was inserted and removed have some offset due to new code in the file (ie. quilt pushed correctly with fuzz). I assume this means it is technically a backport? Let me know and I will update the patch accordingly.

> 3. By checking the libvirt versions looks like this bug is Fix Released in >= Kinetic. Can you please confirm this?
Upstream, the fix was backported to 8.10+. I double checked and the fix is not present in kinetic, but is in lunar. (Note, focal does not exhibit this issue). I will update the bug and prepare a patch for kinetic accordingly.

> 4. Was the fix already tested by building the package in a PPA?
Yes this has been tested using a PPA and fixes the particular leak we observed.

Changed in libvirt (Ubuntu Kinetic):
status: New → In Progress
importance: Undecided → Medium
assignee: nobody → Rafael Lopez (rafael.lopez)
description: updated
description: updated
Jeremy Bícha (jbicha)
Changed in libvirt (Ubuntu):
status: In Progress → Fix Released
Changed in libvirt (Ubuntu Kinetic):
status: In Progress → Triaged
Revision history for this message
Jeremy Bícha (jbicha) wrote :

Because this issue is fixed in the stable Lunar release, it is not required that it also be fixed in Kinetic. See the exceptions at https://wiki.ubuntu.com/StableReleaseUpdates#Newer_Releases

I have uploaded your Jammy debdiff to the unapproved queue for Jammy. It must be manually reviewed by a member of the SRU Team before it will be available as a proposed update.

https://launchpad.net/ubuntu/jammy/+queue?queue_state=1

I am unsubscribing ubuntu-sponsors now. Please feel to resubscribe if you have something else that needs to be sponsored.

tags: added: se-sponsor-halves
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

https://bugzilla.redhat.com/show_bug.cgi?id=2143235#c4 has a nice reproducer which could be used in the test case, if you can confirm it works as advertised.

Basically:

1. Run "virsh nodedev-list" for 1000 times, and check the memory occupied by virtnodedevd service; The memory occupied increased from 13.9M to 24.0M after 8min;

#!/bin/sh
systemctl start virtnodedevd
systemctl status virtnodedevd
i=0
while [ $i -ne 1000 ]
do
 virsh nodedev-list
        i=$(($i+1))
        echo "$i"
done
systemctl status virtnodedevd

and watch the "Memory:" field grow (or not, if the fix is there).

Changed in libvirt (Ubuntu Jammy):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-jammy
Revision history for this message
Andreas Hasenack (ahasenack) wrote : Please test proposed package

Hello Rafael, or anyone else affected,

Accepted libvirt into jammy-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/libvirt/8.0.0-1ubuntu7.6 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-jammy to verification-done-jammy. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-jammy. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

If you still want to provide a kinetic fix (and verify it), I can sponsor it. If not, please switch the kinetic task to "wontfix".

Revision history for this message
Rafael Lopez (rafael.lopez) wrote :

Hi Andreas, I will check to see if we can do that simple test for verification, though it uses 'virtnodedevd', which is not included in our libvirt packages. It looks like for Jammy at least we still build that functionality into the main daemon, so we may be able to run the same checks against 'systemctl status libvirtd'.

Changed in libvirt (Ubuntu Kinetic):
status: Triaged → Won't Fix
description: updated
Revision history for this message
Rafael Lopez (rafael.lopez) wrote (last edit ):

Verification of package in -proposed is complete.

Using the current version (8.0.0-1ubuntu7.5), we validated the test and confirmed bug:
- Restart libvirtd
- Initial consumption
ubuntu@machine-1:~$ systemctl status libvirtd
...
Memory: 21.6M

- Ran loop test
i=0; while [ $i -ne 1000 ]; do virsh nodedev-list; i=$(($i+1)); echo "$i"; done

- Post consumption
ubuntu@machine-1:~$ systemctl status libvirtd
...
Memory: 31.5M

- Increase was ~10M.
---------------------------------------------
Upgraded to libvirt from -proposed (8.0.0-1ubuntu7.6) to verify fix.

root@machine-1:~# dpkg -l | grep -i libvirt
ii libvirt-clients 8.0.0-1ubuntu7.6 amd64 Programs for the libvirt library
ii libvirt-daemon 8.0.0-1ubuntu7.6 amd64 Virtualization daemon
ii libvirt-daemon-config-network 8.0.0-1ubuntu7.6 all Libvirt daemon configuration files (default network)
ii libvirt-daemon-config-nwfilter 8.0.0-1ubuntu7.6 all Libvirt daemon configuration files (default network filters)
ii libvirt-daemon-driver-qemu 8.0.0-1ubuntu7.6 amd64 Virtualization daemon QEMU connection driver
ii libvirt-daemon-system 8.0.0-1ubuntu7.6 amd64 Libvirt daemon configuration files
ii libvirt-daemon-system-systemd 8.0.0-1ubuntu7.6 all Libvirt daemon configuration files (systemd)
ii libvirt0:amd64 8.0.0-1ubuntu7.6 amd64 library for interfacing with different virtualization systems
ii nova-compute-libvirt 3:25.2.0-0ubuntu1 all OpenStack Compute - compute node libvirt support
ii python3-libvirt 8.0.0-1build1 amd64 libvirt Python 3 bindings

- Restart libvirtd
- Initial consumption
root@machine-1:~# systemctl status libvirtd
...
Memory: 23.7M

- Ran loop test
i=0; while [ $i -ne 1000 ]; do virsh nodedev-list; i=$(($i+1)); echo "$i"; done

- Post consumption
root@machine-1:~# systemctl status libvirtd
...
Memory: 25.4M

- Increase was ~1.7M

tags: added: verification-done-jammy
removed: verification-needed verification-needed-jammy
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package libvirt - 8.0.0-1ubuntu7.6

---------------
libvirt (8.0.0-1ubuntu7.6) jammy; urgency=medium

  * d/p/u/lp-2024114-Avoid-memleak-in-virNodeDeviceGetPCIVPDDynamicCap.patch:
    fix memory leak PCI devices with VPD data (LP: #2024114)

 -- Rafael Lopez <email address hidden> Tue, 20 Jun 2023 11:54:15 +1000

Changed in libvirt (Ubuntu Jammy):
status: Fix Committed → Fix Released
Revision history for this message
Robie Basak (racb) wrote : Update Released

The verification of the Stable Release Update for libvirt has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.