Activity log for bug #2024114

Date Who What changed Old value New value Message
2023-06-16 00:53:11 Rafael Lopez bug added bug
2023-06-16 00:53:39 Rafael Lopez libvirt (Ubuntu): importance Undecided Medium
2023-06-16 00:53:54 Rafael Lopez libvirt (Ubuntu): status New In Progress
2023-06-16 00:55:04 Rafael Lopez nominated for series Ubuntu Jammy
2023-06-16 00:55:04 Rafael Lopez bug task added libvirt (Ubuntu Jammy)
2023-06-16 00:55:12 Rafael Lopez libvirt (Ubuntu Jammy): status New In Progress
2023-06-16 00:55:14 Rafael Lopez libvirt (Ubuntu Jammy): importance Undecided Medium
2023-06-16 00:55:47 Rafael Lopez description Memory grows over time, likely due to a memory leak in PCI data collection. Can only reproduce on hardware environments, may be particular to specific PCI devices that supply VPD data. Valgrind stacks after a couple of hours: ==3411871== 7,559,541 (407,160 direct, 7,152,381 indirect) bytes in 16,965 blocks are definitely lost in loss record 2,846 of 2,846 ==3411871== at 0x484DA83: calloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so) ==3411871== by 0x4D53C50: g_malloc0 (gmem.c:161) ==3411871== by 0x49A2832: virPCIVPDParse (virpcivpd.c:672) ==3411871== by 0x4983BD8: virPCIDeviceGetVPD (virpci.c:2694) ==3411871== by 0x4A2CEB7: UnknownInlinedFun (node_device_conf.c:3032) ==3411871== by 0x4A2CEB7: virNodeDeviceGetPCIDynamicCaps (node_device_conf.c:3065) ==3411871== by 0x4A2D03D: virNodeDeviceUpdateCaps (node_device_conf.c:2636) ==3411871== by 0xFC8CD35: nodeDeviceGetXMLDesc (node_device_driver.c:370) ==3411871== by 0x4B7E9D1: virNodeDeviceGetXMLDesc (libvirt-nodedev.c:275) ==3411871== by 0x15519A: UnknownInlinedFun (remote_daemon_dispatch_stubs.h:15507) ==3411871== by 0x15519A: remoteDispatchNodeDeviceGetXMLDescHelper.lto_priv.0 (remote_daemon_dispatch_stubs.h:15484) ==3411871== by 0x4A59785: UnknownInlinedFun (virnetserverprogram.c:428) ==3411871== by 0x4A59785: virNetServerProgramDispatch (virnetserverprogram.c:302) ==3411871== by 0x4A60067: UnknownInlinedFun (virnetserver.c:140) ==3411871== by 0x4A60067: virNetServerHandleJob (virnetserver.c:160) ==3411871== by 0x499B982: virThreadPoolWorker (virthreadpool.c:164) ==3411871== by 0x499A4D8: virThreadHelper (virthread.c:241) ==3411871== by 0x514CB42: start_thread (pthread_create.c:442) ==3411871== by 0x51DDBB3: clone (clone.S:100) ==3411871== 1,608,514 (134,160 direct, 1,474,354 indirect) bytes in 5,590 blocks are definitely lost in loss record 2,844 of 2,846 ==3411871== at 0x484DA83: calloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so) ==3411871== by 0x4D53C50: g_malloc0 (gmem.c:161) ==3411871== by 0x49A2832: virPCIVPDParse (virpcivpd.c:672) ==3411871== by 0x4983BD8: virPCIDeviceGetVPD (virpci.c:2694) ==3411871== by 0x4A2CEB7: UnknownInlinedFun (node_device_conf.c:3032) ==3411871== by 0x4A2CEB7: virNodeDeviceGetPCIDynamicCaps (node_device_conf.c:3065) ==3411871== by 0x4A2D03D: virNodeDeviceUpdateCaps (node_device_conf.c:2636) ==3411871== by 0x4A2D075: virNodeDeviceCapsListExport (node_device_conf.c:2707) ==3411871== by 0xFC8D10F: nodeDeviceListCaps (node_device_driver.c:459) ==3411871== by 0x4B7EE68: virNodeDeviceListCaps (libvirt-nodedev.c:402) ==3411871== by 0x1554FE: UnknownInlinedFun (remote_daemon_dispatch_stubs.h:15688) ==3411871== by 0x1554FE: remoteDispatchNodeDeviceListCapsHelper.lto_priv.0 (remote_daemon_dispatch_stubs.h:15655) ==3411871== by 0x4A59785: UnknownInlinedFun (virnetserverprogram.c:428) ==3411871== by 0x4A59785: virNetServerProgramDispatch (virnetserverprogram.c:302) ==3411871== by 0x4A60067: UnknownInlinedFun (virnetserver.c:140) ==3411871== by 0x4A60067: virNetServerHandleJob (virnetserver.c:160) ==3411871== by 0x499B982: virThreadPoolWorker (virthreadpool.c:164) ==3411871== by 0x499A4D8: virThreadHelper (virthread.c:241) ==3411871== by 0x514CB42: start_thread (pthread_create.c:442) ==3411871== by 0x51DDBB3: clone (clone.S:100) Possibly fixed by: https://github.com/libvirt/libvirt/commit/64d32118540aca3d42bc5ee21c8b780cafe04bfa.patch Memory grows over time, likely due to a memory leak in PCI data collection. Can only reproduce on hardware environments, may be particular to specific PCI devices that supply VPD data. Only seen on Jammy so far. Valgrind stacks after a couple of hours: ==3411871== 7,559,541 (407,160 direct, 7,152,381 indirect) bytes in 16,965 blocks are definitely lost in loss record 2,846 of 2,846 ==3411871== at 0x484DA83: calloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so) ==3411871== by 0x4D53C50: g_malloc0 (gmem.c:161) ==3411871== by 0x49A2832: virPCIVPDParse (virpcivpd.c:672) ==3411871== by 0x4983BD8: virPCIDeviceGetVPD (virpci.c:2694) ==3411871== by 0x4A2CEB7: UnknownInlinedFun (node_device_conf.c:3032) ==3411871== by 0x4A2CEB7: virNodeDeviceGetPCIDynamicCaps (node_device_conf.c:3065) ==3411871== by 0x4A2D03D: virNodeDeviceUpdateCaps (node_device_conf.c:2636) ==3411871== by 0xFC8CD35: nodeDeviceGetXMLDesc (node_device_driver.c:370) ==3411871== by 0x4B7E9D1: virNodeDeviceGetXMLDesc (libvirt-nodedev.c:275) ==3411871== by 0x15519A: UnknownInlinedFun (remote_daemon_dispatch_stubs.h:15507) ==3411871== by 0x15519A: remoteDispatchNodeDeviceGetXMLDescHelper.lto_priv.0 (remote_daemon_dispatch_stubs.h:15484) ==3411871== by 0x4A59785: UnknownInlinedFun (virnetserverprogram.c:428) ==3411871== by 0x4A59785: virNetServerProgramDispatch (virnetserverprogram.c:302) ==3411871== by 0x4A60067: UnknownInlinedFun (virnetserver.c:140) ==3411871== by 0x4A60067: virNetServerHandleJob (virnetserver.c:160) ==3411871== by 0x499B982: virThreadPoolWorker (virthreadpool.c:164) ==3411871== by 0x499A4D8: virThreadHelper (virthread.c:241) ==3411871== by 0x514CB42: start_thread (pthread_create.c:442) ==3411871== by 0x51DDBB3: clone (clone.S:100) ==3411871== 1,608,514 (134,160 direct, 1,474,354 indirect) bytes in 5,590 blocks are definitely lost in loss record 2,844 of 2,846 ==3411871== at 0x484DA83: calloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so) ==3411871== by 0x4D53C50: g_malloc0 (gmem.c:161) ==3411871== by 0x49A2832: virPCIVPDParse (virpcivpd.c:672) ==3411871== by 0x4983BD8: virPCIDeviceGetVPD (virpci.c:2694) ==3411871== by 0x4A2CEB7: UnknownInlinedFun (node_device_conf.c:3032) ==3411871== by 0x4A2CEB7: virNodeDeviceGetPCIDynamicCaps (node_device_conf.c:3065) ==3411871== by 0x4A2D03D: virNodeDeviceUpdateCaps (node_device_conf.c:2636) ==3411871== by 0x4A2D075: virNodeDeviceCapsListExport (node_device_conf.c:2707) ==3411871== by 0xFC8D10F: nodeDeviceListCaps (node_device_driver.c:459) ==3411871== by 0x4B7EE68: virNodeDeviceListCaps (libvirt-nodedev.c:402) ==3411871== by 0x1554FE: UnknownInlinedFun (remote_daemon_dispatch_stubs.h:15688) ==3411871== by 0x1554FE: remoteDispatchNodeDeviceListCapsHelper.lto_priv.0 (remote_daemon_dispatch_stubs.h:15655) ==3411871== by 0x4A59785: UnknownInlinedFun (virnetserverprogram.c:428) ==3411871== by 0x4A59785: virNetServerProgramDispatch (virnetserverprogram.c:302) ==3411871== by 0x4A60067: UnknownInlinedFun (virnetserver.c:140) ==3411871== by 0x4A60067: virNetServerHandleJob (virnetserver.c:160) ==3411871== by 0x499B982: virThreadPoolWorker (virthreadpool.c:164) ==3411871== by 0x499A4D8: virThreadHelper (virthread.c:241) ==3411871== by 0x514CB42: start_thread (pthread_create.c:442) ==3411871== by 0x51DDBB3: clone (clone.S:100) Possibly fixed by: https://github.com/libvirt/libvirt/commit/64d32118540aca3d42bc5ee21c8b780cafe04bfa.patch
2023-06-16 01:40:02 Rafael Lopez libvirt (Ubuntu Jammy): assignee Rafael Lopez (rafael.lopez)
2023-06-20 00:02:40 Rafael Lopez description Memory grows over time, likely due to a memory leak in PCI data collection. Can only reproduce on hardware environments, may be particular to specific PCI devices that supply VPD data. Only seen on Jammy so far. Valgrind stacks after a couple of hours: ==3411871== 7,559,541 (407,160 direct, 7,152,381 indirect) bytes in 16,965 blocks are definitely lost in loss record 2,846 of 2,846 ==3411871== at 0x484DA83: calloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so) ==3411871== by 0x4D53C50: g_malloc0 (gmem.c:161) ==3411871== by 0x49A2832: virPCIVPDParse (virpcivpd.c:672) ==3411871== by 0x4983BD8: virPCIDeviceGetVPD (virpci.c:2694) ==3411871== by 0x4A2CEB7: UnknownInlinedFun (node_device_conf.c:3032) ==3411871== by 0x4A2CEB7: virNodeDeviceGetPCIDynamicCaps (node_device_conf.c:3065) ==3411871== by 0x4A2D03D: virNodeDeviceUpdateCaps (node_device_conf.c:2636) ==3411871== by 0xFC8CD35: nodeDeviceGetXMLDesc (node_device_driver.c:370) ==3411871== by 0x4B7E9D1: virNodeDeviceGetXMLDesc (libvirt-nodedev.c:275) ==3411871== by 0x15519A: UnknownInlinedFun (remote_daemon_dispatch_stubs.h:15507) ==3411871== by 0x15519A: remoteDispatchNodeDeviceGetXMLDescHelper.lto_priv.0 (remote_daemon_dispatch_stubs.h:15484) ==3411871== by 0x4A59785: UnknownInlinedFun (virnetserverprogram.c:428) ==3411871== by 0x4A59785: virNetServerProgramDispatch (virnetserverprogram.c:302) ==3411871== by 0x4A60067: UnknownInlinedFun (virnetserver.c:140) ==3411871== by 0x4A60067: virNetServerHandleJob (virnetserver.c:160) ==3411871== by 0x499B982: virThreadPoolWorker (virthreadpool.c:164) ==3411871== by 0x499A4D8: virThreadHelper (virthread.c:241) ==3411871== by 0x514CB42: start_thread (pthread_create.c:442) ==3411871== by 0x51DDBB3: clone (clone.S:100) ==3411871== 1,608,514 (134,160 direct, 1,474,354 indirect) bytes in 5,590 blocks are definitely lost in loss record 2,844 of 2,846 ==3411871== at 0x484DA83: calloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so) ==3411871== by 0x4D53C50: g_malloc0 (gmem.c:161) ==3411871== by 0x49A2832: virPCIVPDParse (virpcivpd.c:672) ==3411871== by 0x4983BD8: virPCIDeviceGetVPD (virpci.c:2694) ==3411871== by 0x4A2CEB7: UnknownInlinedFun (node_device_conf.c:3032) ==3411871== by 0x4A2CEB7: virNodeDeviceGetPCIDynamicCaps (node_device_conf.c:3065) ==3411871== by 0x4A2D03D: virNodeDeviceUpdateCaps (node_device_conf.c:2636) ==3411871== by 0x4A2D075: virNodeDeviceCapsListExport (node_device_conf.c:2707) ==3411871== by 0xFC8D10F: nodeDeviceListCaps (node_device_driver.c:459) ==3411871== by 0x4B7EE68: virNodeDeviceListCaps (libvirt-nodedev.c:402) ==3411871== by 0x1554FE: UnknownInlinedFun (remote_daemon_dispatch_stubs.h:15688) ==3411871== by 0x1554FE: remoteDispatchNodeDeviceListCapsHelper.lto_priv.0 (remote_daemon_dispatch_stubs.h:15655) ==3411871== by 0x4A59785: UnknownInlinedFun (virnetserverprogram.c:428) ==3411871== by 0x4A59785: virNetServerProgramDispatch (virnetserverprogram.c:302) ==3411871== by 0x4A60067: UnknownInlinedFun (virnetserver.c:140) ==3411871== by 0x4A60067: virNetServerHandleJob (virnetserver.c:160) ==3411871== by 0x499B982: virThreadPoolWorker (virthreadpool.c:164) ==3411871== by 0x499A4D8: virThreadHelper (virthread.c:241) ==3411871== by 0x514CB42: start_thread (pthread_create.c:442) ==3411871== by 0x51DDBB3: clone (clone.S:100) Possibly fixed by: https://github.com/libvirt/libvirt/commit/64d32118540aca3d42bc5ee21c8b780cafe04bfa.patch [ Impact ] Memory leak causing growing memory footprints in long running libvirt processes. In a fairly busy openstack env, this showed a steady linear growth up to ~15GB after a couple of months. This would impact many openstack deployments and anyone else using libvirt with particular PCI devices (VPD capable), forcing them to restart libvirt regularly to reset the memory consumption. This memory leak has only been observed so far in a hardware (metal) environment with mellanox devices, but ostensibly occurs wherever a VPD capable device exists. [ Test Plan ] It is only possible to reproduce this on certain hardware, seemingly hosts that have PCI cards that present VPD (Vital Product Data). For example, this was noticed on a host where libvirt was obtaining data from a mellanox card that presented vpd data. You can tell if a PCI device presents vpd data by looking at the sysfs entry /sys/bus/pci/devices/{address}/vpd, or from `lswh` if you see 'vpd' in the list of capabilities, for example: *-network:0 description: Ethernet interface product: MT2892 Family [ConnectX-6 Dx] vendor: Mellanox Technologies ...snip... capabilities: pciexpress vpd msix pm bus_master cap_list rom ethernet physical 1000bt-fd 10000bt-fd 25000bt-fd 40000bt-fd autonegotiation ...snip... It is easy to confirm by running libvirt in valgrind which will show a stack like the following: ==3411871== 7,559,541 (407,160 direct, 7,152,381 indirect) bytes in 16,965 blocks are definitely lost in loss record 2,846 of 2,846 ==3411871== at 0x484DA83: calloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so) ==3411871== by 0x4D53C50: g_malloc0 (gmem.c:161) ==3411871== by 0x49A2832: virPCIVPDParse (virpcivpd.c:672) ==3411871== by 0x4983BD8: virPCIDeviceGetVPD (virpci.c:2694) ==3411871== by 0x4A2CEB7: UnknownInlinedFun (node_device_conf.c:3032) ==3411871== by 0x4A2CEB7: virNodeDeviceGetPCIDynamicCaps (node_device_conf.c:3065) ==3411871== by 0x4A2D03D: virNodeDeviceUpdateCaps (node_device_conf.c:2636) ==3411871== by 0xFC8CD35: nodeDeviceGetXMLDesc (node_device_driver.c:370) ==3411871== by 0x4B7E9D1: virNodeDeviceGetXMLDesc (libvirt-nodedev.c:275) ==3411871== by 0x15519A: UnknownInlinedFun (remote_daemon_dispatch_stubs.h:15507) ==3411871== by 0x15519A: remoteDispatchNodeDeviceGetXMLDescHelper.lto_priv.0 (remote_daemon_dispatch_stubs.h:15484) ==3411871== by 0x4A59785: UnknownInlinedFun (virnetserverprogram.c:428) ==3411871== by 0x4A59785: virNetServerProgramDispatch (virnetserverprogram.c:302) ==3411871== by 0x4A60067: UnknownInlinedFun (virnetserver.c:140) ==3411871== by 0x4A60067: virNetServerHandleJob (virnetserver.c:160) ==3411871== by 0x499B982: virThreadPoolWorker (virthreadpool.c:164) ==3411871== by 0x499A4D8: virThreadHelper (virthread.c:241) ==3411871== by 0x514CB42: start_thread (pthread_create.c:442) ==3411871== by 0x51DDBB3: clone (clone.S:100) Knowing the server has a VPD capable device, and monitoring the memory consumption over time can show if the issue is present as well as when it is fixed. Before the fix there is a clear linear growth, which should flatten out after applying the patch. [ Where problems could occur ] The functions changed are only called in environments where VPD devices exist, and the patch adjusts pointers and contents of data structures related to VPD capable PCI devices found by libvirt. Things could go wrong in environments wherever VPD capable devices are present, and may show up as garbage data about the device, null pointer where there should be data, segfaults. [ Other Info ] The backport is derived from an upstream fix: https://github.com/libvirt/libvirt/commit/64d32118540aca3d42bc5ee21c8b780cafe04bfa
2023-06-20 02:06:34 Rafael Lopez attachment added lp-2024114-pcivpd-memleak-jammy.debdiff https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/2024114/+attachment/5680882/+files/lp-2024114-pcivpd-memleak-jammy.debdiff
2023-06-20 02:08:20 Rafael Lopez bug added subscriber sts-sponsors (DEACTIVATED; use se-sponsors)
2023-06-20 02:08:26 Rafael Lopez removed subscriber sts-sponsors (DEACTIVATED; use se-sponsors)
2023-06-20 02:09:01 Rafael Lopez bug added subscriber Support Engineering Sponsors
2023-06-20 04:14:58 Ubuntu Foundations Team Bug Bot tags patch
2023-06-20 04:15:02 Ubuntu Foundations Team Bug Bot bug added subscriber Ubuntu Sponsors
2023-06-20 23:53:58 Rafael Lopez nominated for series Ubuntu Kinetic
2023-06-20 23:53:58 Rafael Lopez bug task added libvirt (Ubuntu Kinetic)
2023-06-20 23:54:05 Rafael Lopez libvirt (Ubuntu Kinetic): status New In Progress
2023-06-20 23:54:08 Rafael Lopez libvirt (Ubuntu Kinetic): importance Undecided Medium
2023-06-20 23:54:12 Rafael Lopez libvirt (Ubuntu Kinetic): assignee Rafael Lopez (rafael.lopez)
2023-06-20 23:56:24 Rafael Lopez description [ Impact ] Memory leak causing growing memory footprints in long running libvirt processes. In a fairly busy openstack env, this showed a steady linear growth up to ~15GB after a couple of months. This would impact many openstack deployments and anyone else using libvirt with particular PCI devices (VPD capable), forcing them to restart libvirt regularly to reset the memory consumption. This memory leak has only been observed so far in a hardware (metal) environment with mellanox devices, but ostensibly occurs wherever a VPD capable device exists. [ Test Plan ] It is only possible to reproduce this on certain hardware, seemingly hosts that have PCI cards that present VPD (Vital Product Data). For example, this was noticed on a host where libvirt was obtaining data from a mellanox card that presented vpd data. You can tell if a PCI device presents vpd data by looking at the sysfs entry /sys/bus/pci/devices/{address}/vpd, or from `lswh` if you see 'vpd' in the list of capabilities, for example: *-network:0 description: Ethernet interface product: MT2892 Family [ConnectX-6 Dx] vendor: Mellanox Technologies ...snip... capabilities: pciexpress vpd msix pm bus_master cap_list rom ethernet physical 1000bt-fd 10000bt-fd 25000bt-fd 40000bt-fd autonegotiation ...snip... It is easy to confirm by running libvirt in valgrind which will show a stack like the following: ==3411871== 7,559,541 (407,160 direct, 7,152,381 indirect) bytes in 16,965 blocks are definitely lost in loss record 2,846 of 2,846 ==3411871== at 0x484DA83: calloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so) ==3411871== by 0x4D53C50: g_malloc0 (gmem.c:161) ==3411871== by 0x49A2832: virPCIVPDParse (virpcivpd.c:672) ==3411871== by 0x4983BD8: virPCIDeviceGetVPD (virpci.c:2694) ==3411871== by 0x4A2CEB7: UnknownInlinedFun (node_device_conf.c:3032) ==3411871== by 0x4A2CEB7: virNodeDeviceGetPCIDynamicCaps (node_device_conf.c:3065) ==3411871== by 0x4A2D03D: virNodeDeviceUpdateCaps (node_device_conf.c:2636) ==3411871== by 0xFC8CD35: nodeDeviceGetXMLDesc (node_device_driver.c:370) ==3411871== by 0x4B7E9D1: virNodeDeviceGetXMLDesc (libvirt-nodedev.c:275) ==3411871== by 0x15519A: UnknownInlinedFun (remote_daemon_dispatch_stubs.h:15507) ==3411871== by 0x15519A: remoteDispatchNodeDeviceGetXMLDescHelper.lto_priv.0 (remote_daemon_dispatch_stubs.h:15484) ==3411871== by 0x4A59785: UnknownInlinedFun (virnetserverprogram.c:428) ==3411871== by 0x4A59785: virNetServerProgramDispatch (virnetserverprogram.c:302) ==3411871== by 0x4A60067: UnknownInlinedFun (virnetserver.c:140) ==3411871== by 0x4A60067: virNetServerHandleJob (virnetserver.c:160) ==3411871== by 0x499B982: virThreadPoolWorker (virthreadpool.c:164) ==3411871== by 0x499A4D8: virThreadHelper (virthread.c:241) ==3411871== by 0x514CB42: start_thread (pthread_create.c:442) ==3411871== by 0x51DDBB3: clone (clone.S:100) Knowing the server has a VPD capable device, and monitoring the memory consumption over time can show if the issue is present as well as when it is fixed. Before the fix there is a clear linear growth, which should flatten out after applying the patch. [ Where problems could occur ] The functions changed are only called in environments where VPD devices exist, and the patch adjusts pointers and contents of data structures related to VPD capable PCI devices found by libvirt. Things could go wrong in environments wherever VPD capable devices are present, and may show up as garbage data about the device, null pointer where there should be data, segfaults. [ Other Info ] The backport is derived from an upstream fix: https://github.com/libvirt/libvirt/commit/64d32118540aca3d42bc5ee21c8b780cafe04bfa [ Impact ] Memory leak causing growing memory footprints in long running libvirt processes. In a fairly busy openstack env, this showed a steady linear growth up to ~15GB after a couple of months. This would impact many openstack deployments and anyone else using libvirt with particular PCI devices (VPD capable), forcing them to restart libvirt regularly to reset the memory consumption. This memory leak has only been observed so far in a hardware (metal) environment with mellanox devices, but ostensibly occurs wherever a VPD capable device exists. [ Test Plan ] It is only possible to reproduce this on certain hardware, seemingly hosts that have PCI cards that present VPD (Vital Product Data). For example, this was noticed on a host where libvirt was obtaining data from a mellanox card that presented vpd data. You can tell if a PCI device presents vpd data by looking at the sysfs entry /sys/bus/pci/devices/{address}/vpd, or from `lswh` if you see 'vpd' in the list of capabilities, for example:    *-network:0         description: Ethernet interface         product: MT2892 Family [ConnectX-6 Dx]         vendor: Mellanox Technologies         ...snip...         capabilities: pciexpress vpd msix pm bus_master cap_list rom ethernet physical 1000bt-fd 10000bt-fd 25000bt-fd 40000bt-fd autonegotiation         ...snip... It is easy to confirm by running libvirt in valgrind which will show a stack like the following: ==3411871== 7,559,541 (407,160 direct, 7,152,381 indirect) bytes in 16,965 blocks are definitely lost in loss record 2,846 of 2,846 ==3411871== at 0x484DA83: calloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so) ==3411871== by 0x4D53C50: g_malloc0 (gmem.c:161) ==3411871== by 0x49A2832: virPCIVPDParse (virpcivpd.c:672) ==3411871== by 0x4983BD8: virPCIDeviceGetVPD (virpci.c:2694) ==3411871== by 0x4A2CEB7: UnknownInlinedFun (node_device_conf.c:3032) ==3411871== by 0x4A2CEB7: virNodeDeviceGetPCIDynamicCaps (node_device_conf.c:3065) ==3411871== by 0x4A2D03D: virNodeDeviceUpdateCaps (node_device_conf.c:2636) ==3411871== by 0xFC8CD35: nodeDeviceGetXMLDesc (node_device_driver.c:370) ==3411871== by 0x4B7E9D1: virNodeDeviceGetXMLDesc (libvirt-nodedev.c:275) ==3411871== by 0x15519A: UnknownInlinedFun (remote_daemon_dispatch_stubs.h:15507) ==3411871== by 0x15519A: remoteDispatchNodeDeviceGetXMLDescHelper.lto_priv.0 (remote_daemon_dispatch_stubs.h:15484) ==3411871== by 0x4A59785: UnknownInlinedFun (virnetserverprogram.c:428) ==3411871== by 0x4A59785: virNetServerProgramDispatch (virnetserverprogram.c:302) ==3411871== by 0x4A60067: UnknownInlinedFun (virnetserver.c:140) ==3411871== by 0x4A60067: virNetServerHandleJob (virnetserver.c:160) ==3411871== by 0x499B982: virThreadPoolWorker (virthreadpool.c:164) ==3411871== by 0x499A4D8: virThreadHelper (virthread.c:241) ==3411871== by 0x514CB42: start_thread (pthread_create.c:442) ==3411871== by 0x51DDBB3: clone (clone.S:100) Knowing the server has a VPD capable device, and monitoring the memory consumption over time can show if the issue is present as well as when it is fixed. Before the fix there is a clear linear growth, which should flatten out after applying the patch. [ Where problems could occur ] The functions changed are only called in environments where VPD devices exist, and the patch adjusts pointers and contents of data structures related to VPD capable PCI devices found by libvirt. Things could go wrong in environments wherever VPD capable devices are present, and may show up as garbage data about the device, null pointer where there should be data, segfaults. [ Other Info ] The backport is derived from an upstream fix: https://github.com/libvirt/libvirt/commit/64d32118540aca3d42bc5ee21c8b780cafe04bfa This commit is missing from Jammy and Kinetic, but present in Lunar+. The same issue has not been observed in a similar environment running Focal.
2023-06-20 23:56:41 Rafael Lopez description [ Impact ] Memory leak causing growing memory footprints in long running libvirt processes. In a fairly busy openstack env, this showed a steady linear growth up to ~15GB after a couple of months. This would impact many openstack deployments and anyone else using libvirt with particular PCI devices (VPD capable), forcing them to restart libvirt regularly to reset the memory consumption. This memory leak has only been observed so far in a hardware (metal) environment with mellanox devices, but ostensibly occurs wherever a VPD capable device exists. [ Test Plan ] It is only possible to reproduce this on certain hardware, seemingly hosts that have PCI cards that present VPD (Vital Product Data). For example, this was noticed on a host where libvirt was obtaining data from a mellanox card that presented vpd data. You can tell if a PCI device presents vpd data by looking at the sysfs entry /sys/bus/pci/devices/{address}/vpd, or from `lswh` if you see 'vpd' in the list of capabilities, for example:    *-network:0         description: Ethernet interface         product: MT2892 Family [ConnectX-6 Dx]         vendor: Mellanox Technologies         ...snip...         capabilities: pciexpress vpd msix pm bus_master cap_list rom ethernet physical 1000bt-fd 10000bt-fd 25000bt-fd 40000bt-fd autonegotiation         ...snip... It is easy to confirm by running libvirt in valgrind which will show a stack like the following: ==3411871== 7,559,541 (407,160 direct, 7,152,381 indirect) bytes in 16,965 blocks are definitely lost in loss record 2,846 of 2,846 ==3411871== at 0x484DA83: calloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so) ==3411871== by 0x4D53C50: g_malloc0 (gmem.c:161) ==3411871== by 0x49A2832: virPCIVPDParse (virpcivpd.c:672) ==3411871== by 0x4983BD8: virPCIDeviceGetVPD (virpci.c:2694) ==3411871== by 0x4A2CEB7: UnknownInlinedFun (node_device_conf.c:3032) ==3411871== by 0x4A2CEB7: virNodeDeviceGetPCIDynamicCaps (node_device_conf.c:3065) ==3411871== by 0x4A2D03D: virNodeDeviceUpdateCaps (node_device_conf.c:2636) ==3411871== by 0xFC8CD35: nodeDeviceGetXMLDesc (node_device_driver.c:370) ==3411871== by 0x4B7E9D1: virNodeDeviceGetXMLDesc (libvirt-nodedev.c:275) ==3411871== by 0x15519A: UnknownInlinedFun (remote_daemon_dispatch_stubs.h:15507) ==3411871== by 0x15519A: remoteDispatchNodeDeviceGetXMLDescHelper.lto_priv.0 (remote_daemon_dispatch_stubs.h:15484) ==3411871== by 0x4A59785: UnknownInlinedFun (virnetserverprogram.c:428) ==3411871== by 0x4A59785: virNetServerProgramDispatch (virnetserverprogram.c:302) ==3411871== by 0x4A60067: UnknownInlinedFun (virnetserver.c:140) ==3411871== by 0x4A60067: virNetServerHandleJob (virnetserver.c:160) ==3411871== by 0x499B982: virThreadPoolWorker (virthreadpool.c:164) ==3411871== by 0x499A4D8: virThreadHelper (virthread.c:241) ==3411871== by 0x514CB42: start_thread (pthread_create.c:442) ==3411871== by 0x51DDBB3: clone (clone.S:100) Knowing the server has a VPD capable device, and monitoring the memory consumption over time can show if the issue is present as well as when it is fixed. Before the fix there is a clear linear growth, which should flatten out after applying the patch. [ Where problems could occur ] The functions changed are only called in environments where VPD devices exist, and the patch adjusts pointers and contents of data structures related to VPD capable PCI devices found by libvirt. Things could go wrong in environments wherever VPD capable devices are present, and may show up as garbage data about the device, null pointer where there should be data, segfaults. [ Other Info ] The backport is derived from an upstream fix: https://github.com/libvirt/libvirt/commit/64d32118540aca3d42bc5ee21c8b780cafe04bfa This commit is missing from Jammy and Kinetic, but present in Lunar+. The same issue has not been observed in a similar environment running Focal. [ Impact ] Memory leak causing growing memory footprints in long running libvirt processes. In a fairly busy openstack env, this showed a steady linear growth up to ~15GB after a couple of months. This would impact many openstack deployments and anyone else using libvirt with particular PCI devices (VPD capable), forcing them to restart libvirt regularly to reset the memory consumption. This memory leak has only been observed so far in a hardware (metal) environment with mellanox devices, but ostensibly occurs wherever a VPD capable device exists. [ Test Plan ] It is only possible to reproduce this on certain hardware, seemingly hosts that have PCI cards that present VPD (Vital Product Data). For example, this was noticed on a host where libvirt was obtaining data from a mellanox card that presented vpd data. You can tell if a PCI device presents vpd data by looking at the sysfs entry /sys/bus/pci/devices/{address}/vpd, or from `lswh` if you see 'vpd' in the list of capabilities, for example:    *-network:0         description: Ethernet interface         product: MT2892 Family [ConnectX-6 Dx]         vendor: Mellanox Technologies         ...snip...         capabilities: pciexpress vpd msix pm bus_master cap_list rom ethernet physical 1000bt-fd 10000bt-fd 25000bt-fd 40000bt-fd autonegotiation         ...snip... It is easy to confirm by running libvirt in valgrind which will show a stack like the following: ==3411871== 7,559,541 (407,160 direct, 7,152,381 indirect) bytes in 16,965 blocks are definitely lost in loss record 2,846 of 2,846 ==3411871== at 0x484DA83: calloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so) ==3411871== by 0x4D53C50: g_malloc0 (gmem.c:161) ==3411871== by 0x49A2832: virPCIVPDParse (virpcivpd.c:672) ==3411871== by 0x4983BD8: virPCIDeviceGetVPD (virpci.c:2694) ==3411871== by 0x4A2CEB7: UnknownInlinedFun (node_device_conf.c:3032) ==3411871== by 0x4A2CEB7: virNodeDeviceGetPCIDynamicCaps (node_device_conf.c:3065) ==3411871== by 0x4A2D03D: virNodeDeviceUpdateCaps (node_device_conf.c:2636) ==3411871== by 0xFC8CD35: nodeDeviceGetXMLDesc (node_device_driver.c:370) ==3411871== by 0x4B7E9D1: virNodeDeviceGetXMLDesc (libvirt-nodedev.c:275) ==3411871== by 0x15519A: UnknownInlinedFun (remote_daemon_dispatch_stubs.h:15507) ==3411871== by 0x15519A: remoteDispatchNodeDeviceGetXMLDescHelper.lto_priv.0 (remote_daemon_dispatch_stubs.h:15484) ==3411871== by 0x4A59785: UnknownInlinedFun (virnetserverprogram.c:428) ==3411871== by 0x4A59785: virNetServerProgramDispatch (virnetserverprogram.c:302) ==3411871== by 0x4A60067: UnknownInlinedFun (virnetserver.c:140) ==3411871== by 0x4A60067: virNetServerHandleJob (virnetserver.c:160) ==3411871== by 0x499B982: virThreadPoolWorker (virthreadpool.c:164) ==3411871== by 0x499A4D8: virThreadHelper (virthread.c:241) ==3411871== by 0x514CB42: start_thread (pthread_create.c:442) ==3411871== by 0x51DDBB3: clone (clone.S:100) Knowing the server has a VPD capable device, and monitoring the memory consumption over time can show if the issue is present as well as when it is fixed. Before the fix there is a clear linear growth, which should flatten out after applying the patch. [ Where problems could occur ] The functions changed are only called in environments where VPD devices exist, and the patch adjusts pointers and contents of data structures related to VPD capable PCI devices found by libvirt. Things could go wrong in environments wherever VPD capable devices are present, and may show up as garbage data about the device, null pointer where there should be data, segfaults. [ Other Info ] The backport is derived from an upstream fix: https://github.com/libvirt/libvirt/commit/64d32118540aca3d42bc5ee21c8b780cafe04bfa This patch is missing from Jammy and Kinetic, but present in Lunar+. The same issue has not been observed in a similar environment running Focal.
2023-06-21 13:06:51 Junien F bug added subscriber The Canonical Sysadmins
2023-06-21 18:34:19 Jeremy Bícha libvirt (Ubuntu): status In Progress Fix Released
2023-06-21 18:34:59 Jeremy Bícha libvirt (Ubuntu Kinetic): status In Progress Triaged
2023-06-21 18:42:13 Jeremy Bícha removed subscriber Ubuntu Sponsors
2023-06-21 18:42:27 Jeremy Bícha bug added subscriber Jeremy Bícha
2023-06-22 12:33:10 Heitor Alves de Siqueira removed subscriber Support Engineering Sponsors
2023-06-22 12:33:20 Heitor Alves de Siqueira tags patch patch se-sponsor-halves
2023-06-30 18:11:51 Andreas Hasenack bug watch added https://bugzilla.redhat.com/show_bug.cgi?id=2143235
2023-06-30 18:12:30 Andreas Hasenack libvirt (Ubuntu Jammy): status In Progress Fix Committed
2023-06-30 18:12:32 Andreas Hasenack bug added subscriber Ubuntu Stable Release Updates Team
2023-06-30 18:12:36 Andreas Hasenack bug added subscriber SRU Verification
2023-06-30 18:12:40 Andreas Hasenack tags patch se-sponsor-halves patch se-sponsor-halves verification-needed verification-needed-jammy
2023-07-04 02:51:17 Rafael Lopez libvirt (Ubuntu Kinetic): status Triaged Won't Fix
2023-07-11 23:46:06 Rafael Lopez description [ Impact ] Memory leak causing growing memory footprints in long running libvirt processes. In a fairly busy openstack env, this showed a steady linear growth up to ~15GB after a couple of months. This would impact many openstack deployments and anyone else using libvirt with particular PCI devices (VPD capable), forcing them to restart libvirt regularly to reset the memory consumption. This memory leak has only been observed so far in a hardware (metal) environment with mellanox devices, but ostensibly occurs wherever a VPD capable device exists. [ Test Plan ] It is only possible to reproduce this on certain hardware, seemingly hosts that have PCI cards that present VPD (Vital Product Data). For example, this was noticed on a host where libvirt was obtaining data from a mellanox card that presented vpd data. You can tell if a PCI device presents vpd data by looking at the sysfs entry /sys/bus/pci/devices/{address}/vpd, or from `lswh` if you see 'vpd' in the list of capabilities, for example:    *-network:0         description: Ethernet interface         product: MT2892 Family [ConnectX-6 Dx]         vendor: Mellanox Technologies         ...snip...         capabilities: pciexpress vpd msix pm bus_master cap_list rom ethernet physical 1000bt-fd 10000bt-fd 25000bt-fd 40000bt-fd autonegotiation         ...snip... It is easy to confirm by running libvirt in valgrind which will show a stack like the following: ==3411871== 7,559,541 (407,160 direct, 7,152,381 indirect) bytes in 16,965 blocks are definitely lost in loss record 2,846 of 2,846 ==3411871== at 0x484DA83: calloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so) ==3411871== by 0x4D53C50: g_malloc0 (gmem.c:161) ==3411871== by 0x49A2832: virPCIVPDParse (virpcivpd.c:672) ==3411871== by 0x4983BD8: virPCIDeviceGetVPD (virpci.c:2694) ==3411871== by 0x4A2CEB7: UnknownInlinedFun (node_device_conf.c:3032) ==3411871== by 0x4A2CEB7: virNodeDeviceGetPCIDynamicCaps (node_device_conf.c:3065) ==3411871== by 0x4A2D03D: virNodeDeviceUpdateCaps (node_device_conf.c:2636) ==3411871== by 0xFC8CD35: nodeDeviceGetXMLDesc (node_device_driver.c:370) ==3411871== by 0x4B7E9D1: virNodeDeviceGetXMLDesc (libvirt-nodedev.c:275) ==3411871== by 0x15519A: UnknownInlinedFun (remote_daemon_dispatch_stubs.h:15507) ==3411871== by 0x15519A: remoteDispatchNodeDeviceGetXMLDescHelper.lto_priv.0 (remote_daemon_dispatch_stubs.h:15484) ==3411871== by 0x4A59785: UnknownInlinedFun (virnetserverprogram.c:428) ==3411871== by 0x4A59785: virNetServerProgramDispatch (virnetserverprogram.c:302) ==3411871== by 0x4A60067: UnknownInlinedFun (virnetserver.c:140) ==3411871== by 0x4A60067: virNetServerHandleJob (virnetserver.c:160) ==3411871== by 0x499B982: virThreadPoolWorker (virthreadpool.c:164) ==3411871== by 0x499A4D8: virThreadHelper (virthread.c:241) ==3411871== by 0x514CB42: start_thread (pthread_create.c:442) ==3411871== by 0x51DDBB3: clone (clone.S:100) Knowing the server has a VPD capable device, and monitoring the memory consumption over time can show if the issue is present as well as when it is fixed. Before the fix there is a clear linear growth, which should flatten out after applying the patch. [ Where problems could occur ] The functions changed are only called in environments where VPD devices exist, and the patch adjusts pointers and contents of data structures related to VPD capable PCI devices found by libvirt. Things could go wrong in environments wherever VPD capable devices are present, and may show up as garbage data about the device, null pointer where there should be data, segfaults. [ Other Info ] The backport is derived from an upstream fix: https://github.com/libvirt/libvirt/commit/64d32118540aca3d42bc5ee21c8b780cafe04bfa This patch is missing from Jammy and Kinetic, but present in Lunar+. The same issue has not been observed in a similar environment running Focal. [ Impact ] Memory leak causing growing memory footprints in long running libvirt processes. In a fairly busy openstack env, this showed a steady linear growth up to ~15GB after a couple of months. This would impact many openstack deployments and anyone else using libvirt with particular PCI devices (VPD capable), forcing them to restart libvirt regularly to reset the memory consumption. This memory leak has only been observed so far in a hardware (metal) environment with mellanox devices, but ostensibly occurs wherever a VPD capable device exists. [ Test Plan ] It is only possible to reproduce this on certain hardware, seemingly hosts that have PCI cards that present VPD (Vital Product Data). For example, this was noticed on a host where libvirt was obtaining data from a mellanox card that presented vpd data. You can tell if a PCI device presents vpd data by looking at the sysfs entry /sys/bus/pci/devices/{address}/vpd, or from `lswh` if you see 'vpd' in the list of capabilities, for example:    *-network:0         description: Ethernet interface         product: MT2892 Family [ConnectX-6 Dx]         vendor: Mellanox Technologies         ...snip...         capabilities: pciexpress vpd msix pm bus_master cap_list rom ethernet physical 1000bt-fd 10000bt-fd 25000bt-fd 40000bt-fd autonegotiation         ...snip... 1. Knowing the server has a VPD capable device, and monitoring the memory consumption over time can show if the issue is present as well as when it is fixed. Before the fix there is a clear linear growth, which should flatten out after applying the patch. 2. This is also another simple test that can be done: Run "virsh nodedev-list" for 1000 times, and check the memory occupied by the libvirtd service. #!/bin/sh systemctl start libvirtd systemctl status libvirtd i=0 while [ $i -ne 1000 ] do virsh nodedev-list i=$(($i+1)) echo "$i" done systemctl status libvirtd and watch the "Memory:" field grow (or not, if the fix is there). [ Where problems could occur ] The functions changed are only called in environments where VPD devices exist, and the patch adjusts pointers and contents of data structures related to VPD capable PCI devices found by libvirt. Things could go wrong in environments wherever VPD capable devices are present, and may show up as garbage data about the device, null pointer where there should be data, segfaults. [ Other Info ] The backport is derived from an upstream fix: https://github.com/libvirt/libvirt/commit/64d32118540aca3d42bc5ee21c8b780cafe04bfa This patch is missing from Jammy and Kinetic, but present in Lunar+. The same issue has not been observed in a similar environment running Focal. Running libvirt in valgrind which will stacks like the following: ==3411871== 7,559,541 (407,160 direct, 7,152,381 indirect) bytes in 16,965 blocks are definitely lost in loss record 2,846 of 2,846 ==3411871== at 0x484DA83: calloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so) ==3411871== by 0x4D53C50: g_malloc0 (gmem.c:161) ==3411871== by 0x49A2832: virPCIVPDParse (virpcivpd.c:672) ==3411871== by 0x4983BD8: virPCIDeviceGetVPD (virpci.c:2694) ==3411871== by 0x4A2CEB7: UnknownInlinedFun (node_device_conf.c:3032) ==3411871== by 0x4A2CEB7: virNodeDeviceGetPCIDynamicCaps (node_device_conf.c:3065) ==3411871== by 0x4A2D03D: virNodeDeviceUpdateCaps (node_device_conf.c:2636) ==3411871== by 0xFC8CD35: nodeDeviceGetXMLDesc (node_device_driver.c:370) ==3411871== by 0x4B7E9D1: virNodeDeviceGetXMLDesc (libvirt-nodedev.c:275) ==3411871== by 0x15519A: UnknownInlinedFun (remote_daemon_dispatch_stubs.h:15507) ==3411871== by 0x15519A: remoteDispatchNodeDeviceGetXMLDescHelper.lto_priv.0 (remote_daemon_dispatch_stubs.h:15484) ==3411871== by 0x4A59785: UnknownInlinedFun (virnetserverprogram.c:428) ==3411871== by 0x4A59785: virNetServerProgramDispatch (virnetserverprogram.c:302) ==3411871== by 0x4A60067: UnknownInlinedFun (virnetserver.c:140) ==3411871== by 0x4A60067: virNetServerHandleJob (virnetserver.c:160) ==3411871== by 0x499B982: virThreadPoolWorker (virthreadpool.c:164) ==3411871== by 0x499A4D8: virThreadHelper (virthread.c:241) ==3411871== by 0x514CB42: start_thread (pthread_create.c:442) ==3411871== by 0x51DDBB3: clone (clone.S:100)
2023-07-12 00:16:30 Rafael Lopez tags patch se-sponsor-halves verification-needed verification-needed-jammy patch se-sponsor-halves verification-done-jammy
2023-07-12 10:39:04 Robie Basak removed subscriber Ubuntu Stable Release Updates Team
2023-07-12 10:39:03 Launchpad Janitor libvirt (Ubuntu Jammy): status Fix Committed Fix Released