libvirtd restart due to assertion failure in libnl

Bug #1277157 reported by Lee T. Schermerhorn
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
netcf (Ubuntu)
Fix Released
Undecided
Unassigned
Precise
Fix Released
High
Chris J Arges
Quantal
Fix Released
High
Chris J Arges

Bug Description

SRU Justification:
[Impact]
 * When starting and stopping large amounts of domains, libvirtd restarts occasionally due to assertion failure in libnl. This is due to netcf providing caches in netlink_init (via libnl) which are not thread safe.

[Test Case]
 * Define a large number of domains (32) that have 4 vCPUS, 8 GB memory. In a loop start these domains, allow them to boot, and destroy them in parallel.

[Regression Potential]
 * This is an upstream patch and this code is already present in saucy and beyond.

--

Running a multiple-domain start/destroy loop, libvirtd 1.1.1 restarts occasionally due to assertion failure in libnl.

This affects both precise/quantal versions of netcf.

Distibution version: Ubuntu 12.04.2 LTS
Kernel: 3.5.0-41-generic #64 SMP Mon Dec 9 20:35:04 UTC 2013 x86_64
Libvirt: libvirt_1.1.1-0ubuntu8.2~cloud1 from cloud repository.
Qemu-kvm: qemu-kvm_1.0+noroms-0ubuntu14.12 [qemu 1.5 not yet tested]
libnl: libnl-3-200_3.2.3-2ubuntu2
Platform: HP ProLiant SL390s G7 x86_64; 2 socket x 6 cores/socket x 2 HT/core; 96GB

How to reproduce:

1. Define a number of test domains. E.g., test-nn, nn from 01 .. NN. I use 32

Test domains have 4 vcpu, 8G memory, running an Ubuntu 11.04 image. Network is libvirt default virtual network. Domain mac addresses based on test number nn and dhcp serves fixed IP addresses based on mac address.

2. In a loop:

2a. start the NN domains serially -- waiting for the "virsh start" command to complete before starting next domain.
2b. sleep 20sec -- give domains some time to boot. Test doesn't check that domains have completed booting.
2c. destroy the NN domains "in parallel" -- with "virsh destroy test-$nn &"
2d. sleep 15secs.

sleep times are more or less arbitrary. Next pass of starts does usually begin before previous pass of destroys completes, but I've never seen it say that "domain <name> already running" or such.

What I expected to happen: start-destroy loop runs indefinitely without error.

What happened instead: Eventually, I start seeing errors like:

error: Failed to destroy domain lnvtest-31
error: End of file while reading data: Input/output error
error: One or more references were leaked after disconnect from the hypervisor

and

error: failed to get domain 'lnvtest-12'
error: Domain not found: no domain with matching name 'lnvtest-12'

Checking the libvirt debug log, I see that libvirtd restarted during this iteration, dumping its internal log buffer in the process.

gdb traceback shows:

#3 0x00007f845cafb192 in __GI___assert_fail (assertion=0x7f845c4b89bd "0", file=0x7f845c4b8538 "/build/buildd/libnl3-3.2.3/./lib/object.c",
    line=185, function=0x7f845c4b8668 "nl_object_put") at assert.c:103
#4 0x00007f845c4b4dea in nl_object_put () from /lib/libnl-3.so.200
#5 0x00007f845c4afb92 in nl_cache_remove () from /lib/libnl-3.so.200
#6 0x00007f845c4b4b07 in nl_object_free () from /lib/libnl-3.so.200
#7 0x00007f845c4b4b15 in nl_object_free () from /lib/libnl-3.so.200
#8 0x00007f845c4afb92 in nl_cache_remove () from /lib/libnl-3.so.200
#9 0x00007f845c4afd0b in nl_cache_clear () from /lib/libnl-3.so.200
#10 0x00007f845c4afd3e in nl_cache_free () from /lib/libnl-3.so.200
#11 0x00007f84517f3096 in ?? () from /usr/lib/libnetcf.so.1
#12 0x00007f84517f4220 in ?? () from /usr/lib/libnetcf.so.1
#13 0x00007f84517ef53f in ncf_close () from /usr/lib/libnetcf.so.1
#14 0x00007f8451a0be6f in netcfInterfaceClose (conn=0x7f83f00e7ad0) at /tmp/buildd/libvirt-1.1.1/./src/interface/interface_backend_netcf.c:197
#15 0x00007f845d19d224 in virConnectDispose (obj=0x7f83f00e7ad0) at /tmp/buildd/libvirt-1.1.1/./src/datatypes.c:149
#16 0x00007f845d1246bb in virObjectUnref (anyobj=<optimized out>) at /tmp/buildd/libvirt-1.1.1/./src/util/virobject.c:262
#17 0x00007f845d1a6e2f in virConnectClose (conn=0x7f83f00e7ad0) at /tmp/buildd/libvirt-1.1.1/./src/libvirt.c:1510
#18 0x00007f845db46581 in remoteClientFreeFunc (data=<optimized out>) at /tmp/buildd/libvirt-1.1.1/./daemon/remote.c:683
#19 0x00007f845d20f362 in virNetServerClientDispose (obj=<optimized out>) at /tmp/buildd/libvirt-1.1.1/./src/rpc/virnetserverclient.c:911
#20 0x00007f845d1246bb in virObjectUnref (anyobj=<optimized out>) at /tmp/buildd/libvirt-1.1.1/./src/util/virobject.c:262
#21 0x00007f845d21781d in virNetSocketEventFree (opaque=<optimized out>) at /tmp/buildd/libvirt-1.1.1/./src/rpc/virnetsocket.c:1714
#22 0x00007f845d108b29 in virEventPollCleanupHandles () at /tmp/buildd/libvirt-1.1.1/./src/util/vireventpoll.c:580
#23 0x00007f845d1096e3 in virEventPollRunOnce () at /tmp/buildd/libvirt-1.1.1/./src/util/vireventpoll.c:616
#24 0x00007f845d1086ad in virEventRunDefaultImpl () at /tmp/buildd/libvirt-1.1.1/./src/util/virevent.c:273
#25 0x00007f845d20ecfd in virNetServerRun (srv=0x7f845e818510) at /tmp/buildd/libvirt-1.1.1/./src/rpc/virnetserver.c:1096
#26 0x00007f845db2445e in main (argc=<optimized out>, argv=<optimized out>) at /tmp/buildd/libvirt-1.1.1/./daemon/libvirtd.c

BUG() at line 185 in nl_object_put(); reference count going negative:

171 /**
172 * Release a reference from an object
173 * @arg obj object to release reference from
174 */
175 void nl_object_put(struct nl_object *obj)
176 {
177 if (!obj)
178 return;
179
180 obj->ce_refcnt--;
181 NL_DBG(4, "Returned object reference %p, %d remaining\n",
182 obj, obj->ce_refcnt);
183
184 if (obj->ce_refcnt < 0)
185 BUG();
186
187 if (obj->ce_refcnt <= 0)
188 nl_object_free(obj);
189 }

I hope this is sufficient information.

Chris J Arges (arges)
Changed in libnl3 (Ubuntu):
assignee: nobody → Chris J Arges (arges)
importance: Undecided → High
status: New → In Progress
Revision history for this message
Chris J Arges (arges) wrote :

This bug is related to: https://bugzilla.redhat.com/show_bug.cgi?id=886454
The patch that addresses this issue is 9aadccd57ef2ce3769475f52bc1c30cd689c8085 in netcf. This patch is present in saucy versions of netcf and beyond.

Changed in libnl3 (Ubuntu Precise):
assignee: nobody → Chris J Arges (arges)
Changed in libnl3 (Ubuntu Quantal):
assignee: nobody → Chris J Arges (arges)
Changed in libnl3 (Ubuntu Precise):
importance: Undecided → High
Changed in libnl3 (Ubuntu Quantal):
importance: Undecided → High
Changed in libnl3 (Ubuntu Precise):
status: New → In Progress
Changed in libnl3 (Ubuntu Quantal):
status: New → In Progress
Changed in libnl3 (Ubuntu):
status: In Progress → Fix Released
importance: High → Undecided
assignee: Chris J Arges (arges) → nobody
description: updated
no longer affects: libnl3 (Ubuntu)
Changed in netcf (Ubuntu):
status: New → Fix Released
Changed in netcf (Ubuntu Precise):
assignee: nobody → Chris J Arges (arges)
Changed in netcf (Ubuntu Quantal):
assignee: nobody → Chris J Arges (arges)
no longer affects: libnl3 (Ubuntu Precise)
no longer affects: libnl3 (Ubuntu Quantal)
Changed in netcf (Ubuntu Precise):
status: New → In Progress
Changed in netcf (Ubuntu Quantal):
status: New → In Progress
Changed in netcf (Ubuntu Precise):
importance: Undecided → High
Changed in netcf (Ubuntu Quantal):
importance: Undecided → High
Chris J Arges (arges)
description: updated
Chris J Arges (arges)
description: updated
Revision history for this message
Chris J Arges (arges) wrote :

Uploading fixes for P/Q.

Revision history for this message
Brian Murray (brian-murray) wrote : Please test proposed package

Hello Lee, or anyone else affected,

Accepted netcf into quantal-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/netcf/0.2.0-1ubuntu1.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in netcf (Ubuntu Quantal):
status: In Progress → Fix Committed
tags: added: verification-needed
Changed in netcf (Ubuntu Precise):
status: In Progress → Fix Committed
Revision history for this message
Brian Murray (brian-murray) wrote :

Hello Lee, or anyone else affected,

Accepted netcf into precise-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/netcf/0.1.9-2ubuntu3.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Revision history for this message
Lee T. Schermerhorn (lee-schermerhorn) wrote :

Brian: I tested the 0.2.0-1ubuntu1.2 version with the same test that uncovered the problem and it ran fine overnight; ~580 iterations when it would fail in O(10) without the patched libnetcf1.

I've already tested a hot fix version of the 0.1.9-2ubuntu3.2, but I'll pull the proposed package and test that as well.

Thank you.

Rolf Leggewie (r0lf)
tags: added: verification-done
removed: verification-needed
tags: added: verification-done-quantal verification-needed
removed: verification-done
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package netcf - 0.2.0-1ubuntu1.2

---------------
netcf (0.2.0-1ubuntu1.2) quantal; urgency=low

  * netlink-Do-not-provide-caches-not-needed-and-only-co.patch: remove extra
    caches as they trigger an assertion error in libnl. (LP: #1277157)
 -- Chris J Arges <email address hidden> Fri, 14 Feb 2014 10:04:29 -0600

Changed in netcf (Ubuntu Quantal):
status: Fix Committed → Fix Released
Revision history for this message
Brian Murray (brian-murray) wrote :

Precise still needs verification.

Revision history for this message
Lee T. Schermerhorn (lee-schermerhorn) wrote :

Brian: I have lost my test system [reimaged] so I can't test the precise version at the scale that I did previously. Not sure when/if I'll set that environment up again. I have a smaller version on a workstation and I'll test there, but it will be with a fraction of the guests that used earlier.

Revision history for this message
Lee T. Schermerhorn (lee-schermerhorn) wrote :

Update: testing an a workstation [start/destroy loop with 8 VMs], I was able to hit the assertion error after ~250 iterations. With the new libnetcf1 from precise-proposed [0.1.9-2ubuntu3.2] the test ran for over 1760 iterations w/o error.

Revision history for this message
Chris J Arges (arges) wrote :

@Lee
Thanks for testing this. Marking it verified in precise.

tags: added: verification-done-precise
removed: verification-needed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package netcf - 0.1.9-2ubuntu3.2

---------------
netcf (0.1.9-2ubuntu3.2) precise; urgency=low

  * netlink-Do-not-provide-caches-not-needed-and-only-co.patch: remove extra
    caches as they trigger an assertion error in libnl. (LP: #1277157)
 -- Chris J Arges <email address hidden> Tue, 11 Feb 2014 12:42:58 -0600

Changed in netcf (Ubuntu Precise):
status: Fix Committed → Fix Released
Revision history for this message
Brian Murray (brian-murray) wrote : Update Released

The verification of the Stable Release Update for netcf has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regresssions.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.