libvirt-bin nwfilter deadlock

Bug #1753604 reported by Pooja Ghumre
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
libvirt (Ubuntu)
Fix Released
Undecided
Unassigned
Xenial
Fix Released
Undecided
Unassigned

Bug Description

[Impact]

 * Fix a potential deadlock

 * The trigger is essentially a race, so it is rare and repro unreliable

[Test Case]

 * This is the hard part for this issue, essentially changing NWFilters
   through libvirt. The reporter hits it on removing a guest, but it is
   hard to make a recipe by that.

 * We have to fall back to general regression checks with an extra eye on
   NWFilter handling

[Regression Potential]

 * The only bad part is the hard repro, everything else seems to make this
   low risk
   - the patch is small
   - the patch got no follow on fix in 2 years (so it is very stable)
   - it unlocks earlier (so deadlocks are less likely)
   - there is no data issue (being unlocked early) as explained in the
     patch

[Other Info]

 * n/a

---

When deleting a VM, libvirt-bin intermittently hits a deadlock in nwfilter instantiation code.
This is fixed in libvirt upstream, but the below commit is not cherrypicked in ubuntu yet:
https://github.com/libvirt/libvirt/commit/2f5e24ba00d93c25b87561cb1f5259de175d70f9

Thread 5 (Thread 0x7f2ce1c4b700 (LWP 13866)):
#0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1 0x00007f2cee30ce42 in __GI___pthread_mutex_lock (mutex=0x7f2c980008d0) at ../nptl/pthread_mutex_lock.c:115
#2 0x00007f2cc7d7a62f in virNWFilterLockIface () from /usr/lib/libvirt/connection-driver/libvirt_driver_nwfilter.so
#3 0x00007f2cc7d6ebf5 in ?? () from /usr/lib/libvirt/connection-driver/libvirt_driver_nwfilter.so
#4 0x00007f2cc7d6f84c in virNWFilterTeardownFilter () from /usr/lib/libvirt/connection-driver/libvirt_driver_nwfilter.so
---Type <return> to continue, or q <return> to quit---
#5 0x00007f2ceea32f0c in virDomainConfVMNWFilterTeardown () from /usr/lib/x86_64-linux-gnu/libvirt.so.0
#6 0x00007f2cc73d5a86 in qemuProcessStop () from /usr/lib/libvirt/connection-driver/libvirt_driver_qemu.so
#7 0x00007f2cc7420577 in ?? () from /usr/lib/libvirt/connection-driver/libvirt_driver_qemu.so
#8 0x00007f2ceea857cf in virDomainDestroy () from /usr/lib/x86_64-linux-gnu/libvirt.so.0
#9 0x000055c4ba3fe9db in ?? ()
#10 0x00007f2ceeaf8f69 in virNetServerProgramDispatch () from /usr/lib/x86_64-linux-gnu/libvirt.so.0
#11 0x00007f2ceeaf4478 in ?? () from /usr/lib/x86_64-linux-gnu/libvirt.so.0
#12 0x00007f2cee9eb566 in ?? () from /usr/lib/x86_64-linux-gnu/libvirt.so.0
#13 0x00007f2cee9eaae8 in ?? () from /usr/lib/x86_64-linux-gnu/libvirt.so.0
#14 0x00007f2cee30a6ba in start_thread (arg=0x7f2ce1c4b700) at pthread_create.c:333
#15 0x00007f2cee04041d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 17 (Thread 0x7f2cc2801700 (LWP 8866)):
#0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1 0x00007f2cee30ce42 in __GI___pthread_mutex_lock (mutex=0x7f2cc7f8d280) at ../nptl/pthread_mutex_lock.c:115
#2 0x00007f2cc7d6f6d3 in virNWFilterInstantiateFilterLate () from /usr/lib/libvirt/connection-driver/libvirt_driver_nwfilter.so
#3 0x00007f2cc7d7a9b2 in ?? () from /usr/lib/libvirt/connection-driver/libvirt_driver_nwfilter.so
#4 0x00007f2cee9eab12 in ?? () from /usr/lib/x86_64-linux-gnu/libvirt.so.0
#5 0x00007f2cee30a6ba in start_thread (arg=0x7f2cc2801700) at pthread_create.c:333
#6 0x00007f2cee04041d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Pooja Ghumre (pooja-9)
tags: removed: pat
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Thanks for the report Pooja,
this is fixed in v1.3.5

By that this in anything newer than Xenial via being upstream.

I need to take a look at backporting the suggested change to Xenial, but that will probably not happen this week as I'm traveling.

Changed in libvirt (Ubuntu Xenial):
status: New → Confirmed
status: Confirmed → Triaged
Changed in libvirt (Ubuntu):
status: New → Fix Released
Revision history for this message
Pooja Ghumre (pooja-9) wrote :

Thanks Christian!

Right, we were building libvirt from upstream 1.3.5 source code for ubuntu14 earlier, however that same build procedure didn't seem to work for ubuntu16. Hence the need for backporting it to xenial/trusty, so that we can use the libvirt packages available from ubuntu repo directly.

Appreciate your help on this!

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

I tried loop-killing VMs also with some extra filters, but there is no reliable "this is how you trigger it" for this case.
As initially reported it "intermittently" fails.

That will make the SRU template a bit thin on test steps.
Never the less the fix is correct and makes sense.
So running the usual regressions will be most of what we can do.

On the good side, between the version in xenial and the fix there is no other related locking change to consider on the backport and the change itself is limited to just nwfilters.

@Pooja - can you surely-reproduce this in any way, so that we could use your verification as the check for this eventually?

description: updated
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

SRU Template added.

Fix checked by regressions tests against ppa, ready for SRU considerarion.
- Xenial: libvirt_1.3.1-1ubuntu10.21_source.changes (This also includes a fix for 1688508)

Note: test builds in https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/3224

Regression tests started on all arches

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Start/Stop-Looped a bunch of guests with extra nwfilters, but as I mentioned in the SRU Template it is inherently racy and so far never triggered for me. So this isn't a full verification, but more a sanity check.

Regression tests good as well, moving it to x-unapproved for the SRU Team to consider.
=> libvirt_1.3.1-1ubuntu10.21_source.changes

Revision history for this message
Łukasz Zemczak (sil2100) wrote : Please test proposed package

Hello Pooja, or anyone else affected,

Accepted libvirt into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/libvirt/1.3.1-1ubuntu10.21 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-xenial to verification-done-xenial. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-xenial. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in libvirt (Ubuntu Xenial):
status: Triaged → Fix Committed
tags: added: verification-needed verification-needed-xenial
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :
Download full text (4.6 KiB)

As I said this is hard to test explicitly.
The issue is between instantiating and destroying nwfilters.
So to stress the area of the code I created a test:

1. run a guest (bionic in my case, but it doesn't matter)

2. prep hotpluggable interfaces with deep nwfilter rules
   Clean-traffic references some subrules and is available by default, so use
   # cat net-1.xml
    <interface type='network'>
      <mac address='de:ad:be:ef:00:01'/>
      <source network='default' bridge='virbr0'/>
      <model type='virtio'/>
      <filterref filter='clean-traffic'/>
    </interface>
   Spread out this rule by
   # for i in $(seq 2 9); do sed -e "s/01/0$i/g" net-1.xml > net-$i.xml ; don

3. attach all of them and destroy the guest
  $ for i in $(seq 1 9); do virsh attach-device b-test net-$i.xml; done; echo "date"; virsh destroy b-test;

Note: I had concurrent attach/detach loops, but that was (independent to the bug here) too much as it only led to timeouts on cleanup, so it was not helping to test this.

This locks it up in:
# cat /proc/12531/wchan
futex_wait_queue_me

BT looks just as the bug described:
Thread 10 (Thread 0x7fee6a9ef700 (LWP 12540)):
#0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1 0x00007fee798b5e42 in __GI___pthread_mutex_lock (mutex=0x7fee00000a30) at ../nptl/pthread_mutex_lock.c:115
#2 0x00007fee79f93c65 in virMutexLock (m=<optimized out>) at ../../../src/util/virthread.c:89
#3 0x00007fee5331e62f in virNWFilterLockIface (ifname=ifname@entry=0x7fee440068a0 "vnet1")
    at ../../../src/nwfilter/nwfilter_learnipaddr.c:180
#4 0x00007fee53312bf5 in _virNWFilterTeardownFilter (ifname=0x7fee440068a0 "vnet1")
    at ../../../src/nwfilter/nwfilter_gentech_driver.c:1086
#5 0x00007fee5331384c in virNWFilterTeardownFilter (net=0x7fee44000b50)
    at ../../../src/nwfilter/nwfilter_gentech_driver.c:1104
#6 0x00007fee79fdbf0c in virDomainConfVMNWFilterTeardown (vm=vm@entry=0x7fee300009f0)
    at ../../../src/conf/domain_nwfilter.c:64
#7 0x00007fee52979a86 in qemuProcessStop (driver=driver@entry=0x7fee483a53c0, vm=0x7fee300009f0,
    reason=reason@entry=VIR_DOMAIN_SHUTOFF_DESTROYED, flags=0) at ../../../src/qemu/qemu_process.c:5290
#8 0x00007fee529c4577 in qemuDomainDestroyFlags (dom=<optimized out>, flags=0)
---Type <return> to continue, or q <return> to quit---
    at ../../../src/qemu/qemu_driver.c:2228
#9 0x00007fee7a02e7cf in virDomainDestroy (domain=domain@entry=0x7fee28000c50) at ../../../src/libvirt-domain.c:479

Thread 18 (Thread 0x7fee1cff9700 (LWP 14653)):
#0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1 0x00007fee798b5dbd in __GI___pthread_mutex_lock (mutex=mutex@entry=0x7fee7a455740 <ruleLock>)
    at ../nptl/pthread_mutex_lock.c:80
#2 0x00007fee79f93c65 in virMutexLock (m=m@entry=0x7fee7a455740 <ruleLock>) at ../../../src/util/virthread.c:89
#3 0x00007fee79f5991d in virFirewallApply (firewall=firewall@entry=0x7fee00000ac0)
    at ../../../src/util/virfirewall.c:933
#4 0x00007fee5331e4a5 in ebtablesApplyDropAllRules (ifname=0x7fee44000de8 "vnet1")
    at ../../../src/nwfilter/nwfilter_ebiptables_driver.c:3107
#5 0x00007fee5331eb2a in learnIPAddress...

Read more...

tags: added: verification-done verification-done-xenial
removed: verification-needed verification-needed-xenial
Revision history for this message
Pooja Ghumre (pooja-9) wrote :

Thanks Christian and Łukasz for fixing this issue! I will update our tests to use this new package and let you know if the issue persists.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package libvirt - 1.3.1-1ubuntu10.21

---------------
libvirt (1.3.1-1ubuntu10.21) xenial; urgency=medium

  * d/p/ubuntu/lp1688508-fix-variable-scope-in-in-check_guests_shutdown.patch:
    backport further upstream fixes that were identified on verification.
    Together with the former change this fixes (LP: #1688508)
  * d/p/ubuntu/lp1753604-nwfilter-fix-lock-order-deadlock.patch:
    fix intermittent deadlock in NWFilter handling (LP: #1753604)

libvirt (1.3.1-1ubuntu10.20) xenial; urgency=medium

  * d/p/ubuntu/lp1688508-tools-avoid-text-spilling-into-variables.patch:
    avoid hanging on shutdown (LP: #1688508)

 -- Christian Ehrhardt <email address hidden> Wed, 04 Apr 2018 10:46:12 +0200

Changed in libvirt (Ubuntu Xenial):
status: Fix Committed → Fix Released
Revision history for this message
Łukasz Zemczak (sil2100) wrote : Update Released

The verification of the Stable Release Update for libvirt has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.