Deadlock when detaching network interface

Bug #1818880 reported by Heitor Alves de Siqueira
22
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
Fix Released
High
Unassigned
Mitaka
Fix Released
High
Unassigned
Ocata
Fix Released
High
Unassigned
Pike
Fix Released
High
Unassigned
Queens
Fix Released
High
Unassigned
Rocky
Won't Fix
High
Unassigned
Stein
Fix Released
High
Unassigned
qemu (Ubuntu)
Fix Released
Undecided
Unassigned
Xenial
Fix Released
Undecided
Heitor Alves de Siqueira
Bionic
Fix Released
Undecided
Unassigned
Cosmic
Fix Released
Undecided
Unassigned
Disco
Fix Released
Undecided
Unassigned

Bug Description

[Impact]
Qemu guests hang indefinitely

[Description]
When running a Qemu guest with VirtIO network interfaces, detaching an interface that's currently being used can result in a deadlock. The guest instance will hang and become unresponsive to commands, and the only option is to kill -9 the instance.
The reason for this is a dealock between a monitor and an RCU thread, which will fight over the BQL (qemu_global_mutex) and the critical RCU section locks. The monitor thread will acquire the BQL for detaching the network interface, and fire up a helper thread to deal with detaching the network adapter. That new thread needs to wait on the RCU thread to complete the deletion, but the RCU thread wants the BQL to commit its transactions.
This bug is already fixed upstream (73c6e4013b4c rcu: completely disable pthread_atfork callbacks as soon as possible) and included for other series (see below), so we don't need to backport it to Bionic onwards.

Upstream commit: https://git.qemu.org/?p=qemu.git;a=commit;h=73c6e4013b4c

$ git describe --contains 73c6e4013b4c
v2.10.0-rc2~1^2~8

$ rmadison qemu
===> qemu | 1:2.5+dfsg-5ubuntu10.34 | xenial-updates/universe | amd64, ...
     qemu | 1:2.11+dfsg-1ubuntu7 | bionic/universe | amd64, ...
     qemu | 1:2.12+dfsg-3ubuntu8 | cosmic/universe | amd64, ...
     qemu | 1:3.1+dfsg-2ubuntu2 | disco/universe | amd64, ...

[Test Case]
Being a racing condition, this is a tricky bug to reproduce consistently. We've had reports of users running into this with OpenStack deployments and Windows Server guests, and the scenario is usually like this:
1) Deploy a 16vCPU Windows Server 2012 R2 guest with a virtio network interface
2) Stress the network interface with e.g. Windows HLK test suite or similar
3) Repeatedly attach/detach the network adapter that's in use
It usually takes more than ~4000 attach/detach cycles to trigger the bug.

[Regression Potential]
Regressions for this might arise from the fact that the fix changes RCU lock code. Since this patch has been upstream and in other series for a while, it's unlikely that it would regressions in RCU code specifically. Other code that makes use of the RCU locks (MMIO and some monitor events) will be thoroughly tested for any regressions with use-case scenarios and scripted runs.

Changed in qemu:
assignee: nobody → Heitor R. Alves de Siqueira (halves)
status: New → Confirmed
status: Confirmed → Fix Released
Changed in qemu (Ubuntu):
assignee: nobody → Heitor R. Alves de Siqueira (halves)
status: New → Confirmed
Changed in qemu (Ubuntu Cosmic):
assignee: nobody → Heitor R. Alves de Siqueira (halves)
Changed in qemu (Ubuntu Bionic):
assignee: nobody → Heitor R. Alves de Siqueira (halves)
Changed in qemu (Ubuntu Xenial):
assignee: nobody → Heitor R. Alves de Siqueira (halves)
Changed in qemu:
assignee: Heitor R. Alves de Siqueira (halves) → nobody
Changed in qemu (Ubuntu Disco):
status: Confirmed → Fix Released
Changed in qemu (Ubuntu Cosmic):
status: New → Fix Released
Changed in qemu (Ubuntu Bionic):
status: New → Fix Released
Changed in qemu (Ubuntu Xenial):
status: New → Confirmed
Changed in cloud-archive:
status: New → Confirmed
Revision history for this message
Heitor Alves de Siqueira (halves) wrote :
description: updated
tags: added: sts-sponsor
Changed in qemu (Ubuntu Disco):
assignee: Heitor R. Alves de Siqueira (halves) → nobody
Changed in qemu (Ubuntu Cosmic):
assignee: Heitor R. Alves de Siqueira (halves) → nobody
Changed in qemu (Ubuntu Bionic):
assignee: Heitor R. Alves de Siqueira (halves) → nobody
Dan Streetman (ddstreet)
tags: added: sts-sponsor-ddstreet
Revision history for this message
Heitor Alves de Siqueira (halves) wrote :

Patch v2:
Added missing DEP3 info and corrected pkg version

Revision history for this message
Heitor Alves de Siqueira (halves) wrote :
Thomas Huth (th-huth)
no longer affects: qemu
Revision history for this message
Łukasz Zemczak (sil2100) wrote :

Ok, for such complicated-to-reproduce issues I can understand it might be hard to actually perform verification. In this case besides running the listed test case (which might or might not actually trigger the problem), could you also do some additional smoketesting/dogfooding of the package once it's in -proposed to make sure we didn't regress? (things like running it with different guests and playing around) Thanks!

Changed in qemu (Ubuntu Xenial):
status: Confirmed → Fix Committed
tags: added: verification-needed verification-needed-xenial
Revision history for this message
Łukasz Zemczak (sil2100) wrote : Please test proposed package

Hello Heitor, or anyone else affected,

Accepted qemu into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/qemu/1:2.5+dfsg-5ubuntu10.35 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-xenial to verification-done-xenial. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-xenial. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Dan Streetman (ddstreet)
tags: removed: sts-sponsor sts-sponsor-ddstreet
Revision history for this message
Heitor Alves de Siqueira (halves) wrote :

Hi,

As this bug is very hard to trigger, I've been running some tests with qemu
1:2.5+dfsg-5ubuntu10.35 to see if we didn't regress or break anything with the
new patch. My test setup is as follows:

1) Create new QEMU guest with uvt-kvm or virt-install
    # uvt-kvm create xenial release=xenial --cpu 8
2) Install iperf and latest stress-ng from git. Upstream stress-ng is desired to have a consistent load on each guest
    # apt install iperf
    # git clone git://kernel.ubuntu.com/cking/stress-ng
    # cd stress-ng
    # make clean
    # make
3) Start an iperf server instance on the host
    root@host:~# iperf -s
4) Stress the guest instance with the iperf-retry.sh script and stress-ng (run those commands in different screen windows)
    # ./stress-ng --cpu 4 --hdd 4 --io 4 --vm 4
    # ./stress-ng --class network --all 2
    # ./iperf-retry.sh
5) Attach and detach network adapter using the hotplug.sh script
    root@host:~# ./hotplug.sh xenial 52:54:00:19:7a:21

I've tested this with different guests, including Xenial, Bionic and Disco.
Guest instances performed more than ~2000 hotplug cycles each:
    root@host:~# ./hotplug.sh xenial 52:54:00:19:7a:21
    Detach #1
    Interface detached successfully

    Detach #2
    Interface detached successfully

    ...
    Detach #2168
    Interface detached successfully

The QEMU guests work correctly through and after the tests, and it doesn't
look like we ran into any regressions due to the RCU patch.

Thanks,
Heitor

Revision history for this message
Heitor Alves de Siqueira (halves) wrote :
tags: added: verification-done verification-done-xenial
removed: verification-needed verification-needed-xenial
Revision history for this message
Łukasz Zemczak (sil2100) wrote :

Thanks, this should be enough in that case. Releasing.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package qemu - 1:2.5+dfsg-5ubuntu10.35

---------------
qemu (1:2.5+dfsg-5ubuntu10.35) xenial; urgency=medium

  * Fix deadlock when detaching network interface (LP: #1818880)
    Fixed by upstream patch:
    - d/p/lp-1818880-rcu-disable-atfork.patch: rcu: completely disable
      pthread_atfork callbacks as soon as possible

 -- <email address hidden> (Heitor R. Alves de Siqueira) Fri, 01 Mar 2019 15:59:01 -0300

Changed in qemu (Ubuntu Xenial):
status: Fix Committed → Fix Released
Revision history for this message
Łukasz Zemczak (sil2100) wrote : Update Released

The verification of the Stable Release Update for qemu has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

We don't have qemu in the rocky cloud archive so this is fixed via bionic.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

The Stein cloud archive has qemu 1:3.1+dfsg-2ubuntu3~cloud0 which corresponds to the current version in disco, so I'm marking this as fix released for Stein.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

I've just accepted qemu version 1:2.11+dfsg-1ubuntu7.12~cloud0 into queens-proposed however the changelog doesn't mention (LP: #1818880) anywhere. Seeing that bionic is marked as fix released I assume queens can be marked as fix released.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

Heiter, please if fixes are needed for ocata or pike can you provide us with patches? We need to be careful not to skip cloud archive releases (ie. if this is fixed in xenial but not the ocata and pike cloud archive).

Revision history for this message
Edward Hope-Morley (hopem) wrote :

@corey.bryant I just spoke to @halves and he said that the series targets above Bionic are an oversight since this patch landed in anything newer than 2.11 (i.e. bionic version). We do also need this for Trusty-Mitaka though so I have added that as a UCA target. I'll let @halves reply about O/P.

Revision history for this message
Heitor Alves de Siqueira (halves) wrote :

@@corey.bryant I've checked Pike and it has the fix included already, so I'm attaching a debdiff for Ocata only.

@hopem About Trusty-Mitaka, it seems that it automatically pulled the patch from Xenial's release so it looks good too.

Revision history for this message
Corey Bryant (corey.bryant) wrote : Please test proposed package

Hello Heitor, or anyone else affected,

Accepted qemu into ocata-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:ocata-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-ocata-needed to verification-ocata-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-ocata-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-ocata-needed
Revision history for this message
Corey Bryant (corey.bryant) wrote :

Hello Heitor, or anyone else affected,

Accepted qemu into mitaka-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:mitaka-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-mitaka-needed to verification-mitaka-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-mitaka-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-mitaka-needed
Revision history for this message
Heitor Alves de Siqueira (halves) wrote :

Hi,

I've done some regression testing on qemu from mitaka-proposed, following the procedure from comment #6:

$ dpkg -l | grep qemu
ii ipxe-qemu 1.0.0+git-20131111.c3d1e78-2ubuntu1.1 all PXE boot firmware - ROM images for qemu
ii qemu-block-extra:amd64 1:2.5+dfsg-5ubuntu10.36~cloud0 amd64 extra block backend modules for qemu-
system and qemu-utils
ii qemu-kvm 1:2.5+dfsg-5ubuntu10.36~cloud0 amd64 QEMU Full virtualization
ii qemu-system-common 1:2.5+dfsg-5ubuntu10.36~cloud0 amd64 QEMU full system emulation binaries (common files)
ii qemu-system-x86 1:2.5+dfsg-5ubuntu10.36~cloud0 amd64 QEMU full system emulation binaries (x86)
ii qemu-utils 1:2.5+dfsg-5ubuntu10.36~cloud0 amd64 QEMU utilities

$ /hotplug.sh xenial 52:54:00:b3:ef:bc
Detach #1
Interface detached successfully

Detach #2
Interface detached successfully

...
Detach #2617
Interface detached successfully

This was performed while iperf and stress-ng were running. The guest works correctly through and after the tests.

Thanks,
Heitor

tags: added: verification-mitaka-done
removed: verification-mitaka-needed
Revision history for this message
Heitor Alves de Siqueira (halves) wrote :

Hi,

I've done some regression testing on qemu from ocata-proposed, following the procedure from comment #6:

$ dpkg -l | grep qemu
ii ipxe-qemu 1.0.0+git-20150424.a25a16d-1ubuntu1.2 all PXE boot firmware - ROM images for qemu
ii qemu-block-extra:amd64 1:2.8+dfsg-3ubuntu2.9~cloud4 amd64 extra block backend modules for qemu-system and qemu-
utils
ii qemu-kvm 1:2.8+dfsg-3ubuntu2.9~cloud4 amd64 QEMU Full virtualization
ii qemu-system-common 1:2.8+dfsg-3ubuntu2.9~cloud4 amd64 QEMU full system emulation binaries (common files)
ii qemu-system-x86 1:2.8+dfsg-3ubuntu2.9~cloud4 amd64 QEMU full system emulation binaries (x86)
ii qemu-utils 1:2.8+dfsg-3ubuntu2.9~cloud4 amd64 QEMU utilities

$ /hotplug.sh xenial 52:54:00:bb:23:77
Detach #1
Interface detached successfully

Detach #2
Interface detached successfully

...
Detach #2722
Interface detached successfully

This was performed while iperf and stress-ng were running. The guest works correctly through and after the tests.

Thanks,
Heitor

tags: added: verification-ocata-done
removed: verification-ocata-needed
Revision history for this message
James Page (james-page) wrote : Update Released

The verification of the Stable Release Update for qemu has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
James Page (james-page) wrote :

This bug was fixed in the package qemu - 1:2.8+dfsg-3ubuntu2.9~cloud4
---------------

 qemu (1:2.8+dfsg-3ubuntu2.9~cloud4) xenial-ocata; urgency=medium
 .
   * Fix deadlock when detaching network interface (LP: #1818880)
     Fixed by upstream patch:
     - d/p/lp-1818880-rcu-disable-atfork.patch: rcu: completely disable
       pthread_atfork callbacks as soon as possible

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.