[UBUNTU 20.04] Lost virtio host --> guest notifications cause devices to cease normal operation

Bug #1894942 reported by bugproxy on 2020-09-09
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ubuntu on IBM z Systems
High
Skipper Bug Screeners
qemu (Ubuntu)
High
Canonical Server Team
Xenial
Low
Unassigned
Bionic
Medium
Unassigned
Focal
Medium
Unassigned
Groovy
High
Canonical Server Team

Bug Description

[Impact]

 * Host -> Guest notifications can be lost and kill I/O due to that,
   see below at the original bug report for more details.

 * Backport the fix that ensures that the generated code has to re-load
   variables properly avoiding the issue.

[Test Case]

 * Set up iperf in the host and run the server "iperf -s"
 * get a guest using driver=qemu like:
    <interface type='network'>
    <source network='default'/>
    <model type='virtio'/>
    <driver name='qemu'/>
    <interface/>
 * In the guest run a loop of iperf runs connecting to the
   server on the host.
    #!/bin/bash
    for i in $(seq 1 1000);
    do
      echo Try $i
      iperf -c 192.168.122.1 || break
    done
 * Depending on the HW model, the machine saturation and such it seems
   the above test either is rather reproducible or not-at-all.
   That is bad, but we haven't found a much better repro, gladly IBM
   who reported this issue (and created the fix) can recreate this on
   their end and are willing to do so again for the SRU verification.

[Regression Potential]

 * The changed code path is s390x only and there on the virtio-ccw
   handling. Therefore regressions - if any - would be isolated to s390x
   only and there manifest on virtio-ccw based I/O.

[Other Info]

 * n/a

----

Problem Description:

When irqfds are not used setting of the adapter interruption host-->guest notifier bit is accomplished by the QEMU function virtio_set_ind_atomic().

The atomic_cmpxchg() loop in virtio_set_ind_atomic() is broken because we occasionally end up with old and _old having different values (a legit compiler can generate code that accessed *ind_addr again to pick up a value for _old instead of using the value of old that was already fetched according to the rules of the abstract machine). This means the underlying CS instruction may use a different old (_old) than the one we intended to use if atomic_cmpxchg() performed the xchg part.

The direct consequence of the problem is that host --> guest notifications can get lost. The indirect consequence is that queues may get stuck and the devices may cease operate normally. We stumbled on debugging a choked virtio-net interface (one that used the qemu driver and not vhost). But it can affect other virtio-ccw devices as well.

If irqfds are used for host->guest notifications, then we are safe because notifier bit manipulation is done in the kernel (and it's done correctly).

The problem described above is fixed upstream by commit.

1a8242f7c3 ("virtio-ccw: fix virtio_set_ind_atomic")

All upstream versions since v2.0.0 are (potentially) affected.

The same mistake was made in QEMU in another place, and is fixed by:

45175361f1 ("s390x/pci: fix set_ind_atomic")

We can file a separate BZ for it if necessary.

Related branches

Revision history for this message
bugproxy (bugproxy) wrote : Debug only patch that simulates PV guest IO on a non-PV machine

Default Comment by Bridge

tags: added: architecture-s39064 bugnameltc-184605 severity-high targetmilestone-inin2004
Changed in ubuntu:
assignee: nobody → Skipper Bug Screeners (skipper-screen-team)
affects: ubuntu → qemu (Ubuntu)
Frank Heimes (fheimes) on 2020-09-09
Changed in ubuntu-z-systems:
assignee: nobody → Skipper Bug Screeners (skipper-screen-team)
Changed in qemu (Ubuntu):
assignee: Skipper Bug Screeners (skipper-screen-team) → Canonical Server Team (canonical-server)
Changed in ubuntu-z-systems:
importance: Undecided → High
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hi,
the patches LGTM and I'll make it part of the Ubuntu builds.

One question thou: on one hand "occasionally end up with" sounds like this is almost impossible to explicitly test, but then the attached debug patch suggests there might be a way to trigger this through protvirt.
But even if that is the case backporting it to all kind of older versions there is no protvirt.
So I wanted to ask:
a) is there any way to reliably trigger this for A/B testing of the fix as far back as qemu 2.5?
b) how real is the danger to hit and consequences of this in the pre-protvirt (=<Focal) era?

Changed in qemu (Ubuntu Xenial):
status: New → Incomplete
Changed in qemu (Ubuntu Bionic):
status: New → Incomplete
Changed in qemu (Ubuntu Focal):
status: New → Triaged
importance: Undecided → Medium
Changed in qemu (Ubuntu):
importance: Undecided → High
status: New → In Progress
Frank Heimes (fheimes) on 2020-09-09
Changed in ubuntu-z-systems:
status: New → Triaged
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2020-09-09 02:50 EDT-------
> Hi,
> the patches LGTM and I'll make it part of the Ubuntu builds.
>
> One question thou: on one hand "occasionally end up with" sounds like this
> is almost impossible to explicitly test, but then the attached debug patch
> suggests there might be a way to trigger this through protvirt.
> But even if that is the case backporting it to all kind of older versions
> there is no protvirt.
> So I wanted to ask:
> a) is there any way to reliably trigger this for A/B testing of the fix as
> far back as qemu 2.5?
> b) how real is the danger to hit and consequences of this in the
> pre-protvirt (=<Focal) era?

This is independent of protvirt.
uperf network workload with driver=qemu in libvirt for the network interface did trigger after some time.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

At least for Groovy that seems pretty trivial.
PPA: https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/4257/+packages
MP: https://code.launchpad.net/~paelzer/ubuntu/+source/qemu/+git/qemu/+merge/390440

For consideration on SRUs for other releases please answer the questions of comment #2

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Thanks for the quick answer!
Any particular uperf profile that was found to be more likely/reliably to trigger it?
I assume with driver=qemu you actually mean something like:
  <interface type='network'>
    <source network='default'/>
    <model type='virtio'/>
    <driver name='qemu'
Which is the only driver=qemu that makes sense for [1] I think.
If you have a libvirt XML of the guest that was used somewhere could you please just attach it.

[1]: https://libvirt.org/formatdomain.html#network-interfaces

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2020-09-09 09:44 EDT-------
(In reply to comment #67)
> Thanks for the quick answer!
> Any particular uperf profile that was found to be more likely/reliably to
> trigger it?
> I assume with driver=qemu you actually mean something like:
> <interface type='network'>
> <source network='default'/>
> <model type='virtio'/>
> <driver name='qemu'
> Which is the only driver=qemu that makes sense for [1] I think.

You are right, that is what we mean.

> If you have a libvirt XML of the guest that was used somewhere could you
> please just attach it.

Will have a look.

>
> [1]: https://libvirt.org/formatdomain.html#network-interfaces

(In reply to comment #67)
> Thanks for the quick answer!
> Any particular uperf profile that was found to be more likely/reliably to
> trigger it?

Actually I used iperf in a loop to trigger this (from the guest)

for i in $(seq 1 100) ;
do
echo $i
iperf -c 192.168.122.1 || break
done

Hope that helps.

Revision history for this message
bugproxy (bugproxy) wrote : The domainxml of the gurst I used to debug

------- Comment on attachment From <email address hidden> 2020-09-09 10:03 EDT-------

This is a domainxml I used. It specifies a custom built qemu (installed under /usr/local/bin), so among others that is likely a subject to be changed before any re-use.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Thanks Halil,
I have uploaded it to Groovy already.
For SRU considerations let us wait what the outcome of 1887930 will be - since if approved - they'd make a good pair for an SRU upload.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package qemu - 1:5.0-5ubuntu8

---------------
qemu (1:5.0-5ubuntu8) groovy; urgency=medium

  * d/p/u/lp-1887930-*: Enable Channel Path Handling for vfio-ccw (LP: #1887930)

qemu (1:5.0-5ubuntu7) groovy; urgency=medium

  * d/p/u/lp-1894942-*: fix virtio-ccw host/guest notification (LP: #1894942)

 -- Christian Ehrhardt <email address hidden> Mon, 14 Sep 2020 08:23:49 +0200

Changed in qemu (Ubuntu):
status: In Progress → Fix Released
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Checking SRU compatibility:

#1 Source
Focal: some (minor) backport noise
Bionic: the same backport of Focal applies with offsets.
Xenial: some more nois (tracepoint missing), but still ok

Not as-is, but pretty close - should be ok

#2 Builds
I have started builds of those in
https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/4270

#3 testing
The builds in the PPA succeeded now.
I'd ask IBM to verify with their tests to check if they can
a) trigger the issue on those releases
and
b) if the PPA then is fixing for them as expected.

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2020-10-12 21:16 EDT-------
(In reply to comment #72)
> Checking SRU compatibility:
>
> #1 Source
> Focal: some (minor) backport noise
> Bionic: the same backport of Focal applies with offsets.
> Xenial: some more nois (tracepoint missing), but still ok
>
> Not as-is, but pretty close - should be ok
>
> #2 Builds
> I have started builds of those in
> https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/4270
>
> #3 testing
> The builds in the PPA succeeded now.
> I'd ask IBM to verify with their tests to check if they can
> a) trigger the issue on those releases
> and
> b) if the PPA then is fixing for them as expected.

Both focal and xenial look good: the issue can be triggered with an up to date distro qemu and can not be triggered with the PPA qemu.

I've ran ran into some unrelated issues with bionic, I will try to test there as well after the impediments are out of the way.

Let me also note, that with xenial I see some recoverable connection drops (with PPA). Probably a different issue.

Changed in qemu (Ubuntu Bionic):
status: Incomplete → Triaged
Changed in qemu (Ubuntu Xenial):
status: Incomplete → Triaged
importance: Undecided → Low
Changed in qemu (Ubuntu Bionic):
importance: Undecided → Medium
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Thanks Halil!

I was not able to trigger this on our machine in a thousand runs on bionic and focal, but it might be machine/model and/or overall machine business dependent.

Therefore I'm extra-glad you gave this a try on your end as well!
Due to that I'll also have to rely on you for the actual final verification in -proposed then.

description: updated
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2020-10-13 12:41 EDT-------
(In reply to comment #75)
> (In reply to comment #72)
> > Checking SRU compatibility:
> >
> > #1 Source
> > Focal: some (minor) backport noise
> > Bionic: the same backport of Focal applies with offsets.
> > Xenial: some more nois (tracepoint missing), but still ok
> >
> > Not as-is, but pretty close - should be ok
> >
> > #2 Builds
> > I have started builds of those in
> > https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/4270
> >
> > #3 testing
> > The builds in the PPA succeeded now.
> > I'd ask IBM to verify with their tests to check if they can
> > a) trigger the issue on those releases
> > and
> > b) if the PPA then is fixing for them as expected.
>
> Both focal and xenial look good: the issue can be triggered with an up to
> date distro qemu and can not be triggered with the PPA qemu.
>
> I've ran ran into some unrelated issues with bionic, I will try to test
> there as well after the impediments are out of the way.
>
> Let me also note, that with xenial I see some recoverable connection drops
> (with PPA). Probably a different issue.

With regards to xenial I jumped to conclusion to quick. The drops in connectivity were these easily recoverable ones, so I've looked at the generated assembly, and I don't see the extra load that is the cause of trouble. I would still prefer to have this fixed, because the compiler is free to generate broken code, it just does not in this particular case, but it is up to you. Sorry for the noise.

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2020-10-14 21:19 EDT-------
(In reply to comment #75)
> (In reply to comment #72)
> > Checking SRU compatibility:
> >
> > #1 Source
> > Focal: some (minor) backport noise
> > Bionic: the same backport of Focal applies with offsets.
> > Xenial: some more nois (tracepoint missing), but still ok
> >
> > Not as-is, but pretty close - should be ok
> >
> > #2 Builds
> > I have started builds of those in
> > https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/4270
> >
> > #3 testing
> > The builds in the PPA succeeded now.
> > I'd ask IBM to verify with their tests to check if they can
> > a) trigger the issue on those releases
> > and
> > b) if the PPA then is fixing for them as expected.
>
> Both focal and xenial look good: the issue can be triggered with an up to
> date distro qemu and can not be triggered with the PPA qemu.
>
> I've ran ran into some unrelated issues with bionic, I will try to test
> there as well after the impediments are out of the way.
>

I've managed to test bionic as well. It is pretty much same as xenial. That means while the source code is broken, in a sense that the compiler is allowed to generate broken code, the generated code does not exhibit the issue in question.

As with xenial I do see intermittent connection problems with iperf3, but the recv queue does not get stuck for good (like in focal, because of the lost interrupt).

The presumably fixes the code source I can not see any positive or negative impact with regards to the intermittent and recoverable connectivity problems.

Changed in qemu (Ubuntu Xenial):
status: Triaged → In Progress
Changed in qemu (Ubuntu Bionic):
status: Triaged → In Progress
Changed in qemu (Ubuntu Focal):
status: Triaged → In Progress
Frank Heimes (fheimes) on 2020-10-19
Changed in ubuntu-z-systems:
status: Triaged → In Progress
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

FYI: Review is complete and this was uploaded to the X/B/F-unapproved queue entering the SRU process.

Revision history for this message
Brian Murray (brian-murray) wrote : Please test proposed package

Hello bugproxy, or anyone else affected,

Accepted qemu into focal-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/qemu/1:4.2-3ubuntu6.8 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-focal to verification-done-focal. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-focal. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in qemu (Ubuntu Focal):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-focal
Changed in qemu (Ubuntu Bionic):
status: In Progress → Fix Committed
tags: added: verification-needed-bionic
Revision history for this message
Brian Murray (brian-murray) wrote :

Hello bugproxy, or anyone else affected,

Accepted qemu into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/qemu/1:2.11+dfsg-1ubuntu7.33 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in qemu (Ubuntu Xenial):
status: In Progress → Fix Committed
tags: added: verification-needed-xenial
Revision history for this message
Brian Murray (brian-murray) wrote :

Hello bugproxy, or anyone else affected,

Accepted qemu into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/qemu/1:2.5+dfsg-5ubuntu10.47 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-xenial to verification-done-xenial. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-xenial. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Revision history for this message
Ubuntu SRU Bot (ubuntu-sru-bot) wrote : Autopkgtest regression report (qemu/1:2.11+dfsg-1ubuntu7.33)

All autopkgtests for the newly accepted qemu (1:2.11+dfsg-1ubuntu7.33) for bionic have finished running.
The following regressions have been reported in tests triggered by the package:

libvirt/4.0.0-1ubuntu8.17 (i386)
systemd/237-3ubuntu10.42 (ppc64el)

Please visit the excuses page listed below and investigate the failures, proceeding afterwards as per the StableReleaseUpdates policy regarding autopkgtest regressions [1].

https://people.canonical.com/~ubuntu-archive/proposed-migration/bionic/update_excuses.html#qemu

[1] https://wiki.ubuntu.com/StableReleaseUpdates#Autopkgtest_Regressions

Thank you!

Frank Heimes (fheimes) on 2020-10-28
Changed in ubuntu-z-systems:
status: In Progress → Fix Committed
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

I found no regressions running this on some general workloads which is good, but as mentioned in the SRU I can't reproduce the issue on our s390x box.
Therefore it is up to IBM (the reporter) to verify the builds in -proposed in regard to the actual bug fix.

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2020-11-03 06:04 EDT-------
I was also not able to reproduce this with the latest updates from the focal-proposed repository.

Both guests I set up had also used the qemu driver.

<interface type='network'>
<mac address='52:54:00:42:af:52'/>
<source network='default' portid='da4aacfa-671f-49bc-b4f3-6ee26c356628' bridge='virbr0'/>
<target dev='vnet0'/>
<model type='virtio'/>
<driver name='qemu' iommu='on'/>
<alias name='net0'/>
<address type='ccw' cssid='0xfe' ssid='0x0' devno='0x0001'/>
</interface>

libvirt version: 6.0.0
package: 0ubuntu8.4
qemu version: 4.2.1Debian 1:4.2-3ubuntu6.8
kernel: 5.4.0-52-generic

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Thanks, could you please also test Xenial and Bionic?

Frank Heimes (fheimes) on 2020-11-03
tags: added: verification-done-focal
removed: verification-needed-focal
Revision history for this message
Chris Halse Rogers (raof) wrote : Update Released

The verification of the Stable Release Update for qemu has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package qemu - 1:4.2-3ubuntu6.8

---------------
qemu (1:4.2-3ubuntu6.8) focal; urgency=medium

  * d/p/u/lp-1894942-*: fix virtio-ccw host/guest notification (LP: #1894942)

 -- Christian Ehrhardt <email address hidden> Mon, 21 Sep 2020 15:35:30 +0200

Changed in qemu (Ubuntu Focal):
status: Fix Committed → Fix Released
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2020-11-04 07:54 EDT-------
(In reply to comment #86)
> Thanks, could you please also test Xenial and Bionic?

I've just verified bionic. The results are pretty much the same as the results with the ppa version I tested before.

I will try to get the Xenial verification done too before the end of day.

Revision history for this message
Frank Heimes (fheimes) wrote :

Thanks Halil for testing bionic (I've updated the tag accordingly),
and also thanks for having a look at xenial ...

tags: added: verification-done-bionic
removed: verification-needed-bionic
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2020-11-04 13:19 EDT-------
(In reply to comment #92)
> (In reply to comment #86)
> > Thanks, could you please also test Xenial and Bionic?
>
> I will try to get the Xenial verification done too before the end of day.
>

I've just verified Xenial. Surprise, surprise the results are pretty much the same as the results with the ppa version I tested before ;)

Revision history for this message
Andrew Cloke (andrew-cloke) wrote :

Thanks!
Updated verification tags.

tags: added: verification-done verification-done-xenial
removed: verification-needed verification-needed-xenial
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package qemu - 1:2.11+dfsg-1ubuntu7.33

---------------
qemu (1:2.11+dfsg-1ubuntu7.33) bionic; urgency=medium

  * d/p/u/lp-1894942-*: fix virtio-ccw host/guest notification (LP: #1894942)

 -- Christian Ehrhardt <email address hidden> Mon, 21 Sep 2020 15:39:32 +0200

Changed in qemu (Ubuntu Bionic):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package qemu - 1:2.5+dfsg-5ubuntu10.47

---------------
qemu (1:2.5+dfsg-5ubuntu10.47) xenial; urgency=medium

  * d/p/u/lp-1894942-*: fix virtio-ccw host/guest notification (LP: #1894942)

 -- Christian Ehrhardt <email address hidden> Mon, 21 Sep 2020 15:50:56 +0200

Changed in qemu (Ubuntu Xenial):
status: Fix Committed → Fix Released
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2020-11-05 05:44 EDT-------
IBM Bugzilla status-> closed, Fix Released with all requested Distros

Frank Heimes (fheimes) on 2020-11-05
Changed in ubuntu-z-systems:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers