When irqfds are not used setting of the adapter interruption host-->guest notifier bit is accomplished by the QEMU function virtio_set_ind_atomic().
The atomic_cmpxchg() loop in virtio_set_ind_atomic() is broken because we occasionally end up with old and _old having different values (a legit compiler can generate code that accessed *ind_addr again to pick up a value for _old instead of using the value of old that was already fetched according to the rules of the abstract machine). This means the underlying CS instruction may use a different old (_old) than the one we intended to use if atomic_cmpxchg() performed the xchg part.
The direct consequence of the problem is that host --> guest notifications can get lost. The indirect consequence is that queues may get stuck and the devices may cease operate normally. We stumbled on debugging a choked virtio-net interface (one that used the qemu driver and not vhost). But it can affect other virtio-ccw devices as well.
If irqfds are used for host->guest notifications, then we are safe because notifier bit manipulation is done in the kernel (and it's done correctly).
The problem described above is fixed upstream by commit.
Problem Description:
When irqfds are not used setting of the adapter interruption host-->guest notifier bit is accomplished by the QEMU function virtio_ set_ind_ atomic( ).
The atomic_cmpxchg() loop in virtio_ set_ind_ atomic( ) is broken because we occasionally end up with old and _old having different values (a legit compiler can generate code that accessed *ind_addr again to pick up a value for _old instead of using the value of old that was already fetched according to the rules of the abstract machine). This means the underlying CS instruction may use a different old (_old) than the one we intended to use if atomic_cmpxchg() performed the xchg part.
The direct consequence of the problem is that host --> guest notifications can get lost. The indirect consequence is that queues may get stuck and the devices may cease operate normally. We stumbled on debugging a choked virtio-net interface (one that used the qemu driver and not vhost). But it can affect other virtio-ccw devices as well.
If irqfds are used for host->guest notifications, then we are safe because notifier bit manipulation is done in the kernel (and it's done correctly).
The problem described above is fixed upstream by commit.
1a8242f7c3 ("virtio-ccw: fix virtio_ set_ind_ atomic" )
All upstream versions since v2.0.0 are (potentially) affected.
The same mistake was made in QEMU in another place, and is fixed by:
45175361f1 ("s390x/pci: fix set_ind_atomic")
We can file a separate BZ for it if necessary.