Guest crashed when detaching the ovs interface device

Bug #1812822 reported by Xiao Feng Ren on 2019-01-22
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ubuntu on IBM z Systems
Medium
Unassigned
linux (Ubuntu)
Undecided
Unassigned
qemu (Ubuntu)
Undecided
Unassigned

Bug Description

 When detaching one openvswitch interface device with virsh detach-device, if the port has been deleted from the ovs and the interface device has been deleted. The virsh detach-device will fail with "error: Unable to read from monitor: Connection reset by peer", the qemu is terminated and the log shows " UNSETVNETLE ioctl() failed, File descriptor in bad state".

[Background] This error is originally found from the openstack KVM CI tempest test.  By investigating I found it's introduced by one ovs-vif patch, which deletes the ovs port and delete the interface before detaching the device.  You can find the commit from https://bugs.launchpad.net/os-vif/+bug/1801072

Reproduced:

   root@xxxx:~#  ovs-vsctl del-port br0 tap9273235a-dd
   root@xxxx:~#  ip link del tap9273235a-dd

The interface device tap9273235a-dd has been removed from the host(ifconfg, ovs-vsctl show)  and can be found in the guest.(logon the guest, ip a  it's in down state)

root@xxxx:~# virsh detach-device kvm net.xml
error: Failed to detach device from net.xml
error: Unable to read from monitor: Connection reset by peer

The qemu has terminated and the log in /var/log/libvirt/qemu/kvm.log
TUNSETVNETLE ioctl() failed: File descriptor in bad state.
2019-01-18 08:16:11.304+0000: shutting down, reason=crashed

It seems the qemu tried to handle this interface, but in fact it has been deleted. qemu couldn't read the file and give the error.
But I don't think the guest should be crashed directly for the file descriptor error.

Environment:
Ubuntu 16.04.5 LTS
Linux (EC12) 4.4.0-141-generic
QEMU emulator version 2.11.1(Debian 1:2.11+dfsg-1ubuntu7.5~cloud0)
libvirtd (libvirt) 4.0.0

net.xml

     <interface type='bridge'>
    <mac address='52:54:00:fb:5c:46'/>
    <source bridge='br0'/>
    <virtualport type='openvswitch'>
      <parameters  interfaceid='9273234d-9ad4-4ecf-8869-d63ac17a0e6d'/>
    </virtualport>
    <target dev='tap9273235a-dd'/>
      <model type='virtio'/>
      <mtu size='1450'/>
      <alias name='net1'/>
      <address type='ccw' cssid='0xfe' ssid='0x0' devno='0x0005'/>
  </interface>

kvm.xml

<domain type='kvm' id='31'>
  <name>kvm</name>
  <uuid>59f71b47-16e4-401d-9d33-30bc1605a84a</uuid>
  <memory unit='KiB'>524288</memory>
  <currentMemory unit='KiB'>524288</currentMemory>
  <vcpu placement='static'>1</vcpu>
  <resource>
    <partition>/machine</partition>
  </resource>
  <os>
    <type arch='s390x' machine='s390-ccw-virtio-bionic'>hvm</type>
    <boot dev='hd'/>
  </os>
  <cpu>
    <topology sockets='1' cores='1' threads='1'/>
  </cpu>
  <clock offset='utc'/>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>destroy</on_crash>
  <devices>
    <emulator>/usr/bin/qemu-system-s390x</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2' cache='none'/>
      <source file='/root/xenial-minimal.qcow2'/>
      <backingStore/>
      <target dev='vda' bus='virtio'/>
      <alias name='virtio-disk0'/>
      <address type='ccw' cssid='0xfe' ssid='0x0' devno='0x0000'/>
    </disk>
    <console type='pty' tty='/dev/pts/2'>
      <source path='/dev/pts/2'/>
      <target type='sclp' port='0'/>
      <alias name='console0'/>
    </console>
    <memballoon model='virtio'>
      <alias name='balloon0'/>
      <address type='ccw' cssid='0xfe' ssid='0x0' devno='0x0001'/>
    </memballoon>
    <panic model='s390'/>
  </devices>
  <seclabel type='dynamic' model='apparmor' relabel='yes'>
    <label>libvirt-59f71b47-16e4-401d-9d33-30bc1605a84a</label>
    <imagelabel>libvirt-59f71b47-16e4-401d-9d33-30bc1605a84a</imagelabel>
  </seclabel>
  <seclabel type='dynamic' model='dac' relabel='yes'>
    <label>+0:+0</label>
    <imagelabel>+0:+0</imagelabel>
  </seclabel>
</domain>

tags: added: s390x
removed: detach device
bugproxy (bugproxy) on 2019-01-22
tags: added: architecture-s39064 bugnameltc-174882 severity-high targetmilestone-inin16045

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1812822

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Frank Heimes (frank-heimes) wrote :

As of comment #1, please share/attach the relevant logs.

Xiao Feng Ren (renxiaof) wrote :

I couldn't use the command apport-collect 1812822

root@****:~# apport-collect 1812822
ERROR: The python-launchpadlib package is not installed. This functionality is not available.

root@****:~# apt install python-launchpadlib -y
Reading package lists... Done
Building dependency tree
Reading state information... Done
python-launchpadlib is already the newest version (1.10.3-3ubuntu0.1).
0 upgraded, 0 newly installed, 0 to remove and 41 not upgraded.

root@****:~# dpkg -l python-launchpadlib
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-=================-=============-=============-========================================
ii python-launchpadl 1.10.3-3ubunt all Launchpad web services client library

There's the guest log under /var/log/libvirt/qemu/kvm.log

2019-01-31 10:52:55.667+0000: starting up libvirt version: 4.0.0, package: 1ubuntu8.5~cloud0 (Openstack Ubuntu Testing Bot <email address hidden> Fri, 07 Sep 2018 04:25:04 +0000), qemu version: 2.11.1(Debian 1:2.11+dfsg-1ubuntu7.5~cloud0), hostname: zfwcec178.boeblingen.de.ibm.com
LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin QEMU_AUDIO_DRV=none /usr/bin/qemu-system-s390x -name guest=kvm,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-36-kvm/master-key.aes -machine s390-ccw-virtio-bionic,accel=kvm,usb=off,dump-guest-core=off -m 512 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -uuid 59f71b47-16e4-401d-9d33-30bc1605a84a -display none -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-36-kvm/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -boot strict=on -drive file=/root/xenial-minimal.qcow2,format=qcow2,if=none,id=drive-virtio-disk0,cache=none -device virtio-blk-ccw,scsi=off,devno=fe.0.0000,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -chardev pty,id=charconsole0 -device sclpconsole,chardev=charconsole0,id=console0 -device virtio-balloon-ccw,id=balloon0,devno=fe.0.0001 -msg timestamp=on
2019-01-31 10:52:55.667+0000: Domain id=36 is tainted: high-privileges
2019-01-31T10:52:55.749009Z qemu-system-s390x: -chardev pty,id=charconsole0: char device redirected to /dev/pts/1 (label charconsole0)
TUNSETVNETLE ioctl() failed: File descriptor in bad state.
2019-01-31 10:57:29.694+0000: shutting down, reason=crashed

affects: qemu-kvm (Ubuntu) → qemu (Ubuntu)

I used a recent version of the softwrae stack from Disco
- qemu 3.1
- libvirt 5.0
- openvswitch 2.11

With that I had a guest with an OVS device like that:
    <interface type='bridge'>
      <mac address='52:54:00:22:57:fd'/>
      <source network='ovsbr0' bridge='ovsbr0'/>
      <virtualport type='openvswitch'>
        <parameters interfaceid='f44ac4e9-fe46-48b8-920c-7ba13dd024ba'/>
      </virtualport>
      <target dev='vnet1'/>
      <model type='virtio'/>
      <driver name='vhost' queues='4'/>
      <alias name='net1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </interface>

Not too different to your's I'd think.
The OVS is trivial just having this interface atm.

$ sudo ovs-vsctl show
596674ef-e4cd-471f-9708-9caa5737961c
    Bridge "ovsbr0"
        Port "eno49"
            Interface "eno49"
        Port "ovsbr0"
            Interface "ovsbr0"
                type: internal
        Port "vnet1"
            Interface "vnet1"
    ovs_version: "2.11.0"

$ ip link show dev vnet1
93: vnet1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master ovs-system state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether fe:54:00:22:57:fd brd ff:ff:ff:ff:ff:ff

I have a started a second guest on the same vswitch (to check traffic from the first guest later on).

Now lets delete that port:
$ sudo ovs-vsctl del-port ovsbr0 vnet1
$ sudo ovs-vsctl show
596674ef-e4cd-471f-9708-9caa5737961c
    Bridge "ovsbr0"
        Port "vnet3"
            Interface "vnet3"
        Port "eno49"
            Interface "eno49"
        Port "ovsbr0"
            Interface "ovsbr0"
                type: internal
    ovs_version: "2.11.0"

Ok the OVS device is gone.
Obviously traffic on that interface is dead now, but the guest is still alive and happy.

The host dev is still there:
$ ip link show dev vnet1
93: vnet1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether fe:54:00:22:57:fd brd ff:ff:ff:ff:ff:ff
Removing that as well as suggested

$ sudo ip link del vnet1
$ ip link show dev vnet1
Device "vnet1" does not exist.

The guest still is up and running, while traffic still won't work for obvious reasons.
Now lets trigger the hot-unplug of the device.

$ virsh detach-device guest-openvswitch-1 net.xml
Device detached successfully

The guest is still happy and alive.
It lost the device (since we detached it) but that is ok and intentional.

To some extend this feels a bit like:
- https://bugzilla.redhat.com/show_bug.cgi?id=1242383
- https://bugzilla.redhat.com/show_bug.cgi?id=1151306

All those got closed as "invalid host config -> won't fix" so we can't find the fix there.
But something happened to let it work fine in my case, we need to find that.

We now need to find out what the difference is:
a) your test case is slightly different and you can trigger it on the same SW levels it works for me, then we need to report that to upstream as those are the very latest versions
b) your test is good once you use the more recent SW levels, in that case we need to drill down into your crash and identify the fix (that must be in between qemu 2.11 and 3.1 somewhere) to consider backporting it.
c) This would be arch dependent (I tested x86), but we would find that further down the road as you'd report (a) to happen. After all TUNSETVNETLE is for setting big/little endian operations for linux tap/macvtap so it could be s390x only after all.

I can't re-deploy the system to use Bionic level components that you use at the moment and that also would only answer (b) but not (a).
Therefore to differentiate between the above I'd want to ask you if you could re-run your test on Ubuntu 19.04 with Proposed enabled [1] as the new openvswitch still is in proposed for now.

Report back if you can still trigger the issue, but then I'll most likely encourage you to report it upstream and I'd then participate in the discussion there - probably building test PPAs for you as needed.

Also report back if this SW stack works for you as well, in that case I'd wonder if you get an actual crash in /var/crash that would help where in the qemu code we would look for
  TUNSETVNETLE ioctl() failed: File descriptor in bad state.
I'd assume net/tap-linux.c in tap_fd_set_vnet_le, but let's be sure.

[1]: https://wiki.ubuntu.com/Testing/EnableProposed

Changed in qemu (Ubuntu):
status: New → Incomplete
Changed in ubuntu-z-systems:
status: New → Incomplete
Changed in ubuntu-z-systems:
importance: Undecided → Medium

------- Comment From <email address hidden> 2019-02-14 08:22 EDT-------
Can this be reproduced with the upstream qemu? If yes, can you also report this to the qemu-s390x mailing list?

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2019-02-15 03:09 EDT-------
(In reply to comment #12)
> Can this be reproduced with the upstream qemu? If yes, can you also report
> this to the qemu-s390x mailing list?

Have tested and reproduced this bug with the latest SW version:

qemu-system-s390x : QEMU emulator version 3.1.0 (Debian 1:3.1+dfsg-2ubuntu1)
libvirtd : libvirtd (libvirt) 5.0.0
openvswitch: ovs-vsctl (Open vSwitch) 2.11.0 DB Schema 7.16.1

Distributor: Ubuntu Disco Dingo(development branch)
Linux server 4.19.0-12-generic #13-Ubuntu SMP

I have reported this problem to the qemu-s390x mailing list.

Xiao Feng Ren (renxiaof) wrote :

I couldn't get the crash file in the disco system though I set the apport, but get the qemu_system-s390x.crash under /var/crash in the original test system ubuntu16.04(qemu: 2.11.1).

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.