Comment 2 for bug 1977485

Revision history for this message
Phil Evans (philthinkhuge) wrote :

I have been working with our consultants and can supply a bit more detail here. The important thing to note is that (I believe because of a switch from OVS networking to OVN) *older* servers that have been up for at least a couple of months, of which there are many, have a networking node in their XML of type "bridge", whereas newer servers have a networking node of type "ethernet". That difference is the whole root cause of the issue.

So we are on OVN at the moment, and new VMs are created as "ethernet" interfaces. When I remove an older VM's port through the API, no errors come back and the port is removed in the database, and everything seems to be ok. However if you look at the instance XML, it is still there:

   </controller>
    <controller type='pci' index='0' model='pci-root'>
      <alias name='pci.0'/>
    </controller>
    <interface type='bridge'>
      <mac address='00:16:3c:6d:ab:b5'/>
      <source bridge='br-int'/>
      <virtualport type='openvswitch'>
        <parameters interfaceid='85595837-a555-4404-a0fc-9e65bb4be84a'/>
      </virtualport>
      <target dev='tap85595837-a5'/>
      <model type='virtio'/>
      <mtu size='1500'/>
      <alias name='net0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </interface>
    <serial type='pty'>
      <source path='/dev/pts/59'/>
      <log file='/var/lib/nova/instances/d471c93e-d4ac-46a8-9381-a78f2cf5b3f5/console.log' append='off'/>
      <target type='isa-serial' port='0'>

As noted in the original comment, the logs do show that it had trouble finding the interface to delete it from the instance.

If you then add back a port with the same MAC address, you now end up with both a "bridge" and an "ethernet" interface:

   <controller type='pci' index='0' model='pci-root'>
      <alias name='pci.0'/>
    </controller>
    <interface type='bridge'>
      <mac address='00:16:3c:6d:ab:b5'/>
      <source bridge='br-int'/>
      <virtualport type='openvswitch'>
        <parameters interfaceid='85595837-a555-4404-a0fc-9e65bb4be84a'/>
      </virtualport>
      <target dev='tap85595837-a5'/>
      <model type='virtio'/>
      <mtu size='1500'/>
      <alias name='net0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </interface>
    <interface type='ethernet'>
      <mac address='00:16:3c:6d:ab:b5'/>
      <target dev='tape9957b4a-9a'/>
      <model type='virtio'/>
      <mtu size='1500'/>
      <alias name='net1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/>
    </interface>
    <serial type='pty'>
      <source path='/dev/pts/59'/>
      <log file='/var/lib/nova/instances/...

At this point, if you try and remove that port through the API, again there are no errors, and it seems as though it was removed successfully, however the port is still attached both in the instance XML *and* according to the API, and the port is now impossible to remove, no matter how many times you try and remove it.

All these problems are solved by rebooting the server, which of course re-writes fresh XML and of course would now track properly with how it should be.

Our current workaround is to always detach the port directly using virsh on the compute node before doing it through the Openstack API. Interestingly, as long as you do this *before* trying to remove it via the API, the interface is removed successfully from the instance. However if you do the remove via the API first, and then try and remove it via virsh, it actually doesn't show any errors, but the interface does not get removed. I surmise this is because by that point the physical tap interface has been removed, and I presume libvirt doesn't like that.

Here is my theory: Openstack is making an assumption that *all* instances will have an interface in the XML of type "ethernet", despite older VMs having "bridge" type interfaces. Because of this, I wonder if Openstack is not seeing the interface even though it is there, complains it can't find it, and then presumably ignores it as it was wanting to remove it anyway. So now it thinks the interface isn't there as it would expect, but it is.

Then, of course things end up in this confused state where it thinks the instance has no interface, but actually it still does, and causes all future issues.

So if that were the case, then OS should not make that assumption of the interface type always being "ethernet", and instead look for *any* type interface with the appropriate MAC address.

This is a guess on my part, not being familiar at all with the inner workings of Openstack, but that sequence of events makes sense to me.