SR-IOV VMs booting one-by-one sometimes the VM boot fails

Bug #1737128 reported by Oleg Bondarev
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Fix Released
High
Oleg Bondarev

Bug Description

When booting SR-IOV VMs one-by-one sometimes the VM boot fails with VirtualInterfaceCreateException error.
The reason of this error is not known and appears occasionally.
From 100 VM boot it appeared 3 times when we were using the PF driver: i40e and VF driver i40evf on a Fortville NIC.
However when PF: 40e and VF: vfio-pci is in use the problem appears more frequently: 3 times from 30 tries.
While we were trying to identify the problem it seems that nova and neutron communication might fail.
Currently we are using i40e driver version:
version: 2.1.26
---
vfio-pci:
version: 0.2
srcversion: E7FB8B08C7E3C0BC5FE7954
depends: vfio,irqbypass,vfio_virqfd
intree: Y
vermagic: 4.4.0-91-generic SMP mod_unload modversions
---
Could you please check if you can see any problems when SR-IOV VMs are booted up and deleted one after an other at least 30-40 times?

This is from the nova-compute.log on the compute where the VM startup fails:

2017-11-28T10:00:34.529424+01:00 compute-0-4.domain.tld nova-compute[22532]: 2017-11-28 10:00:34.528 22532 WARNING nova.virt.libvirt.driver [req-197c9a8a-8cef-4a03-bee5-387ca28f1d52 35f64d3cbaf8423aa29dd1a9cdd51f5b 575ce3dbc5844d4dbc0cba96eb02155c - - -] [instance: ec65f71a-e94a-4508-bc26-722dddd3cbf3] Timeout waiting for vif plugging callback for instance ec65f71a-e94a-4508-bc26-722dddd3cbf3
2017-11-28T10:00:35.137933+01:00 compute-0-4.domain.tld nova-compute[22532]: 2017-11-28 10:00:35.137 22532 INFO nova.virt.libvirt.driver [req-197c9a8a-8cef-4a03-bee5-387ca28f1d52 35f64d3cbaf8423aa29dd1a9cdd51f5b 575ce3dbc5844d4dbc0cba96eb02155c - - -] [instance: ec65f71a-e94a-4508-bc26-722dddd3cbf3] Deleting instance files /var/lib/nova/instances/ec65f71a-e94a-4508-bc26-722dddd3cbf3_del
2017-11-28T10:00:35.138655+01:00 compute-0-4.domain.tld nova-compute[22532]: 2017-11-28 10:00:35.138 22532 INFO nova.virt.libvirt.driver [req-197c9a8a-8cef-4a03-bee5-387ca28f1d52 35f64d3cbaf8423aa29dd1a9cdd51f5b 575ce3dbc5844d4dbc0cba96eb02155c - - -] [instance: ec65f71a-e94a-4508-bc26-722dddd3cbf3] Deletion of /var/lib/nova/instances/ec65f71a-e94a-4508-bc26-722dddd3cbf3_del complete
2017-11-28T10:00:35.252593+01:00 compute-0-4.domain.tld nova-compute[22532]: 2017-11-28 10:00:35.251 22532 ERROR nova.compute.manager [req-197c9a8a-8cef-4a03-bee5-387ca28f1d52 35f64d3cbaf8423aa29dd1a9cdd51f5b 575ce3dbc5844d4dbc0cba96eb02155c - - -] [instance: ec65f71a-e94a-4508-bc26-722dddd3cbf3] Instance failed to spawn
2017-11-28 10:00:35.251 22532 ERROR nova.compute.manager [instance: ec65f71a-e94a-4508-bc26-722dddd3cbf3] Traceback (most recent call last):
2017-11-28 10:00:35.251 22532 ERROR nova.compute.manager [instance: ec65f71a-e94a-4508-bc26-722dddd3cbf3] File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2239, in _build_resources
2017-11-28 10:00:35.251 22532 ERROR nova.compute.manager [instance: ec65f71a-e94a-4508-bc26-722dddd3cbf3] yield resources
2017-11-28 10:00:35.251 22532 ERROR nova.compute.manager [instance: ec65f71a-e94a-4508-bc26-722dddd3cbf3] File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2085, in _build_and_run_instance
2017-11-28 10:00:35.251 22532 ERROR nova.compute.manager [instance: ec65f71a-e94a-4508-bc26-722dddd3cbf3] block_device_info=block_device_info)
2017-11-28 10:00:35.251 22532 ERROR nova.compute.manager [instance: ec65f71a-e94a-4508-bc26-722dddd3cbf3] File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 2827, in spawn
2017-11-28 10:00:35.251 22532 ERROR nova.compute.manager [instance: ec65f71a-e94a-4508-bc26-722dddd3cbf3] block_device_info=block_device_info)
2017-11-28 10:00:35.251 22532 ERROR nova.compute.manager [instance: ec65f71a-e94a-4508-bc26-722dddd3cbf3] File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 5244, in _create_domain_and_network
2017-11-28 10:00:35.251 22532 ERROR nova.compute.manager [instance: ec65f71a-e94a-4508-bc26-722dddd3cbf3] raise exception.VirtualInterfaceCreateException()
2017-11-28 10:00:35.251 22532 ERROR nova.compute.manager [instance: ec65f71a-e94a-4508-bc26-722dddd3cbf3] VirtualInterfaceCreateException: Virtual Interface creation failed
2017-11-28 10:00:35.251 22532 ERROR nova.compute.manager [instance: ec65f71a-e94a-4508-bc26-722dddd3cbf3]
2017-11-28T10:00:35.253282+01:00 compute-0-4.domain.tld nova-compute[22532]: 2017-11-28 10:00:35.252 22532 INFO nova.compute.manager [req-197c9a8a-8cef-4a03-bee5-387ca28f1d52 35f64d3cbaf8423aa29dd1a9cdd51f5b 575ce3dbc5844d4dbc0cba96eb02155c - - -] [instance: ec65f71a-e94a-4508-bc26-722dddd3cbf3] Terminating instance
2017-11-28T10:00:35.259724+01:00 compute-0-4.domain.tld nova-compute[22532]: 2017-11-28 10:00:35.259 22532 INFO nova.virt.libvirt.driver [-] [instance: ec65f71a-e94a-4508-bc26-722dddd3cbf3] During wait destroy, instance disappeared.
2017-11-28T10:00:35.375574+01:00 compute-0-4.domain.tld nova-compute[22532]: 2017-11-28 10:00:35.374 22532 INFO nova.virt.libvirt.driver [req-197c9a8a-8cef-4a03-bee5-387ca28f1d52 35f64d3cbaf8423aa29dd1a9cdd51f5b 575ce3dbc5844d4dbc0cba96eb02155c - - -] [instance: ec65f71a-e94a-4508-bc26-722dddd3cbf3] Deletion of /var/lib/nova/instances/ec65f71a-e94a-4508-bc26-722dddd3cbf3_del complete
2017-11-28T10:00:35.478930+01:00 compute-0-4.domain.tld nova-compute[22532]: 2017-11-28 10:00:35.478 22532 INFO nova.compute.manager [req-197c9a8a-8cef-4a03-bee5-387ca28f1d52 35f64d3cbaf8423aa29dd1a9cdd51f5b 575ce3dbc5844d4dbc0cba96eb02155c - - -] [instance: ec65f71a-e94a-4508-bc26-722dddd3cbf3] Took 0.22 seconds to destroy the instance on the hypervisor.
2017-11-28T10:00:35.662721+01:00 compute-0-4.domain.tld nova-compute[22532]: 2017-11-28 10:00:35.661 22532 ERROR nova.compute.manager [req-197c9a8a-8cef-4a03-bee5-387ca28f1d52 35f64d3cbaf8423aa29dd1a9cdd51f5b 575ce3dbc5844d4dbc0cba96eb02155c - - -] [instance: ec65f71a-e94a-4508-bc26-722dddd3cbf3] Failed to allocate network(s)
2017-11-28 10:00:35.661 22532 ERROR nova.compute.manager [instance: ec65f71a-e94a-4508-bc26-722dddd3cbf3] Traceback (most recent call last):
2017-11-28 10:00:35.661 22532 ERROR nova.compute.manager [instance: ec65f71a-e94a-4508-bc26-722dddd3cbf3] File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2085, in _build_and_run_instance
2017-11-28 10:00:35.661 22532 ERROR nova.compute.manager [instance: ec65f71a-e94a-4508-bc26-722dddd3cbf3] block_device_info=block_device_info)
2017-11-28 10:00:35.661 22532 ERROR nova.compute.manager [instance: ec65f71a-e94a-4508-bc26-722dddd3cbf3] File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 2827, in spawn
2017-11-28 10:00:35.661 22532 ERROR nova.compute.manager [instance: ec65f71a-e94a-4508-bc26-722dddd3cbf3] block_device_info=block_device_info)
2017-11-28 10:00:35.661 22532 ERROR nova.compute.manager [instance: ec65f71a-e94a-4508-bc26-722dddd3cbf3] File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 5244, in _create_domain_and_network
2017-11-28 10:00:35.661 22532 ERROR nova.compute.manager [instance: ec65f71a-e94a-4508-bc26-722dddd3cbf3] raise exception.VirtualInterfaceCreateException()
2017-11-28 10:00:35.661 22532 ERROR nova.compute.manager [instance: ec65f71a-e94a-4508-bc26-722dddd3cbf3] VirtualInterfaceCreateException: Virtual Interface creation failed

Revision history for this message
Oleg Bondarev (obondarev) wrote :

Customer analysis:

We have done some analysis and the issue points towards an issue in the neutron-sriov-agent's behavior in case of the ip link show commands output is not correct.

We are already working with Intel to see why in case of i40e driver this behavior occurs, however sriov-agent should be more robust and not fail in this scenario.

And now let me describe what we have observed:

A,
sriov-agent reports the following warning and after this it never reports back to neutron and nova that the vif-plugging was successful causing the virtual interface exception in nova-compute.

2017-11-28T13:52:05.162410+01:00 compute-0-1.domain.tld neutron-sriov-nic-agent[194171]: 2017-11-28 13:52:05.162 194171 WARNING neutron.plugins.ml2.drivers.mech_sriov.agent.pci_lib [req-72a5e67c-50cc-43e7-a686-8b68f81410fb - - - - -] Cannot find vfs [7] in device eth5

B,
The reason for this warning we figured out is that during vif-plugging for some reason the ip link show command's output is missing the affected VF for like 300 ms, probably the time it is being enabled.

2017-11-28T13:52:05.162193+01:00 compute-0-1.domain.tld neutron-sriov-nic-agent[194171]: 2017-11-28 13:52:05.162 194171 WARNING neutron.plugins.ml2.drivers.mech_sriov.agent.pci_lib [req-72a5e67c-50cc-43e7-a686-8b68f81410fb - - - - -] 7: eth5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2140 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 3c:fd:fe:a0:3d:4a brd ff:ff:ff:ff:ff:ff
    vf 0 MAC 0e:66:2b:22:c4:ad, spoof checking on, link-state auto
    vf 1 MAC 0e:70:49:f2:47:05, spoof checking on, link-state auto
    vf 2 MAC 0e:27:e0:5f:ec:84, spoof checking on, link-state auto
    vf 3 MAC 0e:67:9c:28:af:1c, spoof checking on, link-state auto
    vf 4 MAC 0e:32:4b:5c:ba:3a, spoof checking on, link-state auto
    vf 5 MAC 0e:8d:57:4e:13:01, spoof checking on, link-state auto
    vf 6 MAC 0e:4d:f3:59:90:b9, spoof checking on, link-state auto
2017-11-28T13:52:05.162410+01:00 compute-0-1.domain.tld neutron-sriov-nic-agent[194171]: 2017-11-28 13:52:05.162 194171 WARNING neutron.plugins.ml2.drivers.mech_sriov.agent.pci_lib [req-72a5e67c-50cc-43e7-a686-8b68f81410fb - - - - -] Cannot find vfs [7] in device eth5

I did a quick fix to test the idea in pci_lib.py's get_assigned_macs function, basically made sure to retry the ip link show if the get_vf_link_show comes back empty.

Then we ran the test of 100 VM boots and 100/100 was successful, before the issue happened 15 out of 100 cases at least.

CONCLUSION:
We would like the neutron-sriov-agent to be fixed in a way that it should not throw a Warning if that can eventually make the VM boot fail, so if it throws a Warning at least then it should RE-TRY and not just let it hang without any errors until the vif-plug timeout occurs. We are aware that the driver should be investigated, but sriov-agent should be intelligent enough that if it is not a critical issue it should continue after re-trying.

tags: added: sla1
Revision history for this message
Oleg Bondarev (obondarev) wrote :
Download full text (3.2 KiB)

So what happens is that initially (on new agent loop iteration start) agent sees new device being added:

2017-11-28T13:52:04.959947+01:00 compute-0-1.domain.tld neutron-sriov-nic-agent[194171]: 2017-11-28 13:52:04.959 194171 WARNING neutron.plugins.ml2.drivers.mech_sriov.agent.pci_lib [req-72a5e67c-50cc-43e7-a686-8b68f81410fb - - - - -] LOTR: 7: eth5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2140 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 3c:fd:fe:a0:3d:4a brd ff:ff:ff:ff:ff:ff
    vf 0 MAC 0e:66:2b:22:c4:ad, spoof checking on, link-state auto
    vf 1 MAC 0e:70:49:f2:47:05, spoof checking on, link-state auto
    vf 2 MAC 0e:27:e0:5f:ec:84, spoof checking on, link-state auto
    vf 3 MAC 0e:67:9c:28:af:1c, spoof checking on, link-state auto
    vf 4 MAC 0e:32:4b:5c:ba:3a, spoof checking on, link-state auto
    vf 5 MAC 0e:8d:57:4e:13:01, spoof checking on, link-state auto
    vf 6 MAC 0e:4d:f3:59:90:b9, spoof checking on, link-state auto
    vf 7 MAC fa:16:3e:92:c8:ce, vlan 550, spoof checking on, link-state enable

2017-11-28T13:52:05.003570+01:00 compute-0-1.domain.tld neutron-sriov-nic-agent[194171]: 2017-11-28 13:52:05.003 194171 DEBUG neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent [req-72a5e67c-50cc-43e7-a686-8b68f81410fb - - - - -] Agent loop found changes! {'current': <...> 'added': set([(u'fa:16:3e:92:c8:ce', '0000:3b:0a.7')]), 'updated': set([(u'fa:16:3e:92:c8:ce', u'0000:3b:0a.7')])}

But when starting processing added devices the new device disappears:

2017-11-28T13:52:05.162193+01:00 compute-0-1.domain.tld neutron-sriov-nic-agent[194171]: 2017-11-28 13:52:05.162 194171 WARNING neutron.plugins.ml2.drivers.mech_sriov.agent.pci_lib [req-72a5e67c-50cc-43e7-a686-8b68f81410fb - - - - -] LOTR: 7: eth5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2140 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 3c:fd:fe:a0:3d:4a brd ff:ff:ff:ff:ff:ff
    vf 0 MAC 0e:66:2b:22:c4:ad, spoof checking on, link-state auto
    vf 1 MAC 0e:70:49:f2:47:05, spoof checking on, link-state auto
    vf 2 MAC 0e:27:e0:5f:ec:84, spoof checking on, link-state auto
    vf 3 MAC 0e:67:9c:28:af:1c, spoof checking on, link-state auto
    vf 4 MAC 0e:32:4b:5c:ba:3a, spoof checking on, link-state auto
    vf 5 MAC 0e:8d:57:4e:13:01, spoof checking on, link-state auto
    vf 6 MAC 0e:4d:f3:59:90:b9, spoof checking on, link-state auto
2017-11-28T13:52:05.162410+01:00 compute-0-1.domain.tld neutron-sriov-nic-agent[194171]: 2017-11-28 13:52:05.162 194171 WARNING neutron.plugins.ml2.drivers.mech_sriov.agent.pci_lib [req-72a5e67c-50cc-43e7-a686-8b68f81410fb - - - - -] Cannot find vfs [7] in device eth5

2017-11-28T13:52:05.162770+01:00 compute-0-1.domain.tld neutron-sriov-nic-agent[194171]: 2017-11-28 13:52:05.162 194171 INFO neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent [req-72a5e67c-50cc-43e7-a686-8b68f81410fb - - - - -] No device with MAC fa:16:3e:92:c8:ce defined on agent.

In such a case I don't think it'd be correct to add retry as we can't handle every driver's incorrect behavior by adding random retries to the agent code.
I'd suggest to set "resync" flag in this case, so the agent treats it as an exc...

Read more...

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to openstack/neutron (9.0/mitaka)

Fix proposed to branch: 9.0/mitaka
Change author: Oleg Bondarev <email address hidden>
Review: https://review.fuel-infra.org/37426

Changed in mos:
status: Confirmed → In Progress
Changed in mos:
milestone: none → 9.2-mu-4
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to openstack/neutron (9.0/mitaka)

Reviewed: https://review.fuel-infra.org/37426
Submitter: Pkgs Jenkins <email address hidden>
Branch: 9.0/mitaka

Commit: 51b41fda8dc8f7bdfa4a67c1da13052d60a3cc4d
Author: Oleg Bondarev <email address hidden>
Date: Mon Dec 11 13:19:34 2017

Do agent resync in case device disappears

Sometimes due to NIC driver incorrect behavior VFs
might be missing in 'ip link show' output.
This may lead to VM boot failure as agent just skips such missing
devices.
This patch makes agent do resync in case new added device 'disappears'
during processing.

Related-PROD: PROD-16478
Closes-Bug: #1737128

Change-Id: I8a97a08d04699a49c0de740773e48e93d709cf2a

Changed in mos:
status: In Progress → Fix Committed
Revision history for this message
Vladimir Jigulin (vjigulin) wrote :

Closing on 9.2-mu4 based on the fact that customer confirmed that fix is working

Changed in mos:
status: Fix Committed → Fix Released
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to openstack/neutron (mcp/1.0/mitaka)

Fix proposed to branch: mcp/1.0/mitaka
Change author: Oleg Bondarev <email address hidden>
Review: https://review.fuel-infra.org/38062

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.