We're seeing an intermittent issue with the Pike release where a VM with SRIOV NIC gets stuck in the "rebuild_spawning" task state after a "nova evacuate" operation (which rebuilds the instance on a different compute
node).
The specific issue we're seeing is that occasionally neutron seems to get "stuck". Nova is down in nova.virt.libvirt.driver.LibvirtDriver._create_domain_and_network(), and five minutes later
it hits the eventlet.timeout.Timeout exception clause at https://github.com/openstack/nova/blob/stable/pike/nova/virt/libvirt/driver.py#L5451
We think that maybe there's a race condition between nova/libvirt setting the MAC for the VF and the VF actually being usable/visible to the system. Closest mention of this upstream is this:
https://www.spinics.net/lists/kvm/msg165297.html
To test this we modified get_assigned_macs() at https://github.com/openstack/neutron/blob/stable/pike/neutron/plugins/ml2/drivers/mech_sriov/agent/pci_lib.py#L75 to wait 2 seconds if vf_lines
is empty,then call _get_vf_link_show() and get vf_lines again (max 3 tries).
After a few instance evacuates, the retry code was exercised after message "Cannot find vfs" was produced, and 1 retry was executed, and this time vf_lines had data in it.
Details:
The initial nova log looks like this:
2018-03-17 10:19:41.344 58584 WARNING nova.compute.manager [req-5283f1b9-23d3-472c-80df-4b8d73f8523b 3893b5675ba84cb5b6610dc0370d820b 4a7e8af978b043c1a0a26eafd52e0301 - default default] [instance: 983e3130-cb8f-4715-9e22-4d373fe4bb77] Received unexpected event network-vif-unplugged-0ac89834-0220-48ef-8559-e48f65fe4efb for instance
2018-03-17 10:19:42.770 58584 INFO nova.network.neutronv2.api [req-f13b904e-3711-445d-8dcd-22d122d5791e ff8718b957f6454680b820ee0d3a8cb4 78e01109154c49aab9960640407d96af - default default] [instance: 983e3130-cb8f-4715-9e22-4d373fe4bb77] Updating port 0ac89834-0220-48ef-8559-e48f65fe4efb with attributes {'binding:profile': {}, 'binding:host_id': 'compute-1'}
2018-03-17 10:19:44.032 58584 WARNING nova.compute.manager [req-a1f9f8fb-cdec-40c1-be2c-0b052305e331 3893b5675ba84cb5b6610dc0370d820b 4a7e8af978b043c1a0a26eafd52e0301 - default default] [instance: 983e3130-cb8f-4715-9e22-4d373fe4bb77] Received unexpected event network-vif-unplugged-59ce6131-07ea-4b6c-ac62-9c27e99149b8 for instance
2018-03-17 10:19:44.048 58584 INFO nova.network.neutronv2.api [req-f13b904e-3711-445d-8dcd-22d122d5791e ff8718b957f6454680b820ee0d3a8cb4 78e01109154c49aab9960640407d96af - default default] [instance: 983e3130-cb8f-4715-9e22-4d373fe4bb77] Updating port 59ce6131-07ea-4b6c-ac62-9c27e99149b8 with attributes {'binding:profile': {u'physical_network': u'group0-data0b', u'pci_request_id': u'b6ccb9b3-1551-4be7-812f-6108786b2b0a', u'pci_slot': u'0000:81:0f.7', u'pci_vendor_info': u'8086:154c'}, 'binding:host_id': 'compute-1'}
2018-03-17 10:19:44.438 58584 WARNING nova.compute.manager [req-7478c956-46cd-4651-9ea7-e477f2a99747 3893b5675ba84cb5b6610dc0370d820b 4a7e8af978b043c1a0a26eafd52e0301 - default default] [instance: 983e3130-cb8f-4715-9e22-4d373fe4bb77] Received unexpected event network-vif-plugged-59ce6131-07ea-4b6c-ac62-9c27e99149b8 for instance
2018-03-17 10:19:45.475 58584 INFO nova.compute.manager [req-f13b904e-3711-445d-8dcd-22d122d5791e ff8718b957f6454680b820ee0d3a8cb4 78e01109154c49aab9960640407d96af - default default] [instance: 983e3130-cb8f-4715-9e22-4d373fe4bb77] Detaching volume d4cb0ea9-a689-4203-8e38-40f16a448915
2018-03-17 10:19:46.091 58584 WARNING nova.compute.manager [req-b1d1c529-d32f-4d99-8b45-c2bea4ada9b3 3893b5675ba84cb5b6610dc0370d820b 4a7e8af978b043c1a0a26eafd52e0301 - default default] [instance: 983e3130-cb8f-4715-9e22-4d373fe4bb77] Received unexpected event network-vif-unplugged-59ce6131-07ea-4b6c-ac62-9c27e99149b8 for instance
Then the neutron log shows "Trigger reload_allocations" a few times, and the last one looks like:
2018-03-17 10:19:45.032 56804 INFO neutron.agent.dhcp.agent [req-5240cf46-0dff-4344-9d57-9e8f133559b0 1188dcf2254945e493726a4245e1a676 4a7e8af978b043c1a0a26eafd52e0301 - - -] Trigger reload_allocations for port admin_state_up=True, allowed_address_pairs=[], binding:host_id=compute-1, binding:profile=pci_request_id=b6ccb9b3-1551-4be7-812f-6108786b2b0a, pci_slot=0000:81:0f.7, pci_vendor_info=8086:154c, physical_network=group0-data0b, binding:vif_details=port_filter=False, vlan=2389, binding:vif_type=hw_veb, binding:vnic_type=direct, created_at=2018-03-17T10:18:29Z, description=, device_id=983e3130-cb8f-4715-9e22-4d373fe4bb77, device_owner=compute:nova, extra_dhcp_opts=[], fixed_ips=[{u'subnet_id': u'a7062041-37ab-468d-a1eb-f33d3b3621ee', u'ip_address': u'10.1.1.24'}], id=59ce6131-07ea-4b6c-ac62-9c27e99149b8, mac_address=fa:16:3e:8a:c2:d2, name=, network_id=a201980f-83e9-4fdd-9933-2fdfd58b8089, project_id=37fad54aace64d13b8d03bfa83563d59, revision_number=11, security_groups=[], status=DOWN, tags=[], tenant_id=37fad54aace64d13b8d03bfa83563d59, updated_at=2018-03-17T10:19:44Z, wrs-binding:mac_filtering=False, wrs-binding:mtu=9216, wrs-binding:vif_model=pci-sriov
2018-03-17 10:19:53.860 57003 INFO neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent [req-4b2db1f7-06a8-42c5-9dab-1bbf7a937da3 - - - - -] Port fa:16:3e:8a:c2:d2 updated. Details: {u'allowed_address_pairs': [], u'device_owner': u'compute:nova', u'port_security_enabled': False, u'mtu': 9216, u'port_id': u'59ce6131-07ea-4b6c-ac62-9c27e99149b8', u'network': {u'provider:physical_network': u'group0-data0b', u'updated_at': u'2018-03-16T20:58:50Z', u'revision_number': 2, u'mtu': 9216, u'id': u'a201980f-83e9-4fdd-9933-2fdfd58b8089', u'description': u'', u'router:external': False, u'availability_zone_hints': [], u'availability_zones': [u'nova'], u'ipv4_address_scope': None, u'shared': True, u'project_id': u'78e01109154c49aab9960640407d96af', u'status': u'ACTIVE', u'subnets': [u'a7062041-37ab-468d-a1eb-f33d3b3621ee'], u'wrs-tm:qos': u'b1ec2429-3158-4716-8c5e-1848aac9c5fb', u'tags': [], u'ipv6_address_scope': None, u'provider:segmentation_id': 2389, u'name': u'internal0-net0-1', u'admin_state_up': True, u'tenant_id': u'78e01109154c49aab9960640407d96af', u'created_at': u'2018-03-16T20:58:36Z', u'provider:network_type': u'vlan', u'vlan_transparent': True}, u'fixed_ips': [{u'subnet_id': u'a7062041-37ab-468d-a1eb-f33d3b3621ee', u'ip_address': u'10.1.1.24'}], u'physical_network': u'group0-data0b', u'trunk_port': False, u'mac_address': u'fa:16:3e:8a:c2:d2', u'profile': {u'pci_request_id': u'b6ccb9b3-1551-4be7-812f-6108786b2b0a', u'pci_slot': u'0000:81:0f.7', u'physical_network': u'group0-data0b', u'pci_vendor_info': u'8086:154c'}, u'subnets': [{u'subnet': {u'ipv6_ra_mode': None, u'allocation_pools': [{u'subnet_id': u'a7062041-37ab-468d-a1eb-f33d3b3621ee', u'last_ip': u'10.1.1.254', u'id': u'55b74d0a-0e02-4026-8964-5e254d8d2566', u'first_ip': u'10.1.1.17'}], u'standard_attr_id': 88, u'ipv6_address_mode': None, u'id': u'a7062041-37ab-468d-a1eb-f33d3b3621ee', u'vlan_id': 0, u'dns_nameservers': [], u'network_standard_attr': {u'description': u'', u'tags': [], u'created_at': u'2018-03-16T20:58:36.000000', u'updated_at': u'2018-03-16T20:58:50.000000', u'revision_number': 2, u'id': 49, u'resource_type': u'networks'}, u'standard_attr': {u'description': u'', u'tags': [], u'created_at': u'2018-03-16T20:58:50.000000', u'updated_at': u'2018-03-16T20:58:50.000000', u'revision_number': 0, u'id': 88, u'resource_type': u'subnets'}, u'gateway_ip': None, u'project_id': u'78e01109154c49aab9960640407d96af', u'subnetpool': None, u'routes': [], u'cidr': u'10.1.1.0/24', u'subnetpool_id': None, u'service_types': [], u'name': u'internal0-subnet0-1', u'enable_dhcp': True, u'segment_id': None, u'network_id': u'a201980f-83e9-4fdd-9933-2fdfd58b8089', u'ip_version': 4, u'rbac_entries': [{u'action': u'access_as_shared', u'project_id': u'78e01109154c49aab9960640407d96af', u'id': u'66bc32fc-99ca-468b-a1de-d55c5c79b3bc', u'target_tenant': u'*', u'object_id': u'a201980f-83e9-4fdd-9933-2fdfd58b8089'}]}, u'physical_network': u'group0-data0b', u'network_type': u'vlan', u'segmentation_id': 2389}], u'network_qos_policy_id': None, u'segmentation_id': 2389, u'port_qos_policy': None, u'device': u'fa:16:3e:8a:c2:d2', u'qos_policy_id': None, u'admin_state_up': True, u'network_id': u'a201980f-83e9-4fdd-9933-2fdfd58b8089', u'tenant_id': u'37fad54aace64d13b8d03bfa83563d59', u'network_qos_policy': {u'scheduler': {u'weight': u'4'}}, u'network_type': u'vlan'}
Then the nova log shows libvirt timing out after 5 minutes:
2018-03-17 10:24:54.058 58584 WARNING nova.virt.libvirt.driver [req-f13b904e-3711-445d-8dcd-22d122d5791e ff8718b957f6454680b820ee0d3a8cb4 78e01109154c49aab9960640407d96af - default default] [instance: 983e3130-cb8f-4715-9e22-4d373fe4bb77] Timeout waiting for vif plugging callback for instance with vm_state error and task_state rebuild_spawning.
Repeating the experiment with debug logs enabled for neutron gave the following:
2018-04-23 18:14:19.561 73750 DEBUG neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent [req-293f2f36-0257-4056-9898-5938f71e1e9e - - - - -] Agent loop found changes! {'current': set([('00:00:00:00:00:00', '0000:81:0f.5'), ('00:00:00:00:00:00', '0000:81:0f.4'), ('00:00:00:00:00:00', '0000:81:0e.6'), ('00:00:00:00:00:00', '0000:81:0e.7'), ('00:00:00:00:00:00', '0000:81:0f.6'), ('00:00:00:00:00:00', '0000:81:0e.4'), ('00:00:00:00:00:00', '0000:81:0e.5'), ('00:00:00:00:00:00', '0000:81:0e.2'), ('00:00:00:00:00:00', '0000:81:0e.3'), ('00:00:00:00:00:00', '0000:81:0f.1'), ('fa:16:3e:d6:01:a7', '0000:81:0f.7'), ('00:00:00:00:00:00', '0000:81:0e.0'), ('00:00:00:00:00:00', '0000:81:0f.0'), ('00:00:00:00:00:00', '0000:81:0e.1'), ('00:00:00:00:00:00', '0000:81:0f.3'), ('00:00:00:00:00:00', '0000:81:0f.2')]), 'removed': set([('00:00:00:00:00:00', '0000:81:0f.7')]), 'added': set([('fa:16:3e:d6:01:a7', '0000:81:0f.7')]), 'updated': set([])} daemon_loop /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/mech_sriov/agent/sriov_nic_agent.py:388
2018-04-23 18:14:19.562 73750 INFO neutron.agent.securitygroups_rpc [req-293f2f36-0257-4056-9898-5938f71e1e9e - - - - -] Skipping method prepare_devices_filter as firewall is disabled or configured as NoopFirewallDriver.
2018-04-23 18:14:19.828 73750 DEBUG neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent [req-293f2f36-0257-4056-9898-5938f71e1e9e - - - - -] Port with MAC address fa:16:3e:d6:01:a7 is added treat_devices_added_updated /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/mech_sriov/agent/sriov_nic_agent.py:294
2018-04-23 18:14:19.829 73750 INFO neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent [req-293f2f36-0257-4056-9898-5938f71e1e9e - - - - -] Port fa:16:3e:d6:01:a7 updated. Details: {u'allowed_address_pairs': [], u'device_owner': u'compute:nova', u'port_security_enabled': False, u'mtu': 9216, u'port_id': u'f90e58b4-bbc7-4eb2-991b-46406d3d6470', u'network': {u'provider:physical_network': u'group0-data0b', u'updated_at': u'2018-04-23T16:48:51Z', u'revision_number': 2, u'mtu': 9216, u'id': u'a156b9b9-167c-447c-9ee0-f209f7ebb3f8', u'description': u'', u'router:external': False, u'availability_zone_hints': [], u'availability_zones': [u'nova'], u'ipv4_address_scope': None, u'shared': True, u'project_id': u'198ed4d2d0ec4440a98729ffc7bc0635', u'status': u'ACTIVE', u'subnets': [u'f452f5c3-a686-45cd-8098-0d092bd5e21e'], u'wrs-tm:qos': u'7d2f03f3-b9c6-4183-9859-878bbb8141e7', u'tags': [], u'ipv6_address_scope': None, u'provider:segmentation_id': 2285, u'name': u'internal0-net0', u'admin_state_up': True, u'tenant_id': u'198ed4d2d0ec4440a98729ffc7bc0635', u'created_at': u'2018-04-23T16:48:46Z', u'provider:network_type': u'vlan', u'vlan_transparent': True}, u'fixed_ips': [{u'subnet_id': u'f452f5c3-a686-45cd-8098-0d092bd5e21e', u'ip_address': u'10.0.0.132'}], u'physical_network': u'group0-data0b', u'trunk_port': False, u'mac_address': u'fa:16:3e:d6:01:a7', u'profile': {u'pci_request_id': u'89135f77-a4df-4744-b228-d396b7c0e876', u'pci_slot': u'0000:81:0f.7', u'physical_network': u'group0-data0b', u'pci_vendor_info': u'8086:154c'}, u'subnets': [{u'subnet': {u'ipv6_ra_mode': None, u'allocation_pools': [{u'subnet_id': u'f452f5c3-a686-45cd-8098-0d092bd5e21e', u'last_ip': u'10.0.0.254', u'id': u'86242517-d9dc-4c59-a3d6-17881ee0d42c', u'first_ip': u'10.0.0.128'}], u'standard_attr_id': 11, u'ipv6_address_mode': None, u'id': u'f452f5c3-a686-45cd-8098-0d092bd5e21e', u'vlan_id': 0, u'dns_nameservers': [], u'network_standard_attr': {u'description': u'', u'tags': [], u'created_at': u'2018-04-23T16:48:46.000000', u'updated_at': u'2018-04-23T16:48:51.000000', u'revision_number': 2, u'id': 9, u'resource_type': u'networks'}, u'standard_attr': {u'description': u'', u'tags': [], u'created_at': u'2018-04-23T16:48:51.000000', u'updated_at': u'2018-04-23T16:48:51.000000', u'revision_number': 0, u'id': 11, u'resource_type': u'subnets'}, u'gateway_ip': None, u'project_id': u'198ed4d2d0ec4440a98729ffc7bc0635', u'subnetpool': None, u'routes': [], u'cidr': u'10.0.0.0/24', u'subnetpool_id': None, u'service_types': [], u'name': u'internal0-subnet0', u'enable_dhcp': True, u'segment_id': None, u'network_id': u'a156b9b9-167c-447c-9ee0-f209f7ebb3f8', u'ip_version': 4, u'rbac_entries': [{u'action': u'access_as_shared', u'project_id': u'198ed4d2d0ec4440a98729ffc7bc0635', u'id': u'9c699b99-a98c-48ce-bdee-19e6e8483429', u'target_tenant': u'*', u'object_id': u'a156b9b9-167c-447c-9ee0-f209f7ebb3f8'}]}, u'physical_network': u'group0-data0b', u'network_type': u'vlan', u'segmentation_id': 2285}], u'network_qos_policy_id': None, u'segmentation_id': 2285, u'port_qos_policy': None, u'device': u'fa:16:3e:d6:01:a7', u'qos_policy_id': None, u'admin_state_up': True, u'network_id': u'a156b9b9-167c-447c-9ee0-f209f7ebb3f8', u'tenant_id': u'34a91443256b4db1aa2ba9d305079d41', u'network_qos_policy': {u'scheduler': {u'weight': u'4'}}, u'network_type': u'vlan'}
2018-04-23 18:14:19.830 73750 DEBUG neutron.agent.linux.utils [req-293f2f36-0257-4056-9898-5938f71e1e9e - - - - -] Running command: ['sudo', 'ip', 'link', 'show'] create_process /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:92
2018-04-23 18:14:19.850 73750 DEBUG neutron.agent.linux.utils [req-293f2f36-0257-4056-9898-5938f71e1e9e - - - - -] Running command: ['sudo', 'ip', 'link', 'show', 'ens787f3'] create_process /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:92
2018-04-23 18:14:19.872 73750 WARNING neutron.plugins.ml2.drivers.mech_sriov.agent.pci_lib [req-293f2f36-0257-4056-9898-5938f71e1e9e - - - - -] Cannot find vfs [15] in device ens787f3
2018-04-23 18:14:19.873 73750 WARNING neutron.plugins.ml2.drivers.mech_sriov.agent.eswitch_manager [req-293f2f36-0257-4056-9898-5938f71e1e9e - - - - -] device pci mismatch: fa:16:3e:d6:01:a7 - 0000:81:0f.7
2018-04-23 18:14:19.874 73750 INFO neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent [req-293f2f36-0257-4056-9898-5938f71e1e9e - - - - -] No device with MAC fa:16:3e:d6:01:a7 defined on agent.
2018-04-23 18:14:19.880 73750 INFO neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent [req-293f2f36-0257-4056-9898-5938f71e1e9e - - - - -] Removing device with MAC address 00:00:00:00:00:00 and PCI slot 0000:81:0f.7
2018-04-23 18:14:19.881 73750 WARNING neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent [req-293f2f36-0257-4056-9898-5938f71e1e9e - - - - -] port_id to device with MAC 00:00:00:00:00:00 not found
2018-04-23 18:14:19.964 73750 INFO neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent [req-293f2f36-0257-4056-9898-5938f71e1e9e - - - - -] Port with MAC 00:00:00:00:00:00 and PCI slot 0000:81:0f.7 updated.
It seems like the VF's MAC address was programmed okay, but we don't see the VF being removed even though there's a "Cannot find vfs" log.
There's a host kernel log that makes us wonder if there's some sort of delay on this hardware:
2018-04-23T18:14:19.377 compute-1 kernel: info [ 418.459004] i40e 0000:81:00.3: Setting MAC fa:16:3e:d6:01:a7 on VF 15
2018-04-23T18:14:19.472 compute-1 kernel: info [ 418.553081] i40e 0000:81:00.3: Reload the VF driver to make this change effective.
2018-04-23T18:14:19.472 compute-1 kernel: info [ 418.553138] i40e 0000:81:00.3: Setting VLAN 2285, QOS 0x0 on VF 15
2018-04-23T18:14:19.851 compute-1 kernel: info [ 418.930504] vfio-pci 0000:81:0f.7: enabling device (0000 -> 0002)
2018-04-23T18:14:19.869 compute-1 kernel: err [ 418.944122] i40e 0000:81:00.3: VF 15 still in reset. Try again.
Hi Chris,
IIUC, this is induced by the hardware not setting itself up fast enough, so the sriov neutron agent doesn't find the MAC and decides "Removing device with MAC address 00:00:00:00:00:00 and PCI slot 0000:81:0f.7". Should the retry you tested be the approach to handle this hw slowness?