[SR-IOV] An instance with 2 SR-IOV VF interfaces fails to boot if a compute has 2 SR-IOV NICs in same physnet

Bug #1576185 reported by Mikhail Chernik
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Status tracked in 10.0.x
10.0.x
Fix Committed
High
Elena Ezhova

Bug Description

Environment:

MOS 9.0 ISO 232
1 controller + 2 computes
compute node-1: , 2x10G 82599 NICs. 2nd port is used for SR-IOV, 24 VFs, physnet2
compute node-3: , 2x10G 82599 NICs. Both ports are used for SR-IOV, 24 VFs per port, same physnett (physnet2)

nova-compute.log: http://paste.openstack.org/show/495655/

Expected result:
Instance is in ACTIVE state on any compute node

Actual result:
Instance is in ERROR state after timeout on compute node with 2 SR-IOV NICs in same physnet

Steps to reproduce:
Run this script on freshly deployed environment
http://paste.openstack.org/show/495652/

Diagnostic snapshot: http://mos-scale-share.mirantis.com/fuel-snapshot-2016-04-28_11-31-24.tar.xz

Revision history for this message
Artem Panchenko (apanchenko-8) wrote :

Can't reproduce on 9.0 iso #250. Also I've found the following errors in logs from your diagnostic snapshot:

node-3: neutron-sriov-agent.log

2016-04-28 11:15:20.704 33100 ERROR neutron.agent.linux.utils [req-9f230b78-31b1-483c-bdd4-8a29aedf9679 - - - - -] Exit code: 2; Stdin: ; Stdout: ; Stderr: RTNETLINK answers: Operation not supported
2016-04-28 11:15:20.705 33100 WARNING neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent [req-9f230b78-31b1-483c-bdd4-8a29aedf9679 - - - - -] Device fa:16:3e:ae:13:e5 does not support state change
2016-04-28 11:15:20.840 33100 DEBUG oslo_concurrency.lockutils [req-9f230b78-31b1-483c-bdd4-8a29aedf9679 - - - - -] Lock "qos-port" acquired by "neutron.agent.l2.extensions.qos.handle_port" :: waited 0.000s inner /usr/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py:270
2016-04-28 11:15:20.841 33100 INFO neutron.agent.l2.extensions.qos [req-9f230b78-31b1-483c-bdd4-8a29aedf9679 - - - - -] QoS extension did have no information about the port ec35f707-7e9e-4d93-850c-de765411f5f1 that we were trying to reset

Did you enable Neutron QoS on your environment?

Changed in mos:
milestone: 9.0 → 10.0
no longer affects: mos/9.0.x
no longer affects: mos/10.0.x
no longer affects: mos/9.0.x
Changed in mos:
milestone: 10.0 → 9.0
Revision history for this message
Atsuko Ito (yottatsa) wrote :

Looks like nova-compute missed neutron-server notification about port binding.

Spawning:

2016-04-28 11:14:05.336 3707 INFO nova.virt.libvirt.driver Creating image
2016-04-28 11:14:11.124 3707 INFO nova.compute.manager VM Started (Lifecycle Event)
2016-04-28 11:14:11.181 3707 INFO nova.compute.manager VM Paused (Lifecycle Event)
2016-04-28 11:14:11.277 3707 INFO nova.compute.manager During sync_power_state the instance has a pending task (spawning). Skip.

Here we're waiting for portbinding and neutron does it perfectly:

2016-04-28 11:14:12.332 23427 INFO neutron.notifiers.nova [-] Nova event response: {u'status': u'completed', u'tag': u'ec35f707-7e9e-4d93-850c-de765411f5f1', u'name': u'network-vif-plugged', u'server_uuid': u'aa1af326-ff02-471c-a546-a4efdc871f63', u'code': 200}
2016-04-28 11:14:12.459 23438 INFO neutron.notifiers.nova [-] Nova event response: {u'status': u'completed', u'tag': u'2be1b1be-f700-49da-b150-b82a3ae983cd', u'name': u'network-vif-plugged', u'server_uuid': u'aa1af326-ff02-471c-a546-a4efdc871f63', u'code': 200}
2016-04-28 11:15:22.997 23375 INFO neutron.notifiers.nova [-] Nova event response: {u'status': u'completed', u'tag': u'ec35f707-7e9e-4d93-850c-de765411f5f1', u'name': u'network-vif-plugged', u'server_uuid': u'aa1af326-ff02-471c-a546-a4efdc871f63', u'code': 200}

2016-04-28 11:14:12.313 14444 INFO nova.api.openstack.compute.server_external_events Creating event network-vif-plugged:ec35f707-7e9e-4d93-850c-de765411f5f1 for instance aa1af326-ff02-471c-a546-a4efdc871f63
2016-04-28 11:14:12.451 14467 INFO nova.api.openstack.compute.server_external_events Creating event network-vif-plugged:2be1b1be-f700-49da-b150-b82a3ae983cd for instance aa1af326-ff02-471c-a546-a4efdc871f63
2016-04-28 11:15:22.989 14467 INFO nova.api.openstack.compute.server_external_events Creating event network-vif-plugged:ec35f707-7e9e-4d93-850c-de765411f5f1 for instance aa1af326-ff02-471c-a546-a4efdc871f63

But actually nova-compute didn't get it and raise a timeout in nova.virt.libvirt.driver.LibvirtDriver#_create_domain_and_network:

2016-04-28 11:19:11.124 3707 WARNING nova.virt.libvirt.driver [req-6e606db9-d0cf-4ea5-82aa-8bdb4d7ed63d ff83fb6171cf411db4a05126c9663ccf 2161feb2f5c5425f815e468f30894f4b - - -] [instance: aa1af326-ff02-471c-a546-a4efdc871f63] Timeout waiting for vif plugging callback for instance aa1af326-ff02-471c-a546-a4efdc871f63

Changed in mos:
assignee: nobody → MOS Nova (mos-nova)
importance: Undecided → High
status: New → Confirmed
tags: added: area-neutron area-nova
Revision history for this message
Atsuko Ito (yottatsa) wrote :

Need debug from nova.compute.manager.ComputeManager#external_instance_event:
            LOG.debug('Received event %(event)s',
                      {'event': event.key},
                      instance=instance)

Could somebody post a log?

Revision history for this message
Atsuko Ito (yottatsa) wrote :

Events arrived to nova-compute http://paste.openstack.org/show/495690/ but instance didn't unpaused. Clearly nova-compute bug.

Revision history for this message
Mikhail Chernik (mchernik) wrote :
Revision history for this message
Timofey Durakov (tdurakov) wrote :

According to logs above nova waits for 3 events from neutron, one for private and 2 for sr-iov ports: http://xsnippet.org/361662/
events received for 1 private and 1 sr-iov port only: http://xsnippet.org/361663/ which finally causes fail. Same picture is visible from nova-api side. So there is nothing to fix in nova internal communication.

Changed in mos:
assignee: MOS Nova (mos-nova) → MOS Neutron (mos-neutron)
Revision history for this message
Atsuko Ito (yottatsa) wrote :

Please also check the #2 comment, there was clearly all three events passed to nova-api. So situation is not clear.

Revision history for this message
Atsuko Ito (yottatsa) wrote :

Sorry, I was wrong. There was event duplication

Revision history for this message
Atsuko Ito (yottatsa) wrote :

Looks like the only one port is updated in neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent when instance is booted.

Revision history for this message
Atsuko Ito (yottatsa) wrote :

For node that contains two SR-IOV nic apply the patch and restart the sriov-agent.

Revision history for this message
Atsuko Ito (yottatsa) wrote :

Two phys functions are used for instance with two SR-IOV ports, and port lookup seems broken.

Hypervisor: assignable PCI devices ... "address": "0000:03:15.6", "parent_addr": "0000:03:00.0", "address": "0000:03:15.7", ... "parent_addr": "0000:03:00.1"

Related bug https://bugs.launchpad.net/neutron/+bug/1558626

Revision history for this message
Atsuko Ito (yottatsa) wrote :

https://review.openstack.org/310927 Fix SR-IOV binding when two NICs mapped to one physnet

Changed in mos:
assignee: MOS Neutron (mos-neutron) → Oleg Bondarev (obondarev)
Elena Ezhova (eezhova)
Changed in mos:
assignee: Oleg Bondarev (obondarev) → Elena Ezhova (eezhova)
Changed in mos:
status: Confirmed → In Progress
Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

does it really in progress? Where we can find the code on review?

Revision history for this message
Elena Ezhova (eezhova) wrote :

There is a patch by Vladimir Eremin on review in upstream, see comment #12.

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to openstack/neutron (stable/mitaka)

Fix proposed to branch: stable/mitaka
Change author: Vladimir Eremin <email address hidden>
Review: https://review.fuel-infra.org/20365

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Change abandoned on openstack/neutron (stable/mitaka)

Change abandoned by Elena Ezhova <email address hidden> on branch: stable/mitaka
Review: https://review.fuel-infra.org/20365

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to openstack/neutron (9.0/mitaka)

Fix proposed to branch: 9.0/mitaka
Change author: Vladimir Eremin <email address hidden>
Review: https://review.fuel-infra.org/20366

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Change abandoned on openstack/neutron (9.0/mitaka)

Change abandoned by Alexander Ignatov <email address hidden> on branch: 9.0/mitaka
Review: https://review.fuel-infra.org/20366
Reason: Not needed, it will be merged soon as part of stable/mitaka sync

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to openstack/neutron (9.0/mitaka)

Reviewed: https://review.fuel-infra.org/20797
Submitter: Pkgs Jenkins <email address hidden>
Branch: 9.0/mitaka

Commit: 7a3b99f6f9783953c61101fd0d2813cb0ddea09c
Author: Jenkins <email address hidden>
Date: Thu May 19 10:07:27 2016

Merge the tip of origin/stable/mitaka into origin/9.0/mitaka

ca690cc DVR: Fix TypeError in arp update with allowed_address_pairs
41e0fcd DVR: Handle unbound allowed_address_pair port with FIP
30a849e Fix SR-IOV binding when two NICs mapped to one physnet
65bb2d5 Fix test failure against latest oslo.* from master
3ab2ada Add exponential back-off RPC client
8825166 Use correct session in update_allocation_pools
a88b41c Don't log warning for missing resource_versions

Closes-Bug: #1576185
Closes-Bug: #1575554
Change-Id: I95d3e0fc16624d23ddf442723921b5153d898b0a

Changed in mos:
status: In Progress → Fix Committed
Revision history for this message
Sergii (sgudz) wrote :

Verified on mos 9.0 iso 459. Fixed

Changed in mos:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.