OpenStack Compute (nova)

cannot schedule ovs sriov offload port to tunneled segment

Bug #1983570 reported by Bence Romsics on 2022-08-04

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	In Progress	Wishlist	Unassigned

Bug Description

We observed a scheduling failure when using ovs sriov offload (https://docs.openstack.org/neutron/latest/admin/config-ovs-offload.html
) in combination with multisegment networks. The problem seems to affect the case when the port should be bound to a tunneled network segment (a segment that does not have a physnet).

I read that nova scheduler works the same way with pci sriov passthrough, therefore I believe the same bug affects pci sriov passthrough, though I did not test that.

Due to the special hardware needs for this environment I could not reproduce this in devstack. But I hope we have collected enough information that shows the error regardless. We believe we also identified the relevant lines of code.

The overall setup includes l2gw - connecting the segments in the multisegment network. But I will ignore that here, since l2gw cannot be part of the root cause here. Neutron was configured with mechanism_drivers=sriovnicswitch,opendaylight_v2. However since the error happens before we bind the port, I believe the mechanism_driver is irrelevant as long as it allows the creation of ports with "--vnic-type direct --binding-profile '{"capabilities": ["switchdev"]}'". For the sake of simplicity I will call these "ovs sriov offload ports".

As I understand the problem:

1) ovs sriov offload port on single segment neutron network, the segment is vxlan: works
2) normal port on no offload capable ovs (--vnic-type normal) on multisegment neutron network, one vlan, one vxlan segment, the port should be bound to the vxlan segment: works
3) ovs sriov offload port on multisegment neutron network, one vlan, one vxlan segment, the port should be bound to the vxlan segment: does not work

To reproduce:
* create a multisegment network with one vlan and one vxlan segment
* create a port on that network with "--vnic-type direct --binding-profile '{"capabilities": ["switchdev"]}' --disable-port-security --no-security-group".
* boot a vm with that port

On the compute host on which we expect the scheduling and boot to succeed we have configuration like:
[pci]
passthrough_whitelist = [{"devname": "data2", "physical_network": null}, {"devname": "data3", "physical_network": null}]

According to https://docs.openstack.org/nova/latest/admin/pci-passthrough.html this marks the tunneled segments on this host to be passthrough (and ovs offload) capable.

The vm boot fails with:

$ openstack server show c3_ms_1
...
| fault | {'code': 500, 'created': '2022-07-16T08:12:31Z', 'message': 'Insufficient compute resources: Requested instance NUMA topology together with requested PCI devices cannot fit the given host NUMA topology; Claim pci failed.', 'details': 'Traceback (most recent call last):\n File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 2418, in _build_and_run_instance\n limits):\n File "/usr/lib/python3.6/site-packages/oslo_concurrency/lockutils.py", line 360, in inner\n return f(*args, **kwargs)\n File "/usr/lib/python3.6/site-packages/nova/compute/resource_tracker.py", line 172, in instance_claim\n pci_requests, limits=limits)\n File "/usr/lib/python3.6/site-packages/nova/compute/claims.py", line 72, in __init__\n self._claim_test(compute_node, limits)\n File "/usr/lib/python3.6/site-packages/nova/compute/claims.py", line 114, in _claim_test\n "; ".join(reasons))\nnova.exception.ComputeResourcesUnavailable: Insufficient compute resources: Requested instance NUMA topology together with requested PCI devices cannot fit the given host NUMA topology; Claim pci failed.\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 2271, in _do_build_and_run_instance\n filter_properties, request_spec, accel_uuids)\n File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 2469, in _build_and_run_instance\n instance_uuid=instance.uuid, reason=e.format_message())\nnova.exception.RescheduledException: Build of instance 09f3f8bb-b4c0-4395-8167-c10609d32d08 was re-scheduled: Insufficient compute resources: Requested instance NUMA topology together with requested PCI devices cannot fit the given host NUMA topology; Claim pci failed.\n'} |
...

In the scheduler logs we see that the scheduler uses a spec with a physnet. But the pci passthrough capability is on a device without a physnet.

controlhost3:/home/ceeinfra # grep DC259-CEE3- /var/log/nova/nova-scheduler.log
<180>2022-07-16T10:12:29.680009+02:00 controlhost3.redacted.com nova-scheduler[67299]: 2022-07-16 10:12:29.679 76 WARNING nova.scheduler.host_manager [req-4dd7c37e-eb18-48da-9914-44a6a2a18b1d fcd3b2713191485d95befe1941f20e20 cf7024f0f2bd46a8b17fd42055a20323 - default default] Selected host: compute3.redacted.com failed to consume from instance. Error: PCI device request [InstancePCIRequest(alias_name=None,count=1,is_new=<?>,numa_policy=None,request_id=a5644948-3b13-4cca-a98a-7780ee4d2157,requester_id='e79a0e66-debd-42e8-a46d-9cdc29a7c960',spec=[{physical_network='DC259-CEE3-DCGW-NET'}])] failed: nova.exception.PciDeviceRequestFailed: PCI device request [InstancePCIRequest(alias_name=None,count=1,is_new=<?>,numa_policy=None,request_id=a5644948-3b13-4cca-a98a-7780ee4d2157,requester_id='e79a0e66-debd-42e8-a46d-9cdc29a7c960',spec=[{physical_network='DC259-CEE3-DCGW-NET'}])] failed

We observed the bug originally on stable/victoria and found these source code lines:
https://opendev.org/openstack/nova/src/commit/8097c2b2153ff952a266395d4e351fc39f914c6b/nova/network/neutron.py#L2128-L2135

Here for vnic_type=direct ports we unconditionally add a physnet to the spec:

From victoria to master the only recent change in that piece is this:
https://opendev.org/openstack/nova/commit/0620678344d0f032a33e952d4d0fa653741f09e7 Add support for VNIC_REMOTE_MANAGED

Which seems irrelevant to this bug, therefore I believe the bug could be reproduced on master too.

Not having reproduced this bug myself, I will include below my colleague's (Angelo Nappo's) email to me, containing exact commands and configs. However I hope above I already provided all relevant information and eliminated all downstream specific details:

The use case in legacy NFVI solution

The system is based on VXLAN+SDN. If the VM has to reach any host outside the data center:
a MS network is created
l2gw connection si created, so the segments are “joined together” in the switch fabric that does the vlan to vxlan transformation.

For example:

openstack network create --provider-network-type vxlan --provider-segment 2638 esohtom_ms_net_vlan_638
openstack network segment create --network-type vxlan --network esohtom_ms_net_vlan_638 vxlan_segment
openstack network segment create --physical-network DC259-CEE3-DCGW-NET --segment 638 --network-type vlan --network esohtom_ms_net_vlan_638 vlan_dcgw_segment

ceeinfra@controlhost1:~> openstack network segment list

Note: The second vxlan segment (2700) is created to see if it make any difference but is for sure not needed and not wanted.

openstack subnet create --network esohtom_ms_net_vlan_638 --dhcp --gateway 172.60.38.1 --allocation-pool start=172.60.38.10,end=172.60.38.100 --subnet-range 172.60.38.1/24 esohtom_vn_subnet
openstack port create --vnic-type normal --disable-port-security --no-security-group --network esohtom_ms_net_vlan_638 norm_MS_1

neutron l2-gateway-connection-create --default-segmentation-id 638 L2GW_PHY1_LeafCluster001 esohtom_ms_net_vlan_638
neutron l2-gateway-connection-create --default-segmentation-id 638 L2GW_PHY0_LeafCluster001 esohtom_ms_net_vlan_638
neutron l2-gateway-connection-create --default-segmentation-id 638 L2GW_BORDER_LeafCluster001 esohtom_ms_net_vlan_638

This VM can actually ping the DC gateway. The compute where the VM is running uses the legacy vswitch (OVS-dpdk) with no offload capabilities.

controlhost1:/home/ceeinfra # cat /etc/kolla/neutron-server/ml2_conf.ini
[ml2]
type_drivers = vxlan,vlan,flat
tenant_network_types = vxlan,vlan,flat
mechanism_drivers = sriovnicswitch,opendaylight_v2
extension_drivers = port_security,qos
path_mtu = 2140
physical_network_mtus = default:2140,DC259-CEE3-PHY0:2140,DC259-CEE3-PHY1:2140,DC259-CEE3-MLAG:2140,DC259-CEE3-DCGW-NET:2140,DC259-CEE3-MLAG_LEFT:2140,DC259-CEE3-MLAG_RIGHT:2140

[ml2_type_vlan]
network_vlan_ranges = DC259-CEE3-PHY0,DC259-CEE3-PHY1,DC259-CEE3-MLAG,DC259-CEE3-DCGW-NET,DC259-CEE3-MLAG_LEFT,DC259-CEE3-MLAG_RIGHT

[ml2_type_flat]
flat_networks = DC259-CEE3-PHY0,DC259-CEE3-PHY1,DC259-CEE3-MLAG

[ml2_type_vxlan]
vni_ranges = 2001:2999

[securitygroup]
firewall_driver = neutron.agent.firewall.NoopFirewallDriver

[ml2_sdi]

[ml2_bsp]

[ml2_odl]

controlhost1:/home/ceeinfra # cat /etc/kolla/neutron-server/ml2_conf_odl.ini
[ml2_odl]
url = redacted
username = redacted
password = redacted
enable_dhcp_service = False
enable_full_sync = false
port_binding_controller = "pseudo-agentdb-binding"
enable_websocket_pseudo_agentdb = False
odl_features = "operational-port-status"

# scheduler config
[filter_scheduler]
enabled_filters = AggregateMultiTenancyIsolation,AvailabilityZoneFilter,ComputeFilter,ComputeCapabilitiesFilter,ImagePropertiesFilter,AggregateInstanceExtraSpecsFilter,SameHostFilter,DifferentHostFilter,ServerGroupAntiAffinityFilter,ServerGroupAffinityFilter,PciPassthroughFilter,NUMATopologyFilter

The use case in NFVI solution including smartNIC (OVS kernel datapath offload to smart VFs)

Now I create the neutron port representing a smart VF in the MS network, for which I have the necessary capabilities in compute3 and 4:

ceeinfra@controlhost1:~> openstack port create --vnic-type direct --vnic-type=direct --binding-profile='{"capabilities": ["switchdev"]}' --disable-port-security --no-security-group --network esohtom_ms_net_vlan_638 SVF_MS_1
+-------------------------+-----------------------------------------------------------------------------+
| Field | Value |
+-------------------------+-----------------------------------------------------------------------------+
| admin_state_up | UP |
| allowed_address_pairs | |
| binding_host_id | |
| binding_profile | capabilities='['switchdev']' |
| binding_vif_details | |
| binding_vif_type | unbound |
| binding_vnic_type | direct |
| created_at | 2022-07-16T08:04:57Z |
| data_plane_status | None |
| description | |
| device_id | |
| device_owner | |
| dns_assignment | None |
| dns_domain | None |
| dns_name | None |
| extra_dhcp_opts | |
| fixed_ips | ip_address='172.60.38.94', subnet_id='1800e806-6d24-4c45-a586-935a7ba1d1c5' |
| id | e79a0e66-debd-42e8-a46d-9cdc29a7c960 |
| ip_allocation | immediate |
| mac_address | fa:16:3e:99:4a:3e |
| name | SVF_MS_1 |
| network_id | 10084fdd-cb84-4ed8-af28-a5b3a79c5bd5 |
| numa_affinity_policy | None |
| port_security_enabled | False |
| project_id | cf7024f0f2bd46a8b17fd42055a20323 |
| propagate_uplink_status | None |
| qos_network_policy_id | None |
| qos_policy_id | None |
| resource_request | None |
| revision_number | 1 |
| security_group_ids | |
| status | DOWN |
| tags | |
| trunk_details | None |
| updated_at | 2022-07-16T08:04:57Z |
+-------------------------+-----------------------------------------------------------------------------+

On those computes the nova-compute configuration includes:

[pci]
passthrough_whitelist = [{"devname": "data2", "physical_network": null}, {"devname": "data3", "physical_network": null}]

Please note that according to https://docs.openstack.org/nova/latest/admin/pci-passthrough.html we use an empty physical network when we want to address all the NICs that carry an overlay network.

This works perfectly in case of server creation with smartVF on single segment neutron networks vxlan type.

Now I try to create the server with the smart VF on compute3 on the MS network:

ceeinfra@controlhost1:~> openstack server show c3_ms_1
...
| fault | {'code': 500, 'created': '2022-07-16T08:12:31Z', 'message': 'Insufficient compute resources: Requested instance NUMA topology together with requested PCI devices cannot fit the given host NUMA topology; Claim pci failed.', 'details': 'Traceback (most recent call last):\n File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 2418, in _build_and_run_instance\n limits):\n File "/usr/lib/python3.6/site-packages/oslo_concurrency/lockutils.py", line 360, in inner\n return f(*args, **kwargs)\n File "/usr/lib/python3.6/site-packages/nova/compute/resource_tracker.py", line 172, in instance_claim\n pci_requests, limits=limits)\n File "/usr/lib/python3.6/site-packages/nova/compute/claims.py", line 72, in __init__\n self._claim_test(compute_node, limits)\n File "/usr/lib/python3.6/site-packages/nova/compute/claims.py", line 114, in _claim_test\n "; ".join(reasons))\nnova.exception.ComputeResourcesUnavailable: Insufficient compute resources: Requested instance NUMA topology together with requested PCI devices cannot fit the given host NUMA topology; Claim pci failed.\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 2271, in _do_build_and_run_instance\n filter_properties, request_spec, accel_uuids)\n File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 2469, in _build_and_run_instance\n instance_uuid=instance.uuid, reason=e.format_message())\nnova.exception.RescheduledException: Build of instance 09f3f8bb-b4c0-4395-8167-c10609d32d08 was re-scheduled: Insufficient compute resources: Requested instance NUMA topology together with requested PCI devices cannot fit the given host NUMA topology; Claim pci failed.\n'} |
...

The scheduler cannot find the needed PCI devices with the needed capabilities on compute3…but what are the PCI devices is it looking for?

It cannot find any physical network associated to the vxlan segments and then goes to the next segment that is a vlan segment, that has a physical network, but DC259-CEE3-DCGW-NET is for sure not able to provide the required PCI capabilities according to the nova-compute passthrough_whitelist. The same happens if I skip to create explicitly the vxlan segment 2700.

edit, 2022-08-08: redacted some internal details

See original description

Tags:

Balazs Gibizer (balazs-gibizer) on 2022-08-04

tags:

added: pci

Revision history for this message

Balazs Gibizer (balazs-gibizer) wrote on 2022-08-04:

The physnet in the InstancePciRequest used in the scheduling comes from _get_physnet_tunneled_info [1].

In case of #1) [ovs sriov offload port on single segment neutron network] the code simply reads net.get('provider:physical_network') and that can be None for vxlan network. So the scheduling can match that to the whitelist on the compute passthrough_whitelist = [{"devname": "data2", "physical_network": null}, {"devname": "data3", "physical_network": null}]

In case of #3) [ovs sriov offload port on multisegment neutron network, one vlan, one vxlan segment] the code[1] searches through all the segments of the network to find one with not None physnet value. In this case that will be the vlan segment. Then the scheduler will see a mistmatch between the physnet of the vlan segment in the InstancePciRequest and the whitelist on the compute with None physnet.

I would say the code in [1] never supported ovs sriov offload with multisegment network.

[1] https://opendev.org/openstack/nova/src/commit/8097c2b2153ff952a266395d4e351fc39f914c6b/nova/network/neutron.py#L1971-L2012

Revision history for this message

Balazs Gibizer (balazs-gibizer) wrote on 2022-08-04:

I think in sort term we should add a note about this limitation in https://docs.openstack.org/nova/latest/admin/pci-passthrough.html

To make this work is not trivial as we would need to enhance the PCI scheduling to support "physnet=None OR physnet=foobar" type of requests. We have multiple options here:
1) enhance InstancePCIRequest and the PciPassthroughFilter
2) wait for the pci-device-tracking-in-placement to start supporting neutron based SRIOV and factor this support into that work
3) start thinking about expressing the PCI request of a neutron port not via the InstancePCIRequest but via the port's resource_request. That field is directly translated to placement query and placement already supports OR relationship between traits (added in yoga).

Changed in nova:
status:	New → Triaged
importance:	Undecided → Wishlist

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-08-04: Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/852168

Changed in nova:
status:	Triaged → In Progress

Bence Romsics (bence-romsics) on 2022-08-08

description:

updated

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-08-17: Fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/852168
Committed: https://opendev.org/openstack/nova/commit/5710a8ac062dac73406e5633cfd5547c48612b38
Submitter: "Zuul (22348)"
Branch: master

commit 5710a8ac062dac73406e5633cfd5547c48612b38
Author: Bence Romsics <email address hidden>
Date: Thu Aug 4 15:29:36 2022 +0200

Add limitation to docs about bug 1983570

Change-Id: Ie5611952ab8607bde02735503bfd84ba6c7990af
Partial-Bug: #1983570

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.