cannot schedule ovs sriov offload port to tunneled segment
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
In Progress
|
Wishlist
|
Unassigned |
Bug Description
We observed a scheduling failure when using ovs sriov offload (https:/
) in combination with multisegment networks. The problem seems to affect the case when the port should be bound to a tunneled network segment (a segment that does not have a physnet).
I read that nova scheduler works the same way with pci sriov passthrough, therefore I believe the same bug affects pci sriov passthrough, though I did not test that.
Due to the special hardware needs for this environment I could not reproduce this in devstack. But I hope we have collected enough information that shows the error regardless. We believe we also identified the relevant lines of code.
The overall setup includes l2gw - connecting the segments in the multisegment network. But I will ignore that here, since l2gw cannot be part of the root cause here. Neutron was configured with mechanism_
As I understand the problem:
1) ovs sriov offload port on single segment neutron network, the segment is vxlan: works
2) normal port on no offload capable ovs (--vnic-type normal) on multisegment neutron network, one vlan, one vxlan segment, the port should be bound to the vxlan segment: works
3) ovs sriov offload port on multisegment neutron network, one vlan, one vxlan segment, the port should be bound to the vxlan segment: does not work
To reproduce:
* create a multisegment network with one vlan and one vxlan segment
* create a port on that network with "--vnic-type direct --binding-profile '{"capabilities": ["switchdev"]}' --disable-
* boot a vm with that port
On the compute host on which we expect the scheduling and boot to succeed we have configuration like:
[pci]
passthrough_
According to https:/
The vm boot fails with:
$ openstack server show c3_ms_1
...
| fault | {'code': 500, 'created': '2022-07-
...
In the scheduler logs we see that the scheduler uses a spec with a physnet. But the pci passthrough capability is on a device without a physnet.
controlhost3:
<180>2022-
We observed the bug originally on stable/victoria and found these source code lines:
https:/
Here for vnic_type=direct ports we unconditionally add a physnet to the spec:
From victoria to master the only recent change in that piece is this:
https:/
Which seems irrelevant to this bug, therefore I believe the bug could be reproduced on master too.
Not having reproduced this bug myself, I will include below my colleague's (Angelo Nappo's) email to me, containing exact commands and configs. However I hope above I already provided all relevant information and eliminated all downstream specific details:
The use case in legacy NFVI solution
The system is based on VXLAN+SDN. If the VM has to reach any host outside the data center:
a MS network is created
l2gw connection si created, so the segments are “joined together” in the switch fabric that does the vlan to vxlan transformation.
For example:
openstack network create --provider-
openstack network segment create --network-type vxlan --network esohtom_
openstack network segment create --physical-network DC259-CEE3-DCGW-NET --segment 638 --network-type vlan --network esohtom_
ceeinfra@
+------
| ID | Name | Network | Network Type | Segment |
+------
| 1adda535-
| 29b3505c-
| 506c76db-
| 578d3f85-
| 78b21ff9-
| 7b112fcd-
| 8ca5d660-
| 977147b3-
| 9b6369fe-
| cd6f295e-
+------
Note: The second vxlan segment (2700) is created to see if it make any difference but is for sure not needed and not wanted.
openstack subnet create --network esohtom_
openstack port create --vnic-type normal --disable-
neutron l2-gateway-
neutron l2-gateway-
neutron l2-gateway-
ceeinfra@
+------
| Field | Value |
+------
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-
| OS-EXT-
| OS-EXT-
| OS-EXT-
| OS-EXT-
| OS-EXT-
| OS-EXT-STS:vm_state | building |
| OS-SRV-
| OS-SRV-
| accessIPv4 | |
| accessIPv6 | |
| addresses | |
| adminPass | LCnpM4wskqoL |
| config_drive | |
| created | 2022-07-
| flavor | flavor_1 (e3a09880-
| hostId | |
| id | 5fab4bbc-
| image | BAT-image (a2e36de0-
| key_name | None |
| name | c1_ms_1 |
| progress | 0 |
| project_id | cf7024f0f2bd46a
| properties | |
| security_groups | name='default' |
| status | BUILD |
| updated | 2022-07-
| user_id | fcd3b2713191485
| volumes_attached | |
+------
ceeinfra@
+------
| ID | Name | Status | Networks | Image | Flavor |
+------
| 5fab4bbc-
| 5cb48a3e-
| e343d12a-
| cd492514-
| 2806ff70-
| 58e47d2d-
| 2ba6d7ce-
| 2e434248-
| 8990cec4-
| 59e4a587-
| a0446d12-
| 7aed1ccd-
| f1ff093b-
| 16c439ba-
| 7d5472ee-
+------
This VM can actually ping the DC gateway. The compute where the VM is running uses the legacy vswitch (OVS-dpdk) with no offload capabilities.
controlhost1:
[ml2]
type_drivers = vxlan,vlan,flat
tenant_
mechanism_drivers = sriovnicswitch,
extension_drivers = port_security,qos
path_mtu = 2140
physical_
[ml2_type_vlan]
network_vlan_ranges = DC259-CEE3-
[ml2_type_flat]
flat_networks = DC259-CEE3-
[ml2_type_vxlan]
vni_ranges = 2001:2999
[securitygroup]
firewall_driver = neutron.
[ml2_sdi]
[ml2_bsp]
[ml2_odl]
controlhost1:
[ml2_odl]
url = redacted
username = redacted
password = redacted
enable_dhcp_service = False
enable_full_sync = false
port_binding_
enable_
odl_features = "operational-
# scheduler config
[filter_scheduler]
enabled_filters = AggregateMultiT
The use case in NFVI solution including smartNIC (OVS kernel datapath offload to smart VFs)
Now I create the neutron port representing a smart VF in the MS network, for which I have the necessary capabilities in compute3 and 4:
ceeinfra@
+------
| Field | Value |
+------
| admin_state_up | UP |
| allowed_
| binding_host_id | |
| binding_profile | capabilities=
| binding_vif_details | |
| binding_vif_type | unbound |
| binding_vnic_type | direct |
| created_at | 2022-07-
| data_plane_status | None |
| description | |
| device_id | |
| device_owner | |
| dns_assignment | None |
| dns_domain | None |
| dns_name | None |
| extra_dhcp_opts | |
| fixed_ips | ip_address=
| id | e79a0e66-
| ip_allocation | immediate |
| mac_address | fa:16:3e:99:4a:3e |
| name | SVF_MS_1 |
| network_id | 10084fdd-
| numa_affinity_
| port_security_
| project_id | cf7024f0f2bd46a
| propagate_
| qos_network_
| qos_policy_id | None |
| resource_request | None |
| revision_number | 1 |
| security_group_ids | |
| status | DOWN |
| tags | |
| trunk_details | None |
| updated_at | 2022-07-
+------
On those computes the nova-compute configuration includes:
[pci]
passthrough_
Please note that according to https:/
This works perfectly in case of server creation with smartVF on single segment neutron networks vxlan type.
Now I try to create the server with the smart VF on compute3 on the MS network:
ceeinfra@
+------
| Field | Value |
+------
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-
| OS-EXT-
| OS-EXT-
| OS-EXT-
| OS-EXT-
| OS-EXT-
| OS-EXT-STS:vm_state | building |
| OS-SRV-
| OS-SRV-
| accessIPv4 | |
| accessIPv6 | |
| addresses | |
| adminPass | DR6AcaeH85nt |
| config_drive | |
| created | 2022-07-
| flavor | flavor_1 (e3a09880-
| hostId | |
| id | 09f3f8bb-
| image | BAT-image (a2e36de0-
| key_name | None |
| name | c3_ms_1 |
| progress | 0 |
| project_id | cf7024f0f2bd46a
| properties | |
| security_groups | name='default' |
| status | BUILD |
| updated | 2022-07-
| user_id | fcd3b2713191485
| volumes_attached | |
+------
ceeinfra@
...
| fault | {'code': 500, 'created': '2022-07-
...
The scheduler cannot find the needed PCI devices with the needed capabilities on compute3…but what are the PCI devices is it looking for?
controlhost3:
<180>2022-
It cannot find any physical network associated to the vxlan segments and then goes to the next segment that is a vlan segment, that has a physical network, but DC259-CEE3-DCGW-NET is for sure not able to provide the required PCI capabilities according to the nova-compute passthrough_
edit, 2022-08-08: redacted some internal details
The physnet in the InstancePciRequest used in the scheduling comes from _get_physnet_ tunneled_ info [1].
In case of #1) [ovs sriov offload port on single segment neutron network] the code simply reads net.get( 'provider: physical_ network' ) and that can be None for vxlan network. So the scheduling can match that to the whitelist on the compute passthrough_ whitelist = [{"devname": "data2", "physical_network": null}, {"devname": "data3", "physical_network": null}]
In case of #3) [ovs sriov offload port on multisegment neutron network, one vlan, one vxlan segment] the code[1] searches through all the segments of the network to find one with not None physnet value. In this case that will be the vlan segment. Then the scheduler will see a mistmatch between the physnet of the vlan segment in the InstancePciRequest and the whitelist on the compute with None physnet.
I would say the code in [1] never supported ovs sriov offload with multisegment network.
[1] https:/ /opendev. org/openstack/ nova/src/ commit/ 8097c2b2153ff95 2a266395d4e351f c39f914c6b/ nova/network/ neutron. py#L1971- L2012