Comment 4 for bug 1915282

Revision history for this message
sean mooney (sean-k-mooney) wrote :

i will try to check this again tomorow after i have a clear head but if someone else can confrim that analysis that woudl be helpful.

asuuming this is not a securuity issue today there are a few things we can do to solve this broken behavior.

first on the neutron side we can remove direct-physical form the list of supported vnic types for now. if we want to supprot PF with ovn as external port in the future we can by ensuring they are being bound to a flat network and use vif_type hw_veb instead as the sriov nic agent does
that will fix the nova xml generattion and prevent this issue.

note there are still some tenat isolation issue with allowign flat networks but typically those will only be create by admin and they should know that its not safe to combine vm with direct-physical vnic type with a phyent that has both flat and vlan netowrk as it could easdrop on the vlan ones unless you use a heriacical port bindinding driver on the top of rack swith to prevent that. i.e. make so that the network is not vlan transparent at the top of rack swithc level.

on the nova side we can block whitelisting pci devices with physical_network: null, but add a workaround config option to renable that for those that are abusing our parsing to enable tunneled netowrks with ovs hardware offloads.

we could also be a little more targeted and never allow null for PFs.

in the senario above a VF would have been fine because we would have added the representor netdev to ovs and hardware offloaed ovs would have either process all packets in the software dataplane or offloaded the vxlan encapsulation to hardware.

in the PF case there is noting on the host we can add to ovs for hardware offloaed ovs to manage and we cannot enforece any tenant isolation via nova/libvirt so i dont see how we can ever supprot tunneled netowrk for type PF
vlan networks are alos suspect for PF by the way even though they are allowed by
https://specs.openstack.org/openstack/nova-specs/specs/ocata/implemented/sriov-pf-passthrough-neutron-port-vlan.html

personally i would have vetoed that if given the opertunity since it rely on a hyiracical port binding driver which we as nova have no way to validate to make it safe.
but that was at least discussed and reasoned about before hand even if i disagree with the choice that was made. there are many other issue with PF like the fact that all move operation while they appear to work are broken in at least 1 way i know of so people really should avoid them in production.

any addtionally to filtering or blocking the use of null conventionally via a workaround config option the other thing we need to do is understand how the pci passhtough filter is working or not work in this case and what other un intended side effect happen by not blocking the use of null.

the only reason i am not recommend flat out blocking null in all cases on the nova side is i am told that we have customer using it succesfully or planing to do so in the future so it appare when the stars aline it can actully work even if we dont know how.

the fact that device with phyical_network: null are storead as

{"devname":"<pf netdev name>","physical_network":null}

does mean that our protection form allcoating a gpu or qat device to neutron is working fine.
we uncondtionally ad the physnet request to the pci request spec when generating the form ports
https://github.com/openstack/nova/blob/56ac9b22cfa71daaf7a452fda63bd27811a358c4/nova/network/neutron.py#L2132

but since we have no test covverate for "physical_network":null
any part of the nova codebase where we do if physnet:
like this https://github.com/openstack/nova/blob/56ac9b22cfa71daaf7a452fda63bd27811a358c4/nova/network/neutron.py#L2122-L2123

will not necessary do the right thing.
so before we condier this use case fully supported in nova we would have to add a lot of unit
and functional coverage and hopefully get this tested in a third part ci too.