[OVN] orphaned virtual parent ports break new ports

Bug #2000378 reported by Boris-Barboris
26
This bug affects 5 people
Affects Status Importance Assigned to Milestone
neutron
New
Medium
Unassigned

Bug Description

Reproducible on stable/yoga.

Should the ovn port deletion fail due to backend (mariadb or ovn) connection failure, leftover switchports are left hanging in the OVN NB db.

oslo_db.exception.DBDeadlock: (pymysql.err.OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction')
[SQL: DELETE FROM securitygroupportbindings WHERE securitygroupportbindings.port_id = %(port_id)s AND securitygroupportbindings.security_group_id = %(security_group_id)s]
[parameters: {'port_id': '76ff3324-7326-412d-bdc9-df5db5adcf84', 'security_group_id': 'fe1f6c5c-4d49-4ccc-ac2e-20ef23041510'}]

neutron/neutron-server.log:78508:2022-12-12 16:39:15.309 691 ERROR neutron.plugins.ml2.managers [... - default default] Mechanism driver 'ovn' failed in delete_port_postcommit: ovsdbapp.exceptions.TimeoutException: Commands [DelLSwitchPortCommand(lport=76ff3324-7326-412d-bdc9-df5db5adcf84...

Such ports are detected by maintenance task, but only reported as warnings in logs:

neutron/neutron-server.log:76862:2022-12-12 16:35:11.420 712 WARNING neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance [req-4a4b33c2-85b3-48c1-8b15-ed6d65db3c2d - - - - -] Skip fixing resource 76ff3324-7326-412d-bdc9-df5db5adcf84 (type: ports). Resource does not exist in Neutron database anymore: neutron_lib.exceptions.PortNotFound: Port 76ff3324-7326-412d-bdc9-df5db5adcf84 could not be found.

When neutron tries to create a new port for nova instance in the same network and the IP address of the new port matches the IP of the orphaned virtual-parent, neutron binds the new port's virtual switchport to the orphan but fails to proceed with binding algorithms, resulting in a perpetually-DOWN port.

For example, here is OVN-side body of a new virtual port, that has failed to bind to compute:

addresses : ["fa:16:3e:44:8a:d5 10.0.0.29"]
enabled : true
external_ids : {"neutron:cidrs"="10.0.0.29/24", "neutron:device_id"="2098a135-d6a6-4221-a8e9-2584c170dade", "neutron:device_owner"="compute:nova", "neutron:network_name"=neutron-def3de91-2120-47b5-b9f1-6ed51cf0e604, "neutron:port_name"="", "neutron:project_id"="867ba703d19947629e01d800ecdc01c0", "neutron:revision_number"="3", "neutron:security_group_ids"="2ecda920-36a2-44ff-96fa-a652d1cbd6c1 fe1f6c5c-4d49-4ccc-ac2e-20ef23041510"}
name : "79cba8eb-dd1a-455a-873c-0e04f398c8d0"
options : {mcast_flood_reports="true", requested-chassis=cmpt-av-02, virtual-ip="10.0.0.29", virtual-parents="76ff3324-7326-412d-bdc9-df5db5adcf84"}
port_security : ["fa:16:3e:44:8a:d5 10.0.0.29"]
type : virtual
up : false

it was incorrectly bound to orphaned parent 76ff:

addresses : ["fa:16:3e:f6:cc:6a 10.0.0.29"]
enabled : true
external_ids : {"neutron:cidrs"="10.0.0.29/24", "neutron:device_id"="91e19b3e-1412-4519-b499-06ae794ee0a3", "neutron:device_owner"="", "neutron:network_name"=neutron-def3de91-2120-47b5-b9f1-6ed51cf0e604, "neutron:port_name"="", "neutron:project_id"="867ba703d19947629e01d800ecdc01c0", "neutron:revision_number"="1", "neutron:security_group_ids"="fe1f6c5c-4d49-4ccc-ac2e-20ef23041510"}
name : "76ff3324-7326-412d-bdc9-df5db5adcf84"
options : {mcast_flood_reports="true", requested-chassis=""}
port_security : ["fa:16:3e:f6:cc:6a 10.0.0.29"]
type : ""
up : false

As we can see, the only set of matching values is (IP, network_id) triplet, which may indicate that the problem lies in the usage of

def get_virtual_port_parents(self, virtual_ip, port):

function in neutron\plugins\ml2\drivers\ovn\mech_driver\ovsdb\ovn_client.py:303

Manual workaround:
manually delete the port from OVN NB (ovn-nbctl lsp-del), and it's version from neutron ovn_revision_numbers table.

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello Boris:

Can you describe why the port deletion failing in first place? If the Neutron DB port deletion failed, why is not present in the Neutron DB but it is in the OVN DB? Sorry but I can't see how could that happened.

In any case, what you have is an OVN LSP leftover that is not removed by the maintenance task. Correct?

Another question is: how "get_virtual_port_parents" is related to this issue?

Regards.

Revision history for this message
Boris-Barboris (abaranin) wrote :

Hello!

> If the Neutron DB port deletion failed, why is not present in the Neutron DB but it is in the OVN DB?

It was originally caused by simulated controller failure: one of the ovsdb and mariadb instances was hard-shutdown.
Neutron DB port deletion has not in fact failed, galera cluster just reported a deadlock (and, probably, delete was retried successfully). It is the OVN port deletion in postcommit hook that has failed:

ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.ovn_client [req-ba69259e-982d-4065-beb0-39ea20e3eea0... [DelLSwitchPortCommand(lport=76ff3324-7326-412d-bdc9-df5db5adcf84 ... exceeded timeout 180 seconds, cause: Result queue is empty

2022-12-12 16:39:15.309 691 ERROR neutron.plugins.ml2.plugin [req-ba69259e-982d-4065-beb0-39ea20e3eea0 28f1ae985c714b38868ace86630f4fa1 867ba703d19947629e01d800ecdc01c0 - default default] mechanism_manager.de
lete_port_postcommit failed for port 76ff3324-7326-412d-bdc9-df5db5adcf84: neutron.plugins.ml2.common.exceptions.MechanismDriverError

2022-12-12 16:39:15.334 691 INFO neutron.wsgi [req-ba69259e-982d-4065-beb0-39ea20e3eea0 28f1ae985c714b38868ace86630f4fa1 867ba703d19947629e01d800ecdc01c0 - default default] 10.101.56.193,10.101.56.1 "DELETE /v2.0/ports/76ff3324-7326-412d-bdc9-df5db5adcf84 HTTP/1.1" status: 204 len: 0 time: 639.6591825

Note 640 seconds in time, multiple ovn timeouts have been reached.

> In any case, what you have is an OVN LSP leftover that is not removed by the maintenance task. Correct?

Yes, OVN LSP of type ""

> Another question is: how "get_virtual_port_parents" is related to this issue?

It seems to me that this is the function that is responsible for building a parent-child relation between ports, and it only operates on IP+networkid pair, which causes the port binding failure.
Looks like the functionality from 5e72ea104cba1c30d2de36dbbab6e3d23a075929 uses two OVN LSPs for each neutron port, one "virtual" and one "", that references the first one by "virtual-parent" field. And this connection is too lax, leftover LSPs interfere with the lookup that is done by get_virtual_port_parents.

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

OK, I see the problem now: the OVN LSP leftover is "76ff3324-7326-412d-bdc9-df5db5adcf84". When a new LSP is created in the same network and with the same IP addresses, it is assigned as "virtual" to the leftover.

In this case I'll state that the ML2/OVN code is not expecting this kind of leftovers. This is something wrong in the system that should be solved outside this code. The maintenance task is not fixing this issue neither. I would suggest to anyone taking this bug to consider removing those OVN DB items that have Neutron information but are no longer referred in the Neutron DB; this LSP is a good example.

For now the only solution that I can offer is what you already did: to manually clean the OVN DB, after the simulated failure.

Changed in neutron:
importance: Undecided → Medium
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.