Exception in concurrent port binding activation
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
neutron |
Fix Released
|
Undecided
|
Bodo Petermann |
Bug Description
Occasionally VM live-migrations fail in post-migration because the request to activate the port binding on the new host fails with a 500 Internal Server Error.
It appears that nova-compute might try two requests in parallel. One of them succeeds, the other one returns the error.
Neutron version: yoga, 20.1.0
How to reproduce:
- create a port for a compute instance, with a binding to host host1
- create an additional port binding for host2, i.e. POST /v2.0/ports/
- that will create the new binding with status=INACTIVE
- activate the port binding with 2 requests in parallel (2 times PUT /v2.0/ports/
Actual result:
- one PUT request returns 200
- other PUT request returns 500
In neutron-server log the failed request logs an exception: "sqlalchemy.
See https:/
Expected result:
- one PUT request returns 200
- other PUT request returns 409 (port binding already active)
Background:
Nova live-migrations may trigger such concurrent activate requests.
In preparation of the live-migration nova will create a new port binding for the destination host. When the migration completes it will activate that binding. At least in our setup that activation may be triggered from two places: (a) when the lifecycle event about completed migration is handled and (b) when the migration job monitor actively detects that the migration completed. If the 2nd one fails, the post-live-migration breaks and the whole migration goes into error state and may not finish all its work.
Related bugzilla: https:/
Changed in neutron: | |
assignee: | nobody → Bodo Petermann (bpetermann) |
description: | updated |
As I understand it the problem is in Ml2Plugin. _commit_ port_binding. Before the function is entered it's checked that the port exists and that the INACTIVE binding for the new host exists. But inside _commit_ port_binding the port is read again from the database and now the INACTIVE binding does not exist anymore. Probably because the concurrent activate request turned it to ACTIVE in the meantime. context. PortContext initialisation will crash.
If new_binding.status is INACTIVE cur_context_binding will be set to the currently-existing INACTIVE binding, but that's not there, so cur_context_binding is None and driver_