Exception in concurrent port binding activation

Bug #1986003 reported by Bodo Petermann
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Undecided
Bodo Petermann

Bug Description

Occasionally VM live-migrations fail in post-migration because the request to activate the port binding on the new host fails with a 500 Internal Server Error.
It appears that nova-compute might try two requests in parallel. One of them succeeds, the other one returns the error.

Neutron version: yoga, 20.1.0

How to reproduce:

- create a port for a compute instance, with a binding to host host1
- create an additional port binding for host2, i.e. POST /v2.0/ports/{port_id}/bindings
- that will create the new binding with status=INACTIVE
- activate the port binding with 2 requests in parallel (2 times PUT /v2.0/ports/{port_id}/bindings/host2/activate)

Actual result:

- one PUT request returns 200
- other PUT request returns 500

In neutron-server log the failed request logs an exception: "sqlalchemy.orm.exc.UnmappedInstanceError: Class 'builtins.NoneType' is not mapped."
See https://paste.opendev.org/show/bFICeriQTlkmVwYQ5nzo/

Expected result:

- one PUT request returns 200
- other PUT request returns 409 (port binding already active)

Background:

Nova live-migrations may trigger such concurrent activate requests.
In preparation of the live-migration nova will create a new port binding for the destination host. When the migration completes it will activate that binding. At least in our setup that activation may be triggered from two places: (a) when the lifecycle event about completed migration is handled and (b) when the migration job monitor actively detects that the migration completed. If the 2nd one fails, the post-live-migration breaks and the whole migration goes into error state and may not finish all its work.

Related bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2097160

Revision history for this message
Bodo Petermann (bpetermann) wrote :

As I understand it the problem is in Ml2Plugin._commit_port_binding. Before the function is entered it's checked that the port exists and that the INACTIVE binding for the new host exists. But inside _commit_port_binding the port is read again from the database and now the INACTIVE binding does not exist anymore. Probably because the concurrent activate request turned it to ACTIVE in the meantime.
If new_binding.status is INACTIVE cur_context_binding will be set to the currently-existing INACTIVE binding, but that's not there, so cur_context_binding is None and driver_context.PortContext initialisation will crash.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/853281

Changed in neutron:
status: New → In Progress
Changed in neutron:
assignee: nobody → Bodo Petermann (bpetermann)
Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello Bodo:

I'm not an expert in the Nova code. As far as I know, when Nova receives the vif-plugged event from the second host (destination host), it activates the second port binding.

When does the second activation takes place? Who is sending this second activation? Why?

Regards.

Revision history for this message
Bodo Petermann (bpetermann) wrote :

There are two code paths where nova will try to activate the same port binding.

1) lifecycle event EVENT_LIFECYCLE_MIGRATION_COMPLETED

handled in nova.compute.manager.ComputeManager.handle_lifecycle_event, in case of libvirt that event is based on one from libvirt (detail=VIR_DOMAIN_EVENT_SUSPENDED_MIGRATED, job info type VIR_DOMAIN_JOB_COMPLETED). So when there's the suspened-migrated event, nova's callback function is called and nova will handle this lifecycle event. That leads to a port binding activation.
From handle_lifecycle_event via self.network_api.migrate_instance_start.

2) migration monitor

The _post_live_migration function of nova.compute.manager.ComputeManager will also try to activate the port binding via migrate_instance_start. _post_live_migration is registered to the driver's live_migration as a callback. In case of libvirt: LibvirtDriver implements a _live_migration_monitor that will periodically (every 0.5 sec) ask for the guest's job status and if it sees a VIR_DOMAIN_JOB_COMPLETED, it will call the post_method callback, i.e. _post_live_migration.

So there are two threads: one handling events from the guest / libvirt and one that checks the migration's job status periodically. Both of them will learn that the job completed and both of them will try to activate the port binding. Depending on the timing, the migration monitor may be late enough, so it will get a Conflict/409 response and then it will continue properly. If it finds the job completed almost at the same time as the lifecycle event, the race in neutron may be triggered and one of the two nova requests will receive a 500 response.

description: updated
Revision history for this message
sean mooney (sean-k-mooney) wrote :

from a nova perspective, i would expect both port bidding to return a 200.
as activating a port binding that is already ready active was intended to be valid.

both code paths are were described in the original nova spec
https://specs.openstack.org/openstack/nova-specs/specs/rocky/implemented/neutron-new-port-binding-api.html#migration-across-mixed-software-versions
with the expectation that neutron would accept the second activation if it was received.

alternative nova could hanel the 500/409 but this can also happen because we use retires when calling neutron so if we have request that can take a long time to complete we can have duplicate request even without the two code paths.

form my perspective the current nova behaviour is working as designed but clearly there is a dsync between nova's perspective of how this should work and neutrons.

we can all agree that this should not cause a 500.

my expectation is that the second activation would return a 200 not a 409 but we could work with either on the nova side if requried.

Revision history for this message
sean mooney (sean-k-mooney) wrote :

this is the neutron side of the spec which is really hard to find since it has been moved and no redirect has been put in place
https://github.com/openstack/neutron-specs/blob/26f039390c64a698914165861981858ac849eda5/specs/pike/portbinding_information_for_nova.rst
https://specs.openstack.org/openstack/neutron-specs/specs/backlog/pike/portbinding_information_for_nova.html

the high level sequence diagram is here
https://specs.openstack.org/openstack/neutron-specs/specs/backlog/pike/portbinding_information_for_nova.html#usage-by-nova

the activation of an active binding was defined to return a 4xx 409 would be appropriate

https://specs.openstack.org/openstack/neutron-specs/specs/backlog/pike/portbinding_information_for_nova.html#activating-an-inactive-binding

and 5xx was defined for a failed binding altogh a 400 error probably would have been more correct.

that's a little tricky since the content of the request was correct so we should not use 400 but semantically it is invalid because the ml2 driver which is service side was not able to comply but 500 is not ideal since we don't difernciate between 500 (i could not bind for valid reasons i.e. there is no ovs agent on that host and i cant bind because the db is not acceptable)

when we went to implement this in nova we had issue with that and i thought we requested that we do not raise a 4xx at the rocky ptg https://etherpad.opendev.org/p/nova-ptg-rocky#L221 but i don't see the discussion of that. so i guess we instead workaround around this in nova and handled the 409.

given neutron is not returning a 409 in this case that is likely the porblem.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/2023.1)

Fix proposed to branch: stable/2023.1
Review: https://review.opendev.org/c/openstack/neutron/+/879327

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/zed)

Fix proposed to branch: stable/zed
Review: https://review.opendev.org/c/openstack/neutron/+/879328

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/yoga)

Fix proposed to branch: stable/yoga
Review: https://review.opendev.org/c/openstack/neutron/+/879329

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/xena)

Fix proposed to branch: stable/xena
Review: https://review.opendev.org/c/openstack/neutron/+/879330

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/wallaby)

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/neutron/+/879331

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/853281
Committed: https://opendev.org/openstack/neutron/commit/5b4ed5b117f2f418d598af20744f571db581e2ae
Submitter: "Zuul (22348)"
Branch: master

commit 5b4ed5b117f2f418d598af20744f571db581e2ae
Author: Bodo Petermann <email address hidden>
Date: Tue Aug 16 14:14:14 2022 +0200

    Fix concurrent port binding activate

    Fix an issue with concurrent requests to activate a port binding.
    If there are two activate requests in parallel, one might set the
    binding on the new host to active and the other request may
    not find the previously INACTIVE row anymore in
    _commit_port_binding and initializing the driver_context.PortContext
    crashed.

    Closes-Bug: #1986003
    Change-Id: I047e33062bc38f36848e0149c6e670cb5828c8e3

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/2023.1)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/879327
Committed: https://opendev.org/openstack/neutron/commit/1b6fd61e330513ce93af817e268f9f3e5c1d470e
Submitter: "Zuul (22348)"
Branch: stable/2023.1

commit 1b6fd61e330513ce93af817e268f9f3e5c1d470e
Author: Bodo Petermann <email address hidden>
Date: Tue Aug 16 14:14:14 2022 +0200

    Fix concurrent port binding activate

    Fix an issue with concurrent requests to activate a port binding.
    If there are two activate requests in parallel, one might set the
    binding on the new host to active and the other request may
    not find the previously INACTIVE row anymore in
    _commit_port_binding and initializing the driver_context.PortContext
    crashed.

    Closes-Bug: #1986003
    Change-Id: I047e33062bc38f36848e0149c6e670cb5828c8e3
    (cherry picked from commit 5b4ed5b117f2f418d598af20744f571db581e2ae)

tags: added: in-stable-zed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/zed)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/879328
Committed: https://opendev.org/openstack/neutron/commit/bd0d2ae6f1f5cf6847c79f19ab2b23a22efca1af
Submitter: "Zuul (22348)"
Branch: stable/zed

commit bd0d2ae6f1f5cf6847c79f19ab2b23a22efca1af
Author: Bodo Petermann <email address hidden>
Date: Tue Aug 16 14:14:14 2022 +0200

    Fix concurrent port binding activate

    Fix an issue with concurrent requests to activate a port binding.
    If there are two activate requests in parallel, one might set the
    binding on the new host to active and the other request may
    not find the previously INACTIVE row anymore in
    _commit_port_binding and initializing the driver_context.PortContext
    crashed.

    Closes-Bug: #1986003
    Change-Id: I047e33062bc38f36848e0149c6e670cb5828c8e3
    (cherry picked from commit 5b4ed5b117f2f418d598af20744f571db581e2ae)

tags: added: in-stable-yoga
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/yoga)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/879329
Committed: https://opendev.org/openstack/neutron/commit/347486fb513f34bc054cfce5c85749054d1b3438
Submitter: "Zuul (22348)"
Branch: stable/yoga

commit 347486fb513f34bc054cfce5c85749054d1b3438
Author: Bodo Petermann <email address hidden>
Date: Tue Aug 16 14:14:14 2022 +0200

    Fix concurrent port binding activate

    Fix an issue with concurrent requests to activate a port binding.
    If there are two activate requests in parallel, one might set the
    binding on the new host to active and the other request may
    not find the previously INACTIVE row anymore in
    _commit_port_binding and initializing the driver_context.PortContext
    crashed.

    Closes-Bug: #1986003
    Change-Id: I047e33062bc38f36848e0149c6e670cb5828c8e3
    (cherry picked from commit 5b4ed5b117f2f418d598af20744f571db581e2ae)

tags: added: in-stable-xena
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/xena)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/879330
Committed: https://opendev.org/openstack/neutron/commit/fc8960c16a68a68f1056fb2032eecbd430b41697
Submitter: "Zuul (22348)"
Branch: stable/xena

commit fc8960c16a68a68f1056fb2032eecbd430b41697
Author: Bodo Petermann <email address hidden>
Date: Tue Aug 16 14:14:14 2022 +0200

    Fix concurrent port binding activate

    Fix an issue with concurrent requests to activate a port binding.
    If there are two activate requests in parallel, one might set the
    binding on the new host to active and the other request may
    not find the previously INACTIVE row anymore in
    _commit_port_binding and initializing the driver_context.PortContext
    crashed.

    Closes-Bug: #1986003
    Change-Id: I047e33062bc38f36848e0149c6e670cb5828c8e3
    (cherry picked from commit 5b4ed5b117f2f418d598af20744f571db581e2ae)

tags: added: in-stable-wallaby
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/879331
Committed: https://opendev.org/openstack/neutron/commit/93f176995266ce59455c01265630d99a8c4cd74a
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 93f176995266ce59455c01265630d99a8c4cd74a
Author: Bodo Petermann <email address hidden>
Date: Tue Aug 16 14:14:14 2022 +0200

    Fix concurrent port binding activate

    Fix an issue with concurrent requests to activate a port binding.
    If there are two activate requests in parallel, one might set the
    binding on the new host to active and the other request may
    not find the previously INACTIVE row anymore in
    _commit_port_binding and initializing the driver_context.PortContext
    crashed.

    Closes-Bug: #1986003
    Change-Id: I047e33062bc38f36848e0149c6e670cb5828c8e3
    (cherry picked from commit 5b4ed5b117f2f418d598af20744f571db581e2ae)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 19.7.0

This issue was fixed in the openstack/neutron 19.7.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 23.0.0.0b2

This issue was fixed in the openstack/neutron 23.0.0.0b2 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 22.0.1

This issue was fixed in the openstack/neutron 22.0.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 21.1.1

This issue was fixed in the openstack/neutron 21.1.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 20.3.1

This issue was fixed in the openstack/neutron 20.3.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron wallaby-eom

This issue was fixed in the openstack/neutron wallaby-eom release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.