[ml2/ovs]Empty binding_levels=[] cause ovs-agent skipped to process port

Bug #2065577 reported by LIU Yulong
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Opinion
Medium
LIU Yulong

Bug Description

In our production environment we noticed some VM boot failures like this:
1. Create port for nova (port revision_number 0->1)
2. Nova boot VM with --nic port-id
3. Nova scheduled this VM to a host and plug the port
4. Nova update the port device_owner (port revision_number 1->2)
5. Nova update the port host (port revision_number 2->3)
   (Yes, nova will call update_port twice!)
6. Before call real _bind_port_if_needed, neutron-server push port Info cache with binding_levels=[], and revision_number=3
7. Neutron-server try to bind this port
8. Neutron-ovs-agent rpc_loop try to get the port details
9. Neutron-ovs-agent Info cache RPC gets empty binding_levels=[] and skip processing port
10. Neutron-server port bind is done and send Info cache,
   and now the port revision_number is still 3, while binding_levels=[<entry>] is not empty now.
11. neutron-ovs-agent get the new info cache notification, but the revision_number is not changed, so the cache is not updated.

The port will not be processed anymore.

LIU Yulong (dragon889)
summary: - Empty binding_levels=[] cause ovs-agent skipped to process port
+ [ml2/ovs]Empty binding_levels=[] cause ovs-agent skipped to process port
description: updated
Revision history for this message
Bence Romsics (bence-romsics) wrote (last edit ):

Hi Yulong,

Thanks for the report! I guess this is a bug that occurs infrequently. How freuqent it is? Did you observe this on master or some other version? Do you have a method that makes it reproducible at will? The cause sounds timing dependent, so maybe inserting a sleep() at a critical place?

It seems to me a possible fix would be to ensure the port revision_number gets bumped when the binding_levels change from empty to something (between point 6 and 10). Do you have an idea why the revision_number is not bumped between point 6 and 10? In my environment (where this bug does not occur) if I boot a vm with --nic port-id=port0 then port0's revision_number is 4 when everything is done.

Or do you propose a different way to fix? Do you want to take this bug?

Changed in neutron:
status: New → Incomplete
Revision history for this message
LIU Yulong (dragon889) wrote (last edit ):

Hi Bence,

Create vm with ports can reproduce this:
create 10 (or more) ports, and then create 5 (or more) VMs with --nic port-id, set 2 ports (NICs) for each VM.

This can reproduce the issue in our env frequently.

I have not tested this on master.

I have no fix locally, but after some code resource, we found that [1] may be related to this.
It changed the eventlet pool for ml2 ovo push RPC to the python native threads. And then [2] changed to use the python Queue for the ovo, but the native threads is still in use.
In some cases, the python thread scheduler may not run as excepted, the DB save action is a bit later than the ovo push RPC. And some times the ovo push RPC is not even run.

[1] https://review.opendev.org/c/openstack/neutron/+/555608/35/neutron/plugins/ml2/ovo_rpc.py
[2] https://review.opendev.org/c/openstack/neutron/+/788510/12/neutron/plugins/ml2/ovo_rpc.py

Still analyzing...

Report this bug to see if others meet same issue in their deployments.

Changed in neutron:
status: Incomplete → Opinion
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/919915

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/919917

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/919918

Changed in neutron:
assignee: nobody → LIU Yulong (dragon889)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/919915
Committed: https://opendev.org/openstack/neutron/commit/80577381d9819d0d5be7d89978139ca1b0dc6977
Submitter: "Zuul (22348)"
Branch: master

commit 80577381d9819d0d5be7d89978139ca1b0dc6977
Author: LIU Yulong <email address hidden>
Date: Fri May 17 11:43:13 2024 +0800

    Bump port revision if binding_levels changed

    Related-Bug: #2065577
    Change-Id: I0d04c3a1dae86a2b6b4ba70c0e9595b2527b4f71

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.