[SRU] OVN: ovnmeta namespaces missing during scalability test causing DHCP issues

Bug #2017748 reported by Lucas Alvares Gomes
18
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
New
Undecided
Unassigned
Yoga
New
Undecided
Hua Zhang
Zed
New
Undecided
Hua Zhang
neutron
Fix Released
High
Terry Wilson
neutron (Ubuntu)
Fix Released
Undecided
Unassigned
Focal
New
Undecided
Unassigned
Jammy
New
Undecided
Unassigned

Bug Description

[Impact]

ovnmeta- namespaces are missing intermittently then can't reach to VMs

[Test Case]
TBD
- Not able to reproduce this easily.

[Where problems could occur]
This patches are related to ovn metadata agent in compute.
VM's connectivity can possibly be affected by this patch when ovn is used.
Biding port to datapath could be affected.

[Others]

== ORIGINAL DESCRIPTION ==

Reported at: https://bugzilla.redhat.com/show_bug.cgi?id=2187650

During a scalability test it was noted that a few VMs where having issues being pinged (2 out of ~5000 VMs in the test conducted). After some investigation it was found that the VMs in question did not receive a DHCP lease:

udhcpc: no lease, failing
FAIL
checking http://169.254.169.254/2009-04-04/instance-id
failed 1/20: up 181.90. request failed

And the ovnmeta- namespaces for the networks that the VMs was booting from were missing. Looking into the ovn-metadata-agent.log:

2023-04-18 06:56:09.864 353474 DEBUG neutron.agent.ovn.metadata.agent [-] There is no metadata port for network 9029c393-5c40-4bf2-beec-27413417eafa or it has no MAC or IP addresses configured, tearing the namespace down if needed _get_provision_params /usr/lib/python3.9/site-packages/neutron/agent/ovn/metadata/agent.py:495

Apparently, when the system is under stress (scalability tests) there are some edge cases where the metadata port information has not yet being propagated by OVN to the Southbound database and when the PortBindingChassisEvent event is being handled and try to find either the metadata port of the IP information on it (which is updated by ML2/OVN during subnet creation) it can not be found and fails silently with the error shown above.

Note that, running the same tests but with less concurrency did not trigger this issue. So only happens when the system is overloaded.

Tags: ovn sts
Changed in neutron:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by "Slawek Kaplonski <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/881487
Reason: This review is > 4 weeks without comment, and failed Zuul jobs the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
Terry Wilson (otherwiseguy) wrote : Re: OVN: ovnmeta namespaces missing during scalability test causing DHCP issues

We've done some internal testing and what we see is that ovsdb-server, when there is a backlog sending out event notifications batches updates and can merge "insert" and "update" operations that happen close together. This is intended behavior.

What this means is that our PortBindingUpdatedEvent (or PortBindingChassisCreatedEvent) which looks for "update" events don't fire when we get a Port_Binding "create" that has the chassis field set.

I'm working on a fix.

Changed in neutron:
assignee: Lucas Alvares Gomes (lucasagomes) → Terry Wilson (otherwiseguy)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/903796

Revision history for this message
yatin (yatinkarel) wrote : Re: OVN: ovnmeta namespaces missing during scalability test causing DHCP issues

<< What this means is that our PortBindingUpdatedEvent (or PortBindingChassisCreatedEvent) which looks for "update" events don't fire when we get a Port_Binding "create" that has the chassis field set.

The behavior looks similar to what we saw in https://bugzilla.redhat.com/show_bug.cgi?id=2214289 for some LogicalSwitch events

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/2023.2)

Fix proposed to branch: stable/2023.2
Review: https://review.opendev.org/c/openstack/neutron/+/904715

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/2023.1)

Fix proposed to branch: stable/2023.1
Review: https://review.opendev.org/c/openstack/neutron/+/904716

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/903796
Committed: https://opendev.org/openstack/neutron/commit/a641e8aec09c1e33a15a34b19d92675ed2c85682
Submitter: "Zuul (22348)"
Branch: master

commit a641e8aec09c1e33a15a34b19d92675ed2c85682
Author: Terry Wilson <email address hidden>
Date: Fri Dec 15 21:00:43 2023 +0000

    Handle creation of Port_Binding with chassis set

    When there is a backlog of notifications to be sent, it is possible
    that ovsdb-server will merge insert and update notifications. Due
    to this, we need to handle the situation where we see a Port_Binding
    created with the chassis set.

    Closes-Bug: #2017748

    Change-Id: Idfae87cf6c60e9e18ede91ea20857cea5322738c

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/2023.2)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/904715
Committed: https://opendev.org/openstack/neutron/commit/e9cf2fd6cca8a3d5c06bcb073cb310cd61208b41
Submitter: "Zuul (22348)"
Branch: stable/2023.2

commit e9cf2fd6cca8a3d5c06bcb073cb310cd61208b41
Author: Terry Wilson <email address hidden>
Date: Fri Dec 15 21:00:43 2023 +0000

    Handle creation of Port_Binding with chassis set

    When there is a backlog of notifications to be sent, it is possible
    that ovsdb-server will merge insert and update notifications. Due
    to this, we need to handle the situation where we see a Port_Binding
    created with the chassis set.

    Closes-Bug: #2017748

    Change-Id: Idfae87cf6c60e9e18ede91ea20857cea5322738c
    (cherry picked from commit a641e8aec09c1e33a15a34b19d92675ed2c85682)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/2023.1)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/904716
Committed: https://opendev.org/openstack/neutron/commit/b992d639b974f35612d6bb0057f35c452129aed3
Submitter: "Zuul (22348)"
Branch: stable/2023.1

commit b992d639b974f35612d6bb0057f35c452129aed3
Author: Terry Wilson <email address hidden>
Date: Fri Dec 15 21:00:43 2023 +0000

    Handle creation of Port_Binding with chassis set

    When there is a backlog of notifications to be sent, it is possible
    that ovsdb-server will merge insert and update notifications. Due
    to this, we need to handle the situation where we see a Port_Binding
    created with the chassis set.

    Closes-Bug: #2017748

    Change-Id: Idfae87cf6c60e9e18ede91ea20857cea5322738c
    (cherry picked from commit a641e8aec09c1e33a15a34b19d92675ed2c85682)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 24.0.0.0b1

This issue was fixed in the openstack/neutron 24.0.0.0b1 development milestone.

Revision history for this message
Seyeong Kim (seyeongkim) wrote (last edit ): Re: OVN: ovnmeta namespaces missing during scalability test causing DHCP issues

A customer has the similar issue. Although I can't reproduce this in my local environment. I prepared debdiff for yoga.
Our support engineer pointed this out ( patch 2 ) and it makes sense to backport.
As you can see the description, it is happening intermittently with high load. the customer also faced this few times and can't reproduce even they want.

There are two commits inside the debdiff file

[PATCH 1/2] ovn-metadata: Refactor events
[PATCH 2/2] Handle creation of Port_Binding with chassis set

patch 1 is needed because of massive conflict
Also, I removed commit 2's neutron/agent/ovn/extensions/qos_hwol.py

This could be the code I need to be careful

Above 2023.1 already has above patches.

tags: added: sts
Changed in neutron (Ubuntu):
status: New → Fix Released
Seyeong Kim (seyeongkim)
description: updated
Seyeong Kim (seyeongkim)
Changed in neutron (Ubuntu Focal):
assignee: nobody → Seyeong Kim (seyeongkim)
Changed in neutron (Ubuntu Jammy):
assignee: nobody → Seyeong Kim (seyeongkim)
Hua Zhang (zhhuabj)
summary: - OVN: ovnmeta namespaces missing during scalability test causing DHCP
- issues
+ [SRU] OVN: ovnmeta namespaces missing during scalability test causing
+ DHCP issues
Seyeong Kim (seyeongkim)
Changed in neutron (Ubuntu Jammy):
assignee: Seyeong Kim (seyeongkim) → nobody
Changed in neutron (Ubuntu Focal):
assignee: Seyeong Kim (seyeongkim) → nobody
Revision history for this message
Hua Zhang (zhhuabj) wrote :

After carefully reviewing many patches, I finally backported the following 3 patches to Yoga.

[PATCH 1/3] 686698284b Update tap ip in metadata agent when metadata port ip updated
[PATCH 2/3] 6205158831 ovn-metadata: Refactor events
[PATCH 3/3] b992d639b9 Handle creation of Port_Binding with chassis set

I didn't backport the following patch edf48e46a1,

edf48e46a1 Improve agent provision performance for large networks

as doing so will introduce more dependent patches. Since we opted not to backport this patch, we have to address the code conflict within the provision_datapath function in patch 6205158831. The process of resolving code conflicts for this backport can be found here - https://paste.ubuntu.com/p/2V3SXVvsHx/

Due to the absence of a local reproducer, I only built a test package [1] and verified basic network functions. The test results [2] indicate successful functionality.

[1] https://launchpad.net/~zhhuabj/+archive/ubuntu/focal-yoga-test
[2] https://paste.ubuntu.com/p/m9vp3TJgyv/

Revision history for this message
Hua Zhang (zhhuabj) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.