[SRU] OVN: ovnmeta namespaces missing during scalability test causing DHCP issues

Bug #2017748 reported by Lucas Alvares Gomes
22
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
New
Undecided
Unassigned
Yoga
Fix Committed
Undecided
Hua Zhang
neutron
Status tracked in Ussuri
Ussuri
Fix Released
High
Terry Wilson
Victoria
New
Undecided
Unassigned
Wallaby
New
Undecided
Unassigned
Xena
New
Undecided
Unassigned
neutron (Ubuntu)
Fix Released
Undecided
Unassigned
Focal
New
Undecided
Unassigned
Jammy
New
Undecided
Unassigned

Bug Description

[Impact]

ovnmeta- namespaces are missing intermittently then can't reach to VMs

[Test Case]
TBD
- Not able to reproduce this easily.

[Where problems could occur]
This patches are related to ovn metadata agent in compute.
VM's connectivity can possibly be affected by this patch when ovn is used.
Biding port to datapath could be affected.

[Others]

== ORIGINAL DESCRIPTION ==

Reported at: https://bugzilla.redhat.com/show_bug.cgi?id=2187650

During a scalability test it was noted that a few VMs where having issues being pinged (2 out of ~5000 VMs in the test conducted). After some investigation it was found that the VMs in question did not receive a DHCP lease:

udhcpc: no lease, failing
FAIL
checking http://169.254.169.254/2009-04-04/instance-id
failed 1/20: up 181.90. request failed

And the ovnmeta- namespaces for the networks that the VMs was booting from were missing. Looking into the ovn-metadata-agent.log:

2023-04-18 06:56:09.864 353474 DEBUG neutron.agent.ovn.metadata.agent [-] There is no metadata port for network 9029c393-5c40-4bf2-beec-27413417eafa or it has no MAC or IP addresses configured, tearing the namespace down if needed _get_provision_params /usr/lib/python3.9/site-packages/neutron/agent/ovn/metadata/agent.py:495

Apparently, when the system is under stress (scalability tests) there are some edge cases where the metadata port information has not yet being propagated by OVN to the Southbound database and when the PortBindingChassisEvent event is being handled and try to find either the metadata port of the IP information on it (which is updated by ML2/OVN during subnet creation) it can not be found and fails silently with the error shown above.

Note that, running the same tests but with less concurrency did not trigger this issue. So only happens when the system is overloaded.

Changed in neutron:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by "Slawek Kaplonski <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/881487
Reason: This review is > 4 weeks without comment, and failed Zuul jobs the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
Terry Wilson (otherwiseguy) wrote : Re: OVN: ovnmeta namespaces missing during scalability test causing DHCP issues

We've done some internal testing and what we see is that ovsdb-server, when there is a backlog sending out event notifications batches updates and can merge "insert" and "update" operations that happen close together. This is intended behavior.

What this means is that our PortBindingUpdatedEvent (or PortBindingChassisCreatedEvent) which looks for "update" events don't fire when we get a Port_Binding "create" that has the chassis field set.

I'm working on a fix.

Changed in neutron:
assignee: Lucas Alvares Gomes (lucasagomes) → Terry Wilson (otherwiseguy)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/903796

Revision history for this message
yatin (yatinkarel) wrote : Re: OVN: ovnmeta namespaces missing during scalability test causing DHCP issues

<< What this means is that our PortBindingUpdatedEvent (or PortBindingChassisCreatedEvent) which looks for "update" events don't fire when we get a Port_Binding "create" that has the chassis field set.

The behavior looks similar to what we saw in https://bugzilla.redhat.com/show_bug.cgi?id=2214289 for some LogicalSwitch events

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/2023.2)

Fix proposed to branch: stable/2023.2
Review: https://review.opendev.org/c/openstack/neutron/+/904715

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/2023.1)

Fix proposed to branch: stable/2023.1
Review: https://review.opendev.org/c/openstack/neutron/+/904716

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/903796
Committed: https://opendev.org/openstack/neutron/commit/a641e8aec09c1e33a15a34b19d92675ed2c85682
Submitter: "Zuul (22348)"
Branch: master

commit a641e8aec09c1e33a15a34b19d92675ed2c85682
Author: Terry Wilson <email address hidden>
Date: Fri Dec 15 21:00:43 2023 +0000

    Handle creation of Port_Binding with chassis set

    When there is a backlog of notifications to be sent, it is possible
    that ovsdb-server will merge insert and update notifications. Due
    to this, we need to handle the situation where we see a Port_Binding
    created with the chassis set.

    Closes-Bug: #2017748

    Change-Id: Idfae87cf6c60e9e18ede91ea20857cea5322738c

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/2023.2)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/904715
Committed: https://opendev.org/openstack/neutron/commit/e9cf2fd6cca8a3d5c06bcb073cb310cd61208b41
Submitter: "Zuul (22348)"
Branch: stable/2023.2

commit e9cf2fd6cca8a3d5c06bcb073cb310cd61208b41
Author: Terry Wilson <email address hidden>
Date: Fri Dec 15 21:00:43 2023 +0000

    Handle creation of Port_Binding with chassis set

    When there is a backlog of notifications to be sent, it is possible
    that ovsdb-server will merge insert and update notifications. Due
    to this, we need to handle the situation where we see a Port_Binding
    created with the chassis set.

    Closes-Bug: #2017748

    Change-Id: Idfae87cf6c60e9e18ede91ea20857cea5322738c
    (cherry picked from commit a641e8aec09c1e33a15a34b19d92675ed2c85682)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/2023.1)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/904716
Committed: https://opendev.org/openstack/neutron/commit/b992d639b974f35612d6bb0057f35c452129aed3
Submitter: "Zuul (22348)"
Branch: stable/2023.1

commit b992d639b974f35612d6bb0057f35c452129aed3
Author: Terry Wilson <email address hidden>
Date: Fri Dec 15 21:00:43 2023 +0000

    Handle creation of Port_Binding with chassis set

    When there is a backlog of notifications to be sent, it is possible
    that ovsdb-server will merge insert and update notifications. Due
    to this, we need to handle the situation where we see a Port_Binding
    created with the chassis set.

    Closes-Bug: #2017748

    Change-Id: Idfae87cf6c60e9e18ede91ea20857cea5322738c
    (cherry picked from commit a641e8aec09c1e33a15a34b19d92675ed2c85682)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 24.0.0.0b1

This issue was fixed in the openstack/neutron 24.0.0.0b1 development milestone.

Revision history for this message
Seyeong Kim (seyeongkim) wrote (last edit ): Re: OVN: ovnmeta namespaces missing during scalability test causing DHCP issues

A customer has the similar issue. Although I can't reproduce this in my local environment. I prepared debdiff for yoga.
Our support engineer pointed this out ( patch 2 ) and it makes sense to backport.
As you can see the description, it is happening intermittently with high load. the customer also faced this few times and can't reproduce even they want.

There are two commits inside the debdiff file

[PATCH 1/2] ovn-metadata: Refactor events
[PATCH 2/2] Handle creation of Port_Binding with chassis set

patch 1 is needed because of massive conflict
Also, I removed commit 2's neutron/agent/ovn/extensions/qos_hwol.py

This could be the code I need to be careful

Above 2023.1 already has above patches.

tags: added: sts
Changed in neutron (Ubuntu):
status: New → Fix Released
Seyeong Kim (seyeongkim)
description: updated
Seyeong Kim (seyeongkim)
Changed in neutron (Ubuntu Focal):
assignee: nobody → Seyeong Kim (seyeongkim)
Changed in neutron (Ubuntu Jammy):
assignee: nobody → Seyeong Kim (seyeongkim)
Hua Zhang (zhhuabj)
summary: - OVN: ovnmeta namespaces missing during scalability test causing DHCP
- issues
+ [SRU] OVN: ovnmeta namespaces missing during scalability test causing
+ DHCP issues
Seyeong Kim (seyeongkim)
Changed in neutron (Ubuntu Jammy):
assignee: Seyeong Kim (seyeongkim) → nobody
Changed in neutron (Ubuntu Focal):
assignee: Seyeong Kim (seyeongkim) → nobody
Revision history for this message
Hua Zhang (zhhuabj) wrote :

After carefully reviewing many patches, I finally backported the following 3 patches to Yoga.

[PATCH 1/3] 686698284b Update tap ip in metadata agent when metadata port ip updated
[PATCH 2/3] 6205158831 ovn-metadata: Refactor events
[PATCH 3/3] b992d639b9 Handle creation of Port_Binding with chassis set

I didn't backport the following patch edf48e46a1,

edf48e46a1 Improve agent provision performance for large networks

as doing so will introduce more dependent patches. Since we opted not to backport this patch, we have to address the code conflict within the provision_datapath function in patch 6205158831. The process of resolving code conflicts for this backport can be found here - https://paste.ubuntu.com/p/2V3SXVvsHx/

Due to the absence of a local reproducer, I only built a test package [1] and verified basic network functions. The test results [2] indicate successful functionality.

[1] https://launchpad.net/~zhhuabj/+archive/ubuntu/focal-yoga-test
[2] https://paste.ubuntu.com/p/m9vp3TJgyv/

Revision history for this message
Hua Zhang (zhhuabj) wrote :
Revision history for this message
Brian Haley (brian-haley) wrote :

Sorry, just clicked the wrong buttons, trying to get this targeted to the UCA back to Ussuri.

no longer affects: neutron
Hua Zhang (zhhuabj)
no longer affects: cloud-archive/zed
Revision history for this message
Hua Zhang (zhhuabj) wrote :
Revision history for this message
Hua Zhang (zhhuabj) wrote :

One of our customer helped test focal_yoga.debdiff on one isolated compute host in their test env. post installation they have created about 100 networks, routers and VMs that were spawned on this isolated compute host. they haven't seen any issues so far with VM creation (all the VMs were created successfully)

Considering this backport involves code refactoring, I do not intend to backport it to Ussuri.

Revision history for this message
Hua Zhang (zhhuabj) wrote :

Pls pause the current SRU work for now.

As I encountered a TypeError(https://paste.ubuntu.com/p/bKh59QJJf8/) when testing the following 3 backported patches.

[PATCH 1/3] 686698284b Update tap ip in metadata agent when metadata port ip updated
[PATCH 2/3] 6205158831 ovn-metadata: Refactor events
[PATCH 3/3] b992d639b9 Handle creation of Port_Binding with chassis set

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 22.2.0

This issue was fixed in the openstack/neutron 22.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 23.2.0

This issue was fixed in the openstack/neutron 23.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (unmaintained/zed)

Fix proposed to branch: unmaintained/zed
Review: https://review.opendev.org/c/openstack/neutron/+/926666

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (unmaintained/yoga)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/926656
Committed: https://opendev.org/openstack/neutron/commit/952e960414e7c15d4d4351bf2300ce53a69e4051
Submitter: "Zuul (22348)"
Branch: unmaintained/yoga

commit 952e960414e7c15d4d4351bf2300ce53a69e4051
Author: Terry Wilson <email address hidden>
Date: Tue Aug 20 10:20:52 2024 -0500

    Handle creation of Port_Binding with chassis set

    When there is a backlog of notifications to be sent, it is possible
    that ovsdb-server will merge insert and update notifications. Due
    to this, we need to handle the situation where we see a Port_Binding
    created with the chassis set.

    Closes-Bug: #2017748
    Change-Id: Idfae87cf6c60e9e18ede91ea20857cea5322738c

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (unmaintained/zed)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/926666
Committed: https://opendev.org/openstack/neutron/commit/7bfbd4c88ff02000da73b1455cb43fb4f2c72107
Submitter: "Zuul (22348)"
Branch: unmaintained/zed

commit 7bfbd4c88ff02000da73b1455cb43fb4f2c72107
Author: Terry Wilson <email address hidden>
Date: Tue Aug 20 10:20:52 2024 -0500

    Handle creation of Port_Binding with chassis set

    When there is a backlog of notifications to be sent, it is possible
    that ovsdb-server will merge insert and update notifications. Due
    to this, we need to handle the situation where we see a Port_Binding
    created with the chassis set.

    Closes-Bug: #2017748
    Change-Id: Idfae87cf6c60e9e18ede91ea20857cea5322738c

tags: added: in-unmaintained-zed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by "Slawek Kaplonski <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/881487
Reason: This review is > 4 weeks without comment, and failed Zuul jobs the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.