Bug #2074209 “OVN maintenance tasks may be delayed 10 minutes in...” : Bugs : neutron

Revision history for this message

Jakub Libosvar (libosvar) wrote on 2024-07-29:

#1

This is just a food for thought, perhaps we can also discuss this on a mailing list or the next PTG but maybe we should consider a way how to decouple the maintenance and periodic tasks out from Neutron in podified environments. There is always only one maintenance process active in the cluster, the one that holds the lock. With podified environment we can have just one pod running the tasks, avoiding locking and relying on the underlaying k8s functionality to take care of the pod lifecycle, meaning that there will always be one healthy pod in the cluster executing the periodic/maintenance routines.

That way we would also solve the problem with this bug + perphas other potential issues that may raise up because of the podified nature.

What do you think?

Revision history for this message

Rodolfo Alonso (rodolfo-alonso-hernandez) wrote on 2024-07-30:

#2

Hello Jakub:

With the WSGI implementation, any worker that is not an API thread is spawned in a separate process. For example in [1], you can see the neutron-api, neutron-ovn-maintenance-worker, neutron-periodic-workers and neutron-rpc-server. As you suggested, the neutron-ovn-maintenance-worker is running in a single process and can be configured as pod with one single replica.

In any case, the issue reported by Slawek will also affect a deployment using one single container for the maintenance task: before the old container is stopped and removed, the new one will be spawned and the problem reported will affect. Reducing the spacing of the methods that should be executed once only and at the beginning of the process could solve it.

Regards.

[1]https://03399ed3bb928f8e37fb-954a4196d912d707c769d8596124df5e.ssl.cf1.rackcdn.com/924317/3/experimental/neutron-tempest-plugin-api-ovn-wsgi/ba8289e/controller/logs/

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-07-30: Fix proposed to neutron (stable/2023.2)

#3

Fix proposed to branch: stable/2023.2
Review: https://review.opendev.org/c/openstack/neutron/+/925178

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-07-30: Fix proposed to neutron (stable/2023.1)

#4

Fix proposed to branch: stable/2023.1
Review: https://review.opendev.org/c/openstack/neutron/+/925179

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-07-30: Fix proposed to neutron (master)

#5

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/925194

Changed in neutron:
status:	Confirmed → In Progress

Revision history for this message

Jakub Libosvar (libosvar) wrote on 2024-07-30:

#6

Thanks Rodolfo, I didn't know we already have the functionality in place, that's great news.

My suggestion though was to not use the locks at all when running the maintenance task in kubernetes. The underlaying infrastructure should take care of having only a single maintenance process and that only one is up and running at a time.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-07-30: Fix merged to neutron (stable/2023.2)

#7

Reviewed: https://review.opendev.org/c/openstack/neutron/+/925178
Committed: https://opendev.org/openstack/neutron/commit/35d0535e74682ae1d91a0c8dec8a925e7b8f64a4
Submitter: "Zuul (22348)"
Branch: stable/2023.2

commit 35d0535e74682ae1d91a0c8dec8a925e7b8f64a4
Author: Terry Wilson <email address hidden>
Date: Tue Sep 26 09:15:51 2023 -0500

Add has_lock_periodic decorator for OVN Maintenance

    Almost all of the periodic methods in ml2/OVN's maintenance code
    exit early if the IDL lock is not held. This adds a decorator
    that wraps the futurist periodic that exits if the lock isn't held.

    There are still a few methods that exit raising NeverAgain if
    certain features aren't available, and those have been left alone
    for now.

Conflicts:
neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/maintenance.py

Partial-bug: #2074209

Change-Id: I9771bd4f76a9ec073afeeb80a787832102446cd6
(cherry picked from commit 1d7e99bc0f10509166b240ad503e649fef4ebab3)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-08-01: Fix merged to neutron (stable/2023.1)

#8

Reviewed: https://review.opendev.org/c/openstack/neutron/+/925179
Committed: https://opendev.org/openstack/neutron/commit/15f60beefbf0859ecbc17b552fabc489f1b4dad9
Submitter: "Zuul (22348)"
Branch: stable/2023.1

commit 15f60beefbf0859ecbc17b552fabc489f1b4dad9
Author: Terry Wilson <email address hidden>
Date: Tue Sep 26 09:15:51 2023 -0500

Add has_lock_periodic decorator for OVN Maintenance

    Almost all of the periodic methods in ml2/OVN's maintenance code
    exit early if the IDL lock is not held. This adds a decorator
    that wraps the futurist periodic that exits if the lock isn't held.

    There are still a few methods that exit raising NeverAgain if
    certain features aren't available, and those have been left alone
    for now.

Conflicts:
neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/maintenance.py

Partial-bug: #2074209

    Change-Id: I9771bd4f76a9ec073afeeb80a787832102446cd6
    (cherry picked from commit 1d7e99bc0f10509166b240ad503e649fef4ebab3)
    (cherry picked from commit 35d0535e74682ae1d91a0c8dec8a925e7b8f64a4)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-08-02: Fix merged to neutron (master)

#9

Reviewed: https://review.opendev.org/c/openstack/neutron/+/925194
Committed: https://opendev.org/openstack/neutron/commit/04c217bcd0eda07d52a60121b6f86236ba6e26ee
Submitter: "Zuul (22348)"
Branch: master

commit 04c217bcd0eda07d52a60121b6f86236ba6e26ee
Author: Slawek Kaplonski <email address hidden>
Date: Tue Jul 30 14:17:44 2024 +0200

Lower spacing time of the OVN maintenance tasks which should be run once

    Some of the OVN maintenance tasks are expected to be run just once and
    then they raise periodic.NeverAgain() to not be run anymore. Those tasks
    also require to have acquried ovn db lock so that only one of the
    maintenance workers really runs them.
    All those tasks had set 600 seconds as a spacing time so they were run
    every 600 seconds. This works fine usually but that may cause small
    issue in the environments were Neutron is run in POD as k8s/openshift
    application. In such case, when e.g. configuration of neutron is
    updated, it may happen that first new POD with Neutron is spawned and
    only once it is already running, k8s will stop old POD. Because of that
    maintenance worker running in the new neutron-server POD will not
    acquire lock on the OVN DB (old POD still holds the lock) and will not
    run all those maintenance tasks immediately. After old POD will be
    terminated, one of the new PODs will at some point acquire that lock and
    then will run all those maintenance tasks but this would cause 600
    seconds delay in running them.

    To avoid such long waiting time to run those maintenance tasks, this
    patch lowers its spacing time from 600 to just 5 seconds.
    Additionally maintenance tasks which are supposed to be run only once and
    only by the maintenance worker which has acquired ovn db lock will now be
    stopped (periodic.NeverAgain will be raised) after 100 attempts of
    run.
    This will avoid running them every 5 seconds forever on the workers
    which don't acquire lock on the OVN DB at all.

Closes-bug: #2074209
Change-Id: Iabb4bb427588c1a5da27a5d313f75b5bd23805b2

Reviewed:  https://review.opendev.org/c/openstack/neutron/+/925194
Committed: https://opendev.org/openstack/neutron/commit/04c217bcd0eda07d52a60121b6f86236ba6e26ee
Submitter: "Zuul (22348)"
Branch:    master

commit 04c217bcd0eda07d52a60121b6f86236ba6e26ee
Author: Slawek Kaplonski <skaplons@redhat.com>
Date:   Tue Jul 30 14:17:44 2024 +0200

Lower spacing time of the OVN maintenance tasks which should be run once
    
    Some of the OVN maintenance tasks are expected to be run just once and
    then they raise periodic.NeverAgain() to not be run anymore. Those tasks
    also require to have acquried ovn db lock so that only one of the
    maintenance workers really runs them.
    All those tasks had set 600 seconds as a spacing time so they were run
    every 600 seconds. This works fine usually but that may cause small
    issue in the environments were Neutron is run in POD as k8s/openshift
    application. In such case, when e.g. configuration of neutron is
    updated, it may happen that first new POD with Neutron is spawned and
    only once it is already running, k8s will stop old POD. Because of that
    maintenance worker running in the new neutron-server POD will not
    acquire lock on the OVN DB (old POD still holds the lock) and will not
    run all those maintenance tasks immediately. After old POD will be
    terminated, one of the new PODs will at some point acquire that lock and
    then will run all those maintenance tasks but this would cause 600
    seconds delay in running them.
    
    To avoid such long waiting time to run those maintenance tasks, this
    patch lowers its spacing time from 600 to just 5 seconds.
    Additionally maintenance tasks which are supposed to be run only once and
    only by the maintenance worker which has acquired ovn db lock will now be
    stopped (periodic.NeverAgain will be raised) after 100 attempts of
    run.
    This will avoid running them every 5 seconds forever on the workers
    which don't acquire lock on the OVN DB at all.
    
    Closes-bug: #2074209
    Change-Id: Iabb4bb427588c1a5da27a5d313f75b5bd23805b2

Changed in neutron:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-08-05: Fix proposed to neutron (stable/2024.1)

#10

Fix proposed to branch: stable/2024.1
Review: https://review.opendev.org/c/openstack/neutron/+/925709

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-08-05: Fix proposed to neutron (stable/2023.2)

#11

Fix proposed to branch: stable/2023.2
Review: https://review.opendev.org/c/openstack/neutron/+/925710

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-08-05: Fix proposed to neutron (stable/2023.1)

#12

Fix proposed to branch: stable/2023.1
Review: https://review.opendev.org/c/openstack/neutron/+/925711

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-08-06: Fix merged to neutron (stable/2024.1)

#13

Reviewed: https://review.opendev.org/c/openstack/neutron/+/925709
Committed: https://opendev.org/openstack/neutron/commit/78900c12f3c398b9f974531bbf61ce5c8e852770
Submitter: "Zuul (22348)"
Branch: stable/2024.1

commit 78900c12f3c398b9f974531bbf61ce5c8e852770
Author: Slawek Kaplonski <email address hidden>
Date: Tue Jul 30 14:17:44 2024 +0200

Lower spacing time of the OVN maintenance tasks which should be run once

    Some of the OVN maintenance tasks are expected to be run just once and
    then they raise periodic.NeverAgain() to not be run anymore. Those tasks
    also require to have acquried ovn db lock so that only one of the
    maintenance workers really runs them.
    All those tasks had set 600 seconds as a spacing time so they were run
    every 600 seconds. This works fine usually but that may cause small
    issue in the environments were Neutron is run in POD as k8s/openshift
    application. In such case, when e.g. configuration of neutron is
    updated, it may happen that first new POD with Neutron is spawned and
    only once it is already running, k8s will stop old POD. Because of that
    maintenance worker running in the new neutron-server POD will not
    acquire lock on the OVN DB (old POD still holds the lock) and will not
    run all those maintenance tasks immediately. After old POD will be
    terminated, one of the new PODs will at some point acquire that lock and
    then will run all those maintenance tasks but this would cause 600
    seconds delay in running them.

    To avoid such long waiting time to run those maintenance tasks, this
    patch lowers its spacing time from 600 to just 5 seconds.
    Additionally maintenance tasks which are supposed to be run only once and
    only by the maintenance worker which has acquired ovn db lock will now be
    stopped (periodic.NeverAgain will be raised) after 100 attempts of
    run.
    This will avoid running them every 5 seconds forever on the workers
    which don't acquire lock on the OVN DB at all.

Conflicts:
neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/maintenance.py

    Closes-bug: #2074209
    Change-Id: Iabb4bb427588c1a5da27a5d313f75b5bd23805b2
    (cherry picked from commit 04c217bcd0eda07d52a60121b6f86236ba6e26ee)

Reviewed:  https://review.opendev.org/c/openstack/neutron/+/925709
Committed: https://opendev.org/openstack/neutron/commit/78900c12f3c398b9f974531bbf61ce5c8e852770
Submitter: "Zuul (22348)"
Branch:    stable/2024.1

commit 78900c12f3c398b9f974531bbf61ce5c8e852770
Author: Slawek Kaplonski <skaplons@redhat.com>
Date:   Tue Jul 30 14:17:44 2024 +0200

Lower spacing time of the OVN maintenance tasks which should be run once
    
    Some of the OVN maintenance tasks are expected to be run just once and
    then they raise periodic.NeverAgain() to not be run anymore. Those tasks
    also require to have acquried ovn db lock so that only one of the
    maintenance workers really runs them.
    All those tasks had set 600 seconds as a spacing time so they were run
    every 600 seconds. This works fine usually but that may cause small
    issue in the environments were Neutron is run in POD as k8s/openshift
    application. In such case, when e.g. configuration of neutron is
    updated, it may happen that first new POD with Neutron is spawned and
    only once it is already running, k8s will stop old POD. Because of that
    maintenance worker running in the new neutron-server POD will not
    acquire lock on the OVN DB (old POD still holds the lock) and will not
    run all those maintenance tasks immediately. After old POD will be
    terminated, one of the new PODs will at some point acquire that lock and
    then will run all those maintenance tasks but this would cause 600
    seconds delay in running them.
    
    To avoid such long waiting time to run those maintenance tasks, this
    patch lowers its spacing time from 600 to just 5 seconds.
    Additionally maintenance tasks which are supposed to be run only once and
    only by the maintenance worker which has acquired ovn db lock will now be
    stopped (periodic.NeverAgain will be raised) after 100 attempts of
    run.
    This will avoid running them every 5 seconds forever on the workers
    which don't acquire lock on the OVN DB at all.
    
    Conflicts:
        neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/maintenance.py
    
    Closes-bug: #2074209
    Change-Id: Iabb4bb427588c1a5da27a5d313f75b5bd23805b2
    (cherry picked from commit 04c217bcd0eda07d52a60121b6f86236ba6e26ee)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-08-06: Fix merged to neutron (stable/2023.1)

#14

Reviewed: https://review.opendev.org/c/openstack/neutron/+/925711
Committed: https://opendev.org/openstack/neutron/commit/406b1e00c4caa96b7933a6701ded255ac7874438
Submitter: "Zuul (22348)"
Branch: stable/2023.1

commit 406b1e00c4caa96b7933a6701ded255ac7874438
Author: Slawek Kaplonski <email address hidden>
Date: Tue Jul 30 14:17:44 2024 +0200

Lower spacing time of the OVN maintenance tasks which should be run once

    Some of the OVN maintenance tasks are expected to be run just once and
    then they raise periodic.NeverAgain() to not be run anymore. Those tasks
    also require to have acquried ovn db lock so that only one of the
    maintenance workers really runs them.
    All those tasks had set 600 seconds as a spacing time so they were run
    every 600 seconds. This works fine usually but that may cause small
    issue in the environments were Neutron is run in POD as k8s/openshift
    application. In such case, when e.g. configuration of neutron is
    updated, it may happen that first new POD with Neutron is spawned and
    only once it is already running, k8s will stop old POD. Because of that
    maintenance worker running in the new neutron-server POD will not
    acquire lock on the OVN DB (old POD still holds the lock) and will not
    run all those maintenance tasks immediately. After old POD will be
    terminated, one of the new PODs will at some point acquire that lock and
    then will run all those maintenance tasks but this would cause 600
    seconds delay in running them.

    To avoid such long waiting time to run those maintenance tasks, this
    patch lowers its spacing time from 600 to just 5 seconds.
    Additionally maintenance tasks which are supposed to be run only once and
    only by the maintenance worker which has acquired ovn db lock will now be
    stopped (periodic.NeverAgain will be raised) after 100 attempts of
    run.
    This will avoid running them every 5 seconds forever on the workers
    which don't acquire lock on the OVN DB at all.

Conflicts:
neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/maintenance.py

    Closes-bug: #2074209
    Change-Id: Iabb4bb427588c1a5da27a5d313f75b5bd23805b2
    (cherry picked from commit 04c217bcd0eda07d52a60121b6f86236ba6e26ee)
    (cherry picked from commit 78900c12f3c398b9f974531bbf61ce5c8e852770)
    (cherry picked from commit 910821805907a1800d0dd875e9af41352c80cf5a)

Reviewed:  https://review.opendev.org/c/openstack/neutron/+/925711
Committed: https://opendev.org/openstack/neutron/commit/406b1e00c4caa96b7933a6701ded255ac7874438
Submitter: "Zuul (22348)"
Branch:    stable/2023.1

commit 406b1e00c4caa96b7933a6701ded255ac7874438
Author: Slawek Kaplonski <skaplons@redhat.com>
Date:   Tue Jul 30 14:17:44 2024 +0200

Lower spacing time of the OVN maintenance tasks which should be run once
    
    Some of the OVN maintenance tasks are expected to be run just once and
    then they raise periodic.NeverAgain() to not be run anymore. Those tasks
    also require to have acquried ovn db lock so that only one of the
    maintenance workers really runs them.
    All those tasks had set 600 seconds as a spacing time so they were run
    every 600 seconds. This works fine usually but that may cause small
    issue in the environments were Neutron is run in POD as k8s/openshift
    application. In such case, when e.g. configuration of neutron is
    updated, it may happen that first new POD with Neutron is spawned and
    only once it is already running, k8s will stop old POD. Because of that
    maintenance worker running in the new neutron-server POD will not
    acquire lock on the OVN DB (old POD still holds the lock) and will not
    run all those maintenance tasks immediately. After old POD will be
    terminated, one of the new PODs will at some point acquire that lock and
    then will run all those maintenance tasks but this would cause 600
    seconds delay in running them.
    
    To avoid such long waiting time to run those maintenance tasks, this
    patch lowers its spacing time from 600 to just 5 seconds.
    Additionally maintenance tasks which are supposed to be run only once and
    only by the maintenance worker which has acquired ovn db lock will now be
    stopped (periodic.NeverAgain will be raised) after 100 attempts of
    run.
    This will avoid running them every 5 seconds forever on the workers
    which don't acquire lock on the OVN DB at all.
    
    Conflicts:
        neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/maintenance.py
    
    Closes-bug: #2074209
    Change-Id: Iabb4bb427588c1a5da27a5d313f75b5bd23805b2
    (cherry picked from commit 04c217bcd0eda07d52a60121b6f86236ba6e26ee)
    (cherry picked from commit 78900c12f3c398b9f974531bbf61ce5c8e852770)
    (cherry picked from commit 910821805907a1800d0dd875e9af41352c80cf5a)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-08-06: Fix merged to neutron (stable/2023.2)

#15

Reviewed: https://review.opendev.org/c/openstack/neutron/+/925710
Committed: https://opendev.org/openstack/neutron/commit/910821805907a1800d0dd875e9af41352c80cf5a
Submitter: "Zuul (22348)"
Branch: stable/2023.2

commit 910821805907a1800d0dd875e9af41352c80cf5a
Author: Slawek Kaplonski <email address hidden>
Date: Tue Jul 30 14:17:44 2024 +0200

Lower spacing time of the OVN maintenance tasks which should be run once

    Some of the OVN maintenance tasks are expected to be run just once and
    then they raise periodic.NeverAgain() to not be run anymore. Those tasks
    also require to have acquried ovn db lock so that only one of the
    maintenance workers really runs them.
    All those tasks had set 600 seconds as a spacing time so they were run
    every 600 seconds. This works fine usually but that may cause small
    issue in the environments were Neutron is run in POD as k8s/openshift
    application. In such case, when e.g. configuration of neutron is
    updated, it may happen that first new POD with Neutron is spawned and
    only once it is already running, k8s will stop old POD. Because of that
    maintenance worker running in the new neutron-server POD will not
    acquire lock on the OVN DB (old POD still holds the lock) and will not
    run all those maintenance tasks immediately. After old POD will be
    terminated, one of the new PODs will at some point acquire that lock and
    then will run all those maintenance tasks but this would cause 600
    seconds delay in running them.

    To avoid such long waiting time to run those maintenance tasks, this
    patch lowers its spacing time from 600 to just 5 seconds.
    Additionally maintenance tasks which are supposed to be run only once and
    only by the maintenance worker which has acquired ovn db lock will now be
    stopped (periodic.NeverAgain will be raised) after 100 attempts of
    run.
    This will avoid running them every 5 seconds forever on the workers
    which don't acquire lock on the OVN DB at all.

    Conflicts:
        neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/maintenance.py
        neutron/tests/unit/plugins/ml2/drivers/ovn/mech_driver/ovsdb/test_maintenance.py

    Closes-bug: #2074209
    Change-Id: Iabb4bb427588c1a5da27a5d313f75b5bd23805b2
    (cherry picked from commit 04c217bcd0eda07d52a60121b6f86236ba6e26ee)
    (cherry picked from commit 78900c12f3c398b9f974531bbf61ce5c8e852770)

Reviewed:  https://review.opendev.org/c/openstack/neutron/+/925710
Committed: https://opendev.org/openstack/neutron/commit/910821805907a1800d0dd875e9af41352c80cf5a
Submitter: "Zuul (22348)"
Branch:    stable/2023.2

commit 910821805907a1800d0dd875e9af41352c80cf5a
Author: Slawek Kaplonski <skaplons@redhat.com>
Date:   Tue Jul 30 14:17:44 2024 +0200

Lower spacing time of the OVN maintenance tasks which should be run once
    
    Some of the OVN maintenance tasks are expected to be run just once and
    then they raise periodic.NeverAgain() to not be run anymore. Those tasks
    also require to have acquried ovn db lock so that only one of the
    maintenance workers really runs them.
    All those tasks had set 600 seconds as a spacing time so they were run
    every 600 seconds. This works fine usually but that may cause small
    issue in the environments were Neutron is run in POD as k8s/openshift
    application. In such case, when e.g. configuration of neutron is
    updated, it may happen that first new POD with Neutron is spawned and
    only once it is already running, k8s will stop old POD. Because of that
    maintenance worker running in the new neutron-server POD will not
    acquire lock on the OVN DB (old POD still holds the lock) and will not
    run all those maintenance tasks immediately. After old POD will be
    terminated, one of the new PODs will at some point acquire that lock and
    then will run all those maintenance tasks but this would cause 600
    seconds delay in running them.
    
    To avoid such long waiting time to run those maintenance tasks, this
    patch lowers its spacing time from 600 to just 5 seconds.
    Additionally maintenance tasks which are supposed to be run only once and
    only by the maintenance worker which has acquired ovn db lock will now be
    stopped (periodic.NeverAgain will be raised) after 100 attempts of
    run.
    This will avoid running them every 5 seconds forever on the workers
    which don't acquire lock on the OVN DB at all.
    
    Conflicts:
        neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/maintenance.py
        neutron/tests/unit/plugins/ml2/drivers/ovn/mech_driver/ovsdb/test_maintenance.py
    
    Closes-bug: #2074209
    Change-Id: Iabb4bb427588c1a5da27a5d313f75b5bd23805b2
    (cherry picked from commit 04c217bcd0eda07d52a60121b6f86236ba6e26ee)
    (cherry picked from commit 78900c12f3c398b9f974531bbf61ce5c8e852770)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-08-09: Fix included in openstack/neutron 22.2.1

#16

This issue was fixed in the openstack/neutron 22.2.1 release.

neutron

OVN maintenance tasks may be delayed 10 minutes in the podified deployment

Bug Description

Other bug subscribers

Remote bug watches