OVN maintenance tasks may be delayed 10 minutes in the podified deployment

Bug #2074209 reported by Slawek Kaplonski
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Medium
Slawek Kaplonski

Bug Description

When running Neutron server on the K8s (or OpenShift) cluster it may happen that ovn maintenance periodic tasks which are supposed to be run imediatelly are delayed for about 10 minutes. It is like when e.g. Neutron's configuration is changed and K8s is restarting neutron pods. What happens in such case is:

1. pods with neutron-api application are running,
2. configuration is updated and k8s is first starting new pods and after new ones are ready it terminates old pods,
3. during that time, neutron-server process which runs in the new pod is starting maintenance task and it immediately tries to run tasks defined with "periodics.periodic(spacing=600, run_immediately=True)" decorator.
4. This new pod don't yet have lock to the ovn northbound db so each of such maintenance tasks is stopped immediately,
5. Few seconds later OLD neutron-server pod is terminated by k8s and then new pod (the one started above in point 3) got lock to the ovn database,
6. Now all maintenance tasks are run again by the maintenance worked after time defined in the "spacing" parameter which is 600 seconds. This 600 seconds is pretty long time to wait for e.g. some options in the ovn database will be adjusted to the new Neutron configuration.

We could reduce this spacing time to e.g. 5 seconds. This will decrease this additonal waiting time significantly in the case described in this bug. It would make all those methods to be called much more often in neutron-server processes which don't have lock granted but we may introduce additional parameter for that and e.g. raise NeverAgain() after 100 attempts of run such periodic task.

Tags: ovn
Revision history for this message
Jakub Libosvar (libosvar) wrote :

This is just a food for thought, perhaps we can also discuss this on a mailing list or the next PTG but maybe we should consider a way how to decouple the maintenance and periodic tasks out from Neutron in podified environments. There is always only one maintenance process active in the cluster, the one that holds the lock. With podified environment we can have just one pod running the tasks, avoiding locking and relying on the underlaying k8s functionality to take care of the pod lifecycle, meaning that there will always be one healthy pod in the cluster executing the periodic/maintenance routines.

That way we would also solve the problem with this bug + perphas other potential issues that may raise up because of the podified nature.

What do you think?

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello Jakub:

With the WSGI implementation, any worker that is not an API thread is spawned in a separate process. For example in [1], you can see the neutron-api, neutron-ovn-maintenance-worker, neutron-periodic-workers and neutron-rpc-server. As you suggested, the neutron-ovn-maintenance-worker is running in a single process and can be configured as pod with one single replica.

In any case, the issue reported by Slawek will also affect a deployment using one single container for the maintenance task: before the old container is stopped and removed, the new one will be spawned and the problem reported will affect. Reducing the spacing of the methods that should be executed once only and at the beginning of the process could solve it.

Regards.

[1]https://03399ed3bb928f8e37fb-954a4196d912d707c769d8596124df5e.ssl.cf1.rackcdn.com/924317/3/experimental/neutron-tempest-plugin-api-ovn-wsgi/ba8289e/controller/logs/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/2023.2)

Fix proposed to branch: stable/2023.2
Review: https://review.opendev.org/c/openstack/neutron/+/925178

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/2023.1)

Fix proposed to branch: stable/2023.1
Review: https://review.opendev.org/c/openstack/neutron/+/925179

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/925194

Changed in neutron:
status: Confirmed → In Progress
Revision history for this message
Jakub Libosvar (libosvar) wrote :

Thanks Rodolfo, I didn't know we already have the functionality in place, that's great news.

My suggestion though was to not use the locks at all when running the maintenance task in kubernetes. The underlaying infrastructure should take care of having only a single maintenance process and that only one is up and running at a time.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/2023.2)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/925178
Committed: https://opendev.org/openstack/neutron/commit/35d0535e74682ae1d91a0c8dec8a925e7b8f64a4
Submitter: "Zuul (22348)"
Branch: stable/2023.2

commit 35d0535e74682ae1d91a0c8dec8a925e7b8f64a4
Author: Terry Wilson <email address hidden>
Date: Tue Sep 26 09:15:51 2023 -0500

    Add has_lock_periodic decorator for OVN Maintenance

    Almost all of the periodic methods in ml2/OVN's maintenance code
    exit early if the IDL lock is not held. This adds a decorator
    that wraps the futurist periodic that exits if the lock isn't held.

    There are still a few methods that exit raising NeverAgain if
    certain features aren't available, and those have been left alone
    for now.

    Conflicts:
        neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/maintenance.py

    Partial-bug: #2074209

    Change-Id: I9771bd4f76a9ec073afeeb80a787832102446cd6
    (cherry picked from commit 1d7e99bc0f10509166b240ad503e649fef4ebab3)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/2023.1)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/925179
Committed: https://opendev.org/openstack/neutron/commit/15f60beefbf0859ecbc17b552fabc489f1b4dad9
Submitter: "Zuul (22348)"
Branch: stable/2023.1

commit 15f60beefbf0859ecbc17b552fabc489f1b4dad9
Author: Terry Wilson <email address hidden>
Date: Tue Sep 26 09:15:51 2023 -0500

    Add has_lock_periodic decorator for OVN Maintenance

    Almost all of the periodic methods in ml2/OVN's maintenance code
    exit early if the IDL lock is not held. This adds a decorator
    that wraps the futurist periodic that exits if the lock isn't held.

    There are still a few methods that exit raising NeverAgain if
    certain features aren't available, and those have been left alone
    for now.

    Conflicts:
        neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/maintenance.py

    Partial-bug: #2074209

    Change-Id: I9771bd4f76a9ec073afeeb80a787832102446cd6
    (cherry picked from commit 1d7e99bc0f10509166b240ad503e649fef4ebab3)
    (cherry picked from commit 35d0535e74682ae1d91a0c8dec8a925e7b8f64a4)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/925194
Committed: https://opendev.org/openstack/neutron/commit/04c217bcd0eda07d52a60121b6f86236ba6e26ee
Submitter: "Zuul (22348)"
Branch: master

commit 04c217bcd0eda07d52a60121b6f86236ba6e26ee
Author: Slawek Kaplonski <email address hidden>
Date: Tue Jul 30 14:17:44 2024 +0200

    Lower spacing time of the OVN maintenance tasks which should be run once

    Some of the OVN maintenance tasks are expected to be run just once and
    then they raise periodic.NeverAgain() to not be run anymore. Those tasks
    also require to have acquried ovn db lock so that only one of the
    maintenance workers really runs them.
    All those tasks had set 600 seconds as a spacing time so they were run
    every 600 seconds. This works fine usually but that may cause small
    issue in the environments were Neutron is run in POD as k8s/openshift
    application. In such case, when e.g. configuration of neutron is
    updated, it may happen that first new POD with Neutron is spawned and
    only once it is already running, k8s will stop old POD. Because of that
    maintenance worker running in the new neutron-server POD will not
    acquire lock on the OVN DB (old POD still holds the lock) and will not
    run all those maintenance tasks immediately. After old POD will be
    terminated, one of the new PODs will at some point acquire that lock and
    then will run all those maintenance tasks but this would cause 600
    seconds delay in running them.

    To avoid such long waiting time to run those maintenance tasks, this
    patch lowers its spacing time from 600 to just 5 seconds.
    Additionally maintenance tasks which are supposed to be run only once and
    only by the maintenance worker which has acquired ovn db lock will now be
    stopped (periodic.NeverAgain will be raised) after 100 attempts of
    run.
    This will avoid running them every 5 seconds forever on the workers
    which don't acquire lock on the OVN DB at all.

    Closes-bug: #2074209
    Change-Id: Iabb4bb427588c1a5da27a5d313f75b5bd23805b2

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/2024.1)

Fix proposed to branch: stable/2024.1
Review: https://review.opendev.org/c/openstack/neutron/+/925709

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/2023.2)

Fix proposed to branch: stable/2023.2
Review: https://review.opendev.org/c/openstack/neutron/+/925710

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/2023.1)

Fix proposed to branch: stable/2023.1
Review: https://review.opendev.org/c/openstack/neutron/+/925711

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/2024.1)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/925709
Committed: https://opendev.org/openstack/neutron/commit/78900c12f3c398b9f974531bbf61ce5c8e852770
Submitter: "Zuul (22348)"
Branch: stable/2024.1

commit 78900c12f3c398b9f974531bbf61ce5c8e852770
Author: Slawek Kaplonski <email address hidden>
Date: Tue Jul 30 14:17:44 2024 +0200

    Lower spacing time of the OVN maintenance tasks which should be run once

    Some of the OVN maintenance tasks are expected to be run just once and
    then they raise periodic.NeverAgain() to not be run anymore. Those tasks
    also require to have acquried ovn db lock so that only one of the
    maintenance workers really runs them.
    All those tasks had set 600 seconds as a spacing time so they were run
    every 600 seconds. This works fine usually but that may cause small
    issue in the environments were Neutron is run in POD as k8s/openshift
    application. In such case, when e.g. configuration of neutron is
    updated, it may happen that first new POD with Neutron is spawned and
    only once it is already running, k8s will stop old POD. Because of that
    maintenance worker running in the new neutron-server POD will not
    acquire lock on the OVN DB (old POD still holds the lock) and will not
    run all those maintenance tasks immediately. After old POD will be
    terminated, one of the new PODs will at some point acquire that lock and
    then will run all those maintenance tasks but this would cause 600
    seconds delay in running them.

    To avoid such long waiting time to run those maintenance tasks, this
    patch lowers its spacing time from 600 to just 5 seconds.
    Additionally maintenance tasks which are supposed to be run only once and
    only by the maintenance worker which has acquired ovn db lock will now be
    stopped (periodic.NeverAgain will be raised) after 100 attempts of
    run.
    This will avoid running them every 5 seconds forever on the workers
    which don't acquire lock on the OVN DB at all.

    Conflicts:
        neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/maintenance.py

    Closes-bug: #2074209
    Change-Id: Iabb4bb427588c1a5da27a5d313f75b5bd23805b2
    (cherry picked from commit 04c217bcd0eda07d52a60121b6f86236ba6e26ee)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/2023.1)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/925711
Committed: https://opendev.org/openstack/neutron/commit/406b1e00c4caa96b7933a6701ded255ac7874438
Submitter: "Zuul (22348)"
Branch: stable/2023.1

commit 406b1e00c4caa96b7933a6701ded255ac7874438
Author: Slawek Kaplonski <email address hidden>
Date: Tue Jul 30 14:17:44 2024 +0200

    Lower spacing time of the OVN maintenance tasks which should be run once

    Some of the OVN maintenance tasks are expected to be run just once and
    then they raise periodic.NeverAgain() to not be run anymore. Those tasks
    also require to have acquried ovn db lock so that only one of the
    maintenance workers really runs them.
    All those tasks had set 600 seconds as a spacing time so they were run
    every 600 seconds. This works fine usually but that may cause small
    issue in the environments were Neutron is run in POD as k8s/openshift
    application. In such case, when e.g. configuration of neutron is
    updated, it may happen that first new POD with Neutron is spawned and
    only once it is already running, k8s will stop old POD. Because of that
    maintenance worker running in the new neutron-server POD will not
    acquire lock on the OVN DB (old POD still holds the lock) and will not
    run all those maintenance tasks immediately. After old POD will be
    terminated, one of the new PODs will at some point acquire that lock and
    then will run all those maintenance tasks but this would cause 600
    seconds delay in running them.

    To avoid such long waiting time to run those maintenance tasks, this
    patch lowers its spacing time from 600 to just 5 seconds.
    Additionally maintenance tasks which are supposed to be run only once and
    only by the maintenance worker which has acquired ovn db lock will now be
    stopped (periodic.NeverAgain will be raised) after 100 attempts of
    run.
    This will avoid running them every 5 seconds forever on the workers
    which don't acquire lock on the OVN DB at all.

    Conflicts:
        neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/maintenance.py

    Closes-bug: #2074209
    Change-Id: Iabb4bb427588c1a5da27a5d313f75b5bd23805b2
    (cherry picked from commit 04c217bcd0eda07d52a60121b6f86236ba6e26ee)
    (cherry picked from commit 78900c12f3c398b9f974531bbf61ce5c8e852770)
    (cherry picked from commit 910821805907a1800d0dd875e9af41352c80cf5a)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/2023.2)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/925710
Committed: https://opendev.org/openstack/neutron/commit/910821805907a1800d0dd875e9af41352c80cf5a
Submitter: "Zuul (22348)"
Branch: stable/2023.2

commit 910821805907a1800d0dd875e9af41352c80cf5a
Author: Slawek Kaplonski <email address hidden>
Date: Tue Jul 30 14:17:44 2024 +0200

    Lower spacing time of the OVN maintenance tasks which should be run once

    Some of the OVN maintenance tasks are expected to be run just once and
    then they raise periodic.NeverAgain() to not be run anymore. Those tasks
    also require to have acquried ovn db lock so that only one of the
    maintenance workers really runs them.
    All those tasks had set 600 seconds as a spacing time so they were run
    every 600 seconds. This works fine usually but that may cause small
    issue in the environments were Neutron is run in POD as k8s/openshift
    application. In such case, when e.g. configuration of neutron is
    updated, it may happen that first new POD with Neutron is spawned and
    only once it is already running, k8s will stop old POD. Because of that
    maintenance worker running in the new neutron-server POD will not
    acquire lock on the OVN DB (old POD still holds the lock) and will not
    run all those maintenance tasks immediately. After old POD will be
    terminated, one of the new PODs will at some point acquire that lock and
    then will run all those maintenance tasks but this would cause 600
    seconds delay in running them.

    To avoid such long waiting time to run those maintenance tasks, this
    patch lowers its spacing time from 600 to just 5 seconds.
    Additionally maintenance tasks which are supposed to be run only once and
    only by the maintenance worker which has acquired ovn db lock will now be
    stopped (periodic.NeverAgain will be raised) after 100 attempts of
    run.
    This will avoid running them every 5 seconds forever on the workers
    which don't acquire lock on the OVN DB at all.

    Conflicts:
        neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/maintenance.py
        neutron/tests/unit/plugins/ml2/drivers/ovn/mech_driver/ovsdb/test_maintenance.py

    Closes-bug: #2074209
    Change-Id: Iabb4bb427588c1a5da27a5d313f75b5bd23805b2
    (cherry picked from commit 04c217bcd0eda07d52a60121b6f86236ba6e26ee)
    (cherry picked from commit 78900c12f3c398b9f974531bbf61ce5c8e852770)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 22.2.1

This issue was fixed in the openstack/neutron 22.2.1 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.