[ovn] Agent liveness checks are flaky and report false positives

Bug #1860436 reported by Daniel Alvarez
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Low
Daniel Alvarez

Bug Description

The way that networking-ovn mech driver performs health checks on agents reports false positives due to race conditions:

1) neutron-server increments the nb_cfg in NB_Global table from X to X+1
2) neutron-server almost immediately checks all the Chassis rows to see if they have written (X+1) . [1]
3) neutron-server process the updates from each agent from X to X+1

*Most* of the times, in step number 2, this condition doesn't hold so the timestamp is not updated. The result is that after 60 seconds (agent timeout default value), the agent is shown as dead. Sometimes, 3) happens before 2) so the timestamp gets updated and all is fine but this is not the normal case:

1) Bump of nb_cfg
2020-01-21 11:35:59.534 28 INFO networking_ovn.ml2.mech_driver [req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 d974caee8132421190dc790ebad401cc - default default] XXX nb_cfg = 36915
2020-01-21 11:35:59.538 28 INFO networking_ovn.ml2.mech_driver [req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 d974caee8132421190dc790ebad401cc - default default] XXX nb_cfg = 36916

2) Check of each chassis ext_id against our new bumped nb_cfg:
2020-01-21 11:35:59.539 28 INFO networking_ovn.ml2.mech_driver [req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 d974caee8132421190dc790ebad401cc - default default] YYY Global nb_cfg = 36916 chassis nb_cfg = 36915
2020-01-21 11:35:59.540 28 INFO networking_ovn.ml2.mech_driver [req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 d974caee8132421190dc790ebad401cc - default default] YYY Global nb_cfg = 36916 chassis nb_cfg = 36915
2020-01-21 11:35:59.541 28 INFO networking_ovn.ml2.mech_driver [req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 d974caee8132421190dc790ebad401cc - default default] YYY Global nb_cfg = 36916 chassis nb_cfg = 36915
2020-01-21 11:35:59.542 28 INFO networking_ovn.ml2.mech_driver [req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 d974caee8132421190dc790ebad401cc - default default] YYY Global nb_cfg = 36916 chassis nb_cfg = 36915
2020-01-21 11:35:59.543 28 INFO networking_ovn.ml2.mech_driver [req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 d974caee8132421190dc790ebad401cc - default default] YYY Global nb_cfg = 36916 chassis nb_cfg = 36915
2020-01-21 11:35:59.544 28 INFO networking_ovn.ml2.mech_driver [req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 d974caee8132421190dc790ebad401cc - default default] YYY Global nb_cfg = 36916 chassis nb_cfg = 36915
2020-01-21 11:35:59.546 28 INFO networking_ovn.ml2.mech_driver [req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 d974caee8132421190dc790ebad401cc - default default] YYY Global nb_cfg = 36916 chassis nb_cfg = 36915

3) Processing updates [2] in the ChassisEvent (some are even older!)
2020-01-21 11:35:59.546 30 INFO networking_ovn.ovsdb.ovsdb_monitor [req-1906156e-a089-4bde-b9bc-c9f4f9655a3d - - - - -] XXX chassis update: 36915
2020-01-21 11:35:59.548 29 INFO networking_ovn.ovsdb.ovsdb_monitor [req-072386aa-87e9-486c-bb6f-3dd2bdc038bd - - - - -] XXX chassis update: 36915
2020-01-21 11:35:59.556 32 INFO networking_ovn.ovsdb.ovsdb_monitor [req-efa34cac-2296-4d30-b153-9630b0309fcd - - - - -] XXX chassis update:
2020-01-21 11:35:59.556 27 INFO networking_ovn.ovsdb.ovsdb_monitor [req-91f7d181-bfa3-4646-9814-bb680d011081 - - - - -] XXX chassis update:
2020-01-21 11:35:59.557 25 INFO networking_ovn.ovsdb.ovsdb_monitor [req-420e5a25-13e4-4da6-8277-8a3a1028c9e9 - - - - -] XXX chassis update:
2020-01-21 11:35:59.756 30 INFO networking_ovn.ovsdb.ovsdb_monitor [req-1906156e-a089-4bde-b9bc-c9f4f9655a3d - - - - -] XXX chassis update: 36916
2020-01-21 11:35:59.778 29 INFO networking_ovn.ovsdb.ovsdb_monitor [req-072386aa-87e9-486c-bb6f-3dd2bdc038bd - - - - -] XXX chassis update: 36916

IMO, we need to space the bump of nb_cfg [2] and the check [3] in time as the NB_Global changes needs to be propagated to the SB, processed by all agents and then back to neutron-server which needs to process the JSON stuff and update the internal tables. So even if it's fast, most of the times it is not fast enough.

Another solution is to allow a difference of '1' to update timestamps.

[0] https://opendev.org/openstack/networking-ovn/src/branch/master/networking_ovn/ml2/mech_driver.py#L1093
[1] https://opendev.org/openstack/networking-ovn/src/branch/master/networking_ovn/ml2/mech_driver.py#L1098
[2] https://github.com/openstack/networking-ovn/blob/bf577e5a999f7db4cb9b790664ad596e1926d9a0/networking_ovn/ml2/mech_driver.py#L988
[3] https://github.com/openstack/networking-ovn/blob/6302298e9c4313f1200c543c89d92629daff9e89/networking_ovn/ovsdb/ovsdb_monitor.py#L74

Tags: ovn
tags: added: ovn
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/703612

Changed in neutron:
assignee: nobody → Daniel Alvarez (dalvarezs)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/703612
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=18410097f23a8e3d9cf33393b47d8b1a91020e4a
Submitter: Zuul
Branch: master

commit 18410097f23a8e3d9cf33393b47d8b1a91020e4a
Author: Daniel Alvarez <email address hidden>
Date: Tue Jan 21 14:26:22 2020 +0100

    [ovn] Agent liveness - allow time to propagate checks

    Right now neutron-server bumps the nb_cfg parameter in NB_Global
    table which needs to be propagated by northd to SB_Global,
    processed by agents, and write it back into SB_Global.
    This requires processing by neutron-server but unfortunatelly
    the server checks straight away and many times the value read
    is behind the expected value.

    All this results in frequent false positives showing dead agents
    when they are not.

    This patch is relaxing the checks by allowing a difference of 1
    between the read and expected values.

    Change-Id: Id91481b690ad569c5dcfa5bd404f497f591d729d
    Closes-Bug: 1860436
    Signed-off-by: Daniel Alvarez <email address hidden>

Changed in neutron:
status: In Progress → Fix Released
Changed in neutron:
importance: Undecided → Low
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/networking-ovn 7.1.0

This issue was fixed in the openstack/networking-ovn 7.1.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 16.0.0.0b1

This issue was fixed in the openstack/neutron 16.0.0.0b1 development milestone.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.