[OVN] Too frequent agent health-checks causes stress on ovsdb-server

Bug #1861092 reported by Lucas Alvares Gomes
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Lucas Alvares Gomes

Bug Description

Reported at: https://bugzilla.redhat.com/show_bug.cgi?id=1795198

Looks like neutron-server is pinging agents too frequently as per what's observed in the logs. nb-cfg being bumped at a non-fixed rate:

For example, in this part of the log I could find 11 updates in less than 2 minutes:

2020-01-27 12:23:04.247 43567 DEBUG ovsdbapp.backend.ovs_idl.event [-] Matched UPDATE: SbGlobalUpdateEvent(events=('update',), table='SB_Global', conditions=None, old_conditions=None) to row=SB_Global(ipsec=False, ssl=[], nb_cfg=49008, options={'mac_prefix': 'b2:64:0d'}, external_ids={}) old=SB_Global(nb_cfg=49007) matches /usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/event.py:44
2020-01-27 12:23:05.179 43567 DEBUG ovsdbapp.backend.ovs_idl.event [-] Matched UPDATE: SbGlobalUpdateEvent(events=('update',), table='SB_Global', conditions=None, old_conditions=None) to row=SB_Global(ipsec=False, ssl=[], nb_cfg=49009, options={'mac_prefix': 'b2:64:0d'}, external_ids={}) old=SB_Global(nb_cfg=49008) matches /usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/event.py:44
2020-01-27 12:23:32.216 43567 DEBUG ovsdbapp.backend.ovs_idl.event [-] Matched UPDATE: SbGlobalUpdateEvent(events=('update',), table='SB_Global', conditions=None, old_conditions=None) to row=SB_Global(ipsec=False, ssl=[], nb_cfg=49010, options={'mac_prefix': 'b2:64:0d'}, external_ids={}) old=SB_Global(nb_cfg=49009) matches /usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/event.py:44
2020-01-27 12:23:41.248 43567 DEBUG ovsdbapp.backend.ovs_idl.event [-] Matched UPDATE: SbGlobalUpdateEvent(events=('update',), table='SB_Global', conditions=None, old_conditions=None) to row=SB_Global(ipsec=False, ssl=[], nb_cfg=49011, options={'mac_prefix': 'b2:64:0d'}, external_ids={}) old=SB_Global(nb_cfg=49010) matches /usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/event.py:44
2020-01-27 12:23:42.183 43567 DEBUG ovsdbapp.backend.ovs_idl.event [-] Matched UPDATE: SbGlobalUpdateEvent(events=('update',), table='SB_Global', conditions=None, old_conditions=None) to row=SB_Global(ipsec=False, ssl=[], nb_cfg=49012, options={'mac_prefix': 'b2:64:0d'}, external_ids={}) old=SB_Global(nb_cfg=49011) matches /usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/event.py:44
2020-01-27 12:24:09.210 43567 DEBUG ovsdbapp.backend.ovs_idl.event [-] Matched UPDATE: SbGlobalUpdateEvent(events=('update',), table='SB_Global', conditions=None, old_conditions=None) to row=SB_Global(ipsec=False, ssl=[], nb_cfg=49013, options={'mac_prefix': 'b2:64:0d'}, external_ids={}) old=SB_Global(nb_cfg=49012) matches /usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/event.py:44
2020-01-27 12:24:18.252 43567 DEBUG ovsdbapp.backend.ovs_idl.event [-] Matched UPDATE: SbGlobalUpdateEvent(events=('update',), table='SB_Global', conditions=None, old_conditions=None) to row=SB_Global(ipsec=False, ssl=[], nb_cfg=49014, options={'mac_prefix': 'b2:64:0d'}, external_ids={}) old=SB_Global(nb_cfg=49013) matches /usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/event.py:44
2020-01-27 12:24:19.179 43567 DEBUG ovsdbapp.backend.ovs_idl.event [-] Matched UPDATE: SbGlobalUpdateEvent(events=('update',), table='SB_Global', conditions=None, old_conditions=None) to row=SB_Global(ipsec=False, ssl=[], nb_cfg=49015, options={'mac_prefix': 'b2:64:0d'}, external_ids={}) old=SB_Global(nb_cfg=49014) matches /usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/event.py:44
2020-01-27 12:24:46.205 43567 DEBUG ovsdbapp.backend.ovs_idl.event [-] Matched UPDATE: SbGlobalUpdateEvent(events=('update',), table='SB_Global', conditions=None, old_conditions=None) to row=SB_Global(ipsec=False, ssl=[], nb_cfg=49016, options={'mac_prefix': 'b2:64:0d'}, external_ids={}) old=SB_Global(nb_cfg=49015) matches /usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/event.py:44
2020-01-27 12:24:55.254 43567 DEBUG ovsdbapp.backend.ovs_idl.event [-] Matched UPDATE: SbGlobalUpdateEvent(events=('update',), table='SB_Global', conditions=None, old_conditions=None) to row=SB_Global(ipsec=False, ssl=[], nb_cfg=49017, options={'mac_prefix': 'b2:64:0d'}, external_ids={}) old=SB_Global(nb_cfg=49016) matches /usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/event.py:44
2020-01-27 12:24:56.177 43567 DEBUG ovsdbapp.backend.ovs_idl.event [-] Matched UPDATE: SbGlobalUpdateEvent(events=('update',), table='SB_Global', conditions=None, old_conditions=None) to row=SB_Global(ipsec=False, ssl=[], nb_cfg=49018, options={'mac_prefix': 'b2:64:0d'}, external_ids={}) old=SB_Global(nb_cfg=49017) matches /usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/event.py:44

This is triggering too frequent writes from *all* metadata-agents and ovn-controllers in the cloud which creates a lot of traffic. At scale, this can be a problem.

Imagine a 500 node deployment, with one update per 10 seconds as in the example above. That will translate into 1K (1 metadata agent + 1 ovn-controller per node) write transactions into the SB database every 10 seconds so 100 transactions per second that trigger a JSON RPC command update to every single client into the cloud.

Changed in neutron:
status: New → Confirmed
assignee: nobody → Lucas Alvares Gomes (lucasagomes)
tags: added: ovn
Changed in neutron:
status: Confirmed → In Progress
Changed in neutron:
importance: Undecided → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/705295

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.opendev.org/705480

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/704530
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=647b7f63f9dafedfa9fb6e09e3d92d66fb512f0b
Submitter: Zuul
Branch: master

commit 647b7f63f9dafedfa9fb6e09e3d92d66fb512f0b
Author: Lucas Alvares Gomes <email address hidden>
Date: Tue Jan 28 10:46:35 2020 +0000

    [OVN] Add an interval between agents health checks

    This patch adds a minimum interval between each agent health checks.

    The way OVN checks for the agents liveness is by increasing a value in
    the NB DB and waiting for it to be propagated to the SB DB but, this can
    be costy if done many times too quickly. Therefore, a minimum interval
    between each check is being added.

    Closes-Bug: #1861092
    Change-Id: If1f2d97e3a3a17f6744d546b3e8903bde55e83b9
    Signed-off-by: Lucas Alvares Gomes <email address hidden>

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.opendev.org/705295
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=2cd75e073aa7d949f8842db8d4adbc4d7d2c36ac
Submitter: Zuul
Branch: master

commit 2cd75e073aa7d949f8842db8d4adbc4d7d2c36ac
Author: Terry Wilson <email address hidden>
Date: Fri Jan 31 14:49:48 2020 -0600

    OVN Metadata agent gets OVSDB updates for only its Chassis

    The metadata agent registers the Chassis table with ovsdb-server
    and therefor gets database updates every time *any* Chassis is
    updated--even if the update is just a liveness check that updates
    nb_cfg.

    This patch adds a condition so that metadata agent only gets updates
    for the Chassis that it is running on.

    Change-Id: I452b7de09312ecea621c4b448cc63f037cad9675
    Related-bug: #1861092

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 16.0.0.0b1

This issue was fixed in the openstack/neutron 16.0.0.0b1 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.opendev.org/705480
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=d340f6be570a59166bd262322a82a60d58e6aa61
Submitter: Zuul
Branch: master

commit d340f6be570a59166bd262322a82a60d58e6aa61
Author: Terry Wilson <email address hidden>
Date: Mon Feb 3 10:48:58 2020 -0600

    Add functional test for metadata agent monitoring

    This adds the functional test for https://review.opendev.org/#/c/705295/

    Change-Id: Ie2bd745ed3e9c2b0c618c2833bfbecf44c019965
    Related-bug: #1861092

tags: added: neutron-proactive-backport-potential
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.