[OVN] infinite loop in ovsdb_monitor

Bug #1926838 reported by frigo
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
neutron
New
High
Unassigned

Bug Description

I am running the ovn sandbox, a second chassis, and neutron. I synchronize neutron database with the databases of the sandbox, run neutron-server, and possibly run a few ovs-vsctl commands on chassis to set up ovs ports.

I notice that some commands on the chassis can trigger some sort of infinite loop in neutron. For example

    ovs-vsctl set open . external-ids:ovn-cms-options=enable-chassis-as-gw
    ovs-vsctl set open . external-ids:ovn-cms-options=xx
    ovs-vsctl set open . external-ids:ovn-cms-options=enable-chassis-as-gw

on the second chassis, will trigger transactions "in a loop" on the neutron-server:

    ...
    Successfully bumped revision number for resource f32ac6cc (type: ports) to 571
    Router 079cde19-0b92-48f8-bef2-5e35b939a7a1 is bound to host sandbox
    Running txn n=1 command(idx=0): CheckRevisionNumberCommand
    Running txn n=1 command(idx=1): UpdateLRouterPortCommand
    Running txn n=1 command(idx=2): SetLRouterPortInLSwitchPortCommand
    Successfully bumped revision number for resource f32ac6cc (type: router_ports) to 572
    Running txn n=1 command(idx=0): CheckRevisionNumberCommand
    Running txn n=1 command(idx=1): SetLSwitchPortCommand
    Running txn n=1 command(idx=2): PgDelPortCommand
    Successfully bumped revision number for resource f32ac6cc (type: ports) to 572
    Router 079cde19-0b92-48f8-bef2-5e35b939a7a1 is bound to host sandbox
    Running txn n=1 command(idx=0): CheckRevisionNumberCommand
    Running txn n=1 command(idx=1): UpdateLRouterPortCommand
    Running txn n=1 command(idx=2): SetLRouterPortInLSwitchPortCommand
    Successfully bumped revision number for resource f32ac6cc (type: router_ports) to 573
    Running txn n=1 command(idx=0): CheckRevisionNumberCommand
    Running txn n=1 command(idx=1): SetLSwitchPortCommand
    Running txn n=1 command(idx=2): PgDelPortCommand
    ...

This is not limited to the change of external-ids:ovn-cmd-options, other ovs-vsctl commands can trigger the same issue.

neutron-server CPU consumption jumps to 100% and the revision_number of ports keep increasing. Restarting neutron-server fixes the issue temporarily.

I am not sure how to provide a simple reproducer because I did not found any instructions to run neutron standalone and two OVN chassis. I will investigate what is happening locally.

Version: main branch from OVN (d41a337fe3b608a8f90de8722d148344011f0bd8) and of Neutron (94d36862c207b1e4d984d28874ca2f3bd09c855f)

It's not a blocker as long as it happens only on my laptop.

Tags: ovn
Revision history for this message
frigo (rigault-francois) wrote :
Revision history for this message
frigo (rigault-francois) wrote :

It looks like PortBindingChassisEvent is feeding itself with events.

When a PortBindingChassisEvent is received, the revision_number is incremented, port bindings are updated (with the only change being the revision_number), which triggers a new PortBindingChassisEvent.

When looking at the ddlog replay, the update of the revision_number in northd for logical_switch_port and logical_router_port, actually deletes and recreates the keys in the northbound db, which cause deletion and recreation of multicast_group and port_binding in the southbound db (so just a change of the revision_number does have some cost).

Changed in neutron:
importance: Undecided → High
Revision history for this message
David Palacio (davplsm) wrote :

After upgrading to Xena, the neutron's cpu usage has increased considerably.
Logs start to show many "bumped revision number" events.

2022-06-01 08:52:43.554 2004379 INFO neutron.db.ovn_revision_numbers_db [req-24c02bf9-4397-4d8b-b193-0720a22d23d8 - - - - -] Successfully bumped revision number for resource 8bfd6bcd-5aba-4139-a4ae-37c4e9700bb3 (type: ports) to 307
.
.
.
2022-06-01 08:52:51.186 2004379 INFO neutron.db.ovn_revision_numbers_db [req-24c02bf9-4397-4d8b-b193-0720a22d23d8 - - - - -] Successfully bumped revision number for resource 8bfd6bcd-5aba-4139-a4ae-37c4e9700bb3 (type: router_ports) to 312
2022-06-01 08:52:51.300 2004379 INFO neutron.db.ovn_revision_numbers_db [req-24c02bf9-4397-4d8b-b193-0720a22d23d8 - - - - -] Successfully bumped revision number for resource 8bfd6bcd-5aba-4139-a4ae-37c4e9700bb3 (type: ports) to 312

Removing deployed items after upgrading to Xena causes the cpu load to drop.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.