ovs idl not monitor tables after reconnect

Bug #1988039 reported by ZhouHeng
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
In Progress
Low
Unassigned
ovsdbapp
In Progress
Undecided
Unassigned

Bug Description

I came across a strange phenomenon. When I restart a node (there are nb, nb and mariadb services on the node), after a period of time, through the "openstack network agent list", I found that sometimes all agents are down.

By observing the log, it is found that when some processes are requested, all agents are down.

Through packet capture, sb not notify this process of the change of table chassis_private, only notify the change of table Database.

After checking the log of the process, it is found that an exception was printed [1] when the process connected to sb for the last time.

You can see the following code[2]:

def run(self):
 errors = 0
 while self.is_running:
  # If we fail in an Idl call, we could have missed an update
  # from the server, leaving us out of sync with ovsdb-server.
  # It is not safe to continue without restarting the connection.
  # Though it is likely that the error is unrecoverable, keep trying
  # indefinitely just in case.
  try:
   self.idl.wait(self.poller)
   self.poller.fd_wait(self.txns.alert_fileno, poller.POLLIN)
   self.poller.block()
   with self.lock:
    self.idl.run() -------- point-1
  except Exception as e:
   # This shouldn't happen, but is possible if there is a bug
   # in python-ovs
   errors += 1
   LOG.exception(e)
   with self.lock:
    self.idl.force_reconnect() -------- point-2
    try:
     idlutils.wait_for_change(self.idl, self.timeout) ------ ponit-3
    except Exception as e:
     # This could throw the same exception as idl.run()
     # or Exception("timeout"), either way continue
     LOG.exception(e)
   sleep = min(2 ** errors, 60)
   LOG.info("Trying to recover, sleeping %s seconds", sleep)

When we process the notification, if an exception occurs(Unable to connect to the database), it will be thrown from the mark point-1. and then reconnect(ponit-2). in point-3, we will send_server_monitor and handle table Database changes. if we still cannot connect to the database at this time, we will handle the exception[3].
At this time, the following actions cannot be performed(send_monitor).

[1] https://opendev.org/openstack/ovsdbapp/src/commit/96cf8d6288587423e65d5149016e07fb51430724/ovsdbapp/backend/ovs_idl/connection.py#L121
[2] https://opendev.org/openstack/ovsdbapp/src/commit/96cf8d6288587423e65d5149016e07fb51430724/ovsdbapp/backend/ovs_idl/connection.py#L95-L123

[3] https://opendev.org/openstack/neutron/src/commit/7dfe41ab8f9ecf6266c7a51c0223ff8f8822c16f/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovsdb_monitor.py#L719

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ovsdbapp (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/ovsdbapp/+/854963

Changed in ovsdbapp:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on ovsdbapp (master)

Change abandoned by "ZhouHeng <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/ovsdbapp/+/854963
Reason: We will no longer handle the thrown exception: https://review.opendev.org/c/openstack/ovsdbapp/+/862524

Changed in neutron:
status: New → In Progress
Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :
Changed in neutron:
importance: Undecided → Low
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.