Call the api and do not return for a long time

Bug #2008858 reported by ZhouHeng
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Incomplete
Medium
ZhouHeng

Bug Description

I have an environment for using ovn.

I found that sometimes when calling the api, it can return quickly, and sometimes it never returns.
This phenomenon also occurs when calling the same interface.
There is no error when the interface does not return.

After tracing the log, we found that some of the neutral-server processes reported errors when they were just started
from log, I found:
when we call post_fork_initialization and add_node to db failed.
After analyzing the code, we can see that the error process did not connect to ovn-nb/sb and not set _post_fork_event.
However, the api process does not exit, and can receive api requests. When processing requests requires access to ovn-nb, it will always wait.
It is also difficult to troubleshoot problems. It can throw a wait timeout(=cfg.CONF.ovn.ovsdb_connection_timeout) to help troubleshoot problems.

error log:

2023-02-18 03:40:01.845 14 ERROR oslo_db.sqlalchemy.engines [None req-2537f423-6d58-48ef-995f-b6f1e1bac8d9 - - - - -] Database connection was found disconnected; reconnecting: oslo_db.exception.DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: 'SELECT 1'] (Background on this error at: http://sqlalche.me/e/e3q8)
2023-02-18 03:40:01.845 14 ERROR oslo_db.sqlalchemy.engines Traceback (most recent call last):
2023-02-18 03:40:01.845 14 ERROR oslo_db.sqlalchemy.engines File "/var/lib/kolla/venv/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1193, in _execute_context
2023-02-18 03:40:01.845 14 ERROR oslo_db.sqlalchemy.engines context)
2023-02-18 03:40:01.845 14 ERROR oslo_db.sqlalchemy.engines File "/var/lib/kolla/venv/lib/python3.6/site-packages/sqlalchemy/engine/default.py", line 509, in do_execute
2023-02-18 03:40:01.845 14 ERROR oslo_db.sqlalchemy.engines cursor.execute(statement, parameters)
2023-02-18 03:40:01.845 14 ERROR oslo_db.sqlalchemy.engines File "/var/lib/kolla/venv/lib/python3.6/site-packages/pymysql/cursors.py", line 170, in execute
2023-02-18 03:40:01.845 14 ERROR oslo_db.sqlalchemy.engines result = self._query(query)
2023-02-18 03:40:01.845 14 ERROR oslo_db.sqlalchemy.engines File "/var/lib/kolla/venv/lib/python3.6/site-packages/pymysql/cursors.py", line 328, in _query
2023-02-18 03:40:01.845 14 ERROR oslo_db.sqlalchemy.engines conn.query(q)
2023-02-18 03:40:01.845 14 ERROR oslo_db.sqlalchemy.engines File "/var/lib/kolla/venv/lib/python3.6/site-packages/pymysql/connections.py", line 516, in query
2023-02-18 03:40:01.845 14 ERROR oslo_db.sqlalchemy.engines self._affected_rows = self._read_query_result(unbuffered=unbuffered)
2023-02-18 03:40:01.845 14 ERROR oslo_db.sqlalchemy.engines File "/var/lib/kolla/venv/lib/python3.6/site-packages/pymysql/connections.py", line 727, in _read_query_result
2023-02-18 03:40:01.845 14 ERROR oslo_db.sqlalchemy.engines result.read()
2023-02-18 03:40:01.845 14 ERROR oslo_db.sqlalchemy.engines File "/var/lib/kolla/venv/lib/python3.6/site-packages/pymysql/connections.py", line 1066, in read
2023-02-18 03:40:01.845 14 ERROR oslo_db.sqlalchemy.engines first_packet = self.connection._read_packet()
2023-02-18 03:40:01.845 14 ERROR oslo_db.sqlalchemy.engines File "/var/lib/kolla/venv/lib/python3.6/site-packages/pymysql/connections.py", line 656, in _read_packet
2023-02-18 03:40:01.845 14 ERROR oslo_db.sqlalchemy.engines packet_header = self._read_bytes(4)
2023-02-18 03:40:01.845 14 ERROR oslo_db.sqlalchemy.engines File "/var/lib/kolla/venv/lib/python3.6/site-packages/pymysql/connections.py", line 702, in _read_bytes
2023-02-18 03:40:01.845 14 ERROR oslo_db.sqlalchemy.engines CR.CR_SERVER_LOST, "Lost connection to MySQL server during query")
2023-02-18 03:40:01.845 14 ERROR oslo_db.sqlalchemy.engines pymysql.err.OperationalError: (2013, 'Lost connection to MySQL server during query')
2023-02-18 03:40:01.845 14 ERROR oslo_db.sqlalchemy.engines
2023-02-18 03:40:01.845 14 ERROR oslo_db.sqlalchemy.engines The above exception was the direct cause of the following exception:
2023-02-18 03:40:01.845 14 ERROR oslo_db.sqlalchemy.engines
2023-02-18 03:40:01.845 14 ERROR oslo_db.sqlalchemy.engines Traceback (most recent call last):
2023-02-18 03:40:01.845 14 ERROR oslo_db.sqlalchemy.engines File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_db/sqlalchemy/engines.py", line 73, in _connect_ping_listener
2023-02-18 03:40:01.845 14 ERROR oslo_db.sqlalchemy.engines connection.scalar(select([1]))
2023-02-18 03:40:01.845 14 ERROR oslo_db.sqlalchemy.engines File "/var/lib/kolla/venv/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 880, in scalar
2023-02-18 03:40:01.845 14 ERROR oslo_db.sqlalchemy.engines return self.execute(object, *multiparams, **params).scalar()
2023-02-18 03:40:01.845 14 ERROR oslo_db.sqlalchemy.engines File "/var/lib/kolla/venv/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 948, in execute
2023-02-18 03:40:01.845 14 ERROR oslo_db.sqlalchemy.engines return meth(self, multiparams, params)
2023-02-18 03:40:01.845 14 ERROR oslo_db.sqlalchemy.engines File "/var/lib/kolla/venv/lib/python3.6/site-packages/sqlalchemy/sql/elements.py", line 269, in _execute_on_connection
2023-02-18 03:40:01.845 14 ERROR oslo_db.sqlalchemy.engines return connection._execute_clauseelement(self, multiparams, params)
2023-02-18 03:40:01.845 14 ERROR oslo_db.sqlalchemy.engines File "/var/lib/kolla/venv/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1060, in _execute_clauseelement
2023-02-18 03:40:01.845 14 ERROR oslo_db.sqlalchemy.engines compiled_sql, distilled_params
2023-02-18 03:40:01.845 14 ERROR oslo_db.sqlalchemy.engines File "/var/lib/kolla/venv/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1200, in _execute_context
2023-02-18 03:40:01.845 14 ERROR oslo_db.sqlalchemy.engines context)
2023-02-18 03:40:01.845 14 ERROR oslo_db.sqlalchemy.engines File "/var/lib/kolla/venv/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1409, in _handle_dbapi_exception
2023-02-18 03:40:01.845 14 ERROR oslo_db.sqlalchemy.engines util.raise_from_cause(newraise, exc_info)
2023-02-18 03:40:01.845 14 ERROR oslo_db.sqlalchemy.engines File "/var/lib/kolla/venv/lib/python3.6/site-packages/sqlalchemy/util/compat.py", line 265, in raise_from_cause
2023-02-18 03:40:01.845 14 ERROR oslo_db.sqlalchemy.engines reraise(type(exception), exception, tb=exc_tb, cause=cause)
2023-02-18 03:40:01.845 14 ERROR oslo_db.sqlalchemy.engines File "/var/lib/kolla/venv/lib/python3.6/site-packages/sqlalchemy/util/compat.py", line 248, in reraise
2023-02-18 03:40:01.845 14 ERROR oslo_db.sqlalchemy.engines raise value.with_traceback(tb)
2023-02-18 03:40:01.845 14 ERROR oslo_db.sqlalchemy.engines File "/var/lib/kolla/venv/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1193, in _execute_context
2023-02-18 03:40:01.845 14 ERROR oslo_db.sqlalchemy.engines context)
2023-02-18 03:40:01.845 14 ERROR oslo_db.sqlalchemy.engines File "/var/lib/kolla/venv/lib/python3.6/site-packages/sqlalchemy/engine/default.py", line 509, in do_execute
2023-02-18 03:40:01.845 14 ERROR oslo_db.sqlalchemy.engines cursor.execute(statement, parameters)
2023-02-18 03:40:01.845 14 ERROR oslo_db.sqlalchemy.engines File "/var/lib/kolla/venv/lib/python3.6/site-packages/pymysql/cursors.py", line 170, in execute
2023-02-18 03:40:01.845 14 ERROR oslo_db.sqlalchemy.engines result = self._query(query)
2023-02-18 03:40:01.845 14 ERROR oslo_db.sqlalchemy.engines File "/var/lib/kolla/venv/lib/python3.6/site-packages/pymysql/cursors.py", line 328, in _query
2023-02-18 03:40:01.845 14 ERROR oslo_db.sqlalchemy.engines conn.query(q)
2023-02-18 03:40:01.845 14 ERROR oslo_db.sqlalchemy.engines File "/var/lib/kolla/venv/lib/python3.6/site-packages/pymysql/connections.py", line 516, in query
2023-02-18 03:40:01.845 14 ERROR oslo_db.sqlalchemy.engines self._affected_rows = self._read_query_result(unbuffered=unbuffered)
2023-02-18 03:40:01.845 14 ERROR oslo_db.sqlalchemy.engines File "/var/lib/kolla/venv/lib/python3.6/site-packages/pymysql/connections.py", line 727, in _read_query_result
2023-02-18 03:40:01.845 14 ERROR oslo_db.sqlalchemy.engines result.read()
2023-02-18 03:40:01.845 14 ERROR oslo_db.sqlalchemy.engines File "/var/lib/kolla/venv/lib/python3.6/site-packages/pymysql/connections.py", line 1066, in read
2023-02-18 03:40:01.845 14 ERROR oslo_db.sqlalchemy.engines first_packet = self.connection._read_packet()
2023-02-18 03:40:01.845 14 ERROR oslo_db.sqlalchemy.engines File "/var/lib/kolla/venv/lib/python3.6/site-packages/pymysql/connections.py", line 656, in _read_packet
2023-02-18 03:40:01.845 14 ERROR oslo_db.sqlalchemy.engines packet_header = self._read_bytes(4)
2023-02-18 03:40:01.845 14 ERROR oslo_db.sqlalchemy.engines File "/var/lib/kolla/venv/lib/python3.6/site-packages/pymysql/connections.py", line 702, in _read_bytes
2023-02-18 03:40:01.845 14 ERROR oslo_db.sqlalchemy.engines CR.CR_SERVER_LOST, "Lost connection to MySQL server during query")
2023-02-18 03:40:01.845 14 ERROR oslo_db.sqlalchemy.engines oslo_db.exception.DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: 'SELECT 1'] (Background on this error at: http://sqlalche.me/e/e3q8)
2023-02-18 03:40:01.845 14 ERROR oslo_db.sqlalchemy.engines

Tags: ovn
Changed in neutron:
status: New → In Progress
ZhouHeng (zhouhenglc)
Changed in neutron:
assignee: nobody → ZhouHeng (zhouhenglc)
Revision history for this message
yatin (yatinkarel) wrote :

@ZhouHeng in which release you seeing this? i recall similar issue and was handled with https://review.opendev.org/q/I5f703d82175d71a222c76df37a82b5ccad890d14, you have that patch included?

Revision history for this message
ZhouHeng (zhouhenglc) wrote (last edit ):
tags: added: ovn
Revision history for this message
Lajos Katona (lajos-katona) wrote :

Am I understand right that the issue is at the process startup phase when neutron starts the workers?
But yes, a log msg with warning for example would be helpful

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Related patch: https://review.opendev.org/c/openstack/neutron/+/874641

Hello ZhouHeng:

I don't see the relationship between the Neutron DB and the OVN connections. If the problem you are describing happens when setting up the hash ring [1], as you described, what you need to fix is the connectivity with the DB. These kind of errors should never happen during the server initialization.

The patch you are proposing is not fixing that and at the same time breaking the OVN API. If you timeout while waiting for "_post_fork_event", you'll probably access to the OVN without the previous steps needed before releasing "_post_fork_event" event.

Please provide more information about the DB errors given in the description and why are they happening.

Regards.

[1]https://opendev.org/openstack/neutron/src/commit/2718edf76ee5835adfe9aa99d6fbb756a256d801/neutron/plugins/ml2/drivers/ovn/mech_driver/mech_driver.py#L336

Changed in neutron:
status: In Progress → Incomplete
Changed in neutron:
importance: Undecided → Medium
Revision history for this message
ZhouHeng (zhouhenglc) wrote :

hi Rodolfo, DB error is simply unable to connect to the database. The database service and the neutral-server are started at the same time. At this time, the database service is not ready.
How do you suggest to repair it? add retry_db_errors for _setup_hash_ring[1].

[1] 1https://opendev.org/openstack/neutron/src/commit/2718edf76ee5835adfe9aa99d6fbb756a256d801/neutron/plugins/ml2/drivers/ovn/mech_driver/mech_driver.py#L302

Revision history for this message
ZhouHeng (zhouhenglc) wrote :

@Rodolfo, Although adding a timeout to wait for ovnnb/sb cannot solve this problem, I still think it is unreasonable to wait for ovnnb without setting a timeout. If waiting occurs in some cases, it will be difficult to troubleshoot the problem.

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

From the perspective of a user, I would expect Neutron Server to return a HTTP error (503, Service Service Unavailable?) when its backend (mariadb or ovn) necessary for a request is offline.

@Zhou, When you say that neutron never returns an answer, does it mean that neutron-server silently drops the connection? Or does it return an error code?

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

Also from purely user perspective: I don't think neutron-server should avoid replying to api requests until ovn database connectivity is established. The ml2 driver should be ready for the ovn database to be offline at any point in time - at startup or later in the runtime. It may be reasonable for neutron-server to handle offline database on each request, by e.g. replying with an error (4xx or 5xx), not to hold all API requests hostage. Particular API endpoints that depend on ovn database availability could be tagged with a decorator that would require the connectivity, and if not, raise an appropriate error code. The rest of API requests that don't have such dependency should not be affected by the failure.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.