cyborg database sqlalchemy error

Bug #2061130 reported by Théo Galera
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Cyborg (OpenStack)
New
Undecided
Unassigned

Bug Description

Hello all,

I'm deploying cyborg through kolla-ansible to manage nvidia pgpus. It works great and it is very useful for me. However, it seems that Cyborg is not managing well the oslo_db/sqlalchemy database interaction I think.
Cyborg is able to work about a few hours with some errors like that :

Exception during reset or similar: AssertionError: do not call blocking functions from the mainloop
2024-04-12 10:28:56.470 7 ERROR sqlalchemy.pool.impl.QueuePool Traceback (most recent call last):
2024-04-12 10:28:56.470 7 ERROR sqlalchemy.pool.impl.QueuePool File "/var/lib/kolla/venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 739, in _finalize_fairy
2024-04-12 10:28:56.470 7 ERROR sqlalchemy.pool.impl.QueuePool fairy._reset(pool)
2024-04-12 10:28:56.470 7 ERROR sqlalchemy.pool.impl.QueuePool File "/var/lib/kolla/venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 988, in _reset
2024-04-12 10:28:56.470 7 ERROR sqlalchemy.pool.impl.QueuePool pool._dialect.do_rollback(self)
2024-04-12 10:28:56.470 7 ERROR sqlalchemy.pool.impl.QueuePool File "/var/lib/kolla/venv/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 682, in do_rollback
2024-04-12 10:28:56.470 7 ERROR sqlalchemy.pool.impl.QueuePool dbapi_connection.rollback()
2024-04-12 10:28:56.470 7 ERROR sqlalchemy.pool.impl.QueuePool File "/var/lib/kolla/venv/lib/python3.10/site-packages/pymysql/connections.py", line 480, in rollback
2024-04-12 10:28:56.470 7 ERROR sqlalchemy.pool.impl.QueuePool self._read_ok_packet()
2024-04-12 10:28:56.470 7 ERROR sqlalchemy.pool.impl.QueuePool File "/var/lib/kolla/venv/lib/python3.10/site-packages/pymysql/connections.py", line 443, in _read_ok_packet
2024-04-12 10:28:56.470 7 ERROR sqlalchemy.pool.impl.QueuePool pkt = self._read_packet()
2024-04-12 10:28:56.470 7 ERROR sqlalchemy.pool.impl.QueuePool File "/var/lib/kolla/venv/lib/python3.10/site-packages/pymysql/connections.py", line 692, in _read_packet
2024-04-12 10:28:56.470 7 ERROR sqlalchemy.pool.impl.QueuePool packet_header = self._read_bytes(4)
2024-04-12 10:28:56.470 7 ERROR sqlalchemy.pool.impl.QueuePool File "/var/lib/kolla/venv/lib/python3.10/site-packages/pymysql/connections.py", line 732, in _read_bytes
2024-04-12 10:28:56.470 7 ERROR sqlalchemy.pool.impl.QueuePool data = self._rfile.read(num_bytes)
2024-04-12 10:28:56.470 7 ERROR sqlalchemy.pool.impl.QueuePool File "/usr/lib/python3.10/socket.py", line 705, in readinto
2024-04-12 10:28:56.470 7 ERROR sqlalchemy.pool.impl.QueuePool return self._sock.recv_into(b)
2024-04-12 10:28:56.470 7 ERROR sqlalchemy.pool.impl.QueuePool File "/var/lib/kolla/venv/lib/python3.10/site-packages/eventlet/greenio/base.py", line 376, in recv_intI
2024-04-12 10:28:56.470 7 ERROR sqlalchemy.pool.impl.QueuePool return self._recv_loop(self.fd.recv_into, 0, buffer, nbytes, flags)
2024-04-12 10:28:56.470 7 ERROR sqlalchemy.pool.impl.QueuePool File "/var/lib/kolla/venv/lib/python3.10/site-packages/eventlet/greenio/base.py", line 364, in _recv_loop
2024-04-12 10:28:56.470 7 ERROR sqlalchemy.pool.impl.QueuePool self._read_trampoline()
2024-04-12 10:28:56.470 7 ERROR sqlalchemy.pool.impl.QueuePool File "/var/lib/kolla/venv/lib/python3.10/site-packages/eventlet/greenio/base.py", line 332, in _read_trampoline
2024-04-12 10:28:56.470 7 ERROR sqlalchemy.pool.impl.QueuePool self._trampoline(
2024-04-12 10:28:56.470 7 ERROR sqlalchemy.pool.impl.QueuePool File "/var/lib/kolla/venv/lib/python3.10/site-packages/eventlet/greenio/base.py", line 211, in _trampoline
2024-04-12 10:28:56.470 7 ERROR sqlalchemy.pool.impl.QueuePool return trampoline(fd, read=read, write=write, timeout=timeout,
2024-04-12 10:28:56.470 7 ERROR sqlalchemy.pool.impl.QueuePool File "/var/lib/kolla/venv/lib/python3.10/site-packages/eventlet/hubs/__init__.py", line 141, in trampoline
2024-04-12 10:28:56.470 7 ERROR sqlalchemy.pool.impl.QueuePool assert hub.greenlet is not current, 'do not call blocking functions from the mainloop'
2024-04-12 10:28:56.470 7 ERROR sqlalchemy.pool.impl.QueuePool AssertionError: do not call blocking functions from the mainloop
2024-04-12 10:28:56.470 7 ERROR sqlalchemy.pool.impl.QueuePool

And after a few hours/days, I get errors like that and Cyborg is not working anymore :

Error during AgentManager.update_available_resource: oslo_messaging.rpc.client.RemoteError: Remote error: TimeoutError QueuePool limit of size 1 overflow 50 reached, connection timed out, timeout 30.00 (Background on this error at: https://sqlalche.me/e/14/3o7r)
['Traceback (most recent call last):\n', ' File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/rpc/server.py", line 165, in _process_incoming\n res = self.dispatcher.dispatch(message)\n', ' File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/rpc/dispatcher.py", line 309, in dispatch\n return self._do_dispatch(endpoint, method, ctxt, args)\n', ' File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/rpc/dispatcher.py", line 229, in _do_dispatch\n result = func(ctxt, **new_args)\n', ' File "/var/lib/kolla/venv/lib/python3.10/site-packages/cyborg/conductor/manager.py", line 117, in report_data\n old_driver_device_list = DriverDevice.list(context, hostname)\n', ' File "/var/lib/kolla/venv/lib/python3.10/site-packages/cyborg/objects/driver_objects/driver_device.py", line 122, in list\n dev_obj_list = Device.get_list_by_hostname(context, host)\n', ' File "/var/lib/kolla/venv/lib/python3.10/site-packages/cyborg/objects/device.py", line 93, in get_list_by_hostname\n device_obj_list = Device.list(context, dev_filter)\n', ' File "/var/lib/kolla/venv/lib/python3.10/site-packages/cyborg/objects/device.py", line 69, in list\n db_devices = cls.dbapi.device_list_by_filters(\n', ' File "/var/lib/kolla/venv/lib/python3.10/site-packages/cyborg/db/sqlalchemy/api.py", line 453, in device_list_by_filters\n return _paginate_query(context, models.Device, query_prefix,\n', ' File "/var/lib/kolla/venv/lib/python3.10/site-packages/cyborg/db/sqlalchemy/api.py", line 118, in _paginate_query\n return query.all()\n', ' File "/var/lib/kolla/venv/lib/python3.10/site-packages/sqlalchemy/orm/query.py", line 2772, in all\n return self._iter().all()\n', ' File "/var/lib/kolla/venv/lib/python3.10/site-packages/sqlalchemy/orm/query.py", line 2907, in _iter\n result = self.session.execute(\n', ' File "/var/lib/kolla/venv/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 1711, in execute\n conn = self._connection_for_bind(bind)\n', ' File "/var/lib/kolla/venv/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 1552, in _connection_for_bind\n return self._transaction._connection_for_bind(\n', ' File "/var/lib/kolla/venv/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 747, in _connection_for_bind\n conn = bind.connect()\n', ' File "/var/lib/kolla/venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 3315, in connect\n return self._connection_cls(self, close_with_result=close_with_result)\n', ' File "/var/lib/kolla/venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 96, in __init__\n else engine.raw_connection()\n', ' File "/var/lib/kolla/venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 3394, in raw_connection\n return self._wrap_pool_connect(self.pool.connect, _connection)\n', ' File "/var/lib/kolla/venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 3361, in _wrap_pool_connect\n return fn()\n', ' File "/var/lib/kolla/venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 310, in connect\n return _ConnectionFairy._checkout(self)\n', ' File "/var/lib/kolla/venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 868, in _checkout\n fairy = _ConnectionRecord.checkout(pool)\n', ' File "/var/lib/kolla/venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 476, in checkout\n rec = pool._do_get()\n', ' File "/var/lib/kolla/venv/lib/python3.10/site-packages/sqlalchemy/pool/impl.py", line 134, in _do_get\n raise exc.TimeoutError(\n', 'sqlalchemy.exc.TimeoutError: QueuePool limit of size 1 overflow 50 reached, connection timed out, timeout 30.00 (Background on this error at: https://sqlalche.me/e/14/3o7r)\n'].
2024-04-11 09:59:15.976 7 ERROR oslo_service.periodic_task Traceback (most recent call last):
2024-04-11 09:59:15.976 7 ERROR oslo_service.periodic_task File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_service/periodic_task.py", line 216, in run_periodic_tasks
2024-04-11 09:59:15.976 7 ERROR oslo_service.periodic_task task(self, context)
2024-04-11 09:59:15.976 7 ERROR oslo_service.periodic_task File "/var/lib/kolla/venv/lib/python3.10/site-packages/cyborg/agent/manager.py", line 82, in update_available_resource
2024-04-11 09:59:15.976 7 ERROR oslo_service.periodic_task self._rt.update_usage(context)
2024-04-11 09:59:15.976 7 ERROR oslo_service.periodic_task File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_concurrency/lockutils.py", line 414, in inner
2024-04-11 09:59:15.976 7 ERROR oslo_service.periodic_task return f(*args, **kwargs)
2024-04-11 09:59:15.976 7 ERROR oslo_service.periodic_task File "/var/lib/kolla/venv/lib/python3.10/site-packages/cyborg/agent/resource_tracker.py", line 76, in update_usage
2024-04-11 09:59:15.976 7 ERROR oslo_service.periodic_task self.conductor_api.report_data(context, self.host, acc_list)
2024-04-11 09:59:15.976 7 ERROR oslo_service.periodic_task File "/var/lib/kolla/venv/lib/python3.10/site-packages/cyborg/conductor/rpcapi.py", line 55, in report_data
2024-04-11 09:59:15.976 7 ERROR oslo_service.periodic_task cctxt.call(context, 'report_data', hostname=hostname,
2024-04-11 09:59:15.976 7 ERROR oslo_service.periodic_task File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/rpc/client.py", line 189, in call
2024-04-11 09:59:15.976 7 ERROR oslo_service.periodic_task result = self.transport._send(
2024-04-11 09:59:15.976 7 ERROR oslo_service.periodic_task File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/transport.py", line 123, in _send
2024-04-11 09:59:15.976 7 ERROR oslo_service.periodic_task return self._driver.send(target, ctxt, message,
2024-04-11 09:59:15.976 7 ERROR oslo_service.periodic_task File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 689, in send
2024-04-11 09:59:15.976 7 ERROR oslo_service.periodic_task return self._send(target, ctxt, message, wait_for_reply, timeout,
2024-04-11 09:59:15.976 7 ERROR oslo_service.periodic_task File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 681, in _send
2024-04-11 09:59:15.976 7 ERROR oslo_service.periodic_task raise result
2024-04-11 09:59:15.976 7 ERROR oslo_service.periodic_task oslo_messaging.rpc.client.RemoteError: Remote error: TimeoutError QueuePool limit of size 1 overflow 50 reached, connection timed out, timeout 30.00 (Background on this error at: https://sqlalche.me/e/14/3o7r)

To make it works, I changed the [database] config with max_overflow = 100 and max_pool_size = 5, like that :
[database]
connection = mysql+pymysql://cyborg:password@116.81.196.14:3306/cyborg
connection_recycle_time = 10
max_pool_size = 5
max_overflow = 100

It works for the moment, but I still get the first error and another one :

Exception during reset or similar: pymysql.err.OperationalError: (2013, 'Lost connection to MySQL server during query')
2024-04-12 11:14:27.533 6 ERROR sqlalchemy.pool.impl.QueuePool Traceback (most recent call last):
2024-04-12 11:14:27.533 6 ERROR sqlalchemy.pool.impl.QueuePool File "/var/lib/kolla/venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 739, in _finalize_fairy
2024-04-12 11:14:27.533 6 ERROR sqlalchemy.pool.impl.QueuePool fairy._reset(pool)
2024-04-12 11:14:27.533 6 ERROR sqlalchemy.pool.impl.QueuePool File "/var/lib/kolla/venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 988, in _reset
2024-04-12 11:14:27.533 6 ERROR sqlalchemy.pool.impl.QueuePool pool._dialect.do_rollback(self)
2024-04-12 11:14:27.533 6 ERROR sqlalchemy.pool.impl.QueuePool File "/var/lib/kolla/venv/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 682, in do_rollback
2024-04-12 11:14:27.533 6 ERROR sqlalchemy.pool.impl.QueuePool dbapi_connection.rollback()
2024-04-12 11:14:27.533 6 ERROR sqlalchemy.pool.impl.QueuePool File "/var/lib/kolla/venv/lib/python3.10/site-packages/pymysql/connections.py", line 480, in rollback
2024-04-12 11:14:27.533 6 ERROR sqlalchemy.pool.impl.QueuePool self._read_ok_packet()
2024-04-12 11:14:27.533 6 ERROR sqlalchemy.pool.impl.QueuePool File "/var/lib/kolla/venv/lib/python3.10/site-packages/pymysql/connections.py", line 443, in _read_ok_packet
2024-04-12 11:14:27.533 6 ERROR sqlalchemy.pool.impl.QueuePool pkt = self._read_packet()
2024-04-12 11:14:27.533 6 ERROR sqlalchemy.pool.impl.QueuePool File "/var/lib/kolla/venv/lib/python3.10/site-packages/pymysql/connections.py", line 692, in _read_packet
2024-04-12 11:14:27.533 6 ERROR sqlalchemy.pool.impl.QueuePool packet_header = self._read_bytes(4)
2024-04-12 11:14:27.533 6 ERROR sqlalchemy.pool.impl.QueuePool File "/var/lib/kolla/venv/lib/python3.10/site-packages/pymysql/connections.py", line 748, in _read_bytes
2024-04-12 11:14:27.533 6 ERROR sqlalchemy.pool.impl.QueuePool raise err.OperationalError(
2024-04-12 11:14:27.533 6 ERROR sqlalchemy.pool.impl.QueuePool pymysql.err.OperationalError: (2013, 'Lost connection to MySQL server during query')
2024-04-12 11:14:27.533 6 ERROR sqlalchemy.pool.impl.QueuePool

Thank you and have a great day

My versions are : kolla==15.5.0, kolla-ansible==15.5.0 (Zed)
The OS is Ubuntu Jammy

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.