Hello all,
I'm deploying cyborg through kolla-ansible to manage nvidia pgpus. It works great and it is very useful for me. However, it seems that Cyborg is not managing well the oslo_db/sqlalchemy database interaction I think.
Cyborg is able to work about a few hours with some errors like that :
Exception during reset or similar: AssertionError: do not call blocking functions from the mainloop
2024-04-12 10:28:56.470 7 ERROR sqlalchemy.pool.impl.QueuePool Traceback (most recent call last):
2024-04-12 10:28:56.470 7 ERROR sqlalchemy.pool.impl.QueuePool File "/var/lib/kolla/venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 739, in _finalize_fairy
2024-04-12 10:28:56.470 7 ERROR sqlalchemy.pool.impl.QueuePool fairy._reset(pool)
2024-04-12 10:28:56.470 7 ERROR sqlalchemy.pool.impl.QueuePool File "/var/lib/kolla/venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 988, in _reset
2024-04-12 10:28:56.470 7 ERROR sqlalchemy.pool.impl.QueuePool pool._dialect.do_rollback(self)
2024-04-12 10:28:56.470 7 ERROR sqlalchemy.pool.impl.QueuePool File "/var/lib/kolla/venv/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 682, in do_rollback
2024-04-12 10:28:56.470 7 ERROR sqlalchemy.pool.impl.QueuePool dbapi_connection.rollback()
2024-04-12 10:28:56.470 7 ERROR sqlalchemy.pool.impl.QueuePool File "/var/lib/kolla/venv/lib/python3.10/site-packages/pymysql/connections.py", line 480, in rollback
2024-04-12 10:28:56.470 7 ERROR sqlalchemy.pool.impl.QueuePool self._read_ok_packet()
2024-04-12 10:28:56.470 7 ERROR sqlalchemy.pool.impl.QueuePool File "/var/lib/kolla/venv/lib/python3.10/site-packages/pymysql/connections.py", line 443, in _read_ok_packet
2024-04-12 10:28:56.470 7 ERROR sqlalchemy.pool.impl.QueuePool pkt = self._read_packet()
2024-04-12 10:28:56.470 7 ERROR sqlalchemy.pool.impl.QueuePool File "/var/lib/kolla/venv/lib/python3.10/site-packages/pymysql/connections.py", line 692, in _read_packet
2024-04-12 10:28:56.470 7 ERROR sqlalchemy.pool.impl.QueuePool packet_header = self._read_bytes(4)
2024-04-12 10:28:56.470 7 ERROR sqlalchemy.pool.impl.QueuePool File "/var/lib/kolla/venv/lib/python3.10/site-packages/pymysql/connections.py", line 732, in _read_bytes
2024-04-12 10:28:56.470 7 ERROR sqlalchemy.pool.impl.QueuePool data = self._rfile.read(num_bytes)
2024-04-12 10:28:56.470 7 ERROR sqlalchemy.pool.impl.QueuePool File "/usr/lib/python3.10/socket.py", line 705, in readinto
2024-04-12 10:28:56.470 7 ERROR sqlalchemy.pool.impl.QueuePool return self._sock.recv_into(b)
2024-04-12 10:28:56.470 7 ERROR sqlalchemy.pool.impl.QueuePool File "/var/lib/kolla/venv/lib/python3.10/site-packages/eventlet/greenio/base.py", line 376, in recv_intI
2024-04-12 10:28:56.470 7 ERROR sqlalchemy.pool.impl.QueuePool return self._recv_loop(self.fd.recv_into, 0, buffer, nbytes, flags)
2024-04-12 10:28:56.470 7 ERROR sqlalchemy.pool.impl.QueuePool File "/var/lib/kolla/venv/lib/python3.10/site-packages/eventlet/greenio/base.py", line 364, in _recv_loop
2024-04-12 10:28:56.470 7 ERROR sqlalchemy.pool.impl.QueuePool self._read_trampoline()
2024-04-12 10:28:56.470 7 ERROR sqlalchemy.pool.impl.QueuePool File "/var/lib/kolla/venv/lib/python3.10/site-packages/eventlet/greenio/base.py", line 332, in _read_trampoline
2024-04-12 10:28:56.470 7 ERROR sqlalchemy.pool.impl.QueuePool self._trampoline(
2024-04-12 10:28:56.470 7 ERROR sqlalchemy.pool.impl.QueuePool File "/var/lib/kolla/venv/lib/python3.10/site-packages/eventlet/greenio/base.py", line 211, in _trampoline
2024-04-12 10:28:56.470 7 ERROR sqlalchemy.pool.impl.QueuePool return trampoline(fd, read=read, write=write, timeout=timeout,
2024-04-12 10:28:56.470 7 ERROR sqlalchemy.pool.impl.QueuePool File "/var/lib/kolla/venv/lib/python3.10/site-packages/eventlet/hubs/__init__.py", line 141, in trampoline
2024-04-12 10:28:56.470 7 ERROR sqlalchemy.pool.impl.QueuePool assert hub.greenlet is not current, 'do not call blocking functions from the mainloop'
2024-04-12 10:28:56.470 7 ERROR sqlalchemy.pool.impl.QueuePool AssertionError: do not call blocking functions from the mainloop
2024-04-12 10:28:56.470 7 ERROR sqlalchemy.pool.impl.QueuePool
And after a few hours/days, I get errors like that and Cyborg is not working anymore :
Error during AgentManager.update_available_resource: oslo_messaging.rpc.client.RemoteError: Remote error: TimeoutError QueuePool limit of size 1 overflow 50 reached, connection timed out, timeout 30.00 (Background on this error at: https://sqlalche.me/e/14/3o7r)
['Traceback (most recent call last):\n', ' File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/rpc/server.py", line 165, in _process_incoming\n res = self.dispatcher.dispatch(message)\n', ' File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/rpc/dispatcher.py", line 309, in dispatch\n return self._do_dispatch(endpoint, method, ctxt, args)\n', ' File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/rpc/dispatcher.py", line 229, in _do_dispatch\n result = func(ctxt, **new_args)\n', ' File "/var/lib/kolla/venv/lib/python3.10/site-packages/cyborg/conductor/manager.py", line 117, in report_data\n old_driver_device_list = DriverDevice.list(context, hostname)\n', ' File "/var/lib/kolla/venv/lib/python3.10/site-packages/cyborg/objects/driver_objects/driver_device.py", line 122, in list\n dev_obj_list = Device.get_list_by_hostname(context, host)\n', ' File "/var/lib/kolla/venv/lib/python3.10/site-packages/cyborg/objects/device.py", line 93, in get_list_by_hostname\n device_obj_list = Device.list(context, dev_filter)\n', ' File "/var/lib/kolla/venv/lib/python3.10/site-packages/cyborg/objects/device.py", line 69, in list\n db_devices = cls.dbapi.device_list_by_filters(\n', ' File "/var/lib/kolla/venv/lib/python3.10/site-packages/cyborg/db/sqlalchemy/api.py", line 453, in device_list_by_filters\n return _paginate_query(context, models.Device, query_prefix,\n', ' File "/var/lib/kolla/venv/lib/python3.10/site-packages/cyborg/db/sqlalchemy/api.py", line 118, in _paginate_query\n return query.all()\n', ' File "/var/lib/kolla/venv/lib/python3.10/site-packages/sqlalchemy/orm/query.py", line 2772, in all\n return self._iter().all()\n', ' File "/var/lib/kolla/venv/lib/python3.10/site-packages/sqlalchemy/orm/query.py", line 2907, in _iter\n result = self.session.execute(\n', ' File "/var/lib/kolla/venv/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 1711, in execute\n conn = self._connection_for_bind(bind)\n', ' File "/var/lib/kolla/venv/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 1552, in _connection_for_bind\n return self._transaction._connection_for_bind(\n', ' File "/var/lib/kolla/venv/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 747, in _connection_for_bind\n conn = bind.connect()\n', ' File "/var/lib/kolla/venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 3315, in connect\n return self._connection_cls(self, close_with_result=close_with_result)\n', ' File "/var/lib/kolla/venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 96, in __init__\n else engine.raw_connection()\n', ' File "/var/lib/kolla/venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 3394, in raw_connection\n return self._wrap_pool_connect(self.pool.connect, _connection)\n', ' File "/var/lib/kolla/venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 3361, in _wrap_pool_connect\n return fn()\n', ' File "/var/lib/kolla/venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 310, in connect\n return _ConnectionFairy._checkout(self)\n', ' File "/var/lib/kolla/venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 868, in _checkout\n fairy = _ConnectionRecord.checkout(pool)\n', ' File "/var/lib/kolla/venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 476, in checkout\n rec = pool._do_get()\n', ' File "/var/lib/kolla/venv/lib/python3.10/site-packages/sqlalchemy/pool/impl.py", line 134, in _do_get\n raise exc.TimeoutError(\n', 'sqlalchemy.exc.TimeoutError: QueuePool limit of size 1 overflow 50 reached, connection timed out, timeout 30.00 (Background on this error at: https://sqlalche.me/e/14/3o7r)\n'].
2024-04-11 09:59:15.976 7 ERROR oslo_service.periodic_task Traceback (most recent call last):
2024-04-11 09:59:15.976 7 ERROR oslo_service.periodic_task File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_service/periodic_task.py", line 216, in run_periodic_tasks
2024-04-11 09:59:15.976 7 ERROR oslo_service.periodic_task task(self, context)
2024-04-11 09:59:15.976 7 ERROR oslo_service.periodic_task File "/var/lib/kolla/venv/lib/python3.10/site-packages/cyborg/agent/manager.py", line 82, in update_available_resource
2024-04-11 09:59:15.976 7 ERROR oslo_service.periodic_task self._rt.update_usage(context)
2024-04-11 09:59:15.976 7 ERROR oslo_service.periodic_task File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_concurrency/lockutils.py", line 414, in inner
2024-04-11 09:59:15.976 7 ERROR oslo_service.periodic_task return f(*args, **kwargs)
2024-04-11 09:59:15.976 7 ERROR oslo_service.periodic_task File "/var/lib/kolla/venv/lib/python3.10/site-packages/cyborg/agent/resource_tracker.py", line 76, in update_usage
2024-04-11 09:59:15.976 7 ERROR oslo_service.periodic_task self.conductor_api.report_data(context, self.host, acc_list)
2024-04-11 09:59:15.976 7 ERROR oslo_service.periodic_task File "/var/lib/kolla/venv/lib/python3.10/site-packages/cyborg/conductor/rpcapi.py", line 55, in report_data
2024-04-11 09:59:15.976 7 ERROR oslo_service.periodic_task cctxt.call(context, 'report_data', hostname=hostname,
2024-04-11 09:59:15.976 7 ERROR oslo_service.periodic_task File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/rpc/client.py", line 189, in call
2024-04-11 09:59:15.976 7 ERROR oslo_service.periodic_task result = self.transport._send(
2024-04-11 09:59:15.976 7 ERROR oslo_service.periodic_task File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/transport.py", line 123, in _send
2024-04-11 09:59:15.976 7 ERROR oslo_service.periodic_task return self._driver.send(target, ctxt, message,
2024-04-11 09:59:15.976 7 ERROR oslo_service.periodic_task File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 689, in send
2024-04-11 09:59:15.976 7 ERROR oslo_service.periodic_task return self._send(target, ctxt, message, wait_for_reply, timeout,
2024-04-11 09:59:15.976 7 ERROR oslo_service.periodic_task File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 681, in _send
2024-04-11 09:59:15.976 7 ERROR oslo_service.periodic_task raise result
2024-04-11 09:59:15.976 7 ERROR oslo_service.periodic_task oslo_messaging.rpc.client.RemoteError: Remote error: TimeoutError QueuePool limit of size 1 overflow 50 reached, connection timed out, timeout 30.00 (Background on this error at: https://sqlalche.me/e/14/3o7r)
To make it works, I changed the [database] config with max_overflow = 100 and max_pool_size = 5, like that :
[database]
connection = mysql+pymysql://cyborg:password@116.81.196.14:3306/cyborg
connection_recycle_time = 10
max_pool_size = 5
max_overflow = 100
It works for the moment, but I still get the first error and another one :
Exception during reset or similar: pymysql.err.OperationalError: (2013, 'Lost connection to MySQL server during query')
2024-04-12 11:14:27.533 6 ERROR sqlalchemy.pool.impl.QueuePool Traceback (most recent call last):
2024-04-12 11:14:27.533 6 ERROR sqlalchemy.pool.impl.QueuePool File "/var/lib/kolla/venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 739, in _finalize_fairy
2024-04-12 11:14:27.533 6 ERROR sqlalchemy.pool.impl.QueuePool fairy._reset(pool)
2024-04-12 11:14:27.533 6 ERROR sqlalchemy.pool.impl.QueuePool File "/var/lib/kolla/venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 988, in _reset
2024-04-12 11:14:27.533 6 ERROR sqlalchemy.pool.impl.QueuePool pool._dialect.do_rollback(self)
2024-04-12 11:14:27.533 6 ERROR sqlalchemy.pool.impl.QueuePool File "/var/lib/kolla/venv/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 682, in do_rollback
2024-04-12 11:14:27.533 6 ERROR sqlalchemy.pool.impl.QueuePool dbapi_connection.rollback()
2024-04-12 11:14:27.533 6 ERROR sqlalchemy.pool.impl.QueuePool File "/var/lib/kolla/venv/lib/python3.10/site-packages/pymysql/connections.py", line 480, in rollback
2024-04-12 11:14:27.533 6 ERROR sqlalchemy.pool.impl.QueuePool self._read_ok_packet()
2024-04-12 11:14:27.533 6 ERROR sqlalchemy.pool.impl.QueuePool File "/var/lib/kolla/venv/lib/python3.10/site-packages/pymysql/connections.py", line 443, in _read_ok_packet
2024-04-12 11:14:27.533 6 ERROR sqlalchemy.pool.impl.QueuePool pkt = self._read_packet()
2024-04-12 11:14:27.533 6 ERROR sqlalchemy.pool.impl.QueuePool File "/var/lib/kolla/venv/lib/python3.10/site-packages/pymysql/connections.py", line 692, in _read_packet
2024-04-12 11:14:27.533 6 ERROR sqlalchemy.pool.impl.QueuePool packet_header = self._read_bytes(4)
2024-04-12 11:14:27.533 6 ERROR sqlalchemy.pool.impl.QueuePool File "/var/lib/kolla/venv/lib/python3.10/site-packages/pymysql/connections.py", line 748, in _read_bytes
2024-04-12 11:14:27.533 6 ERROR sqlalchemy.pool.impl.QueuePool raise err.OperationalError(
2024-04-12 11:14:27.533 6 ERROR sqlalchemy.pool.impl.QueuePool pymysql.err.OperationalError: (2013, 'Lost connection to MySQL server during query')
2024-04-12 11:14:27.533 6 ERROR sqlalchemy.pool.impl.QueuePool
Thank you and have a great day
My versions are : kolla==15.5.0, kolla-ansible==15.5.0 (Zed)
The OS is Ubuntu Jammy