[sqlite][metal3]get_nodeinfo_list can cause lock conflicts

Bug #2027405 reported by Julia Kreger
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ironic
Fix Released
High
Julia Kreger

Bug Description

While observing some logs where a job failed, I noticed something suspicious:

2023-07-11 15:46:28.030 1 DEBUG futurist.periodics [-] Submitting periodic callback 'ironic.drivers.modules.pxe_base.PXEBaseMixin._check_boot_timeouts' _process_scheduled /usr/lib/python3.9/site-packages/futurist/periodics.py:638
2023-07-11 15:46:28.031 1 DEBUG futurist.periodics [-] Submitting periodic callback 'ironic.drivers.modules.pxe_base.PXEBaseMixin._check_boot_timeouts' _process_scheduled /usr/lib/python3.9/site-packages/futurist/periodics.py:638
2023-07-11 15:46:28.031 1 DEBUG futurist.periodics [-] Submitting periodic callback 'ironic.drivers.modules.pxe_base.PXEBaseMixin._check_boot_timeouts' _process_scheduled /usr/lib/python3.9/site-packages/futurist/periodics.py:638
2023-07-11 15:46:28.031 1 DEBUG futurist.periodics [-] Submitting periodic callback 'ironic.drivers.modules.pxe_base.PXEBaseMixin._check_boot_timeouts' _process_scheduled /usr/lib/python3.9/site-packages/futurist/periodics.py:638

These first few, not so much, one read for each PXE interface loaded out of the base mixin.

It is a lot, but not horrible. Except... they are a bunch of *READ TRASACTIONS*!

So then we get this:

2023-07-11 15:46:28.032 1 ERROR ironic.conductor.base_manager [-] Error while heartbeating. Error: (sqlite3.OperationalError) database is locked
(Background on this error at: https://sqlalche.me/e/14/e3q8): sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) database is locked
(Background on this error at: https://sqlalche.me/e/14/e3q8)
2023-07-11 15:46:28.032 1 ERROR ironic.conductor.base_manager Traceback (most recent call last):
2023-07-11 15:46:28.032 1 ERROR ironic.conductor.base_manager File "/usr/lib64/python3.9/site-packages/sqlalchemy/engine/base.py", line 1089, in _commit_impl
2023-07-11 15:46:28.032 1 ERROR ironic.conductor.base_manager self.engine.dialect.do_commit(self.connection)
2023-07-11 15:46:28.032 1 ERROR ironic.conductor.base_manager File "/usr/lib64/python3.9/site-packages/sqlalchemy/engine/default.py", line 686, in do_commit
2023-07-11 15:46:28.032 1 ERROR ironic.conductor.base_manager dbapi_connection.commit()
2023-07-11 15:46:28.032 1 ERROR ironic.conductor.base_manager sqlite3.OperationalError: database is locked
2023-07-11 15:46:28.032 1 ERROR ironic.conductor.base_manager
2023-07-11 15:46:28.032 1 ERROR ironic.conductor.base_manager The above exception was the direct cause of the following exception:
2023-07-11 15:46:28.032 1 ERROR ironic.conductor.base_manager
2023-07-11 15:46:28.032 1 ERROR ironic.conductor.base_manager Traceback (most recent call last):
2023-07-11 15:46:28.032 1 ERROR ironic.conductor.base_manager File "/usr/lib/python3.9/site-packages/ironic/conductor/base_manager.py", line 445, in _conductor_service_record_keepalive
2023-07-11 15:46:28.032 1 ERROR ironic.conductor.base_manager self.conductor.touch()
2023-07-11 15:46:28.032 1 ERROR ironic.conductor.base_manager File "/usr/lib/python3.9/site-packages/ironic/objects/conductor.py", line 116, in touch
2023-07-11 15:46:28.032 1 ERROR ironic.conductor.base_manager self.dbapi.touch_conductor(self.hostname)
2023-07-11 15:46:28.032 1 ERROR ironic.conductor.base_manager File "/usr/lib/python3.9/site-packages/oslo_db/api.py", line 154, in wrapper
2023-07-11 15:46:28.032 1 ERROR ironic.conductor.base_manager ectxt.value = e.inner_exc
2023-07-11 15:46:28.032 1 ERROR ironic.conductor.base_manager File "/usr/lib/python3.9/site-packages/oslo_utils/excutils.py", line 227, in __exit__
2023-07-11 15:46:28.032 1 ERROR ironic.conductor.base_manager self.force_reraise()
2023-07-11 15:46:28.032 1 ERROR ironic.conductor.base_manager File "/usr/lib/python3.9/site-packages/oslo_utils/excutils.py", line 200, in force_reraise
2023-07-11 15:46:28.032 1 ERROR ironic.conductor.base_manager raise self.value
2023-07-11 15:46:28.032 1 ERROR ironic.conductor.base_manager File "/usr/lib/python3.9/site-packages/oslo_db/api.py", line 142, in wrapper
2023-07-11 15:46:28.032 1 ERROR ironic.conductor.base_manager return f(*args, **kwargs)
2023-07-11 15:46:28.032 1 ERROR ironic.conductor.base_manager File "/usr/lib/python3.9/site-packages/ironic/db/sqlalchemy/api.py", line 1359, in touch_conductor
2023-07-11 15:46:28.032 1 ERROR ironic.conductor.base_manager count = res.rowcount
2023-07-11 15:46:28.032 1 ERROR ironic.conductor.base_manager File "/usr/lib64/python3.9/contextlib.py", line 126, in __exit__
2023-07-11 15:46:28.032 1 ERROR ironic.conductor.base_manager next(self.gen)
2023-07-11 15:46:28.032 1 ERROR ironic.conductor.base_manager File "/usr/lib/python3.9/site-packages/oslo_db/sqlalchemy/enginefacade.py", line 1094, in _transaction_scope
2023-07-11 15:46:28.032 1 ERROR ironic.conductor.base_manager yield resource
2023-07-11 15:46:28.032 1 ERROR ironic.conductor.base_manager File "/usr/lib64/python3.9/contextlib.py", line 126, in __exit__
2023-07-11 15:46:28.032 1 ERROR ironic.conductor.base_manager next(self.gen)
2023-07-11 15:46:28.032 1 ERROR ironic.conductor.base_manager File "/usr/lib/python3.9/site-packages/oslo_db/sqlalchemy/enginefacade.py", line 696, in _session
2023-07-11 15:46:28.032 1 ERROR ironic.conductor.base_manager self.session.rollback()
2023-07-11 15:46:28.032 1 ERROR ironic.conductor.base_manager File "/usr/lib/python3.9/site-packages/oslo_utils/excutils.py", line 227, in __exit__
2023-07-11 15:46:28.032 1 ERROR ironic.conductor.base_manager self.force_reraise()
2023-07-11 15:46:28.032 1 ERROR ironic.conductor.base_manager File "/usr/lib/python3.9/site-packages/oslo_utils/excutils.py", line 200, in force_reraise
2023-07-11 15:46:28.032 1 ERROR ironic.conductor.base_manager raise self.value
2023-07-11 15:46:28.032 1 ERROR ironic.conductor.base_manager File "/usr/lib/python3.9/site-packages/oslo_db/sqlalchemy/enginefacade.py", line 693, in _session
2023-07-11 15:46:28.032 1 ERROR ironic.conductor.base_manager self._end_session_transaction(self.session)
2023-07-11 15:46:28.032 1 ERROR ironic.conductor.base_manager File "/usr/lib/python3.9/site-packages/oslo_db/sqlalchemy/enginefacade.py", line 721, in _end_session_transaction
2023-07-11 15:46:28.032 1 ERROR ironic.conductor.base_manager session.commit()
2023-07-11 15:46:28.032 1 ERROR ironic.conductor.base_manager File "/usr/lib64/python3.9/site-packages/sqlalchemy/orm/session.py", line 1451, in commit
2023-07-11 15:46:28.032 1 ERROR ironic.conductor.base_manager self._transaction.commit(_to_root=self.future)
2023-07-11 15:46:28.032 1 ERROR ironic.conductor.base_manager File "/usr/lib64/python3.9/site-packages/sqlalchemy/orm/session.py", line 836, in commit
2023-07-11 15:46:28.032 1 ERROR ironic.conductor.base_manager trans.commit()
2023-07-11 15:46:28.032 1 ERROR ironic.conductor.base_manager File "/usr/lib64/python3.9/site-packages/sqlalchemy/engine/base.py", line 2459, in commit
2023-07-11 15:46:28.032 1 ERROR ironic.conductor.base_manager self._do_commit()
2023-07-11 15:46:28.032 1 ERROR ironic.conductor.base_manager File "/usr/lib64/python3.9/site-packages/sqlalchemy/engine/base.py", line 2649, in _do_commit
2023-07-11 15:46:28.032 1 ERROR ironic.conductor.base_manager self._connection_commit_impl()
2023-07-11 15:46:28.032 1 ERROR ironic.conductor.base_manager File "/usr/lib64/python3.9/site-packages/sqlalchemy/engine/base.py", line 2620, in _connection_commit_impl
2023-07-11 15:46:28.032 1 ERROR ironic.conductor.base_manager self.connection._commit_impl()
2023-07-11 15:46:28.032 1 ERROR ironic.conductor.base_manager File "/usr/lib64/python3.9/site-packages/sqlalchemy/engine/base.py", line 1091, in _commit_impl
2023-07-11 15:46:28.032 1 ERROR ironic.conductor.base_manager self._handle_dbapi_exception(e, None, None, None, None)
2023-07-11 15:46:28.032 1 ERROR ironic.conductor.base_manager File "/usr/lib64/python3.9/site-packages/sqlalchemy/engine/base.py", line 2122, in _handle_dbapi_exception
2023-07-11 15:46:28.032 1 ERROR ironic.conductor.base_manager util.raise_(newraise, with_traceback=exc_info[2], from_=e)
2023-07-11 15:46:28.032 1 ERROR ironic.conductor.base_manager File "/usr/lib64/python3.9/site-packages/sqlalchemy/util/compat.py", line 208, in raise_
2023-07-11 15:46:28.032 1 ERROR ironic.conductor.base_manager raise exception
2023-07-11 15:46:28.032 1 ERROR ironic.conductor.base_manager File "/usr/lib64/python3.9/site-packages/sqlalchemy/engine/base.py", line 1089, in _commit_impl
2023-07-11 15:46:28.032 1 ERROR ironic.conductor.base_manager self.engine.dialect.do_commit(self.connection)
2023-07-11 15:46:28.032 1 ERROR ironic.conductor.base_manager File "/usr/lib64/python3.9/site-packages/sqlalchemy/engine/default.py", line 686, in do_commit
2023-07-11 15:46:28.032 1 ERROR ironic.conductor.base_manager dbapi_connection.commit()
2023-07-11 15:46:28.032 1 ERROR ironic.conductor.base_manager sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) database is locked
2023-07-11 15:46:28.032 1 ERROR ironic.conductor.base_manager (Background on this error at: https://sqlalche.me/e/14/e3q8)
2023-07-11 15:46:28.032 1 ERROR ironic.conductor.base_manager 

And this is where things went sideways!

And then things went back to normal!

2023-07-11 15:46:28.088 1 INFO eventlet.wsgi.server [None req-59a45775-c3df-4cb7-a13c-7cc123d0b941 - - - - - -] 172.18.0.2,127.0.0.1 "GET /v1/drivers HTTP/1.1" status: 200 len: 3930 time: 0.0517182
2023-07-11 15:46:28.103 1 INFO eventlet.wsgi.server [None req-23de21f1-ffd2-4235-8c35-d815d6951941 - - - - - -] 172.18.0.2,127.0.0.1 "GET /v1/drivers HTTP/1.1" status: 200 len: 3930 time: 0.0429909

But this casued me tolook at get_nodeinfo_list, which is called by base_manager's iter_nodes, which is called by *every* periodic, and the default result set being handed back was a row, with a SQLAlchemy Engine row result, which could be interpretted as a tuple.

Except, we yield by default *between* calls to check the state!

So in other words:

1) We would open a new read transaction to get the list of nodes.
2) That ransaction would remain bound to a _paginate_query invocation.
3) The result set we would get back from our query would have any number of rows, each forcing the transaction to stay open.
4) That data gets handed off to iter_nodes, who *then* yielded between each result to the caller. At this point imagine I have two rows.
5) Iter nodes yields a row at a time to the caller.
6) The caller, say a periodic, does the needful.
7) Before going back iter_nodes, the periodic would issue an eventlet.sleep(0) allowing any other threads to pickup control for now.
8) Eventually, we get back to iter_nodes, get yielded the result.

All along, the transaction is still open.

kaboom.

Anyway, relatively simple fix, just some refactoring required.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/ironic 22.1.0

This issue was fixed in the openstack/ironic 22.1.0 release.

Changed in ironic:
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.