Pacemaker systemd service file retains default open files limits

Bug #1715924 reported by Justin Kilpatrick on 2017-09-08
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Medium
Unassigned

Bug Description

During a high load benchmark on a HA deployment of RDO Newton Ceilometer collector will hit a deadlock killing mysql on the primary controller, the primary will then be moved and the next mysql server will fail with the same error immediately. This goes on continuously effectively killing the cloud.

After exploring the issue further the cause (demonstrated with A/B testing) is that Pacemaker is the parent process for mysqld-safe and other database processes in a HA environment. Pacemaker is by default deployed without any
changes to the LimitNOFILE parameter. Meaning all Pacemaker spawned processes come with a open files limit of 1024.

If the database is pushed for a prolonged period this can result in a deadlock where a blocking operation is waiting on another operation which is blocked on
a lack of open file descriptor. Strangely this seems easier to reproduce in Newton than other versions, probably involving changes to Ceilometer but happens intermittently on all versions.

To reproduce run rally create-list-routers scenario

https://github.com/openstack/rally/blob/master/samples/tasks/scenarios/neutron/create-and-list-routers.json

continuously with a high enough concurrency that the majority of your system ram is used. Eventually you will hit this deadlock.

Logs

Ceilometer collector
2017-08-25 08:03:05.200 66136 ERROR panko.storage.impl_sqlalchemy [-] Failed to record event: (pymysql.err.InternalError) (1213, u'Deadlock found when trying to get lock; try restarting transaction')
2017-08-25 08:03:05.200 66136 ERROR panko.storage.impl_sqlalchemy Traceback (most recent call last):
2017-08-25 08:03:05.200 66136 ERROR panko.storage.impl_sqlalchemy File "/usr/lib/python2.7/site-packages/panko/storage/impl_sqlalchemy.py", line 197, in record_events
2017-08-25 08:03:05.200 66136 ERROR panko.storage.impl_sqlalchemy trait_map[dtype])
2017-08-25 08:03:05.200 66136 ERROR panko.storage.impl_sqlalchemy File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 502, in __exit__
2017-08-25 08:03:05.200 66136 ERROR panko.storage.impl_sqlalchemy self.rollback()
2017-08-25 08:03:05.200 66136 ERROR panko.storage.impl_sqlalchemy File "/usr/lib64/python2.7/site-packages/sqlalchemy/util/langhelpers.py", line 60, in __exit__
2017-08-25 08:03:05.200 66136 ERROR panko.storage.impl_sqlalchemy compat.reraise(exc_type, exc_value, exc_tb)
2017-08-25 08:03:05.200 66136 ERROR panko.storage.impl_sqlalchemy File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 499, in __exit__
2017-08-25 08:03:05.200 66136 ERROR panko.storage.impl_sqlalchemy self.commit()
2017-08-25 08:03:05.200 66136 ERROR panko.storage.impl_sqlalchemy File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 396, in commit
2017-08-25 08:03:05.200 66136 ERROR panko.storage.impl_sqlalchemy t[1].commit()
2017-08-25 08:03:05.200 66136 ERROR panko.storage.impl_sqlalchemy File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1574, in commit
2017-08-25 08:03:05.200 66136 ERROR panko.storage.impl_sqlalchemy self._do_commit()
2017-08-25 08:03:05.200 66136 ERROR panko.storage.impl_sqlalchemy File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1605, in _do_commit
2017-08-25 08:03:05.200 66136 ERROR panko.storage.impl_sqlalchemy self.connection._commit_impl()
2017-08-25 08:03:05.200 66136 ERROR panko.storage.impl_sqlalchemy File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 690, in _commit_impl
2017-08-25 08:03:05.200 66136 ERROR panko.storage.impl_sqlalchemy self._handle_dbapi_exception(e, None, None, None, None)
2017-08-25 08:03:05.200 66136 ERROR panko.storage.impl_sqlalchemy File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1337, in _handle_dbapi_exception
2017-08-25 08:03:05.200 66136 ERROR panko.storage.impl_sqlalchemy util.raise_from_cause(newraise, exc_info)
2017-08-25 08:03:05.200 66136 ERROR panko.storage.impl_sqlalchemy File "/usr/lib64/python2.7/site-packages/sqlalchemy/util/compat.py", line 200, in raise_from_cause
2017-08-25 08:03:05.200 66136 ERROR panko.storage.impl_sqlalchemy reraise(type(exception), exception, tb=exc_tb)
2017-08-25 08:03:05.200 66136 ERROR panko.storage.impl_sqlalchemy File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 688, in _commit_impl
2017-08-25 08:03:05.200 66136 ERROR panko.storage.impl_sqlalchemy self.engine.dialect.do_commit(self.connection)
2017-08-25 08:03:05.200 66136 ERROR panko.storage.impl_sqlalchemy File "/usr/lib64/python2.7/site-packages/sqlalchemy/dialects/mysql/base.py", line 2514, in do_commit
2017-08-25 08:03:05.200 66136 ERROR panko.storage.impl_sqlalchemy dbapi_connection.commit()
2017-08-25 08:03:05.200 66136 ERROR panko.storage.impl_sqlalchemy File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 773, in commit
2017-08-25 08:03:05.200 66136 ERROR panko.storage.impl_sqlalchemy self._read_ok_packet()
2017-08-25 08:03:05.200 66136 ERROR panko.storage.impl_sqlalchemy File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 752, in _read_ok_packet
2017-08-25 08:03:05.200 66136 ERROR panko.storage.impl_sqlalchemy pkt = self._read_packet()
2017-08-25 08:03:05.200 66136 ERROR panko.storage.impl_sqlalchemy File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 991, in _read_packet
2017-08-25 08:03:05.200 66136 ERROR panko.storage.impl_sqlalchemy packet.check_error()
2017-08-25 08:03:05.200 66136 ERROR panko.storage.impl_sqlalchemy File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 393, in check_error
2017-08-25 08:03:05.200 66136 ERROR panko.storage.impl_sqlalchemy err.raise_mysql_exception(self._data)
2017-08-25 08:03:05.200 66136 ERROR panko.storage.impl_sqlalchemy File "/usr/lib/python2.7/site-packages/pymysql/err.py", line 107, in raise_mysql_exception
2017-08-25 08:03:05.200 66136 ERROR panko.storage.impl_sqlalchemy raise errorclass(errno, errval)
2017-08-25 08:03:05.200 66136 ERROR panko.storage.impl_sqlalchemy DBDeadlock: (pymysql.err.InternalError) (1213, u'Deadlock found when trying to get lock; try restarting transaction')
2017-08-25 08:03:05.200 66136 ERROR panko.storage.impl_sqlalchemy
2017-08-25 08:03:05.206 66136 ERROR ceilometer.collector [-] Dispatcher failed to handle the notification, re-queuing it.
2017-08-25 08:03:05.206 66136 ERROR ceilometer.collector Traceback (most recent call last):
2017-08-25 08:03:05.206 66136 ERROR ceilometer.collector File "/usr/lib/python2.7/site-packages/ceilometer/collector.py", line 177, in sample
2017-08-25 08:03:05.206 66136 ERROR ceilometer.collector self.dispatcher_manager.map_method(self.method, goods)
2017-08-25 08:03:05.206 66136 ERROR ceilometer.collector File "/usr/lib/python2.7/site-packages/stevedore/extension.py", line 282, in map_method
2017-08-25 08:03:05.206 66136 ERROR ceilometer.collector method_name, *args, **kwds)
2017-08-25 08:03:05.206 66136 ERROR ceilometer.collector File "/usr/lib/python2.7/site-packages/stevedore/extension.py", line 255, in map
2017-08-25 08:03:05.206 66136 ERROR ceilometer.collector self._invoke_one_plugin(response.append, func, e, args, kwds)
2017-08-25 08:03:05.206 66136 ERROR ceilometer.collector File "/usr/lib/python2.7/site-packages/stevedore/extension.py", line 286, in _invoke_one_plugin
2017-08-25 08:03:05.206 66136 ERROR ceilometer.collector response_callback(func(e, *args, **kwds))
2017-08-25 08:03:05.206 66136 ERROR ceilometer.collector File "/usr/lib/python2.7/site-packages/stevedore/extension.py", line 260, in _call_extension_method
2017-08-25 08:03:05.206 66136 ERROR ceilometer.collector return getattr(extension.obj, method_name)(*args, **kwds)
2017-08-25 08:03:05.206 66136 ERROR ceilometer.collector File "/usr/lib/python2.7/site-packages/panko/dispatcher/database.py", line 68, in record_events
2017-08-25 08:03:05.206 66136 ERROR ceilometer.collector self.conn.record_events(event_list)
2017-08-25 08:03:05.206 66136 ERROR ceilometer.collector File "/usr/lib/python2.7/site-packages/panko/storage/impl_sqlalchemy.py", line 206, in record_events
2017-08-25 08:03:05.206 66136 ERROR ceilometer.collector raise error
2017-08-25 08:03:05.206 66136 ERROR ceilometer.collector DBDeadlock: (pymysql.err.InternalError) (1213, u'Deadlock found when trying to get lock; try restarting transaction')
2017-08-25 08:03:05.206 66136 ERROR ceilometer.collector
2017-08-25 08:05:31.806 66136 ERROR oslo.messaging._drivers.impl_rabbit [-] [b8cf8a91-849c-40e2-9eb1-18f94f5bc96b] AMQP server on overcloud-controller-1.internalapi.localdomain:5672 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds. Client port: 53512
2017-08-25 08:05:31.850 66136 ERROR oslo.messaging._drivers.impl_rabbit [-] [ab124b2a-7429-4c0c-9b78-800f5568806e] AMQP server on overcloud-controller-1.internalapi.localdomain:5672 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds. Client port: 53508
2017-08-25 08:05:32.820 66136 ERROR root [-] Unexpected exception occurred 1 time(s)... retrying.
2017-08-25 08:05:32.820 66136 ERROR root Traceback (most recent call last):
2017-08-25 08:05:32.820 66136 ERROR root File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 250, in wrapper
2017-08-25 08:05:32.820 66136 ERROR root return infunc(*args, **kwargs)
2017-08-25 08:05:32.820 66136 ERROR root File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/base.py", line 302, in _runner
2017-08-25 08:05:32.820 66136 ERROR root batch_size=self.batch_size, batch_timeout=self.batch_timeout)
2017-08-25 08:05:32.820 66136 ERROR root File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/base.py", line 52, in wrapper
2017-08-25 08:05:32.820 66136 ERROR root msg = func(in_self, timeout=timeout_watch.leftover(True))
2017-08-25 08:05:32.820 66136 ERROR root File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 286, in poll
2017-08-25 08:05:32.820 66136 ERROR root self._message_operations_handler.process()
2017-08-25 08:05:32.820 66136 ERROR root File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 89, in process
2017-08-25 08:05:32.820 66136 ERROR root task()
2017-08-25 08:05:32.820 66136 ERROR root File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 253, in acknowledge
2017-08-25 08:05:32.820 66136 ERROR root self._raw_message.ack()
2017-08-25 08:05:32.820 66136 ERROR root File "/usr/lib/python2.7/site-packages/kombu/message.py", line 88, in ack
2017-08-25 08:05:32.820 66136 ERROR root self.channel.basic_ack(self.delivery_tag)
2017-08-25 08:05:32.820 66136 ERROR root File "/usr/lib/python2.7/site-packages/amqp/channel.py", line 1583, in basic_ack
2017-08-25 08:05:32.820 66136 ERROR root self._send_method((60, 80), args)
2017-08-25 08:05:32.820 66136 ERROR root File "/usr/lib/python2.7/site-packages/amqp/abstract_channel.py", line 56, in _send_method
2017-08-25 08:05:32.820 66136 ERROR root self.channel_id, method_sig, args, content,
2017-08-25 08:05:32.820 66136 ERROR root File "/usr/lib/python2.7/site-packages/amqp/method_framing.py", line 221, in write_method
2017-08-25 08:05:32.820 66136 ERROR root write_frame(1, channel, payload)
2017-08-25 08:05:32.820 66136 ERROR root File "/usr/lib/python2.7/site-packages/amqp/transport.py", line 182, in write_frame
2017-08-25 08:05:32.820 66136 ERROR root frame_type, channel, size, payload, 0xce,
2017-08-25 08:05:32.820 66136 ERROR root File "/usr/lib64/python2.7/socket.py", line 224, in meth
2017-08-25 08:05:32.820 66136 ERROR root return getattr(self._sock,name)(*args)
2017-08-25 08:05:32.820 66136 ERROR root error: [Errno 32] Broken pipe
2017-08-25 08:05:32.820 66136 ERROR root

affects: ceilometer → tripleo
summary: - Ceilometer collector deadlocks during heavy DB load
+ Pacemaker systemd service file retains default open files limits
description: updated
description: updated
Changed in tripleo:
milestone: none → queens-1
importance: Undecided → Medium
status: New → Triaged
description: updated
description: updated
description: updated
description: updated
Changed in tripleo:
milestone: queens-1 → queens-2
Changed in tripleo:
milestone: queens-2 → queens-3
Changed in tripleo:
milestone: queens-3 → queens-rc1
Changed in tripleo:
milestone: queens-rc1 → rocky-1
Changed in tripleo:
milestone: rocky-1 → rocky-2
Changed in tripleo:
milestone: rocky-2 → rocky-3
Changed in tripleo:
milestone: rocky-3 → rocky-rc1
Changed in tripleo:
milestone: rocky-rc1 → stein-1
Changed in tripleo:
milestone: stein-1 → stein-2
Changed in tripleo:
milestone: stein-2 → stein-3
Damien Ciabrini (dciabrin) wrote :

I suspect the reported problem is not linked to the number of file descriptor available to the DB, nor to the configured LimitNOFILE for pacemaker.

When pacemaker start the galera resource, the galera resource agent explicitly runs mysqld_safe with an file descriptor override giving it 16384 fd, which is plenty enough to not cause fd starvation IMHO.

Maybe the original deadlock message is due to concurrent access to the DB, which is normally handled by retrying DB operation on the client side?

Anyway, I don't have any easy means to reproduce the issue right now, and I miss logs for further analysis.

If anything, I would suggest to retest against recent tripleo release, and close the bug if this doesn't reproduce.

Setting as incomplete due to both inactivity and the need for more info. Please update if you think this is relevant still.

Changed in tripleo:
status: Triaged → Incomplete
Changed in tripleo:
milestone: stein-3 → train-1
Changed in tripleo:
milestone: train-1 → train-2
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers