Comment 0 for bug 1715924

Revision history for this message
Justin Kilpatrick (jkilpatr) wrote : Ceilometer collector deadlocks during heavy DB load

During a high load benchmark on a HA deployment of RDO Newton Ceilometer collector will hit a deadlock killing mysql on the primary controller, the primary will then be moved and the next mysql server will fail with the same error immediately. This goes on continuously effectively killing the cloud.

This issue is more rarely observed in Ocata and Pike, but is 100% reproducible in Newton by running several hundred Neutron create operations back to back, before this bug was first observed the exact same set of operations completed slowly but without issue.

This was first observed after the introduction of Neutron L3 HA by default although no concrete link between that change and this issue has been found. Current theory L3 HA puts more load on the database, database hits open file limit, open file limit causes irresolvable locks to be acquired, when pacemaker tries to fail over the same actions are played back and the problem repeats.

Collector Log pastebin

https://paste.fedoraproject.org/paste/e-gHTVEqB-gnQKMCAqtt1Q

Full log (warning large)

https://thirdparty.logs.rdoproject.org/jenkins-browbeat-quickstart-ocata-baremetal-mixed-20/overcloud-controller-0/var/log/ceilometer/collector.log.txt.gz

mariadb log pastebin

https://paste.fedoraproject.org/paste/a077ZTXXma0kGarbNZ~Aug