Ceilometer

Ceilometer GET events throws deadlock or timeout errors

Bug #1506717 reported by Divya K Konoor on 2015-10-16

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Ceilometer	Fix Released	Undecided	gordon chung	Ceilometer mitaka-1 "m1"

Bug Description

As part of https://review.openstack.org/#/c/208107/ , there were changes made wherein the isolation levels of database is set to Repeatable read for all databases other that sqllite. After this change , we see deadlock problems while accessing GET /events api. This doesn't happen all the time but is not very uncommon and usually happens when concurrent calls are run on ceilometer.

As per the documentation of RR (repeatable read), the entire table is locked for every unit of work performed and this could be denying other calls from going through.

This behavior seems to be associated with the RR and we should consider moving away. from RR. Other than ceilometer , I don't see any other services setting an isolation level explicitly . The deadlock or timeout behaviour can occur with any underlying database by virtue of what the isolation level is intended to do.

http://www.medtronicfeatures.com/wcm/help/admin/troubleshooting/wwhelp/wwhimpl/common/html/wwhelp.htm?context=troubleshooting&file=page_5_35.htm

https://msdn.microsoft.com/en-us/library/ms675307%28v=vs.85%29.aspx

http://dba.fyicenter.com/faq/mysql/Lock-Timeout-and-Deadlock.html

"If you are using transactions with REPEATABLE READ isolation level and transaction safe storage engines in your applications, data locks, lock timeouts, and dead lock detection will impact your application in a concurrent multi-user environment like Web sites in several ways"

[DB2/LINUXPPC64] SQL0911N The current transaction has been rolled back because of a deadlock or timeout. Reason code "2". SQLSTATE=40001 SQLCODE=-911". Detail:
Traceback (most recent call last):

File "/usr/lib/python2.7/site-packages/wsmeext/pecan.py", line 84, in callfunction
result = f(self, *args, **kwargs)

File "/usr/lib/python2.7/site-packages/ceilometer/api/controllers/v2/events.py", line 278, in get_all
limit)]

File "/usr/lib/python2.7/site-packages/ceilometer/event/storage/impl_sqlalchemy.py", line 290, in get_events
models.EventType.desc, models.Event.raw).all():

File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2399, in all
return list(self)

File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/loading.py", line 86, in instances
util.raise_from_cause(err)

File "/usr/lib64/python2.7/site-packages/sqlalchemy/util/compat.py", line 199, in raise_from_cause
reraise(type(exception), exception, tb=exc_tb)

File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/loading.py", line 67, in instances
fetch = cursor.fetchall()

File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/result.py", line 968, in fetchall
self.cursor, self.context)

File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1337, in _handle_dbapi_exception
util.raise_from_cause(newraise, exc_info)

File "/usr/lib64/python2.7/site-packages/sqlalchemy/util/compat.py", line 199, in raise_from_cause
reraise(type(exception), exception, tb=exc_tb)

File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/result.py", line 962, in fetchall
l = self.process_rows(self._fetchall_impl())

File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/result.py", line 913, in _fetchall_impl
return self.cursor.fetchall()

File "/usr/lib64/python2.7/site-packages/ibm_db_dbi.py", line 1460, in fetchall
return self._fetch_helper()

File "/usr/lib64/python2.7/site-packages/ibm_db_dbi.py", line 1417, in _fetch_helper
raise self.messages[len(self.messages) - 1]

DBDeadlock: (ibm_db_dbi.ProgrammingError) ibm_db_dbi::ProgrammingError: Fetch Failure: [IBM][CLI Driver][DB2/LINUXPPC64] SQL0911N The current transaction has been rolled back because of a deadlock or timeout. Reason code "2". SQLSTATE=40001 SQLCODE=-911

Tags:

Revision history for this message

gordon chung (chungg) wrote on 2015-10-19:

i think it's safe to drop the isolation level... but this needs to be verified.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-10-21: Fix proposed to ceilometer (master)

Fix proposed to branch: master
Review: https://review.openstack.org/238038

Changed in ceilometer:
assignee:	nobody → gordon chung (chungg)
status:	New → In Progress

Revision history for this message

gordon chung (chungg) wrote on 2015-10-21:

http://eavesdrop.openstack.org/irclogs/%23openstack-ceilometer/%23openstack-ceilometer.2015-10-21.log.html#t2015-10-21T12:07:30

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-10-23: Fix merged to ceilometer (master)

Reviewed: https://review.openstack.org/238038
Committed: https://git.openstack.org/cgit/openstack/ceilometer/commit/?id=898cd3d036c4358aa16f7b3e2028365dc9829213
Submitter: Jenkins
Branch: master

commit 898cd3d036c4358aa16f7b3e2028365dc9829213
Author: gordon chung <email address hidden>
Date: Wed Oct 21 08:44:58 2015 -0400

avoid using isolation level

    depending on sql driver, REPEATABLE READ isolation level may lock
    an entire table and cause write timeouts. isolation level was set
    originally to ensure consistent reads between 2 queries required to
    build events. that said, we can avoid table locks by making
    assumption that the 1st query is the correct base and any difference
    given by 2nd query can be discarded.

Change-Id: Ic53e1addf38a4d5934b4e627c4c974c6db42517e
Closes-Bug: #1506717

Changed in ceilometer:
status:	In Progress → Fix Committed

Revision history for this message

Divya K Konoor (dikonoor) wrote on 2015-10-26:

Gordon, I believe this fix should be back-ported to L .

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-10-29: Fix proposed to ceilometer (stable/liberty)

Fix proposed to branch: stable/liberty
Review: https://review.openstack.org/240207

Revision history for this message

Divya K Konoor (dikonoor) wrote on 2015-10-29:

I have cherry picked and backported the fix to stable/liberty.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-11-02: Fix merged to ceilometer (stable/liberty)

Reviewed: https://review.openstack.org/240207
Committed: https://git.openstack.org/cgit/openstack/ceilometer/commit/?id=ca45db1871c02a9c02f8f258ce128e2e4f7a4fcd
Submitter: Jenkins
Branch: stable/liberty

commit ca45db1871c02a9c02f8f258ce128e2e4f7a4fcd
Author: gordon chung <email address hidden>
Date: Wed Oct 21 08:44:58 2015 -0400

avoid using isolation level

    Change-Id: Ic53e1addf38a4d5934b4e627c4c974c6db42517e
    Closes-Bug: #1506717
    (cherry picked from commit 898cd3d036c4358aa16f7b3e2028365dc9829213)

tags:

added: in-stable-liberty

Revision history for this message

Thierry Carrez (ttx) wrote on 2015-12-03: Fix included in openstack/ceilometer 6.0.0.0b1

This issue was fixed in the openstack/ceilometer 6.0.0.0b1 development milestone.

Thierry Carrez (ttx) on 2015-12-03

Changed in ceilometer:
status:	Fix Committed → Fix Released

Revision history for this message

Doug Hellmann (doug-hellmann) wrote on 2015-12-07: Fix included in openstack/ceilometer 5.0.1

#10

This issue was fixed in the openstack/ceilometer 5.0.1 release.

Liusheng (liusheng) on 2015-12-14

Changed in ceilometer:
milestone:	none → mitaka-1

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.