Ceilometer GET events throws deadlock or timeout errors

Bug #1506717 reported by Divya K Konoor
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ceilometer
Fix Released
Undecided
gordon chung

Bug Description

As part of https://review.openstack.org/#/c/208107/ , there were changes made wherein the isolation levels of database is set to Repeatable read for all databases other that sqllite. After this change , we see deadlock problems while accessing GET /events api. This doesn't happen all the time but is not very uncommon and usually happens when concurrent calls are run on ceilometer.

As per the documentation of RR (repeatable read), the entire table is locked for every unit of work performed and this could be denying other calls from going through.

This behavior seems to be associated with the RR and we should consider moving away. from RR. Other than ceilometer , I don't see any other services setting an isolation level explicitly . The deadlock or timeout behaviour can occur with any underlying database by virtue of what the isolation level is intended to do.

http://www.medtronicfeatures.com/wcm/help/admin/troubleshooting/wwhelp/wwhimpl/common/html/wwhelp.htm?context=troubleshooting&file=page_5_35.htm

https://msdn.microsoft.com/en-us/library/ms675307%28v=vs.85%29.aspx

http://dba.fyicenter.com/faq/mysql/Lock-Timeout-and-Deadlock.html

"If you are using transactions with REPEATABLE READ isolation level and transaction safe storage engines in your applications, data locks, lock timeouts, and dead lock detection will impact your application in a concurrent multi-user environment like Web sites in several ways"

[DB2/LINUXPPC64] SQL0911N The current transaction has been rolled back because of a deadlock or timeout. Reason code "2". SQLSTATE=40001 SQLCODE=-911". Detail:
Traceback (most recent call last):

  File "/usr/lib/python2.7/site-packages/wsmeext/pecan.py", line 84, in callfunction
    result = f(self, *args, **kwargs)

  File "/usr/lib/python2.7/site-packages/ceilometer/api/controllers/v2/events.py", line 278, in get_all
    limit)]

  File "/usr/lib/python2.7/site-packages/ceilometer/event/storage/impl_sqlalchemy.py", line 290, in get_events
    models.EventType.desc, models.Event.raw).all():

  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2399, in all
    return list(self)

  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/loading.py", line 86, in instances
    util.raise_from_cause(err)

  File "/usr/lib64/python2.7/site-packages/sqlalchemy/util/compat.py", line 199, in raise_from_cause
    reraise(type(exception), exception, tb=exc_tb)

  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/loading.py", line 67, in instances
    fetch = cursor.fetchall()

  File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/result.py", line 968, in fetchall
    self.cursor, self.context)

  File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1337, in _handle_dbapi_exception
    util.raise_from_cause(newraise, exc_info)

  File "/usr/lib64/python2.7/site-packages/sqlalchemy/util/compat.py", line 199, in raise_from_cause
    reraise(type(exception), exception, tb=exc_tb)

  File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/result.py", line 962, in fetchall
    l = self.process_rows(self._fetchall_impl())

  File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/result.py", line 913, in _fetchall_impl
    return self.cursor.fetchall()

  File "/usr/lib64/python2.7/site-packages/ibm_db_dbi.py", line 1460, in fetchall
    return self._fetch_helper()

  File "/usr/lib64/python2.7/site-packages/ibm_db_dbi.py", line 1417, in _fetch_helper
    raise self.messages[len(self.messages) - 1]

DBDeadlock: (ibm_db_dbi.ProgrammingError) ibm_db_dbi::ProgrammingError: Fetch Failure: [IBM][CLI Driver][DB2/LINUXPPC64] SQL0911N The current transaction has been rolled back because of a deadlock or timeout. Reason code "2". SQLSTATE=40001 SQLCODE=-911

Revision history for this message
gordon chung (chungg) wrote :

i think it's safe to drop the isolation level... but this needs to be verified.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ceilometer (master)

Fix proposed to branch: master
Review: https://review.openstack.org/238038

Changed in ceilometer:
assignee: nobody → gordon chung (chungg)
status: New → In Progress
Revision history for this message
gordon chung (chungg) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ceilometer (master)

Reviewed: https://review.openstack.org/238038
Committed: https://git.openstack.org/cgit/openstack/ceilometer/commit/?id=898cd3d036c4358aa16f7b3e2028365dc9829213
Submitter: Jenkins
Branch: master

commit 898cd3d036c4358aa16f7b3e2028365dc9829213
Author: gordon chung <email address hidden>
Date: Wed Oct 21 08:44:58 2015 -0400

    avoid using isolation level

    depending on sql driver, REPEATABLE READ isolation level may lock
    an entire table and cause write timeouts. isolation level was set
    originally to ensure consistent reads between 2 queries required to
    build events. that said, we can avoid table locks by making
    assumption that the 1st query is the correct base and any difference
    given by 2nd query can be discarded.

    Change-Id: Ic53e1addf38a4d5934b4e627c4c974c6db42517e
    Closes-Bug: #1506717

Changed in ceilometer:
status: In Progress → Fix Committed
Revision history for this message
Divya K Konoor (dikonoor) wrote :

Gordon, I believe this fix should be back-ported to L .

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ceilometer (stable/liberty)

Fix proposed to branch: stable/liberty
Review: https://review.openstack.org/240207

Revision history for this message
Divya K Konoor (dikonoor) wrote :

I have cherry picked and backported the fix to stable/liberty.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ceilometer (stable/liberty)

Reviewed: https://review.openstack.org/240207
Committed: https://git.openstack.org/cgit/openstack/ceilometer/commit/?id=ca45db1871c02a9c02f8f258ce128e2e4f7a4fcd
Submitter: Jenkins
Branch: stable/liberty

commit ca45db1871c02a9c02f8f258ce128e2e4f7a4fcd
Author: gordon chung <email address hidden>
Date: Wed Oct 21 08:44:58 2015 -0400

    avoid using isolation level

    depending on sql driver, REPEATABLE READ isolation level may lock
    an entire table and cause write timeouts. isolation level was set
    originally to ensure consistent reads between 2 queries required to
    build events. that said, we can avoid table locks by making
    assumption that the 1st query is the correct base and any difference
    given by 2nd query can be discarded.

    Change-Id: Ic53e1addf38a4d5934b4e627c4c974c6db42517e
    Closes-Bug: #1506717
    (cherry picked from commit 898cd3d036c4358aa16f7b3e2028365dc9829213)

tags: added: in-stable-liberty
Revision history for this message
Thierry Carrez (ttx) wrote : Fix included in openstack/ceilometer 6.0.0.0b1

This issue was fixed in the openstack/ceilometer 6.0.0.0b1 development milestone.

Thierry Carrez (ttx)
Changed in ceilometer:
status: Fix Committed → Fix Released
Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/ceilometer 5.0.1

This issue was fixed in the openstack/ceilometer 5.0.1 release.

Liusheng (liusheng)
Changed in ceilometer:
milestone: none → mitaka-1
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.