placement aggregate handling has lock trouble with sqlite files

Bug #1789633 reported by Chris Dent
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Undecided
Chris Dent

Bug Description

Not sure if this is in itself a bug, but instead indicates that there are issues with aggregate handling that may show up in the real world under some conditions.

The placecat tooling at https://github.com/cdent/placecat has a suite of automated tests that run every now and again against a docker container built from nova master. It uses a file-based sqlite database. With recent updates those tests are now failing with 'database locked' errors (traceback below is from functional tests, but the one in the placecat server is the same). travis logs at https://travis-ci.org/cdent/placecat/builds/421991104 but the traceback is not visible there

In a nova checkout, I've changed the functional tests to use a file based database for sqlite and can intermittently replicate the problem when PUTting aggregates for a resource provider. It seems this came in with 2d7ed309ec4 ( https://review.openstack.org/#/c/592654/ ).

-=-=-
       ERROR [nova.api.openstack.placement.fault_wrap] Placement API unexpected error: (sqlite3.OperationalError) database is locked (Background on this error at: http://sqlalche.me/e/e3q8)
    Traceback (most recent call last):
      File "nova/api/openstack/placement/fault_wrap.py", line 40, in __call__
        return self.application(environ, start_response)
      File "/mnt/share/cdentsrc/nova/.tox/py27/local/lib/python2.7/site-packages/webob/dec.py", line 129, in __call__
        resp = self.call_func(req, *args, **kw)
      File "/mnt/share/cdentsrc/nova/.tox/py27/local/lib/python2.7/site-packages/webob/dec.py", line 193, in call_func
        return self.func(req, *args, **kwargs)
      File "/mnt/share/cdentsrc/nova/.tox/py27/local/lib/python2.7/site-packages/microversion_parse/middleware.py", line 80, in __call__
        response = req.get_response(self.application)
      File "/mnt/share/cdentsrc/nova/.tox/py27/local/lib/python2.7/site-packages/webob/request.py", line 1313, in send
        application, catch_exc_info=False)
      File "/mnt/share/cdentsrc/nova/.tox/py27/local/lib/python2.7/site-packages/webob/request.py", line 1277, in call_application
        app_iter = application(self.environ, start_response)
      File "nova/api/openstack/placement/handler.py", line 213, in __call__
        return dispatch(environ, start_response, self._map)
      File "nova/api/openstack/placement/handler.py", line 150, in dispatch
        return handler(environ, start_response)
      File "/mnt/share/cdentsrc/nova/.tox/py27/local/lib/python2.7/site-packages/webob/dec.py", line 129, in __call__
        resp = self.call_func(req, *args, **kw)
      File "nova/api/openstack/placement/wsgi_wrapper.py", line 29, in call_func
        super(PlacementWsgify, self).call_func(req, *args, **kwargs)
      File "/mnt/share/cdentsrc/nova/.tox/py27/local/lib/python2.7/site-packages/webob/dec.py", line 193, in call_func
        return self.func(req, *args, **kwargs)
      File "nova/api/openstack/placement/util.py", line 191, in decorated_function
        return f(req)
      File "nova/api/openstack/placement/microversion.py", line 166, in decorated_func
        return _find_method(f, version, status_code)(req, *args, **kwargs)
      File "nova/api/openstack/placement/handlers/aggregate.py", line 131, in set_aggregates
        increment_generation=consider_generation)
      File "nova/api/openstack/placement/handlers/aggregate.py", line 72, in _set_aggregates
        aggregate_uuids, increment_generation=increment_generation)
      File "nova/api/openstack/placement/objects/resource_provider.py", line 991, in set_aggregates
        increment_generation=increment_generation)
      File "/mnt/share/cdentsrc/nova/.tox/py27/local/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py", line 993, in wrapper
        return fn(*args, **kwargs)
      File "nova/api/openstack/placement/objects/resource_provider.py", line 557, in _set_aggregates
        agg_id = _ensure_aggregate(context, agg_uuid)
      File "/mnt/share/cdentsrc/nova/.tox/py27/local/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py", line 993, in wrapper
        return fn(*args, **kwargs)
      File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
        self.gen.next()
      File "/mnt/share/cdentsrc/nova/.tox/py27/local/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py", line 1043, in _transaction_scope
        yield resource
      File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
        self.gen.next()
      File "/mnt/share/cdentsrc/nova/.tox/py27/local/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py", line 653, in _session
        self.session.rollback()
      File "/mnt/share/cdentsrc/nova/.tox/py27/local/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
        self.force_reraise()
      File "/mnt/share/cdentsrc/nova/.tox/py27/local/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
        six.reraise(self.type_, self.value, self.tb)
      File "/mnt/share/cdentsrc/nova/.tox/py27/local/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py", line 650, in _session
        self._end_session_transaction(self.session)
      File "/mnt/share/cdentsrc/nova/.tox/py27/local/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py", line 678, in _end_session_transaction
        session.commit()
      File "/mnt/share/cdentsrc/nova/.tox/py27/local/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 943, in commit
        self.transaction.commit()
      File "/mnt/share/cdentsrc/nova/.tox/py27/local/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 471, in commit
        t[1].commit()
      File "/mnt/share/cdentsrc/nova/.tox/py27/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1643, in commit
        self._do_commit()
      File "/mnt/share/cdentsrc/nova/.tox/py27/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1674, in _do_commit
        self.connection._commit_impl()
      File "/mnt/share/cdentsrc/nova/.tox/py27/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 726, in _commit_impl
        self._handle_dbapi_exception(e, None, None, None, None)
      File "/mnt/share/cdentsrc/nova/.tox/py27/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1409, in _handle_dbapi_exception
        util.raise_from_cause(newraise, exc_info)
      File "/mnt/share/cdentsrc/nova/.tox/py27/local/lib/python2.7/site-packages/sqlalchemy/util/compat.py", line 265, in raise_from_cause
        reraise(type(exception), exception, tb=exc_tb, cause=cause)
      File "/mnt/share/cdentsrc/nova/.tox/py27/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 724, in _commit_impl
        self.engine.dialect.do_commit(self.connection)
      File "/mnt/share/cdentsrc/nova/.tox/py27/local/lib/python2.7/site-packages/sqlalchemy/engine/default.py", line 462, in do_commit
        dbapi_connection.commit()
    OperationalError: (sqlite3.OperationalError) database is locked (Background on this error at: http://sqlalche.me/e/e3q8)
        INFO [nova.api.openstack.placement.requestlog] 127.0.0.1 "PUT /resource_providers/ed83df92-38c2-4674-8aaa-774190e9d5f2/aggregates" status: 500 len: 362 microversion: 1.30
-=-=-

Revision history for this message
Chris Dent (cdent) wrote :

Note that I've tried sqlite_synchronous both True and False. Makes no difference.

Revision history for this message
Jay Pipes (jaypipes) wrote :

Are you only able to replicate the locking when concurrently attempting to associate providers with the same aggregate *and* that aggregate has yet to be "created" by the placement service yet?

Revision history for this message
Chris Dent (cdent) wrote :

The flow of the failure I'm seeing is that the lock failure happens on a PUT to set an aggregate that also requires the aggregate to be created. The following fails, and in this case the "d445d32f-a6a7-43d2-8f93-8fa419b1033e" aggregate does not yet exist.

- name: breadbox in aggregate
  PUT: /resource_providers/98b5eda9-695e-44fa-b81f-fbffb4279916/aggregates
  data:
      resource_provider_generation: $RESPONSE['$.resource_provider_generation']
      aggregates:
          - d445d32f-a6a7-43d2-8f93-8fa419b1033e

Revision history for this message
Chris Dent (cdent) wrote :

If the aggregate already exists, no failure.

Revision history for this message
Chris Dent (cdent) wrote :

If I change the context manager on _ensure_aggregate to not be independent, it's okay. Will push up a canary patch to see if that breaks things for mysql, etc.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/597486

Revision history for this message
Chris Dent (cdent) wrote :

The tests pass with the related fix and the http://logs.openstack.org/86/597486/1/check/nova-next/c048380/logs/placement-perf.txt.gz is not showing any issues with creating aggregates so I'm going to change the patch to "real" and assuming it (and a few rechecks) remain happy, I guess we good to go.

Changed in nova:
assignee: nobody → Chris Dent (cdent)
status: New → In Progress
Revision history for this message
Chris Dent (cdent) wrote :

For reference: the proposed patch fixes the original problem found in placecat.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/597486
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=757983a4cfe3107ea6ffd0b416790ae23d91ef2e
Submitter: Zuul
Branch: master

commit 757983a4cfe3107ea6ffd0b416790ae23d91ef2e
Author: Chris Dent <email address hidden>
Date: Wed Aug 29 13:36:16 2018 +0100

    [placement] Make _ensure_aggregate context not independent

    The use of the independent context on _ensure_aggregate appears to
    be unnecessary. It causes file-based uses of SQLite dbs to fail
    (with database locked errors, as reported in the associated bug,
    1789633) and thus may mask issues with other databases. Adding the
    independent context manager was the result of a series of "throw
    stuff at the wall and see what sticks" patches, but it looks now
    that it is not required, and in some situations causes problems.

    Runs through the gate show that the behavior it was fixing (as
    described in bug 1786703) is not happening.

    Change-Id: I1f325d55ec256db34a4c3bbd230dcd8a91bce542
    Related-Bug: #1786703
    Closes-Bug: #1789633

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 19.0.0.0rc1

This issue was fixed in the openstack/nova 19.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.