placement: claim allocations fails with IndexError in _ensure_lookup_table_entry

Bug #1731668 reported by Matt Riedemann
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Jay Pipes
Queens
Fix Released
Low
melanie witt

Bug Description

http://logs.openstack.org/73/513473/2/check/legacy-tempest-dsvm-cells/be1a0fa/logs/screen-placement-api.txt.gz#_Nov_10_01_14_05_230382

Nov 10 01:14:05.230382 ubuntu-xenial-rax-iad-0000805952 <email address hidden>[13847]: ERROR nova.api.openstack.placement.handler [None req-1f685650-1d9a-4194-b033-31a419813464 service placement] Uncaught exception: IndexError: list index out of range
Nov 10 01:14:05.230656 ubuntu-xenial-rax-iad-0000805952 <email address hidden>[13847]: ERROR nova.api.openstack.placement.handler Traceback (most recent call last):
Nov 10 01:14:05.230866 ubuntu-xenial-rax-iad-0000805952 <email address hidden>[13847]: ERROR nova.api.openstack.placement.handler File "/opt/stack/new/nova/nova/api/openstack/placement/handler.py", line 217, in __call__
Nov 10 01:14:05.231069 ubuntu-xenial-rax-iad-0000805952 <email address hidden>[13847]: ERROR nova.api.openstack.placement.handler return dispatch(environ, start_response, self._map)
Nov 10 01:14:05.231302 ubuntu-xenial-rax-iad-0000805952 <email address hidden>[13847]: ERROR nova.api.openstack.placement.handler File "/opt/stack/new/nova/nova/api/openstack/placement/handler.py", line 144, in dispatch
Nov 10 01:14:05.231489 ubuntu-xenial-rax-iad-0000805952 <email address hidden>[13847]: ERROR nova.api.openstack.placement.handler return handler(environ, start_response)
Nov 10 01:14:05.231701 ubuntu-xenial-rax-iad-0000805952 <email address hidden>[13847]: ERROR nova.api.openstack.placement.handler File "/usr/local/lib/python2.7/dist-packages/webob/dec.py", line 131, in __call__
Nov 10 01:14:05.231942 ubuntu-xenial-rax-iad-0000805952 <email address hidden>[13847]: ERROR nova.api.openstack.placement.handler resp = self.call_func(req, *args, **self.kwargs)
Nov 10 01:14:05.232157 ubuntu-xenial-rax-iad-0000805952 <email address hidden>[13847]: ERROR nova.api.openstack.placement.handler File "/opt/stack/new/nova/nova/api/openstack/placement/wsgi_wrapper.py", line 29, in call_func
Nov 10 01:14:05.232382 ubuntu-xenial-rax-iad-0000805952 <email address hidden>[13847]: ERROR nova.api.openstack.placement.handler super(PlacementWsgify, self).call_func(req, *args, **kwargs)
Nov 10 01:14:05.232611 ubuntu-xenial-rax-iad-0000805952 <email address hidden>[13847]: ERROR nova.api.openstack.placement.handler File "/usr/local/lib/python2.7/dist-packages/webob/dec.py", line 196, in call_func
Nov 10 01:14:05.232841 ubuntu-xenial-rax-iad-0000805952 <email address hidden>[13847]: ERROR nova.api.openstack.placement.handler return self.func(req, *args, **kwargs)
Nov 10 01:14:05.233056 ubuntu-xenial-rax-iad-0000805952 <email address hidden>[13847]: ERROR nova.api.openstack.placement.handler File "/opt/stack/new/nova/nova/api/openstack/placement/microversion.py", line 269, in decorated_func
Nov 10 01:14:05.233258 ubuntu-xenial-rax-iad-0000805952 <email address hidden>[13847]: ERROR nova.api.openstack.placement.handler return _find_method(f, version)(req, *args, **kwargs)
Nov 10 01:14:05.233456 ubuntu-xenial-rax-iad-0000805952 <email address hidden>[13847]: ERROR nova.api.openstack.placement.handler File "/opt/stack/new/nova/nova/api/openstack/placement/util.py", line 147, in decorated_function
Nov 10 01:14:05.233689 ubuntu-xenial-rax-iad-0000805952 <email address hidden>[13847]: ERROR nova.api.openstack.placement.handler return f(req)
Nov 10 01:14:05.233897 ubuntu-xenial-rax-iad-0000805952 <email address hidden>[13847]: ERROR nova.api.openstack.placement.handler File "/opt/stack/new/nova/nova/api/openstack/placement/handlers/allocation.py", line 282, in set_allocations
Nov 10 01:14:05.234101 ubuntu-xenial-rax-iad-0000805952 <email address hidden>[13847]: ERROR nova.api.openstack.placement.handler return _set_allocations(req, ALLOCATION_SCHEMA_V1_8)
Nov 10 01:14:05.234222 ubuntu-xenial-rax-iad-0000805952 <email address hidden>[13847]: ERROR nova.api.openstack.placement.handler File "/opt/stack/new/nova/nova/api/openstack/placement/handlers/allocation.py", line 248, in _set_allocations
Nov 10 01:14:05.234337 ubuntu-xenial-rax-iad-0000805952 <email address hidden>[13847]: ERROR nova.api.openstack.placement.handler allocations.create_all()
Nov 10 01:14:05.234453 ubuntu-xenial-rax-iad-0000805952 <email address hidden>[13847]: ERROR nova.api.openstack.placement.handler File "/opt/stack/new/nova/nova/objects/resource_provider.py", line 1889, in create_all
Nov 10 01:14:05.234567 ubuntu-xenial-rax-iad-0000805952 <email address hidden>[13847]: ERROR nova.api.openstack.placement.handler self._set_allocations(self._context, self.objects)
Nov 10 01:14:05.234685 ubuntu-xenial-rax-iad-0000805952 <email address hidden>[13847]: ERROR nova.api.openstack.placement.handler File "/usr/local/lib/python2.7/dist-packages/oslo_db/api.py", line 150, in wrapper
Nov 10 01:14:05.234793 ubuntu-xenial-rax-iad-0000805952 <email address hidden>[13847]: ERROR nova.api.openstack.placement.handler ectxt.value = e.inner_exc
Nov 10 01:14:05.234902 ubuntu-xenial-rax-iad-0000805952 <email address hidden>[13847]: ERROR nova.api.openstack.placement.handler File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
Nov 10 01:14:05.235015 ubuntu-xenial-rax-iad-0000805952 <email address hidden>[13847]: ERROR nova.api.openstack.placement.handler self.force_reraise()
Nov 10 01:14:05.235129 ubuntu-xenial-rax-iad-0000805952 <email address hidden>[13847]: ERROR nova.api.openstack.placement.handler File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
Nov 10 01:14:05.235242 ubuntu-xenial-rax-iad-0000805952 <email address hidden>[13847]: ERROR nova.api.openstack.placement.handler six.reraise(self.type_, self.value, self.tb)
Nov 10 01:14:05.235352 ubuntu-xenial-rax-iad-0000805952 <email address hidden>[13847]: ERROR nova.api.openstack.placement.handler File "/usr/local/lib/python2.7/dist-packages/oslo_db/api.py", line 138, in wrapper
Nov 10 01:14:05.235460 ubuntu-xenial-rax-iad-0000805952 <email address hidden>[13847]: ERROR nova.api.openstack.placement.handler return f(*args, **kwargs)
Nov 10 01:14:05.235577 ubuntu-xenial-rax-iad-0000805952 <email address hidden>[13847]: ERROR nova.api.openstack.placement.handler File "/usr/local/lib/python2.7/dist-packages/oslo_db/sqlalchemy/enginefacade.py", line 984, in wrapper
Nov 10 01:14:05.235701 ubuntu-xenial-rax-iad-0000805952 <email address hidden>[13847]: ERROR nova.api.openstack.placement.handler return fn(*args, **kwargs)
Nov 10 01:14:05.235821 ubuntu-xenial-rax-iad-0000805952 <email address hidden>[13847]: ERROR nova.api.openstack.placement.handler File "/opt/stack/new/nova/nova/objects/resource_provider.py", line 1817, in _set_allocations
Nov 10 01:14:05.235936 ubuntu-xenial-rax-iad-0000805952 <email address hidden>[13847]: ERROR nova.api.openstack.placement.handler alloc.ensure_consumer_project_user(conn)
Nov 10 01:14:05.236049 ubuntu-xenial-rax-iad-0000805952 <email address hidden>[13847]: ERROR nova.api.openstack.placement.handler File "/opt/stack/new/nova/nova/objects/resource_provider.py", line 1469, in ensure_consumer_project_user
Nov 10 01:14:05.236163 ubuntu-xenial-rax-iad-0000805952 <email address hidden>[13847]: ERROR nova.api.openstack.placement.handler pid = _ensure_project(conn, self.project_id)
Nov 10 01:14:05.236272 ubuntu-xenial-rax-iad-0000805952 <email address hidden>[13847]: ERROR nova.api.openstack.placement.handler File "/opt/stack/new/nova/nova/objects/resource_provider.py", line 1691, in _ensure_project
Nov 10 01:14:05.236391 ubuntu-xenial-rax-iad-0000805952 <email address hidden>[13847]: ERROR nova.api.openstack.placement.handler return _ensure_lookup_table_entry(conn, _PROJECT_TBL, external_id)
Nov 10 01:14:05.236506 ubuntu-xenial-rax-iad-0000805952 <email address hidden>[13847]: ERROR nova.api.openstack.placement.handler File "/opt/stack/new/nova/nova/objects/resource_provider.py", line 1680, in _ensure_lookup_table_entry
Nov 10 01:14:05.236617 ubuntu-xenial-rax-iad-0000805952 <email address hidden>[13847]: ERROR nova.api.openstack.placement.handler return res[0][0]
Nov 10 01:14:05.236727 ubuntu-xenial-rax-iad-0000805952 <email address hidden>[13847]: ERROR nova.api.openstack.placement.handler IndexError: list index out of range
Nov 10 01:14:05.236912 ubuntu-xenial-rax-iad-0000805952 <email address hidden>[13847]: ERROR nova.api.openstack.placement.handler

Looks like this has at least been around for a week:

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Unable%20to%20submit%20allocation%20for%20instance%5C%22%20AND%20message%3A%5C%22The%20server%20has%20either%20erred%20or%20is%20incapable%20of%20performing%20the%20requested%20operation%5C%22%20AND%20tags%3A%5C%22screen-n-sch.txt%5C%22&from=10d

Matt Riedemann (mriedem)
Changed in nova:
status: New → Confirmed
importance: Undecided → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/519034

Revision history for this message
Matt Riedemann (mriedem) wrote :

It looks like in the test run that failed, the test is creating 2 instances, the first gets its allocations created successfully in placement, and the second fails, but I'm not really sure why.

Jay Pipes (jaypipes)
Changed in nova:
assignee: nobody → Jay Pipes (jaypipes)
Revision history for this message
Jay Pipes (jaypipes) wrote :

OK, so I found the root cause of this problem.

Basically, what's going on is that we're checking for DBDuplicateError after running conn.execute(). And "conn" is an SQLAlchemy Connection object. The issue is that DBDuplicateError is not an SQLAlchemy exception. Instead, it is an oslo_db-specific exception that is used to mask differences between DB drivers that variably return an OperationalError or an IntegrityError when a duplicate key is found.

https://github.com/openstack/oslo.db/blob/master/oslo_db/sqlalchemy/exc_filters.py#L95-L108

In practice, what this means is that all calling locations that use the straight "conn.execute()" calling sequence cannot rely on catching DBDuplicateError (because SQLAlchemy's Connection.execute() method will never raise it). Instead, all of those calling locations need to be updated to use context.session.execute() instead. This will allow DBDuplicateError to be caught properly.

In addition, all the manual construction of transactions in resource_provider.py that do:

 with conn.begin()

will need to have those manual begin() calls removed. The reason is that oslo_db's Session object (which is what is injected by the enginefacade into the nova.context.RequestContext) automatically handles creation, commit and rollback of transactions based on whether the source entrypoint is a writer or reader context.

Changed in nova:
status: Confirmed → Triaged
Revision history for this message
Jay Pipes (jaypipes) wrote :

Patch with fixes coming shortly.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/523192

Changed in nova:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Eric Fried (<email address hidden>) on branch: master
Review: https://review.openstack.org/519034
Reason: Real fix at https://review.openstack.org/523192

Changed in nova:
assignee: Jay Pipes (jaypipes) → Eric Fried (efried)
Matt Riedemann (mriedem)
Changed in nova:
assignee: Eric Fried (efried) → Jay Pipes (jaypipes)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/523192
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=31751a7149a3bb4b51087aecdc3b14c061c60d9b
Submitter: Zuul
Branch: master

commit 31751a7149a3bb4b51087aecdc3b14c061c60d9b
Author: Jay Pipes <email address hidden>
Date: Mon Nov 27 13:50:00 2017 -0500

    Use oslo_db Session in resource_provider.py

    In a number of places in resource_provider.py, we're checking for
    DBDuplicateError after running conn.execute(). And "conn" is an
    SQLAlchemy Connection object. The issue is that DBDuplicateError is not
    an SQLAlchemy exception. Instead, it is an oslo_db-specific exception
    that is used to mask differences between DB drivers that variably return
    an OperationalError or an IntegrityError when a duplicate key is found
    [1].

    In practice, what this means is that all calling locations that use the
    straight "conn.execute()" calling sequence cannot rely on catching
    DBDuplicateError (because SQLAlchemy's Connection.execute() method will
    never raise it). Instead, all of those calling locations needed to be
    updated to use context.session.execute() instead. This allows
    DBDuplicateError to be caught properly.

    In addition, all the manual construction of transactions in
    resource_provider.py that did:

     with conn.begin()

    needed to have those manual begin() calls removed. The reason is that
    oslo_db's Session object (which is what is injected by the enginefacade
    into the nova.context.RequestContext) automatically handles creation,
    commit and rollback of transactions based on whether the source
    entrypoint is a writer or reader context.

    Closes-bug: #1731668
    [1]: https://github.com/openstack/oslo.db/blob/stable/pike/oslo_db/sqlalchemy/exc_filters.py#L95-L153

    Change-Id: Ic5860cbef4d47e847bb5f190526b5a8ea3c33298

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 17.0.0.0b2

This issue was fixed in the openstack/nova 17.0.0.0b2 development milestone.

Revision history for this message
sean mooney (sean-k-mooney) wrote :

under downstream rally testing with queens this bug may not have been fully fix by this patch
https://bugzilla.redhat.com/show_bug.cgi?id=1718127

i have traiged the downstream bug as vaild so it will enter our backlog and if we find a more resaonable reporduce then run rally and hope it breaks we will update the bug.

we likely wont start working on this imidely but hopefully this would be reproducible with a functional test.

Revision history for this message
Matt Riedemann (mriedem) wrote :

@Sean - if the bug is in placement you'd need to track the story, at least initially, in storyboard:

https://storyboard.openstack.org/#!/project/1113

Revision history for this message
melanie witt (melwitt) wrote :

Re-opening this for queens as Sean mentioned [1], the bug was not fully fixed at the time of the original fix.

Setting the importance to Low as this has been latent in queens for a long time.

And this is a nova bug as it only appears in integrated placement -- the bug is in data migration related code that was removed entirely after the queens release.

[1] https://bugs.launchpad.net/nova/+bug/1731668/comments/9

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/queens)

Related fix proposed to branch: stable/queens
Review: https://review.opendev.org/756636

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/756637

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/queens)

Reviewed: https://review.opendev.org/756636
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=7d2a491695078742c3e11cd22bbeae0a1e4e1591
Submitter: Zuul
Branch: stable/queens

commit 7d2a491695078742c3e11cd22bbeae0a1e4e1591
Author: melanie witt <email address hidden>
Date: Thu Oct 8 02:33:14 2020 +0000

    [stable-only] Add functional test for bug 1731668

    This reproduces the bug to demonstrate the problem in a real MySQL
    database (the bug cannot be reproduced using sqlite) and verify that
    the fix works properly.

    The bug has to do with parallel transactions and isolation between
    transactions and the sqlite database does not have any isolation
    between operations on the same database connection:

      https://www.sqlite.org/isolation.html

    and we can't use separate in-memory database connections to sqlite for
    each transaction because each separate connection creates its own
    copy of the database. What we want is one copy but isolated
    transactions writing and reading from that one copy.

    Note: this is a stable-only change because the placement code
    containing the bug was temporary data-transitioning code that no longer
    exists in placement after Queens.

    Related-Bug: #1731668

    Change-Id: I93d34c68bf89511dfeb53a4776c478b88463f24f

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/queens)

Reviewed: https://review.opendev.org/756637
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=d696a636bbd8c07bd8c609abfcb513453757b1fc
Submitter: Zuul
Branch: stable/queens

commit d696a636bbd8c07bd8c609abfcb513453757b1fc
Author: melanie witt <email address hidden>
Date: Thu Oct 8 02:41:02 2020 +0000

    [stable-only] Use a separate transaction for reading after race

    As part of data migration code to ensure a Project/User record,
    integrated placement creates a record if it does not exist. However,
    during a race situation, the logic to read back a record written by
    a racing request fails because a change written in a separate
    transaction will not be reflected in the current transaction when the
    isolation level is REPEATABLE_READ (the default):

      https://dev.mysql.com/doc/refman/8.0/en/innodb-transaction-isolation-levels.html

    This changes the read after catching DBDuplicateEntry to occur in a
    separate transaction so it can pick up the change written by the racing
    parallel request.

    Note: this is a stable-only change because the placement code
    containing the bug was temporary data-transitioning code that no longer
    exists in placement after Queens.

    Closes-Bug: #1731668

    Change-Id: Ic90beaaa9848a4f39f4223f31a55cd2c681959cd

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova queens-eol

This issue was fixed in the openstack/nova queens-eol release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.