Placement duplicate aggregate uuid handling during concurrent aggregate create insufficiently robust

Bug #1786703 reported by Chris Dent
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Jay Pipes

Bug Description

NOTE: This may be just a postgresql problem, not sure.

When doing some further experiments with load testing placement, my resource provider create script, which uses asyncio was able to cause several 500 errors from the placement service of the following form:

```
cdent-a01:~/src/placeload(master) $ docker logs zen_murdock |grep 'req-d4dcbfed-b050-4a3b-ab0f-d2489a31c3f2'
2018-08-12 16:03:30.698 9 DEBUG nova.api.openstack.placement.requestlog [req-d4dcbfed-b050-4a3b-ab0f-d2489a31c3f2 admin admin - - -] Starting request: 172.17.0.1 "PUT /resource_providers/13b09bc9-164f-4d03-8a61-5e78c05a73ad/aggregates" __call__ /usr/lib/python3.6/site-packages/nova/api/openstack/placement/requestlog.py:38
2018-08-12 16:03:30.903 9 ERROR nova.api.openstack.placement.fault_wrap [req-d4dcbfed-b050-4a3b-ab0f-d2489a31c3f2 admin admin - - -] Placement API unexpected error: This Session's transaction has been rolled back due to a previous exception during flush. To begin a new transaction with this Session, first issue Session.rollback(). Original exception was: (psycopg2.IntegrityError) duplicate key value violates unique constraint "uniq_placement_aggregates0uuid"
2018-08-12 16:03:30.914 9 INFO nova.api.openstack.placement.requestlog [req-d4dcbfed-b050-4a3b-ab0f-d2489a31c3f2 admin admin - - -] 172.17.0.1 "PUT /resource_providers/13b09bc9-164f-4d03-8a61-5e78c05a73ad/aggregates" status: 500 len: 997 microversion: 1.29
```

"DETAIL: Key (uuid)=(14a5c8a3-5a99-4e8f-88be-00d85fcb1c17) already exists."

This is because the code at https://github.com/openstack/nova/blob/a29ace1d48b5473b9e7b5decdf3d5d19f3d262f3/nova/api/openstack/placement/objects/resource_provider.py#L519-L529 is not trapping the right error when the server thinks it needs to create a new aggregate at the same time that it is already creating it.

It's not clear to me if this is because oslo_db is not transforming the postgresql error properly or that the generic error there is the wrong one and we've never noticed before because we don't hit the concurrency situation hard enough.

Tags: db placement
Revision history for this message
Matt Riedemann (mriedem) wrote :

Which version of postgresql are you using? I'm guessing 8.x?

This is the oslo.db code that is meant to translate the IntegrityError:

http://git.openstack.org/cgit/openstack/oslo.db/tree/oslo_db/sqlalchemy/exc_filters.py#n104

Revision history for this message
Matt Riedemann (mriedem) wrote :

Sorry I meant 9.x.

Revision history for this message
Chris Dent (cdent) wrote :

10.4, I'm using bionic (in this case).

Revision history for this message
Jay Pipes (jaypipes) wrote :

Seems that this might be an issue with whatever is wrapping the transaction rollback is not re-raising the underlying exception properly and the oslo_db.exception.DBDuplicateEntry isn't being raised.

Revision history for this message
Chris Dent (cdent) wrote :
Download full text (18.0 KiB)

When using the mysql, the same high concurrency results in a different problem (pasted from http://logs.openstack.org/67/591367/4/check/nova-next/f8892df/logs/screen-placement-api.txt.gz#_Aug_13_23_11_22_561724 )

Aug 13 23:11:22.553969 ubuntu-xenial-inap-mtl01-0001304255 <email address hidden>[628]: DEBUG nova.api.openstack.placement.requestlog [None req-b524d09f-0bae-4855-9e3b-776abe7c6c2f None None] Starting request: 198.72.124.103 "PUT /placement/resource_providers/381ee627-4cc6-4dee-88a1-4f442de6e553/aggregates" {{(pid=634) __call__ /opt/stack/new/nova/nova/api/openstack/placement/requestlog.py:38}}
Aug 13 23:11:22.561724 ubuntu-xenial-inap-mtl01-0001304255 <email address hidden>[628]: ERROR nova.api.openstack.placement.fault_wrap [None req-216bb3ea-5d1e-40af-a7f6-d8f27dbfbc28 None None] Placement API unexpected error: (pymysql.err.OperationalError) (1213, u'Deadlock found when trying to get lock; try restarting transaction') [SQL: u'INSERT INTO resource_provider_aggregates (resource_provider_id, aggregate_id, created_at) SELECT 93, placement_aggregates.id, %(created_at)s AS anon_1 \nFROM placement_aggregates \nWHERE placement_aggregates.uuid IN (%(uuid_1)s, %(uuid_2)s)'] [parameters: {u'uuid_2': u'66d98e7c-3c25-485d-a0dc-1cea651884de', 'created_at': datetime.datetime(2018, 8, 13, 23, 11, 22, 548464), u'uuid_1': u'14a5c8a3-5a99-4e8f-88be-00d85fcb1c17'}] (Background on this error at: http://sqlalche.me/e/e3q8): DBDeadlock: (pymysql.err.OperationalError) (1213, u'Deadlock found when trying to get lock; try restarting transaction') [SQL: u'INSERT INTO resource_provider_aggregates (resource_provider_id, aggregate_id, created_at) SELECT 93, placement_aggregates.id, %(created_at)s AS anon_1 \nFROM placement_aggregates \nWHERE placement_aggregates.uuid IN (%(uuid_1)s, %(uuid_2)s)'] [parameters: {u'uuid_2': u'66d98e7c-3c25-485d-a0dc-1cea651884de', 'created_at': datetime.datetime(2018, 8, 13, 23, 11, 22, 548464), u'uuid_1': u'14a5c8a3-5a99-4e8f-88be-00d85fcb1c17'}] (Background on this error at: http://sqlalche.me/e/e3q8)
Aug 13 23:11:22.562429 ubuntu-xenial-inap-mtl01-0001304255 <email address hidden>[628]: ERROR nova.api.openstack.placement.fault_wrap Traceback (most recent call last):
Aug 13 23:11:22.562605 ubuntu-xenial-inap-mtl01-0001304255 <email address hidden>[628]: ERROR nova.api.openstack.placement.fault_wrap File "/opt/stack/new/nova/nova/api/openstack/placement/fault_wrap.py", line 40, in __call__
Aug 13 23:11:22.562773 ubuntu-xenial-inap-mtl01-0001304255 <email address hidden>[628]: ERROR nova.api.openstack.placement.fault_wrap return self.application(environ, start_response)
Aug 13 23:11:22.562959 ubuntu-xenial-inap-mtl01-0001304255 <email address hidden>[628]: ERROR nova.api.openstack.placement.fault_wrap File "/usr/local/lib/python2.7/dist-packages/webob/dec.py", line 129, in __call__
Aug 13 23:11:22.563123 ubuntu-xenial-inap-mtl01-0001304255 <email address hidden>[628]: ERROR nova.api.openstack.placement.fault_wrap resp = self.call_func(req, *args, **kw)
Aug 13 23:11:22.563297 ubuntu-xenial-inap-mtl01-0001304255 <email address hidden>[628]: ERROR nova.api....

Revision history for this message
Jay Pipes (jaypipes) wrote :

Second MySQL problem is different. We might be able to fix that one by converting the code to executing multiple single-row INSERT statements instead of a single INSERT SELECT.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/591609

Changed in nova:
assignee: nobody → Jay Pipes (jaypipes)
status: New → In Progress
Changed in nova:
assignee: Jay Pipes (jaypipes) → Chris Dent (cdent)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/592654

Changed in nova:
assignee: Chris Dent (cdent) → Jay Pipes (jaypipes)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Jay Pipes (<email address hidden>) on branch: master
Review: https://review.openstack.org/591609
Reason: didn't work, clearly...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/592654
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=2d7ed309ec4e656ce9d6f21f03ea158278f2526d
Submitter: Zuul
Branch: master

commit 2d7ed309ec4e656ce9d6f21f03ea158278f2526d
Author: Jay Pipes <email address hidden>
Date: Thu Aug 16 14:56:47 2018 -0400

    placement: use single-shot INSERT/DELETE agg

    When replacing a provider's set of aggregate associations, we were
    issuing a call to:

     DELETE resource_provider_aggregates WHERE resource_provider_id = $rp

    and then a single call to:

     INSERT INTO resource_provider_aggregates
     SELECT $rp, aggs.id
     FROM provider_aggregates AS aggs
     WHERE aggs.uuid IN ($agg_uuids)

    This patch changes the _set_aggregates() function in a few ways.
    First, we grab the aggregate's internal ID value when creating new
    aggregate records (or grabbing a provider's existing aggregate
    associations). This eliminates the need for any join to
    provider_aggregates in an INSERT/DELETE statement.

    Second, instead of a multi-row INSERT .. SELECT statement, we do
    single-shot INSERT ... VALUES statements, one for each added aggregate.

    Third, we no longer DELETE all aggregate associations for the provider
    in question. Instead, we issue single-shot DELETE statements for only
    the aggregates that are being disassociated.

    Finally, I've added a number of log debug statements so that we can have
    a little more information if this particular patch does not fix the
    deadlock issue described in the associated bug.

    Change-Id: I87e765305017eae1424005f7d6f419f42a2f8370
    Closes-bug: #1786703

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/597486
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=757983a4cfe3107ea6ffd0b416790ae23d91ef2e
Submitter: Zuul
Branch: master

commit 757983a4cfe3107ea6ffd0b416790ae23d91ef2e
Author: Chris Dent <email address hidden>
Date: Wed Aug 29 13:36:16 2018 +0100

    [placement] Make _ensure_aggregate context not independent

    The use of the independent context on _ensure_aggregate appears to
    be unnecessary. It causes file-based uses of SQLite dbs to fail
    (with database locked errors, as reported in the associated bug,
    1789633) and thus may mask issues with other databases. Adding the
    independent context manager was the result of a series of "throw
    stuff at the wall and see what sticks" patches, but it looks now
    that it is not required, and in some situations causes problems.

    Runs through the gate show that the behavior it was fixing (as
    described in bug 1786703) is not happening.

    Change-Id: I1f325d55ec256db34a4c3bbd230dcd8a91bce542
    Related-Bug: #1786703
    Closes-Bug: #1789633

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 19.0.0.0rc1

This issue was fixed in the openstack/nova 19.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.