Creation of existing resource takes too much time or fails

Bug #1831647 reported by Jakub Libosvar
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Jakub Libosvar

Bug Description

We have a downstream failure of neutron_tempest_plugin.api.admin.test_shared_network_extension.RBACSharedNetworksTest.test_duplicate_policy_error probably because http timeout is different on RHEL based boxes, set 120 seconds. The reason why it started failing in Stein is the recent bump in the time requiring to complete the retry DB mechanism: https://review.opendev.org/#/c/583527/5 The patch increases the required time for a reply about resource already existing to more than 160 seconds.

That is because Neutron server retries for every exception about existing entry from the DB layer:
https://opendev.org/openstack/neutron-lib/src/branch/master/neutron_lib/db/api.py#L119

Steps to reproduce:
net_id=$(openstack network create rbac_net | awk '/ id /{ print $4 }')
openstack network rbac create --type network --action access_as_shared --target-project admin $net_id
openstack network rbac create --type network --action access_as_shared --target-project admin $net_id

( yes, it's the same command twice)

I don't understand which race scenario the retry mechanism for resource create it tries to solve. However, I can think of a race scenario it introduces:

$ openstack network rbac delete 8a00a24e-182a-4e5e-8694-35e66635b581
$ openstack network delete $net_id
$ net_id=$(openstack network create rbac_net | awk '/ id /{ print $4 }')
$ rbac_id=$(openstack network rbac create --type network --action access_as_shared --target-project admin $net_id | awk '/ id /{ print $4 }')
$ openstack network rbac create --type network --action access_as_shared --target-project admin $net_id &
[1] 31383
$ sleep 10
$ openstack network rbac delete $rbac_id
$ fg
openstack network rbac create --type network --action access_as_shared --target-project admin $net_id
+-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Field | Value |
+-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| action | access_as_shared |
| id | 13c5b655-e1b5-4c72-a8af-1ed2f9ddcf89 |
| location | Munch({'cloud': '', 'region_name': 'regionOne', 'zone': None, 'project': Munch({'id': 'cdf84b19b71249ffaffde6627d06da12', 'name': 'admin', 'domain_id': None, 'domain_name': 'Default'})}) |
| name | None |
| object_id | 618108b7-d191-4459-b18a-30b7a65be005 |
| object_type | network |
| project_id | cdf84b19b71249ffaffde6627d06da12 |
| target_project_id | cdf84b19b71249ffaffde6627d06da12 |
+-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

The result should be that second creation of existing resource should fail and there should not exist any rbac policy. However, the second creation succeeded and there does exist the policy, that should have been deleted.

Revision history for this message
Jakub Libosvar (libosvar) wrote :

I digged into why do we retry on DBDuplicateEntry exception and found https://launchpad.net/neutron/+bug/1594796 with the fix https://review.opendev.org/#/c/332487/

Basically the reason is that if we generate entries, it may happen two processes concurrently generate the same resource and then we need to retry on the process that caught the duplicated entry.

Changed in neutron:
status: New → Confirmed
Changed in neutron:
importance: Undecided → High
Revision history for this message
Brian Haley (brian-haley) wrote :

Marking High as this can cause a long delay (120+ seconds) in an API call that is just going to fail.

Some more info after talking to Kuba since this happens for only some resources:

> I would say it happens with a resource where you can specify a primary key in the request.

> While with rbac, you provide an existing uuid of the network and that is PK in the databse for RBAC too

Revision history for this message
Jakub Libosvar (libosvar) wrote :

It seems we've had this issue in the past: https://bugs.launchpad.net/neutron/+bug/1551473 There is a code that should handle this case however, it's not working.

Changed in neutron:
assignee: nobody → Jakub Libosvar (libosvar)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/663749

Changed in neutron:
status: Confirmed → In Progress
tags: added: rocky-backport-potential
tags: added: stein-backport-potenatial
tags: removed: rocky-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/663749
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=26b3e6b1c4622087a2aaa542cb5ac5e477bd47b8
Submitter: Zuul
Branch: master

commit 26b3e6b1c4622087a2aaa542cb5ac5e477bd47b8
Author: Jakub Libosvar <email address hidden>
Date: Thu Jun 6 18:58:20 2019 +0000

    rbac: Catch correct exception for duplicated entry

    RBAC network policy is uniquely identified by network ID. That means
    when attempting to create such network policy, we should not retry when
    such policy already exists in the database.

    Before we switched in rbac to use OVO, we translated DB DBDuplicateEntry
    on such ocasions into dedicated RBAC exception to avoid DB retry
    mechanism (see bug/1551473). After introducing OVO layer for RBAC, the
    exception was not changed to the one coming from OVO. This patch
    replaces the exception from DB to the exception from OVO.

    Another patch will go to neutron-tempest-plugin to limit time API needs
    to reply with failure to user, when attempting to create an existing
    policy.

    Closes-Bug: #1831647

    Change-Id: I7c65376f6fd6fc29d510ea532a684917ed95deb1

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/664519

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/stein)

Reviewed: https://review.opendev.org/664519
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=e6cab0f414b135165f7dda41c49c9b4b485e2c31
Submitter: Zuul
Branch: stable/stein

commit e6cab0f414b135165f7dda41c49c9b4b485e2c31
Author: Jakub Libosvar <email address hidden>
Date: Thu Jun 6 18:58:20 2019 +0000

    rbac: Catch correct exception for duplicated entry

    RBAC network policy is uniquely identified by network ID. That means
    when attempting to create such network policy, we should not retry when
    such policy already exists in the database.

    Before we switched in rbac to use OVO, we translated DB DBDuplicateEntry
    on such ocasions into dedicated RBAC exception to avoid DB retry
    mechanism (see bug/1551473). After introducing OVO layer for RBAC, the
    exception was not changed to the one coming from OVO. This patch
    replaces the exception from DB to the exception from OVO.

    Another patch will go to neutron-tempest-plugin to limit time API needs
    to reply with failure to user, when attempting to create an existing
    policy.

    Closes-Bug: #1831647

    Change-Id: I7c65376f6fd6fc29d510ea532a684917ed95deb1
    (cherry picked from commit 26b3e6b1c4622087a2aaa542cb5ac5e477bd47b8)

tags: added: in-stable-stein
tags: added: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 14.0.2

This issue was fixed in the openstack/neutron 14.0.2 release.

tags: removed: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 15.0.0.0b1

This issue was fixed in the openstack/neutron 15.0.0.0b1 development milestone.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.