neutron

Bulk creation of subports fails with StaleDataError

Bug #1828375 reported by Łukasz Deptuła on 2019-05-09

This bug affects 4 people

Affects		Status	Importance	Assigned to	Milestone
	neutron	Fix Released	Low	Slawek Kaplonski

Bug Description

ENV:
Neutron Ocata (10.0.4)
Kola Ansible
Rhel 7.4
One compute node (openvswitch) with 3 virtual machine kubernetes minions with Nested Kuryr.

REPRODUCTION:
Creation of multiple containers in very short time (in fact creation of many subports in very short time)
Some OpenStack ports(trunk subports) will never transition to Active status.

EXCEPTION:
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 155, in _process_incoming
  res = self.dispatcher.dispatch(message)
  File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 222, in dispatch
  return self._do_dispatch(endpoint, method, ctxt, args)
  File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 192, in _do_dispatch
  result = func(ctxt, **new_args)
  File "/usr/lib/python2.7/site-packages/neutron/services/trunk/rpc/server.py", line 110, in update_trunk_status
  trunk.update(status=status)
  File "/usr/lib/python2.7/site-packages/neutron/objects/base.py", line 203, in decorator
  res = func(self, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/neutron/objects/trunk.py", line 127, in update
  super(Trunk, self).update()
  File "/usr/lib/python2.7/site-packages/neutron/objects/base.py", line 618, in update
  self._get_composite_keys()))
  File "/usr/lib/python2.7/site-packages/neutron/objects/db/api.py", line 80, in update_object
  db_obj.save(session=context.session)
  File "/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/models.py", line 50, in save
  session.flush()
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 2027, in flush
  self._flush(objects)
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 2145, in _flush
  transaction.rollback(_capture_exception=True)
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/util/langhelpers.py", line 60, in __exit__
  compat.reraise(exc_type, exc_value, exc_tb)
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 2109, in _flush
  flush_context.execute()
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/unitofwork.py", line 373, in execute
  rec.execute(self)
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/unitofwork.py", line 532, in execute
  uow
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/persistence.py", line 170, in save_obj
  mapper, table, update)
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/persistence.py", line 728, in _emit_update_statements
  (table.description, len(records), rows))
StaleDataError: UPDATE statement on table 'standardattributes' expected to update 1 row(s); 0 were matched.

ADDITIONAL INVESTIGATION:
Actually there are multiple concurrent update calls to database for trunk ports revision_number. Trunks in database:

standardattributes for trunks:

MariaDB [neutron]> select * from standardattributes where resource_type='trunks';
+----+---------------+---------------------+---------------------+-------------+-----------------+
| id | resource_type | created_at | updated_at | description | revision_number |
+----+---------------+---------------------+---------------------+-------------+-----------------+
| 49 | trunks | 2019-04-16 08:14:10 | 2019-04-25 08:37:32 | | 710 |
| 56 | trunks | 2019-04-16 08:14:45 | 2019-04-25 08:37:32 | | 739 |
| 58 | trunks | 2019-04-16 08:15:09 | 2019-04-25 08:37:56 | | 654 |
+----+---------------+---------------------+---------------------+-------------+-----------------+
3 rows in set (0.00 sec)

MariaDB [neutron]>

database queries dump:
726462 Query 728730 Query 726445 Query 727770 Query 727468 Query 727550 Query 727048 Query 728694 Query 728697 Query 727765 Query 727766 Query 727470 Query 728726 Query 727550 Query 726430 Query 728693 Query 728688 Query 728692 Query 727770 Query 727051 Query 727469 Query 728732 Query 727767 Query UPDATE standardattributes SET revision_number=830, updated_at='2019-04-25 11:00:36' WHERE standardattributes.id = 49 AND standardattributes.revision_number = 829
UPDATE standardattributes SET revision_number=830, updated_at='2019-04-25 11:00:37' WHERE standardattributes.id = 49 AND standardattributes.revision_number = 829
UPDATE standardattributes SET revision_number=830, updated_at='2019-04-25 11:00:37' WHERE standardattributes.id = 49 AND standardattributes.revision_number = 829
UPDATE standardattributes SET revision_number=830, updated_at='2019-04-25 11:00:38' WHERE standardattributes.id = 49 AND standardattributes.revision_number = 829
UPDATE standardattributes SET revision_number=830, updated_at='2019-04-25 11:00:38' WHERE standardattributes.id = 49 AND standardattributes.revision_number = 829
UPDATE standardattributes SET revision_number=830, updated_at='2019-04-25 11:00:39' WHERE standardattributes.id = 49 AND standardattributes.revision_number = 829
UPDATE standardattributes SET revision_number=830, updated_at='2019-04-25 11:00:40' WHERE standardattributes.id = 49 AND standardattributes.revision_number = 829
UPDATE standardattributes SET revision_number=830, updated_at='2019-04-25 11:00:40' WHERE standardattributes.id = 49 AND standardattributes.revision_number = 829
UPDATE standardattributes SET revision_number=830, updated_at='2019-04-25 11:00:40' WHERE standardattributes.id = 49 AND standardattributes.revision_number = 829
UPDATE standardattributes SET revision_number=831, updated_at='2019-04-25 11:00:41' WHERE standardattributes.id = 49 AND standardattributes.revision_number = 830
UPDATE standardattributes SET revision_number=831, updated_at='2019-04-25 11:00:43' WHERE standardattributes.id = 49 AND standardattributes.revision_number = 830
UPDATE standardattributes SET revision_number=831, updated_at='2019-04-25 11:00:43' WHERE standardattributes.id = 49 AND standardattributes.revision_number = 830
UPDATE standardattributes SET revision_number=831, updated_at='2019-04-25 11:00:43' WHERE standardattributes.id = 49 AND standardattributes.revision_number = 830
UPDATE standardattributes SET revision_number=831, updated_at='2019-04-25 11:00:44' WHERE standardattributes.id = 49 AND standardattributes.revision_number = 830
UPDATE standardattributes SET revision_number=832, updated_at='2019-04-25 11:00:44' WHERE standardattributes.id = 49 AND standardattributes.revision_number = 831
UPDATE standardattributes SET revision_number=831, updated_at='2019-04-25 11:00:44' WHERE standardattributes.id = 49 AND standardattributes.revision_number = 830
UPDATE standardattributes SET revision_number=832, updated_at='2019-04-25 11:00:44' WHERE standardattributes.id = 49 AND standardattributes.revision_number = 831
UPDATE standardattributes SET revision_number=831, updated_at='2019-04-25 11:00:44' WHERE standardattributes.id = 49 AND standardattributes.revision_number = 830
UPDATE standardattributes SET revision_number=832, updated_at='2019-04-25 11:00:45' WHERE standardattributes.id = 49 AND standardattributes.revision_number = 831
UPDATE standardattributes SET revision_number=831, updated_at='2019-04-25 11:00:46' WHERE standardattributes.id = 49 AND standardattributes.revision_number = 830
UPDATE standardattributes SET revision_number=832, updated_at='2019-04-25 11:00:47' WHERE standardattributes.id = 49 AND standardattributes.revision_number = 831
UPDATE standardattributes SET revision_number=831, updated_at='2019-04-25 11:00:47' WHERE standardattributes.id = 49 AND standardattributes.revision_number = 830
UPDATE standardattributes SET revision_number=833, updated_at='2019-04-25 11:00:49' WHERE standardattributes.id = 49 AND standardattributes.revision_number = 832

I tried also to queue requests of port updates in agent/server code. After creating locks on methods:
1. def update_subport_bindings(self, context, subports): in file neutron/services/trunk/rpc/server.py
2. def update_trunk_status(self, context, trunk_id, status): in file neutron/services/trunk/rpc/server.py
3. def add_subports(self, context, trunk_id, subports): in file neutron/services/trunk/plugin.py
4. def handle_subports(self, context, resource_type, subports, event_type): in file neutron/services/trunk/drivers/openvswitch/agent/driver.py

Those modification made subport creation work in more stable way.

Interesting fact is that restarting of agent is bringing all ports up.

Tags:

Revision history for this message

Łukasz Deptuła (l.deptula) wrote on 2019-05-09:

Additionally I found that those bugs may be related (or duplicates possibly):
* https://bugs.launchpad.net/neutron/+bug/1805132
* https://bugs.launchpad.net/neutron/+bug/1738337
* https://bugs.launchpad.net/neutron/+bug/1716321

and I have similar conlusions with Ryan Tidwell as he pointed here https://bugs.launchpad.net/neutron/+bug/1805132/comments/3 :
"It does make me wonder whether there is an underlying issue related to the handling updates to standard attributes that affects operations on all resources like networks, subnets, and ports."

Revision history for this message

Lajos Katona (lajos-katona) wrote on 2019-05-09:

Could you confirm that the same issue happens on master as well, or can you provide reproduction steps without containers?
I suppose something like this:
- create port as parent port
- create trunk --parent-port <parent_port_id>
- put trunk/add_subport

I tried to reproduce on master with curl, with 12 subports (perhaps that's not enough?), without success.

Revision history for this message

Łukasz Deptuła (l.deptula) wrote on 2019-05-10:

Unfortunately I am not able to use master...
Reproduction without containers:
1. port create --network NetworkA --name portA
2. server create --nic port-id=portA
3. create trunk --parent-port=portA --name TrunkA
4. 40x port create --network NetworkB (subports)
5. 40x network trunk set --subport eachSubPort TrunkA
6. Wait for all subports to become ACTIVE.

One thing I noticed is that Kuryr is very strongly consuming API (a lot of read requests to verify whether ports are up). I'm not sure how to formulate the 6th step, but my observation is that the more stress is on controller (DB, API) the more ports are failing to go up. For 40 ports created in my case around 4-5 are staying in DOWN state.

Lajos Katona (lajos-katona) on 2019-05-13

Changed in neutron:
importance:	Undecided → Low

Revision history for this message

Lajos Katona (lajos-katona) wrote on 2019-05-13:

Just a note, could you check this page: https://wiki.openstack.org/wiki/Neutron_Trunk_API_Performance_and_Scaling
Perhaps you can find there something useful

Lajos Katona (lajos-katona) on 2019-05-13

Changed in neutron:
importance:	Low → Medium
importance:	Medium → Low

Revision history for this message

Bence Romsics (bence-romsics) wrote on 2019-05-22:

I managed to reproduce the problem on master (commit dfc2586fb1) with a script like this:

#! /bin/bash

source ~/src/os/openstack/devstack/openrc admin admin

set -x

openstack server list -f value -c ID -c Name | awk '/ xvm/ { print $1 }' | xargs -r openstack server delete --wait
openstack network trunk list -f value -c ID -c Name | awk '/ xtrunk/ { print $1 }' | xargs -r openstack network trunk delete
openstack port list -f value -c ID -c Name | awk '/ xport/ { print $1 }' | xargs -r openstack port delete
openstack network list -f value -c ID -c Name | awk '/ xnet/ { print $1 }' | xargs -r openstack network delete

max=40

for i in $( seq 0 "$max" )
do
    openstack network create "xnet$i"
    openstack subnet create "xsubnet$i" --network "xnet$i" --subnet-range 10.0.4.0/24
    openstack port create "xport$i" --network "xnet$i"
done

openstack network trunk create xtrunk0 --parent-port xport0
openstack server create xvm0 --flavor cirros256 --image cirros-0.4.0-x86_64-disk --nic port-id=xport0 --wait

for i in $( seq 1 "$max" )
do
openstack network trunk set xtrunk0 --subport port="xport$i",segmentation-type=vlan,segmentation-id="$i"
done

Out of two runs the first left 3 ports in the DOWN state, the second run left 5 ports in DOWN. StaleDataErrors appear in neutron-server logs.

Changed in neutron:
status:	New → Confirmed

Revision history for this message

Bence Romsics (bence-romsics) wrote on 2019-05-24:

I'm starting to understand what's going on in this error. Here's a summary:

Each subport's binding triggers updates to the trunk's status (first setting it to BUILD and then moving it to some success or error state depending on the outcome of the subport binding).

Therefore we have concurrent updates to the trunk itself (beyond its status also to its standardattributes).

MySQL's default isolation level (for InnoDB tables) allows two concurrent transaction like below to interfere with each other:

BEGIN
SELECT ...
UPDATE ... (based on criteria read in the previous select)
COMMIT

It is possible that one of the two UPDATEs will fail with StaleDataError because it can see the results of the other ongoing transaction.

The standardattributes table uses SqlAlchemy's versioning feature:

https://docs.sqlalchemy.org/en/13/orm/versioning.html
https://opendev.org/openstack/neutron/src/commit/22638b82b602a29f56704b89e06f67aa18f6d3be/neutron/db/standard_attr.py#L70

Therefore every single UPDATE we issue actually looks something like this:
UPDATE standardattributes SET revision_number=111, ... WHERE revision_number=110 AND ...

This greatly increases the chance that concurrent transactions will actually interfere with each other.

I have a patch that proves the problem can be solved by re-trying updates:

from neutron.api.rpc.callbacks import events
from neutron.api.rpc.callbacks.producer import registry
@@ -115,11 +116,16 @@ class TrunkSkeleton(object):
trunk_port = self.core_plugin.get_port(context, trunk_port_id)
trunk_host = trunk_port.get(portbindings.HOST_ID)

- # NOTE(status_police) Set the trunk in BUILD state before processing
- # subport bindings. The trunk will stay in BUILD state until an
- # attempt has been made to bind all subports passed here and the
- # agent acknowledges the operation was successful.
- trunk.update(status=trunk_consts.TRUNK_BUILD_STATUS)
+ for try_cnt in range(10):
+ try:
+ # NOTE(status_police) Set the trunk in BUILD state before processing
+ # subport bindings. The trunk will stay in BUILD state until an
+ # attempt has been made to bind all subports passed here and the
+ # agent acknowledges the operation was successful.
+ trunk.update(status=trunk_consts.TRUNK_BUILD_STATUS)
+ break
+ except exc.StaleDataError:
+ continue

for port_id in port_ids:
try:

However I did not upload this to gerrit because I hope to find a less ugly solution. For example a fix that's generic enough to solve all standardattributes related StaleDataErrors.

I'm starting to understand what's going on in this error. Here's a summary:

Each subport's binding triggers updates to the trunk's status (first setting it to BUILD and then moving it to some success or error state depending on the outcome of the subport binding).

Therefore we have concurrent updates to the trunk itself (beyond its status also to its standardattributes).

MySQL's default isolation level (for InnoDB tables) allows two concurrent transaction like below to interfere with each other:

BEGIN
SELECT ...
UPDATE ... (based on criteria read in the previous select)
COMMIT

It is possible that one of the two UPDATEs will fail with StaleDataError because it can see the results of the other ongoing transaction.

The standardattributes table uses SqlAlchemy's versioning feature:

https://docs.sqlalchemy.org/en/13/orm/versioning.html
https://opendev.org/openstack/neutron/src/commit/22638b82b602a29f56704b89e06f67aa18f6d3be/neutron/db/standard_attr.py#L70

Therefore every single UPDATE we issue actually looks something like this:
UPDATE standardattributes SET revision_number=111, ... WHERE revision_number=110 AND ...

This greatly increases the chance that concurrent transactions will actually interfere with each other.

I have a patch that proves the problem can be solved by re-trying updates:

diff --git a/neutron/services/trunk/rpc/server.py b/neutron/services/trunk/rpc/server.py
index 97de43c..d03d865 100644
--- a/neutron/services/trunk/rpc/server.py
+++ b/neutron/services/trunk/rpc/server.py
@@ -22,6 +22,7 @@ from neutron_lib.services.trunk import constants as trunk_consts
 from oslo_log import helpers as log_helpers
 from oslo_log import log as logging
 import oslo_messaging
+from sqlalchemy.orm import exc
 
 from neutron.api.rpc.callbacks import events
 from neutron.api.rpc.callbacks.producer import registry
@@ -115,11 +116,16 @@ class TrunkSkeleton(object):
         trunk_port = self.core_plugin.get_port(context, trunk_port_id)
         trunk_host = trunk_port.get(portbindings.HOST_ID)
 
-        # NOTE(status_police) Set the trunk in BUILD state before processing
-        # subport bindings. The trunk will stay in BUILD state until an
-        # attempt has been made to bind all subports passed here and the
-        # agent acknowledges the operation was successful.
-        trunk.update(status=trunk_consts.TRUNK_BUILD_STATUS)
+        for try_cnt in range(10):
+            try:
+                # NOTE(status_police) Set the trunk in BUILD state before processing
+                # subport bindings. The trunk will stay in BUILD state until an
+                # attempt has been made to bind all subports passed here and the
+                # agent acknowledges the operation was successful.
+                trunk.update(status=trunk_consts.TRUNK_BUILD_STATUS)
+                break
+            except exc.StaleDataError:
+                continue
 
         for port_id in port_ids:
             try:

However I did not upload this to gerrit because I hope to find a less ugly solution. For example a fix that's generic enough to solve all standardattributes related StaleDataErrors.

Changed in neutron:
status:	Confirmed → Triaged

Revision history for this message

Bence Romsics (bence-romsics) wrote on 2019-05-24:

The mysql-part of this problem can be reproduced by the following little standalone program by running to instances of it concurrenlty (after a little setup documented at it beginning):

http://paste.openstack.org/show/752035/

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-05-30: Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/662236

Changed in neutron:
assignee:	nobody → Bence Romsics (bence-romsics)
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-06-27: Fix merged to neutron (master)

Reviewed: https://review.opendev.org/662236
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=618e24e241241c7323702a815542f11e91fd32a1
Submitter: Zuul
Branch: master

commit 618e24e241241c7323702a815542f11e91fd32a1
Author: Bence Romsics <email address hidden>
Date: Wed May 22 16:42:34 2019 +0200

Retry trunk status updates failing with StaleDataError

This is an approximate partial fix to #1828375.

    update_trunk_status and update_subport_bindings rpc messages are
    processed concurrently and possibly out of order on the server side.
    Therefore they may race with each other.

    The status update race combined with
    1) the versioning feature of sqlalchemy used in the standardattributes
       table and
    2) the less than serializable isolation level of some DB backends (like
       MySQL InnoDB)
    does raise StaleDataErrors and by that leaves some trunk subports in
    DOWN status.

    This change retries the trunk status update (to BUILD) blindly when
    StaleDataError was caught. In my local testbed this practically
    fixes #1828375.

    However theoretically the retry may cover up other real errors (when the
    cause of the StaleDataError was a different status not just a different
    revision count).

    To the best of my understanding a proper fix would entail guaranteeing
    the in order processing of the above rpc messages - which likely won't
    ever happen.

I'm not sure at all if this change is worth merging - let me know what
you think.

Change-Id: Ie581809f24f9547b55a87423dac7db933862d66a
Partial-Bug: #1828375

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-07-30: Fix proposed to neutron (stable/stein)

#10

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/673696

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-07-31: Fix proposed to neutron (stable/rocky)

#11

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/673744

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-07-31: Fix proposed to neutron (stable/queens)

#12

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/673745

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-07-31: Fix merged to neutron (stable/stein)

#13

Reviewed: https://review.opendev.org/673696
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=d090fb9a3cbad6c4c3f5c524907a94cb5f95b205
Submitter: Zuul
Branch: stable/stein

commit d090fb9a3cbad6c4c3f5c524907a94cb5f95b205
Author: Bence Romsics <email address hidden>
Date: Wed May 22 16:42:34 2019 +0200

Retry trunk status updates failing with StaleDataError

This is an approximate partial fix to #1828375.

    update_trunk_status and update_subport_bindings rpc messages are
    processed concurrently and possibly out of order on the server side.
    Therefore they may race with each other.

    This change retries the trunk status update (to BUILD) blindly when
    StaleDataError was caught. In my local testbed this practically
    fixes #1828375.

    However theoretically the retry may cover up other real errors (when the
    cause of the StaleDataError was a different status not just a different
    revision count).

    To the best of my understanding a proper fix would entail guaranteeing
    the in order processing of the above rpc messages - which likely won't
    ever happen.

I'm not sure at all if this change is worth merging - let me know what
you think.

    Conflicts:
        neutron/services/trunk/rpc/server.py
        neutron/tests/unit/services/trunk/rpc/test_server.py

    Change-Id: Ie581809f24f9547b55a87423dac7db933862d66a
    Partial-Bug: #1828375
    (cherry picked from commit 618e24e241241c7323702a815542f11e91fd32a1)

tags:

added: in-stable-stein

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-01: Fix merged to neutron (stable/queens)

#14

Reviewed: https://review.opendev.org/673745
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=10191fd81706f302f2114c8a196d5e0a0cd1b5db
Submitter: Zuul
Branch: stable/queens

commit 10191fd81706f302f2114c8a196d5e0a0cd1b5db
Author: Bence Romsics <email address hidden>
Date: Wed May 22 16:42:34 2019 +0200

Retry trunk status updates failing with StaleDataError

This is an approximate partial fix to #1828375.

    update_trunk_status and update_subport_bindings rpc messages are
    processed concurrently and possibly out of order on the server side.
    Therefore they may race with each other.

    This change retries the trunk status update (to BUILD) blindly when
    StaleDataError was caught. In my local testbed this practically
    fixes #1828375.

    However theoretically the retry may cover up other real errors (when the
    cause of the StaleDataError was a different status not just a different
    revision count).

    To the best of my understanding a proper fix would entail guaranteeing
    the in order processing of the above rpc messages - which likely won't
    ever happen.

I'm not sure at all if this change is worth merging - let me know what
you think.

    Conflicts:
        neutron/services/trunk/rpc/server.py
        neutron/tests/unit/services/trunk/rpc/test_server.py

    Change-Id: Ie581809f24f9547b55a87423dac7db933862d66a
    Partial-Bug: #1828375
    (cherry picked from commit 618e24e241241c7323702a815542f11e91fd32a1)
    (cherry picked from commit d090fb9a3cbad6c4c3f5c524907a94cb5f95b205)

tags:	added: in-stable-queens
tags:	added: in-stable-rocky

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-01: Fix merged to neutron (stable/rocky)

#15

Reviewed: https://review.opendev.org/673744
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=80389bb4ee8a9cd52401187b278e339c69d9f295
Submitter: Zuul
Branch: stable/rocky

commit 80389bb4ee8a9cd52401187b278e339c69d9f295
Author: Bence Romsics <email address hidden>
Date: Wed May 22 16:42:34 2019 +0200

Retry trunk status updates failing with StaleDataError

This is an approximate partial fix to #1828375.

    update_trunk_status and update_subport_bindings rpc messages are
    processed concurrently and possibly out of order on the server side.
    Therefore they may race with each other.

    This change retries the trunk status update (to BUILD) blindly when
    StaleDataError was caught. In my local testbed this practically
    fixes #1828375.

    However theoretically the retry may cover up other real errors (when the
    cause of the StaleDataError was a different status not just a different
    revision count).

    To the best of my understanding a proper fix would entail guaranteeing
    the in order processing of the above rpc messages - which likely won't
    ever happen.

I'm not sure at all if this change is worth merging - let me know what
you think.

    Conflicts:
        neutron/services/trunk/rpc/server.py
        neutron/tests/unit/services/trunk/rpc/test_server.py

Revision history for this message

Michal Dulko (michal-dulko-f) wrote on 2019-08-02:

#16

We were able to apply the fix and it seems to help. Thanks folks!

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-28: Related fix proposed to neutron (master)

#17

Related fix proposed to branch: master
Review: https://review.opendev.org/679053

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-29: Related fix merged to neutron (master)

#18

Reviewed: https://review.opendev.org/679053
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=d1f8888843c53c1c4b95c2fd22b2b6ee97feb238
Submitter: Zuul
Branch: master

commit d1f8888843c53c1c4b95c2fd22b2b6ee97feb238
Author: Slawek Kaplonski <email address hidden>
Date: Wed Aug 28 13:06:33 2019 +0000

Increase number of retries in _process_trunk_subport_bindings

    In patch [1] as partial fix for bug 1828375 retries mechanism
    was proposed.
    We noticed that sometimes in have loaded environments 3 retries
    defined in [1] can be not enough.
    So this patch switches to use neutron_lib.db.api.MAX_RETRIES constant
    as number of retries when processing trunk subport bindings.
    This MAX_RETRIES constant is set to 20 and in our cases it "fixed"
    problem.

[1] https://review.opendev.org/#/c/662236/

Change-Id: I016ef3d7ccbb89b68d4a3d509162b3046a9c2f98
Related-Bug: #1828375

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-29: Related fix proposed to neutron (stable/stein)

#19

Related fix proposed to branch: stable/stein
Review: https://review.opendev.org/679197

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-29: Related fix proposed to neutron (stable/rocky)

#20

Related fix proposed to branch: stable/rocky
Review: https://review.opendev.org/679198

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-29: Related fix proposed to neutron (stable/queens)

#21

Related fix proposed to branch: stable/queens
Review: https://review.opendev.org/679200

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-29: Related fix merged to neutron (stable/stein)

#22

Reviewed: https://review.opendev.org/679197
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=c09e8295806a5f6932b7a2a3795aac008d6e3480
Submitter: Zuul
Branch: stable/stein

commit c09e8295806a5f6932b7a2a3795aac008d6e3480
Author: Slawek Kaplonski <email address hidden>
Date: Wed Aug 28 13:06:33 2019 +0000

Increase number of retries in _process_trunk_subport_bindings

[1] https://review.opendev.org/#/c/662236/

    Change-Id: I016ef3d7ccbb89b68d4a3d509162b3046a9c2f98
    Related-Bug: #1828375
    (cherry picked from commit d1f8888843c53c1c4b95c2fd22b2b6ee97feb238)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-09-03: Related fix merged to neutron (stable/rocky)

#23

Reviewed: https://review.opendev.org/679198
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=1199207e7da977e1cd3c12f8a4242aafe6330d68
Submitter: Zuul
Branch: stable/rocky

commit 1199207e7da977e1cd3c12f8a4242aafe6330d68
Author: Slawek Kaplonski <email address hidden>
Date: Wed Aug 28 13:06:33 2019 +0000

Increase number of retries in _process_trunk_subport_bindings

[1] https://review.opendev.org/#/c/662236/

    Change-Id: I016ef3d7ccbb89b68d4a3d509162b3046a9c2f98
    Related-Bug: #1828375
    (cherry picked from commit d1f8888843c53c1c4b95c2fd22b2b6ee97feb238)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-10-09: Related fix merged to neutron (stable/queens)

#24

Reviewed: https://review.opendev.org/679200
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=8167fb944ecc9d75b731c166b7fd4cf1bb97608c
Submitter: Zuul
Branch: stable/queens

commit 8167fb944ecc9d75b731c166b7fd4cf1bb97608c
Author: Slawek Kaplonski <email address hidden>
Date: Wed Aug 28 13:06:33 2019 +0000

Increase number of retries in _process_trunk_subport_bindings

[1] https://review.opendev.org/#/c/662236/

Conflicts:
neutron/tests/unit/services/trunk/rpc/test_server.py

    Change-Id: I016ef3d7ccbb89b68d4a3d509162b3046a9c2f98
    Related-Bug: #1828375
    (cherry picked from commit d1f8888843c53c1c4b95c2fd22b2b6ee97feb238)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-12-12: Fix proposed to neutron (master)

#25

Fix proposed to branch: master
Review: https://review.opendev.org/698698

Changed in neutron:
assignee:	Bence Romsics (bence-romsics) → Slawek Kaplonski (slaweq)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-01-13: Fix merged to neutron (master)

#26

Reviewed: https://review.opendev.org/698698
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=ade35a233edb5c9489cc3a68ae00672fb328f63d
Submitter: Zuul
Branch: master

commit ade35a233edb5c9489cc3a68ae00672fb328f63d
Author: Slawek Kaplonski <email address hidden>
Date: Thu Dec 12 12:28:22 2019 +0100

Add retries to update trunk port

    In [1] retry of trunk update was added to avoid StaleDataError
    exceptions to fail to set trunk port or subports to ACTIVE state.
    But it was only partial fix for the issue descibed in related bug
    and from [2] we know that it still can happen on high load systems
    from time to time.
    So I was checking this issue and reported bug again and I found out
    that retry was added only in _process_trunk_subport_bindings()
    method. But StaleDataError can be raised also in other cases where
    the same trunk is updated, e.g. in update_trunk_status() method.

So this commit adds same retry mechanism to all trunk.update() actions
in services.trunk.rpc.server module.

[1] https://review.opendev.org/#/c/662236/
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1733197

Change-Id: I10e3619d5f3600ea97ed695321bb691dece3181f
Partial-Bug: #1828375

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-01-14: Fix proposed to neutron (stable/train)

#27

Fix proposed to branch: stable/train
Review: https://review.opendev.org/702363

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-01-14: Fix proposed to neutron (stable/stein)

#28

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/702364

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-01-14: Fix proposed to neutron (stable/rocky)

#29

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/702365

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-01-14: Fix proposed to neutron (stable/queens)

#30

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/702366

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-01-16: Fix merged to neutron (stable/train)

#31

Reviewed: https://review.opendev.org/702363
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=8ec0c2d8658acd6fb0aaf80751914514bdb42334
Submitter: Zuul
Branch: stable/train

commit 8ec0c2d8658acd6fb0aaf80751914514bdb42334
Author: Slawek Kaplonski <email address hidden>
Date: Thu Dec 12 12:28:22 2019 +0100

Add retries to update trunk port

So this commit adds same retry mechanism to all trunk.update() actions
in services.trunk.rpc.server module.

[1] https://review.opendev.org/#/c/662236/
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1733197

    Change-Id: I10e3619d5f3600ea97ed695321bb691dece3181f
    Partial-Bug: #1828375
    (cherry picked from commit ade35a233edb5c9489cc3a68ae00672fb328f63d)

tags:

added: in-stable-train

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-01-28: Fix merged to neutron (stable/stein)

#32

Reviewed: https://review.opendev.org/702364
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=7a8b59624c87b6b74f28e98a1cff1d6a40edc65f
Submitter: Zuul
Branch: stable/stein

commit 7a8b59624c87b6b74f28e98a1cff1d6a40edc65f
Author: Slawek Kaplonski <email address hidden>
Date: Thu Dec 12 12:28:22 2019 +0100

Add retries to update trunk port

So this commit adds same retry mechanism to all trunk.update() actions
in services.trunk.rpc.server module.

[1] https://review.opendev.org/#/c/662236/
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1733197

Conflicts:
neutron/services/trunk/rpc/server.py

    Change-Id: I10e3619d5f3600ea97ed695321bb691dece3181f
    Partial-Bug: #1828375
    (cherry picked from commit ade35a233edb5c9489cc3a68ae00672fb328f63d)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-01-30: Fix merged to neutron (stable/queens)

#33

Reviewed: https://review.opendev.org/702366
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=7a1e79e90b09e5d39d65c4f8da49fcf3af7e2c63
Submitter: Zuul
Branch: stable/queens

commit 7a1e79e90b09e5d39d65c4f8da49fcf3af7e2c63
Author: Slawek Kaplonski <email address hidden>
Date: Thu Dec 12 12:28:22 2019 +0100

Add retries to update trunk port

So this commit adds same retry mechanism to all trunk.update() actions
in services.trunk.rpc.server module.

[1] https://review.opendev.org/#/c/662236/
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1733197

Conflicts:
neutron/services/trunk/rpc/server.py

    Change-Id: I10e3619d5f3600ea97ed695321bb691dece3181f
    Partial-Bug: #1828375
    (cherry picked from commit ade35a233edb5c9489cc3a68ae00672fb328f63d)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-02-17: Fix merged to neutron (stable/rocky)

#34

Reviewed: https://review.opendev.org/702365
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=b51f231c9d912eb2082deb378a2671c10ae43564
Submitter: Zuul
Branch: stable/rocky

commit b51f231c9d912eb2082deb378a2671c10ae43564
Author: Slawek Kaplonski <email address hidden>
Date: Thu Dec 12 12:28:22 2019 +0100

Add retries to update trunk port

So this commit adds same retry mechanism to all trunk.update() actions
in services.trunk.rpc.server module.

[1] https://review.opendev.org/#/c/662236/
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1733197

Conflicts:
neutron/services/trunk/rpc/server.py

    Change-Id: I10e3619d5f3600ea97ed695321bb691dece3181f
    Partial-Bug: #1828375
    (cherry picked from commit ade35a233edb5c9489cc3a68ae00672fb328f63d)