DragonFlow

race happens in api_nb

Bug #1529812 reported by Li Ma on 2015-12-29

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	DragonFlow	Fix Released	High	Li Ma

Bug Description

Code: https://github.com/openstack/dragonflow/blob/master/dragonflow/db/api_nb.py#L206-L220

If some other operations are taken concurrently, the data related to the key is unpredictable. Here we do need a general transaction operations provided by db_api layer to deal with synchronization,
just like the transaction provided by sqlalchemy.

It is also noted that almost all the operations provided by api_nb has the potential race, because they follow the same pattern:

(1) read the key
(2) do some calculation, like adding some json data into previous value
(3) overwrite the key

If some concurrent write happens at the step (2), the data will be overwritten after and the concurrent written data is definitely lost.

See original description

Li Ma (nick-ma-z) on 2015-12-29

Changed in dragonflow:
assignee:	nobody → Li Ma (nick-ma-z)

Li Ma (nick-ma-z) on 2015-12-29

summary:

- race happens in add_subnet of api_nb
+ race happens in api_nb

Li Ma (nick-ma-z) on 2015-12-29

description:

updated

Revision history for this message

Li Ma (nick-ma-z) wrote on 2015-12-31:

I tested this situation with etcd-driver and the problem described above really happened. If concurrent writing to the same key when a certain operation is conducted, the value of that key is corrupted.

(1) Solution 1
I suggest to define a set of transaction interfaces at db-api layer and enforce all the db-backend driver to follow the rules. Then, rewrite nb-api to do transactions at every operation.

Pros: it is straightforward and does not need to introduce other mechanism.
Cons: need to modify the db-api layer and nb-api layer, along with all the db drivers.

(2) Solution 2
Introduce DLM for nb-api layer. The distributed lock is to solve such a synchronization problem. The only problem is that DLM also needs a distributed coordination backend, like etcd or zookeeper. We need to share the same backend for these two distinct purposes.

Pros: separate the responsibility of synchronization from db-api to external system, so don't need to modify db-api and db-drivers.
Cons: introduce the external system to implement synchronization.

Revision history for this message

Li Ma (nick-ma-z) wrote on 2015-12-31:

(3) Solution 3
Introduce a compare-and-swap atomic operation in db-api and implements it in all the db-drivers.
In nb-api:

while(retries < MAXIMUM_RETRIES):
old_value = driver.get_key(key)
new_value = operation on old_value

    try:
        driver.compare_and_set_key(key, new_value, old_value)
    except ValueVersionError:
         retries = retries + 1
    except Exception as e:
        raise e

Li Ma (nick-ma-z) on 2015-12-31

Changed in dragonflow:
importance:	Undecided → Medium

Revision history for this message

Gal Sagie (gal-sagie) wrote on 2015-12-31:

Hi Li Ma,

I agree with you that there is a potential race but i think we need to understand first what is
the exact scenario this happens, because if it does it seems problematic as well from Neutron stand of point
or maybe its wrong modeling of our DB.

Compare and Swap is probably the better solution (Solution 3) but can you please share the exact scenario
that you saw this happens?

I am also moving priority for this to High, because if there is a race its problematic.

Changed in dragonflow:
importance:	Medium → High

Revision history for this message

Li Ma (nick-ma-z) wrote on 2015-12-31:

1. I modify the api_nb:update_subnet function in order to simulate a delay:
def update_subnet(self, id, lswitch_name, **columns):
lswitch_json = self.driver.get_key('lswitch', lswitch_name)

import time
time.sleep(1) // simulates a delay to test

        lswitch = jsonutils.loads(lswitch_json)
        subnet = None
        for s in lswitch.get('subnets', []):
            if s['id'] == id:
                subnet = s

for col, val in columns.items():
subnet[col] = val

lswitch_json = jsonutils.dumps(lswitch)
self.driver.set_key('lswitch', lswitch_name, lswitch_json)

2. run multi-worker for neutron-server, the default configuration by devstack

3. neutron subnet-update --name 111 {SUBNET_ID}
neutron subnet-create {IP_CIDR} {NETWORK_ID}

4. The created subnet is recorded in neutron DB, but lost in Etcd.

I'll attach the screen snapshot for demenstration.

Revision history for this message

Li Ma (nick-ma-z) wrote on 2015-12-31:

screen_snapshot11.png Edit (454.0 KiB, image/png)

Revision history for this message

Li Ma (nick-ma-z) wrote on 2015-12-31:

The scenario happens randomly for bulk operation on subnets, especially for multi-process & multi-worker environment, but it cannot be reproduced every time. So, I did a trick on code to demonstrate it.

Revision history for this message

Li Ma (nick-ma-z) wrote on 2015-12-31:

multi-process --> multi-nodes

Revision history for this message

Yuli (stremovsky) wrote on 2016-03-14:

This also seems related to DB consistency.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-03-16: Fix merged to dragonflow (master)

Reviewed: https://review.openstack.org/282290
Committed: https://git.openstack.org/cgit/openstack/dragonflow/commit/?id=3b92dc8eac33ad96b6cd40ad80427c9951444d99
Submitter: Jenkins
Branch: master

commit 3b92dc8eac33ad96b6cd40ad80427c9951444d99
Author: Li Ma <email address hidden>
Date: Thu Feb 18 10:30:42 2016 +0800

Implement DB consistency

    It is a distributed lock based on SQL. Each lock is
    associated with a Neutron project(tenant) and a given
    API session. It ensures that a lock is acquired and
    released in the same API context.

The detailed description is in the spec review.
https://review.openstack.org/268005

    Closes-Bug: #1529812
    Closes-Bug: #1529326
    Closes-Bug: #1497676
    Related-Bug: #1527234
    Implements: blueprint bp/keep-db-consistency

Change-Id: Iff916481282f2d60df66c0e916f3045f9944531e