race happens in api_nb

Bug #1529812 reported by Li Ma on 2015-12-29
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
DragonFlow
High
Li Ma

Bug Description

Code: https://github.com/openstack/dragonflow/blob/master/dragonflow/db/api_nb.py#L206-L220

If some other operations are taken concurrently, the data related to the key is unpredictable. Here we do need a general transaction operations provided by db_api layer to deal with synchronization,
just like the transaction provided by sqlalchemy.

It is also noted that almost all the operations provided by api_nb has the potential race, because they follow the same pattern:

(1) read the key
(2) do some calculation, like adding some json data into previous value
(3) overwrite the key

If some concurrent write happens at the step (2), the data will be overwritten after and the concurrent written data is definitely lost.

Li Ma (nick-ma-z) on 2015-12-29
Changed in dragonflow:
assignee: nobody → Li Ma (nick-ma-z)
Li Ma (nick-ma-z) on 2015-12-29
summary: - race happens in add_subnet of api_nb
+ race happens in api_nb
Li Ma (nick-ma-z) on 2015-12-29
description: updated
Li Ma (nick-ma-z) wrote :

I tested this situation with etcd-driver and the problem described above really happened. If concurrent writing to the same key when a certain operation is conducted, the value of that key is corrupted.

(1) Solution 1
I suggest to define a set of transaction interfaces at db-api layer and enforce all the db-backend driver to follow the rules. Then, rewrite nb-api to do transactions at every operation.

Pros: it is straightforward and does not need to introduce other mechanism.
Cons: need to modify the db-api layer and nb-api layer, along with all the db drivers.

(2) Solution 2
Introduce DLM for nb-api layer. The distributed lock is to solve such a synchronization problem. The only problem is that DLM also needs a distributed coordination backend, like etcd or zookeeper. We need to share the same backend for these two distinct purposes.

Pros: separate the responsibility of synchronization from db-api to external system, so don't need to modify db-api and db-drivers.
Cons: introduce the external system to implement synchronization.

Li Ma (nick-ma-z) wrote :

(3) Solution 3
Introduce a compare-and-swap atomic operation in db-api and implements it in all the db-drivers.
In nb-api:

while(retries < MAXIMUM_RETRIES):
    old_value = driver.get_key(key)
    new_value = operation on old_value

    try:
        driver.compare_and_set_key(key, new_value, old_value)
    except ValueVersionError:
         retries = retries + 1
    except Exception as e:
        raise e

Li Ma (nick-ma-z) on 2015-12-31
Changed in dragonflow:
importance: Undecided → Medium
Gal Sagie (gal-sagie) wrote :

Hi Li Ma,

I agree with you that there is a potential race but i think we need to understand first what is
the exact scenario this happens, because if it does it seems problematic as well from Neutron stand of point
or maybe its wrong modeling of our DB.

Compare and Swap is probably the better solution (Solution 3) but can you please share the exact scenario
that you saw this happens?

I am also moving priority for this to High, because if there is a race its problematic.

Changed in dragonflow:
importance: Medium → High
Li Ma (nick-ma-z) wrote :

1. I modify the api_nb:update_subnet function in order to simulate a delay:
def update_subnet(self, id, lswitch_name, **columns):
        lswitch_json = self.driver.get_key('lswitch', lswitch_name)

        import time
        time.sleep(1) // simulates a delay to test

        lswitch = jsonutils.loads(lswitch_json)
        subnet = None
        for s in lswitch.get('subnets', []):
            if s['id'] == id:
                subnet = s

        for col, val in columns.items():
            subnet[col] = val

        lswitch_json = jsonutils.dumps(lswitch)
        self.driver.set_key('lswitch', lswitch_name, lswitch_json)

2. run multi-worker for neutron-server, the default configuration by devstack

3. neutron subnet-update --name 111 {SUBNET_ID}
     neutron subnet-create {IP_CIDR} {NETWORK_ID}

4. The created subnet is recorded in neutron DB, but lost in Etcd.

I'll attach the screen snapshot for demenstration.

Li Ma (nick-ma-z) wrote :
Li Ma (nick-ma-z) wrote :

The scenario happens randomly for bulk operation on subnets, especially for multi-process & multi-worker environment, but it cannot be reproduced every time. So, I did a trick on code to demonstrate it.

Li Ma (nick-ma-z) wrote :

multi-process --> multi-nodes

Yuli (stremovsky) wrote :

This also seems related to DB consistency.

Reviewed: https://review.openstack.org/282290
Committed: https://git.openstack.org/cgit/openstack/dragonflow/commit/?id=3b92dc8eac33ad96b6cd40ad80427c9951444d99
Submitter: Jenkins
Branch: master

commit 3b92dc8eac33ad96b6cd40ad80427c9951444d99
Author: Li Ma <email address hidden>
Date: Thu Feb 18 10:30:42 2016 +0800

    Implement DB consistency

    It is a distributed lock based on SQL. Each lock is
    associated with a Neutron project(tenant) and a given
    API session. It ensures that a lock is acquired and
    released in the same API context.

    The detailed description is in the spec review.
    https://review.openstack.org/268005

    Closes-Bug: #1529812
    Closes-Bug: #1529326
    Closes-Bug: #1497676
    Related-Bug: #1527234
    Implements: blueprint bp/keep-db-consistency

    Change-Id: Iff916481282f2d60df66c0e916f3045f9944531e

Changed in dragonflow:
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers