Concurrent bootable volume creation from default glance image (i.e. cirros) with default volume type (i.e. lvmdriver-1) failed

Bug #1705442 reported by Vivek Soni
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Cinder
Incomplete
Undecided
Vivek Soni

Bug Description

Step to reproduce :-
---------------------
1) Comment out the flag 'image_volume_cache_enabled = True' in lvmdriver-1 backend section in cinder.conf
2) Restart the cinder services
3) Try creating bootable volume with default volume-type (lvmdriver-1) and default glance image (cirros)

stack@cld6b14:~/devstack$ glance image-list
+--------------------------------------+--------------------------+
| ID | Name |
+--------------------------------------+--------------------------+
| f5a63abe-e73a-461e-8779-d93a773d18f9 | cirros-0.3.5-x86_64-disk |
+--------------------------------------+--------------------------+
stack@cld6b14:~/devstack$ for i in `seq 10`; do cinder create --image-id f5a63abe-e73a-461e-8779-d93a773d18f9 --name volume$i --volume-type lvmdriver-1 1 & done

Error Logs :-
---------------------
|__Flow 'volume_create_manager': ToozError: Not Found
2017-07-20 02:59:47.429 TRACE cinder.volume.manager Traceback (most recent call last):
2017-07-20 02:59:47.429 TRACE cinder.volume.manager  File "/usr/local/lib/python2.7/dist-packages/taskflow/engines/action_engine/executor.py", line 53, in _execute_task
2017-07-20 02:59:47.429 TRACE cinder.volume.manager  result = task.execute(**arguments)
2017-07-20 02:59:47.429 TRACE cinder.volume.manager  File "/opt/stack/cinder/cinder/volume/flows/manager/create_volume.py", line 951, in execute
2017-07-20 02:59:47.429 TRACE cinder.volume.manager  **volume_spec)
2017-07-20 02:59:47.429 TRACE cinder.volume.manager  File "/opt/stack/cinder/cinder/volume/flows/manager/create_volume.py", line 898, in _create_from_image
2017-07-20 02:59:47.429 TRACE cinder.volume.manager  image_service)
2017-07-20 02:59:47.429 TRACE cinder.volume.manager  File "<decorator-gen-236>", line 2, in _create_from_image_cache_or_download
2017-07-20 02:59:47.429 TRACE cinder.volume.manager  File "/opt/stack/cinder/cinder/coordination.py", line 169, in _synchronized
2017-07-20 02:59:47.429 TRACE cinder.volume.manager  with lock(blocking):
2017-07-20 02:59:47.429 TRACE cinder.volume.manager  File "/usr/local/lib/python2.7/dist-packages/tooz/locking.py", line 31, in __enter__
2017-07-20 02:59:47.429 TRACE cinder.volume.manager  return self.lock.__enter__(*self.args, **self.kwargs)
2017-07-20 02:59:47.429 TRACE cinder.volume.manager  File "/usr/local/lib/python2.7/dist-packages/tooz/locking.py", line 52, in __enter__
2017-07-20 02:59:47.429 TRACE cinder.volume.manager  acquired = self.acquire(*args, **kwargs)
2017-07-20 02:59:47.429 TRACE cinder.volume.manager  File "/usr/local/lib/python2.7/dist-packages/tooz/drivers/etcd3gw.py", line 50, in wrapper
2017-07-20 02:59:47.429 TRACE cinder.volume.manager  cause=e)
2017-07-20 02:59:47.429 TRACE cinder.volume.manager  File "/usr/local/lib/python2.7/dist-packages/tooz/utils.py", line 225, in raise_with_cause
2017-07-20 02:59:47.429 TRACE cinder.volume.manager  excutils.raise_with_cause(exc_cls, message, *args, **kwargs)
2017-07-20 02:59:47.429 TRACE cinder.volume.manager  File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 143, in raise_with_cause
2017-07-20 02:59:47.429 TRACE cinder.volume.manager  six.raise_from(exc_cls(message, *args, **kwargs), kwargs.get('cause'))
2017-07-20 02:59:47.429 TRACE cinder.volume.manager  File "/usr/lib/python2.7/dist-packages/six.py", line 718, in raise_from
2017-07-20 02:59:47.429 TRACE cinder.volume.manager  raise value
2017-07-20 02:59:47.429 TRACE cinder.volume.manager ToozError: Not Found

Revision history for this message
Vivek Soni (viveksoni) wrote :

On further analysis, it fails to acquire the lock in decorator function coordination.synchronized(). This results in error "ToozError: Not Found"

Revision history for this message
Vivek Soni (viveksoni) wrote :

out of 10 volumes 6 were succesfull and 4 was in error state.

stack@cld6b14:~/devstack$ cinder list
+--------------------------------------+-----------+----------+------+-------------+----------+-------------+
| ID | Status | Name | Size | Volume Type | Bootable | Attached to |
+--------------------------------------+-----------+----------+------+-------------+----------+-------------+
| 2f26e4a2-d59f-4bad-8014-4ce14d91281c | available | volume9 | 1 | lvmdriver-1 | true | |
| 4ca041fb-49c7-4ed3-8855-193ad6886b90 | available | volume3 | 1 | lvmdriver-1 | true | |
| 57a88435-11f2-475d-a27e-fcff50f12831 | available | volume2 | 1 | lvmdriver-1 | true | |
| 92435fae-df8d-4ad5-a270-bbf5d1f67eed | error | volume6 | 1 | lvmdriver-1 | false | |
| 99d76aff-17a9-405e-a850-66e3c44deb8d | available | volume5 | 1 | lvmdriver-1 | true | |
| ab480954-7faa-47c9-9d1d-8e26fdf40e1f | available | volume7 | 1 | lvmdriver-1 | true | |
| b53b8cbb-bdaa-4a06-b8ed-c98dba8e9599 | error | volume8 | 1 | lvmdriver-1 | false | |
| bea42336-ea60-4eaf-b064-11a950378436 | error | volume10 | 1 | lvmdriver-1 | false | |
| d88743fb-04b2-404f-840b-d844473f285a | available | volume4 | 1 | lvmdriver-1 | true | |
| e5504986-35fa-4b25-99a7-86c2abee9dbe | available | volume1 | 1 | lvmdriver-1 | true | |
+--------------------------------------+-----------+----------+------+-------------+----------+-------------+
stack@cld6b14:~/devstack$

Revision history for this message
Vivek Soni (viveksoni) wrote :

Analysis :-
------------------------
1)Since we are using the same ‘image’ while creating all the volumes, cinder code uses same ‘image-id’ to create the lock_name, which in turn, used for creating the same key in etcd.
2)There is the limitation of ‘etcd3gw’ client (which communicates with etcd), that in one transaction, multiple request cannot modify the same key several times.

To aid this point please refer below code :-
------------------------------
def transaction(self, txn):
        """Txn processes multiple requests in a single transaction.

        A txn request increments the revision of the key-value store and
        generates events with the same revision for every completed request.
        It is not allowed to modify the same key several times within one txn.

        :param txn:
        :return:
        """
        return self.post(self.get_url("/kv/txn"), data=json.dumps(txn)) ## This gives Etcd3Exception: Not Found

Solution :-
------------------
To actually fix this limitation of ‘etcd3gw’ client, I suggest to pass unique name for creating lock_name, which inturn creates unique key. This suggestion is tested and found to be working fine.
For details please see log below :-

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (master)

Fix proposed to branch: master
Review: https://review.openstack.org/487694

Changed in cinder:
assignee: nobody → Vivek Soni (viveksoni)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to cinder (master)

Reviewed: https://review.openstack.org/487694
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=35ae6fb90380a8e9db936827a4414cb249ef399b
Submitter: Jenkins
Branch: master

commit 35ae6fb90380a8e9db936827a4414cb249ef399b
Author: Vivek Soni <email address hidden>
Date: Wed Jul 26 22:09:17 2017 -0700

    Handle concurrent volume creation request

    Cinder passes same 'image_id', from which the
    lock_name is created. This 'lock_name' is used
    by 'tooz.locking' to acquire the lock.

    In this flow, based on 'lock_name' it creates the
    'key' and stored in 'etcd'. etcd3gw client does
    not allow to modify the same key several times
    within one transactionn. so to aviod this, pass
    unique 'lock_name' which in turn creates unique
    key.

    Change-Id: Ib78ba6aa1fcc488aa2ab7db25a37021f206519dd
    Closes-Bug: #1705442

Changed in cinder:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to cinder (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/488158

Revision history for this message
Eric Harney (eharney) wrote :

This needs more investigation to determine whether there is a bug in tooz/etcd or what is going on here.

Changed in cinder:
status: Fix Released → Incomplete
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/cinder 11.0.0.0b3

This issue was fixed in the openstack/cinder 11.0.0.0b3 development milestone.

Revision history for this message
Gorka Eguileor (gorka) wrote :

I agree with Eric, if this is a limitation in etcd3gw it needs to be fixed there or a workaround provided in TooZ, if none of those are possible we'll have to implement the workaround ourselves, but certainly rendering the lock useless by providing a unique ID is not the solution.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to cinder (master)

Reviewed: https://review.openstack.org/488158
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=9b2a9407f7c1b4688b9f696ea94677516bca5eba
Submitter: Jenkins
Branch: master

commit 9b2a9407f7c1b4688b9f696ea94677516bca5eba
Author: Eric Harney <email address hidden>
Date: Thu Jul 27 15:17:38 2017 -0400

    Revert "Handle concurrent volume creation request"

    This change converted a lock which locked on a
    per-image basis to lock on a per-image+volume basis.

    This defeats the point of having this lock, which was
    to not perform the same image download operation more than
    once, consuming unneeded network and storage resources.

    This is not an especially strenuous or odd locking
    pattern. If this causes tooz+etcd to fail, there is
    likely a bug in tooz or etcd. There are many places in
    Cinder that use locks like this one.

    Related-Bug: #1705442

    This reverts commit 35ae6fb90380a8e9db936827a4414cb249ef399b.

    Change-Id: Ib00fcc61658f6b1e59df66d387f3101ea198d2e7

Revision history for this message
Vivek Soni (viveksoni) wrote :

I agree with Eric. I was about to revert my patch with the actual code changes required.
Thanks for reverting.

Actually with 'image_volume_cache_enabled = False', we do not want to acquire lock since on every request we need new download and not from cache. Locking mechanism here is used to create common shared resource.

Revision history for this message
Chris M (pots) wrote :

The Lenovo Cinder driver fails the same way when @utils.synchronized() is replaced with @coordination.synchronized(), blocking https://review.openstack.org/#/c/479069/

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.