Parallel image download fails with cinder backned because attachment state manager uses thread locks but glance uses the code from worker processes

Bug #1965679 reported by Balazs Gibizer
22
This bug affects 4 people
Affects Status Importance Assigned to Milestone
glance_store
In Progress
Undecided
Mitya Eremeev

Bug Description

The locking scheme in /glance_store/common/attachment_state_manager.py assumes that glance only uses parallel threads on a single host to attach cinder volumes. But glance uses worker processes. Therefore the locking[1] in the state manager does not prevent that two worker processes handle the same cinder volume on the same host in parallel. This leads scenarios when one worker detaches a volume the other worker actually using leading to IO error.

Environment:
* single node devstack from master
* cinder configured as glance backend

The default devstack config creates two glance worker processes:
stack@aio:/opt/stack$ ps -eLf | grep glance
stack 286892 1 286892 0 1 13:29 ? 00:00:00 glance-apiuWSGI master
stack 286914 286892 286914 0 4 13:29 ? 00:00:02 glance-apiuWSGI worker 1
stack 286914 286892 286986 0 4 13:29 ? 00:00:00 glance-apiuWSGI worker 1
stack 286914 286892 287000 0 4 13:29 ? 00:00:00 glance-apiuWSGI worker 1
stack 286914 286892 287440 0 4 13:34 ? 00:00:00 glance-apiuWSGI worker 1
stack 286915 286892 286915 0 5 13:29 ? 00:00:02 glance-apiuWSGI worker 2
stack 286915 286892 286984 0 5 13:29 ? 00:00:00 glance-apiuWSGI worker 2
stack 286915 286892 286998 0 5 13:29 ? 00:00:00 glance-apiuWSGI worker 2
stack 286915 286892 287199 0 5 13:29 ? 00:00:00 glance-apiuWSGI worker 2
stack 286915 286892 287441 0 5 13:34 ? 00:00:00 glance-apiuWSGI worker 2

* I've added some delay to the cinder driver in glance_store to slow down the image download to ease the reproduction of the race condition

* I've applied https://review.opendev.org/c/openstack/cinder/+/805072 to cinder to further ease the reproduction.

* Started two parallel image downloads with a script:
rm /opt/stack/data/glance/cache/78267329-394c-4092-9650-c8195362193e

openstack image save 78267329-394c-4092-9650-c8195362193e --file /dev/null &
openstack image save 78267329-394c-4092-9650-c8195362193e --file /dev/null &
wait

* It produced the following error in glance:
Mar 20 13:29:48 aio <email address hidden>[286914]: ERROR glance_store._drivers.cinder [None req-d455b283-4a56-4d2d-a0a6-13f31de54bb2 admin admin] Exception while accessing to cinder volume 07fb537e-b7f9-4a82-9803-f4151f9af14f.: OSError: [Errno 5] Input/output error
Mar 20 13:29:48 aio <email address hidden>[286914]: ERROR glance_store._drivers.cinder Traceback (most recent call last):
Mar 20 13:29:48 aio <email address hidden>[286914]: ERROR glance_store._drivers.cinder File "/usr/local/lib/python3.8/dist-packages/glance_store/_drivers/cinder.py", line 801, in _open_cinder_volume
Mar 20 13:29:48 aio <email address hidden>[286914]: ERROR glance_store._drivers.cinder yield f
Mar 20 13:29:48 aio <email address hidden>[286914]: ERROR glance_store._drivers.cinder File "/usr/local/lib/python3.8/dist-packages/glance_store/_drivers/cinder.py", line 845, in _cinder_volume_data_iterator
Mar 20 13:29:48 aio <email address hidden>[286914]: ERROR glance_store._drivers.cinder chunk = fp.read(size)
Mar 20 13:29:48 aio <email address hidden>[286914]: ERROR glance_store._drivers.cinder OSError: [Errno 5] Input/output error
Mar 20 13:29:48 aio <email address hidden>[286914]: ERROR glance_store._drivers.cinder

See the full log in https://paste.opendev.org/show/beGGuWCP0CFUPX7Jd4R1/ (I've added extra logs to show the pids of the worker process when the state manager is accessed)

[1] https://github.com/openstack/glance_store/blob/c0c4969a4a6ce77c61f84fcb37b3efdac71d6418/glance_store/common/attachment_state_manager.py#L62

description: updated
Revision history for this message
Mitya Eremeev (mitos) wrote :

Hi,

* I've added some delay to the cinder driver in glance_store to slow down the image download to ease the reproduction of the race condition

could you specify where did you add delays ?

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

diff --git a/glance_store/_drivers/cinder.py b/glance_store/_drivers/cinder.py
index 8a81543..53bde98 100644
--- a/glance_store/_drivers/cinder.py
+++ b/glance_store/_drivers/cinder.py
@@ -844,6 +844,7 @@ class Store(glance_store.driver.Store):

                 chunk = fp.read(size)
                 if chunk:
+ time.sleep(0.3)
                     yield chunk
                     max_size -= len(chunk)
                     if max_size <= 0:

This basically simulate a slow image download by slowing down the returning of the bits of the image.

Revision history for this message
Bence Romsics (bence-romsics) wrote :
Download full text (15.2 KiB)

By the way, I believe something very similar can be reproduced without the additional sleep and without https://review.opendev.org/c/openstack/cinder/+/805072.

I have two glance worker processes:

/etc/glance/glance-uwsgi.ini
[uwsgi]
processes = 2

$ ps aux | egrep 'glance-apiuWSGI worker' | grep -v grep
rubasov 285117 1.9 1.6 586480 135856 ? Sl 11:47 0:15 glance-apiuWSGI worker 1
rubasov 285118 2.0 1.6 586456 135400 ? Sl 11:47 0:16 glance-apiuWSGI worker 2

A 1000 MiB glance image backed by a multiattach cinder volume:

$ openstack volume show "$( openstack volume list --all --user glance --name image-$( openstack image show image1 -f value -c id ) -f value -c ID )" -f value -c type
lvmdriver-1-multiattach

$ openstack volume type show lvmdriver-1-multiattach -f value -c properties
{'multiattach': '<is> True'}

Then I test with two concurrent 'openstack image save' commands. Usually one succeeds, but the other fails with:

Unable to download image: InvalidResponse

And during the failure, I see various errors in g-api logs, a few examples for 15 runs:

01 fail: No targets found
02 fail: No targets found
03 fail: No targets found
04 success
05 fail: No targets found
06 fail: No targets found + Input/output error
07 fail: No targets found
08 fail: No targets found + Input/output error
09 fail: No targets found
10 fail: No targets found
11 fail: No targets found + Checksum verification failed
12 fail: No targets found
13 fail: No targets found + Input/output error
14 fail: No targets found
15 fail: No targets found

júl 18 11:18:08 devstack0 <email address hidden>[1324]: ERROR os_brick.initiator.connectors.iscsi [None req-29a1bfc5-3c80-4d50-94a5-85f2a04a203e admin admin] Exception encountered during portal discovery: oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command.
júl 18 11:18:08 devstack0 <email address hidden>[1324]: ERROR os_brick.initiator.connectors.iscsi Traceback (most recent call last):
júl 18 11:18:08 devstack0 <email address hidden>[1324]: ERROR os_brick.initiator.connectors.iscsi File "/usr/local/lib/python3.10/dist-packages/os_brick/initiator/connectors/iscsi.py", line 206, in _get_ips_iqns_luns
júl 18 11:18:08 devstack0 <email address hidden>[1324]: ERROR os_brick.initiator.connectors.iscsi ips_iqns_luns = method(connection_properties)
júl 18 11:18:08 devstack0 <email address hidden>[1324]: ERROR os_brick.initiator.connectors.iscsi File "/usr/local/lib/python3.10/dist-packages/os_brick/initiator/connectors/iscsi.py", line 365, in _get_discoverydb_portals
júl 18 11:18:08 devstack0 <email address hidden>[1324]: ERROR os_brick.initiator.connectors.iscsi out = self._run_iscsiadm_bare(['-m', 'discoverydb',
júl 18 11:18:08 devstack0 <email address hidden>[1324]: ERROR os_brick.initiator.connectors.iscsi File "/usr/local/lib/python3.10/dist-packages/os_brick/initiator/connectors/iscsi.py", line 1160, in _run_iscsiadm_bare
júl 18 11:18:08 devstack0 <email address hidden>[1324]: ERROR os_brick.initiator.connectors.iscsi (out, err) = self._execute('iscsiadm',
júl 18 11:18:08 devstack0 <email address hidden>[1324]: ERROR os_brick.initiator.connectors.i...

Revision history for this message
Mitya Eremeev (mitos) wrote :

Bence,

you're right. just bigger image is sufficient to reproduce.
I reproduced it too.
Now I'm working on patch..

Mitya Eremeev (mitos)
Changed in glance-store:
assignee: nobody → Mitya Eremeev (mitos)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to glance_store (master)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.