writes to block device during copy_image_to_volume

Bug #1352875 reported by Guy Rozendorn
16
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Cinder
Invalid
Undecided
Unassigned

Bug Description

When cinder writes a qemu-based image from Glance to a block device during copy_image_to_volume, the writes are not committed to disk before it is removed from the SCSI sub-system and then unmapped.

first, we see that the device is found:
{noformat}
Found Fibre Channel volume /dev/sdk (after 1 rescans) connect_volume /usr/lib/python2.6/site-packages/cinder/brick/initiator/connector.py:651
{noformat}

then, we see that it reads the first 512 bytes, we don't know why:
{noformat}
2014-08-05 08:49:36.534 22122 DEBUG cinder.openstack.common.processutils ... Running cmd (subprocess): sudo cinder-rootwrap /etc/cinder/rootwrap.conf sg_scan /dev/sdk execute /usr/lib/python2.6/site-packages/cinder/openstack/common/processutils.py:142
2014-08-05 08:49:36.598 22122 DEBUG cinder.openstack.common.processutils ... Running cmd (subprocess): sudo cinder-rootwrap /etc/cinder/rootwrap.conf dd if=/dev/disk/by-path/pci-0000:07:00.0-fc-0x5742b0f000000815-lun-2 of=/dev/null count=1 execute /usr/lib/python2.6/site-packages/cinder/openstack/common/processutils.py:142
{noformat}

then, it writes the image on the block device:
{noformat}
2014-08-05 08:49:36.790 22122 DEBUG cinder.openstack.common.processutils ... Running cmd (subprocess): sudo cinder-rootwrap /etc/cinder/rootwrap.conf env LC_ALL=C qemu-img info /var/lib/cinder/conversion/tmpIbAM5q execute /usr/lib/python2.6/site-packages/cinder/openstack/common/processutils.py:142
2014-08-05 08:49:40.049 22122 DEBUG cinder.openstack.common.processutils ... Running cmd (subprocess): sudo cinder-rootwrap /etc/cinder/rootwrap.conf env LC_ALL=C qemu-img info /var/lib/cinder/conversion/tmpIbAM5q execute /usr/lib/python2.6/site-packages/cinder/openstack/common/processutils.py:142
2014-08-05 08:49:44.053 22122 DEBUG cinder.image.image_utils ... 72897399-6e6e-4238-baf1-1fa4c6de8b71 was qcow2, converting to raw fetch_to_volume_format /usr/lib/python2.6/site-packages/cinder/image/image_utils.py:217
2014-08-05 08:49:44.054 22122 DEBUG cinder.openstack.common.processutils ... Running cmd (subprocess): sudo cinder-rootwrap /etc/cinder/rootwrap.conf qemu-img convert -O raw /var/lib/cinder/conversion/tmpIbAM5q /dev/disk/by-path/pci-0000:07:00.0-fc-0x5742b0f000000815-lun-2 execute /usr/lib/python2.6/site-packages/cinder/openstack/common/processutils.py:142
{noformat}

then, it reads the block device to verify;
{noformat}
2014-08-05 08:49:45.121 22122 DEBUG cinder.openstack.common.processutils ... Running cmd (subprocess): sudo cinder-rootwrap /etc/cinder/rootwrap.conf env LC_ALL=C qemu-img info /dev/disk/by-path/pci-0000:07:00.0-fc-0x5742b0f000000815-lun-2 execute /usr/lib/python2.6/site-packages/cinder/openstack/common/processutils.py:142
{noformat}

and then it removes the block device and unmaps the volume:
{noformat}
2014-08-05 08:49:45.303 22122 DEBUG cinder.openstack.common.lockutils ... Got semaphore "connect_volume" for method "disconnect_volume"... inner /usr/lib/python2.6/site-packages/cinder/openstack/common/lockutils.py:191
2014-08-05 08:49:45.303 22122 DEBUG cinder.brick.initiator.linuxscsi ... Remove SCSI device(/dev/sdk) with /sys/block/sdk/device/delete remove_scsi_device /usr/lib/python2.6/site-packages/cinder/brick/initiator/linuxscsi.py:58
2014-08-05 08:49:45.304 22122 DEBUG cinder.openstack.common.processutils ... Running cmd (subprocess): sudo cinder-rootwrap /etc/cinder/rootwrap.conf tee -a /sys/block/sdk/device/delete execute /usr/lib/python2.6/site-packages/cinder/openstack/common/processutils.py:142
2014-08-05 08:49:45.810 22122 DEBUG cinder.volume.driver ... volume ad876e38-857d-4149-abeb-cbddc2bc7d63: removing export _detach_volume /usr/lib/python2.6/site-packages/cinder/volume/driver.py:456
{noformat}

then, reporting success:
{noformat}
2014-08-05 08:49:45.811 22122 DEBUG cinder.volume.flows.manager.create_volume ... Downloaded image 72897399-6e6e-4238-baf1-1fa4c6de8b71 ((None, None)) to volume ad876e38-857d-4149-abeb-cbddc2bc7d63 successfully. _copy_image_to_volume /usr/lib/python2.6/site-packages/cinder/volume/flows/manager/create_volume.py:477
2014-08-05 08:49:45.812 22122 DEBUG cinder.volume.flows.manager.create_volume ... Marking volume ad876e38-857d-4149-abeb-cbddc2bc7d63 as bootable. _enable_bootable_flag /usr/lib/python2.6/site-packages/cinder/volume/flows/manager/create_volume.py:417
2014-08-05 08:49:45.869 22122 DEBUG cinder.volume.flows.manager.create_volume ... Copying metadata from image 72897399-6e6e-4238-baf1-1fa4c6de8b71 to ad876e38-857d-4149-abeb-cbddc2bc7d63. _handle_bootable_volume_glance_meta /usr/lib/python2.6/site-packages/cinder/volume/flows/manager/create_volume.py:380
2014-08-05 08:49:45.870 22122 DEBUG cinder.volume.flows.manager.create_volume ... Creating volume glance metadata for volume ad876e38-857d-4149-abeb-cbddc2bc7d63 backed by image 72897399-6e6e-4238-baf1-1fa4c6de8b71 with: {'container_format': u'bare', 'min_ram': 0, 'disk_format': u'qcow2', 'image_name': u'650', 'image_id': u'72897399-6e6e-4238-baf1-1fa4c6de8b71', 'checksum': u'ff3cfedc52decb10a66d61ce457addf3', 'min_disk': 0, 'size': 829095936}. _capture_volume_image_metadata /usr/lib/python2.6/site-packages/cinder/volume/flows/manager/create_volume.py:514
2014-08-05 08:49:46.242 22122 INFO cinder.volume.flows.manager.create_volume ... Volume volume-ad876e38-857d-4149-abeb-cbddc2bc7d63 (ad876e38-857d-4149-abeb-cbddc2bc7d63): created successfully
{noformat}

and that's it.
the problem is that the writes that qemu-img does are not synced/flushed to disk before remove the device.
this piece of code comes from the base-class for volume drivers within Cinder, which means it affects all the volume drivers, not just ours.
as a workaround, we're adding a call to /bin/sync before detaching a volume in the next release of our volume driver.

we managed to write a test that re-creates it and verifies that our workaround fixes this issue.

Revision history for this message
Alon Marx (alonma) wrote :

We are also aware of this issue, and it is affecting our driver as well. This is an important issue that should be receive attention quickly.

Revision history for this message
John Griffith (john-griffith) wrote :

I'm unable to reproduce this against multiple drivers. Also, the existing code blocks on the convert and should not be returning/disconnecting until the device acks the writes. If the device you're using is enabling some sort of caching or doing something there that could explain the issue.

I'd be curious if you could provide some more details regarding your backend device, topology and image size?

Thanks,
John

Revision history for this message
Geoff Willett (geoff-willett) wrote :

Barclays has been testing Cinder Image Volumes with Various back-end devices with mixed results. We are using an image size of ~800Mb which expands to 1024Mb on the Image Volume (though this size is reported as smaller on some array types e.g. SVC). We have been creating volumes one at a time, and also multiple concurrent creates.

These are the array type, attachments, and results (NOTE this should not be taken as an exhaustive test, nor is it any statement/comment on any particuar vendor):

IBM XIV 10GbE iSCSI approx 10-20% volumes are created with an image size of less than 1024MB
IBM SVC 10GbE iSCSI 1% of volumes are created with an image size of less than 792MB
Infinidat 8gbps FC

Revision history for this message
Geoff Willett (geoff-willett) wrote :

Infinidat 8gbps FC 95% of volumes are created with a smaller image size
SolidFire 10GbE iSCSI 0% of volumes are truncated

We have been testing a workaround suggested by Infinidat which very significantly reduces the incidence of truncated image volumes, but does not seem to have a 100% sucess rate.

It is very interesting to note the differences between array types even though they are all using the same code and utility (QEMU) to write the image to disk.

Revision history for this message
Geoff Willett (geoff-willett) wrote :
Download full text (8.7 KiB)

This is a listing of Image Volumes created on two different XIV's. Al volumes should be 1024Mb:

XIV 7861041>>vol_list pool=OPENSTACK -t "name,size_MiB,used_capacity_MiB"
Name Size (MiB) Used Capacity (MiB)
volume-ac192798-958b-45a7-98dd-8b093782a7fb 16411 1024
volume-29972d06-6ca4-49d6-9d1e-c71cae27bafb 16411 1024
volume-1f886ef4-e935-419b-aecc-58f1140e3fc0 16411 1024
volume-facb2bbe-bfa5-4c50-a484-8689b5892fcd 16411 1024
volume-e51d8a69-eae8-4573-b600-b1d04e37416f 16411 1024
volume-e4a9c065-61e0-46e9-a5fb-01293f310d21 16411 1024
volume-2da1c586-8608-483c-98a0-a949507fb402 16411 1024
volume-eb8e5ec8-5532-406a-8bfc-e90118e4851d 16411 1022
volume-7f4736c0-6cd2-488f-aa32-81bcb8055d42 16411 1024
volume-2614f229-3e27-4ef4-a788-8a858314c35f 16411 1024
volume-40f662d1-d1d8-4592-8fa4-613feb9c89ad 16411 1023
volume-3410e89b-1994-4b10-8866-78a6aa37c0bf 16411 1024
volume-18ec5eca-77af-4d77-8cf7-ffab3c358e6f 16411 1024
volume-13c64356-2120-4812-a084-eca468f7618e 16411 1024
volume-8d93f151-f501-4d87-ae0f-561d3ac75dd5 16411 1022
volume-e456858a-fa36-4efd-816e-13d1e90ff021 16411 1024
volume-a138b450-b916-4575-8013-968029b093fe 16411 1024
volume-457dfc14-9151-4714-bd9a-49a0bee345eb 16411 1024
volume-a7863284-748a-4eeb-9783-e21cb4c8b93c 16411 1024
volume-6ae925f6-e5f7-4c0e-808e-539be60e958d 16411 1024
volume-4e0952f2-4dcc-4de2-a429-95b9410b4f7a 16411 1024
volume-6dfb99a2-b4ed-4f32-a14c-83cd9d25825b 16411 1024
volume-29eb1ea3-7595-43c5-bbf9-5ad9bf101b3a 16411 1024
volume-b5a0726a-9164-4f86-91d8-3fad70b0c6e7 16411 1023
volume-dae47306-1236-4cb2-bf02-17e4f96a517a 16411 1024
volume-daf2a368-ab8a-43e2-b4c3-9439a7111309 16411 1023
volume-bc423767-4fcc-4078-b625-b5dbfa326e1f 16411 1023
volume-59d35566-5f73-4447-8e6a-d155cf0b6ad4 16411 1024
volume-e30b2864-4e17-4de0-8408-38d33f2c860c 16411 1022
volume-7826a076-002d-4de0-988e-2a7a6f2bff67 16411 1023
volume-234b3952-ac89-4ebb-b3ea-11e16e676ecc 16411 1024
volume-601da10f-f46c-4a37-a66a-64667ccb215f 16411 1020
volume-a2eef550-31b6-481d-95ae-cf92a61885b3 16411 1023
volume-80a4285d-f2fd-4431-a0f5-ea905818c111 16411 1023
volume-8a90f418-b3c5-41e1-b6c8-991e15d2d47b 16411 1022
vo...

Read more...

Revision history for this message
Guy Rozendorn (guy-rozendorn) wrote :

We're experiencing this with the 32MB cirrus inane, as well as larger images like 1GB.

The backend devices are FC and iSCSI attached block devices; this is easily reproduced with and without multipath.

The current code does with for the convert to complete, but the writes are not synced to the device and the buffers are not flushed before detaching the volume.
and I assume (didn't strace to verify) that the qemu binary does not open the block device with O_DIRECT

Revision history for this message
Guy Rozendorn (guy-rozendorn) wrote :

from the man page of qemu-img and http://en.wikibooks.org/wiki/QEMU/Devices/Storage, always passing '-t none' to qemu-img in cinder.image.image_utils:convert_image might resolve this issue

Revision history for this message
Guy Rozendorn (guy-rozendorn) wrote :

or just adding this the cinder configuration.

Revision history for this message
Duncan Thomas (duncan-thomas) wrote :

Can you strace the cinder-volume service and figure out exactly what is going wrong with the write?

Revision history for this message
John Griffith (john-griffith) wrote :

What driver please (ie backend device)... does not reproduce with LVM using Tgt or LIO helpers or other iSCSI based backends (Solidfire). Also if you could grab the trace info Duncan mentioned that would be great.

Revision history for this message
Duncan Thomas (duncan-thomas) wrote :

Ok, so I've just done a trace myself, and there is a missing flush in some cases. This is going to be tricky to solve - LVM, iscsi and other do a sync before detach internally, we might need to require that other transports do the same

Changed in cinder:
status: New → Confirmed
Revision history for this message
Duncan Thomas (duncan-thomas) wrote :

Even adding a call to /bin/sync is not enough, because sync returns before it has finished - it only initiates a flush. "sync; sync" happens to work on most kernels but that is not a defined behaviour

Revision history for this message
Alon Marx (alonma) wrote :

Does -t writhrough solve the problem

Revision history for this message
Guy Rozendorn (guy-rozendorn) wrote :

@duncan yes calling /bin/sync is obviously not a proper solution to the problem, it was just to prove the point.

@alon writethrough is the default; and we didn't test with -t none.

Revision history for this message
Sean McGinnis (sean-mcginnis) wrote : Bug Cleanup

Closing stale bug. If this is still an issue please reopen.

Changed in cinder:
status: Confirmed → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.