volume cloning is failing with an I/O error happening when the backing snapshot is wiped

Bug #1191812 reported by Giulio Fidente
34
This bug affects 6 people
Affects Status Importance Assigned to Milestone
Cinder
Fix Released
High
John Griffith
Grizzly
Fix Released
High
Joe Breu

Bug Description

cloning a volume using the --source-volid argument is failing with the following in the volume.log file:

"""
Command: sudo cinder-rootwrap /etc/cinder/rootwrap.conf dd if=/dev/zero of=/dev/mapper/cinder--volumes-clone--snap--727c84cc--69ee--4b7d--bdb0--832b5086f1bb count=1024 bs=1M conv=fdatasync
Exit code: 1
Stdout: ''
Stderr: "/bin/dd: fdatasync failed for `/dev/mapper/cinder--volumes-clone--snap--727c84cc--69ee--4b7d--bdb0--832b5086f1bb': Input/output error\n1024+0 records in\n1024+0 records out\n1073741824 bytes (1.1 GB) copied, 19.7612 s, 54.3 MB/s\n"
"""

NOTE: the actual dd from the origin snapshot into the destination volume is working; it is the cleanup of the origin snapshot which fails

Eric Harney (eharney)
Changed in cinder:
assignee: nobody → Eric Harney (eharney)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (master)

Fix proposed to branch: master
Review: https://review.openstack.org/33347

Changed in cinder:
status: New → In Progress
Revision history for this message
Eric Harney (eharney) wrote : Re: LVM volume cloning is failing with I/O error

The problem here is that the LVM snapshot used for the clone fills up when it is zeroed at the end of the clone process. The system needs to either have LVM snapshot autoextend configured (many don't by default), or Cinder needs to ask for a little bit more capacity for the snap.

tags: added: grizzly-backport-potential
summary: - volume cloning is failing with I/O error
+ LVM volume cloning is failing with I/O error
Revision history for this message
Eric Harney (eharney) wrote :

I abandoned https://review.openstack.org/#/c/33347/ because a) it's kind of hacky and b) I don't think it really worked, since it could leave a few percent of the snap un-wiped.

Another idea tossed around IRC was along the lines of: enforce wiping of 98% (or so) of the snapshot, and then allow wiping of the remaining % to fail with the expected error from dd. Also hacky but it is something that's workable for now.

A more robust idea is to allow LVM to do the wipe for us upon snapshot deletion, which would require updating the LVM tools. This is probably a good idea going forward but we may still need something like the above suggestion to support current LVM stacks.

Changed in cinder:
assignee: Eric Harney (eharney) → Rongze Zhu (zrzhit)
summary: - LVM volume cloning is failing with I/O error
+ volume cloning is failing with an I/O error happening when the backing
+ snapshot is wiped
Changed in cinder:
importance: Undecided → High
milestone: none → havana-3
Eric Harney (eharney)
Changed in cinder:
milestone: havana-3 → havana-rc1
Revision history for this message
Joe Breu (breu) wrote :

We ran into this today as well and started to dive into the reason behind the failure. What we discovered is that the snapshot usable space is actually less than the total space for the original LV.

With the current implementation the snapshot is created with a size that is exactly the same size as the original LV. When the dd is run on the snapshot we are not able to write exactly the same amount of data to the snapshot LV as the original LVbecause there is overhead in the COW to account for the disk header and the bits for each of the exception blocks that are remapped on the snapshot. When the COW no longer contains enough free space for new blocks to be written the kernel invalidates the snapshot and disk I/O errors are generated.

In our general tests, with a 1GB volume and a snapshot, we were able to accurately write up to 99.6% of the data to the snapshot (or 1020 of the 1024 1M blocks) before the system reported I/O errors. This held true for volumes of 2GB and 4GB in our tests were we were able to write 2040 or 4080 blocks respectively).

Since the changed blocks are actually written to the COW and not the snapshot volume, we were able to modify the grizzly code to use the snapshot LV COW path instead of the snapshot LV for the secure deletion and thereby wipe the data from the snapshot without error and subsequently delete the LV.

The code change is minimal to make this work and was reliable in our testing. Code to follow.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (master)

Fix proposed to branch: master
Review: https://review.openstack.org/45114

Changed in cinder:
assignee: Rongze Zhu (zrzhit) → Joseph W. Breu (breu)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (stable/grizzly)

Fix proposed to branch: stable/grizzly
Review: https://review.openstack.org/45117

Revision history for this message
Adam Gandelman (gandelman-a) wrote :

Joseph-

Commented on the master review but thought I'd weigh in here, also..

Hit the same bug today and found your proposal under review. Testing it against grizzly, cinder successfully wipes the -cow devices but later fails to lvremove, http://paste.ubuntu.com/6085414/.

I've stepped through the process manually, included is debug output of lvremove on a snapshot that had its corresponding -cow device dd'd: http://paste.ubuntu.com/6085596/

Changed in cinder:
assignee: Joseph W. Breu (breu) → John Griffith (john-griffith)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: stable/grizzly
Review: https://review.openstack.org/47060

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to cinder (master)

Reviewed: https://review.openstack.org/45114
Committed: http://github.com/openstack/cinder/commit/f702fe7e30e4021895dac8e7ab243e5192f8182d
Submitter: Jenkins
Branch: master

commit f702fe7e30e4021895dac8e7ab243e5192f8182d
Author: rackerjoe <email address hidden>
Date: Wed Sep 4 15:31:42 2013 -0500

    Fix secure delete for thick LVM snapshots

    This change modifies the behaviour of the secure delete for thick
    LVM snapshots to wipe the underlying COW of the snapshot LV
    instead of the snapshot LV itself.

    This change is necessary because the snapshot LV does not contain
    exactly the same number of writable blocks as the original LV. The
    COW includes header information per COW block that identifies the
    device as a COW device as well as the source and destination blocks
    for the changed item. The amount of metadata contained in the COW is
    variable based on I/O performed on the snapshot.

    This does not change the behavior of secure deletes on thin LVs
    or secure deletes on the thick LV snapshot origin.

    Closes-Bug: #1191812
    Change-Id: I20e02b6c20d5ac539b5b5469e665fc986180f2e9

Changed in cinder:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to cinder (stable/grizzly)

Reviewed: https://review.openstack.org/45117
Committed: http://github.com/openstack/cinder/commit/0a86ef8de27a3e8d12b16ce5c20be18fa21b0e87
Submitter: Jenkins
Branch: stable/grizzly

commit 0a86ef8de27a3e8d12b16ce5c20be18fa21b0e87
Author: rackerjoe <email address hidden>
Date: Wed Sep 4 15:21:55 2013 -0500

    Fix secure delete for thick LVM snapshots

    This change modifies the behaviour of the secure delete for thick
    LVM snapshots to wipe the underlying COW of the snapshot LV
    instead of the snapshot LV itself.

    This change is necessary because the snapshot LV does not contain
    exactly the same number of writable blocks as the original LV. The
    COW includes header information per COW block that identifies the
     device as a COW device as well as the source and destination blocks
    for the changed item. The amount of metadata contained in the COW is
    variable based on I/O performed on the snapshot.

    This does not change the behavior of secure deletes on thin LVs
    or secure deletes on the thick LV snapshot origin.

    Closes-Bug: #1191812
    Change-Id: I308b554429ac981d709c19b78cdc00389a3f598a

tags: added: in-stable-grizzly
Thierry Carrez (ttx)
Changed in cinder:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in cinder:
milestone: havana-rc1 → 2013.2
Alan Pevec (apevec)
tags: removed: grizzly-backport-potential in-stable-grizzly
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.