Precise kernel locks up while dd to /dev/mapper files > 1Gb (was: Unable to delete volume)

Bug #1023755 reported by Vincent Hou on 2012-07-12
56
This bug affects 7 people
Affects Status Importance Assigned to Milestone
Cinder
Critical
Unassigned
OpenStack Compute (nova)
Critical
John Griffith
linux (Ubuntu)
Undecided
Unassigned

Bug Description

I use the latest devstack and cinder code.
Steps to do:
1. Create a volume vol1.
2. Create a volume snapshot from the volume vol1.
3. Delete the volume snapshot. With command cinder snapshot-delete.

The snapshot was unablt to delete. The status of the snapshot is stuck to "deleting". The log shows not error. I hope someone takes a look.

Vincent Hou (houshengbo) on 2012-07-12
description: updated
Mike Perez (thingee) on 2012-07-14
Changed in cinder:
assignee: nobody → Mike Perez (thingee)
Mike Perez (thingee) wrote :

I was not able to reproduce this behavior with the latest unfortunately. Here's what I did:

$ cinder create 1
$ cinder list
+--------------------------------------+-----------+--------------+------+-------------+-------------+
| ID | Status | Display Name | Size | Volume Type | Attached to |
+--------------------------------------+-----------+--------------+------+-------------+-------------+
| 0ca4a6ba-010d-416c-abf0-af9a03e5e576 | available | None | 1 | None | |
+--------------------------------------+-----------+--------------+------+-------------+-------------+
$ cinder snapshot-create 0ca4a6ba-010d-416c-abf0-af9a03e5e576
$ cinder snapshot-list
+--------------------------------------+--------------------------------------+-----------+--------------+------+
| ID | Volume ID | Status | Display Name | Size |
+--------------------------------------+--------------------------------------+-----------+--------------+------+
| 85a0079e-253c-4b14-809e-699da6874fd8 | 0ca4a6ba-010d-416c-abf0-af9a03e5e576 | available | None | 1 |
+--------------------------------------+--------------------------------------+-----------+--------------+------+
$ cinder snapshot-delete 85a0079e-253c-4b14-809e-699da6874fd8
$ cinder snapshot-list
$

Can you please provide me exactly how you're deleting the snapshot. I know you said there was no information in the logs, but can you double check the cinder api logs? Thanks!

Vincent Hou (houshengbo) wrote :

Hi Mike. What I was doing is exactly the same as yo did. I tested on multiple machines and found something strange.
When the volume size is default with no change to the localrc file, the snapshot can be deleted successfully; when the volume size is set to a different value like VOLUME_BACKING_FILE_SIZE=25G in localrc, this issue will be raised and the status of the snapshot gets stuck to "deleting".

======================Log of c-api========================
When command cinder snapshot-delete is called, the following log prompts:
2012-07-16 13:51:26 INFO cinder.api.openstack.wsgi [req-036bc027-cee0-4ff3-a2b0-cf72d2f989c4 818048fb7d084e118c2d7e03bdb497ea 71dda5584d8145d6a0b379f61b4f3e36] GET http://9.119.148.111:8776/v1/71dda5584d8145d6a0b379f61b4f3e36/snapshots/90850e6a-d5c3-465e-971c-2897b1587222
2012-07-16 13:51:26 DEBUG cinder.api.openstack.wsgi [req-036bc027-cee0-4ff3-a2b0-cf72d2f989c4 818048fb7d084e118c2d7e03bdb497ea 71dda5584d8145d6a0b379f61b4f3e36] Unrecognized Content-Type provided in request from (pid=13765) get_body /opt/stack/cinder/cinder/api/openstack/wsgi.py:696
2012-07-16 13:51:26 INFO cinder.api.openstack.wsgi [req-036bc027-cee0-4ff3-a2b0-cf72d2f989c4 818048fb7d084e118c2d7e03bdb497ea 71dda5584d8145d6a0b379f61b4f3e36] http://9.119.148.111:8776/v1/71dda5584d8145d6a0b379f61b4f3e36/snapshots/90850e6a-d5c3-465e-971c-2897b1587222 returned with HTTP 200
2012-07-16 13:51:26 INFO cinder.api.openstack.wsgi [req-ecca3603-0c64-4cbc-8bf0-1ada7bcce170 818048fb7d084e118c2d7e03bdb497ea 71dda5584d8145d6a0b379f61b4f3e36] DELETE http://9.119.148.111:8776/v1/71dda5584d8145d6a0b379f61b4f3e36/snapshots/90850e6a-d5c3-465e-971c-2897b1587222
2012-07-16 13:51:26 DEBUG cinder.api.openstack.wsgi [req-ecca3603-0c64-4cbc-8bf0-1ada7bcce170 818048fb7d084e118c2d7e03bdb497ea 71dda5584d8145d6a0b379f61b4f3e36] Unrecognized Content-Type provided in request from (pid=13765) get_body /opt/stack/cinder/cinder/api/openstack/wsgi.py:696
2012-07-16 13:51:26 AUDIT cinder.api.openstack.volume.snapshots [req-ecca3603-0c64-4cbc-8bf0-1ada7bcce170 818048fb7d084e118c2d7e03bdb497ea 71dda5584d8145d6a0b379f61b4f3e36] Delete snapshot with id: 90850e6a-d5c3-465e-971c-2897b1587222
2012-07-16 13:51:26 DEBUG cinder.rpc.amqp [req-ecca3603-0c64-4cbc-8bf0-1ada7bcce170 818048fb7d084e118c2d7e03bdb497ea 71dda5584d8145d6a0b379f61b4f3e36] Making asynchronous cast on cinder-volume.vincent-ThinkCentre-M52... from (pid=13765) cast /opt/stack/cinder/cinder/rpc/amqp.py:363
2012-07-16 13:51:26 INFO cinder.api.openstack.wsgi [req-ecca3603-0c64-4cbc-8bf0-1ada7bcce170 818048fb7d084e118c2d7e03bdb497ea 71dda5584d8145d6a0b379f61b4f3e36] http://9.119.148.111:8776/v1/71dda5584d8145d6a0b379f61b4f3e36/snapshots/90850e6a-d5c3-465e-971c-2897b1587222 returned with HTTP 202

Vincent Hou (houshengbo) wrote :

BTW, please create the volume larger than 1G.

Vincent Hou (houshengbo) wrote :

If possible, please do not update the existing code to do the test. I found updating the existing code does not lead to the bug. Do a brand new installation for the cinder code master trunk.

Mike Perez (thingee) wrote :

Unfortunately after starting up 8 different instances (yes really), I was not able to reproduce this behavior. Here's an example:

$ cinder create 5
$ cinder list
+--------------------------------------+-----------+--------------+------+-------------+-------------+
| ID | Status | Display Name | Size | Volume Type | Attached to |
+--------------------------------------+-----------+--------------+------+-------------+-------------+
| 8d5ecba7-98fa-4c56-b3d8-ef6f8920a5a6 | available | None | 5 | None | |
+--------------------------------------+-----------+--------------+------+-------------+-------------+
$ cinder snapshot-create 8d5ecba7-98fa-4c56-b3d8-ef6f8920a5a6
$ cinder snapshot-list
+--------------------------------------+--------------------------------------+-----------+--------------+------+
| ID | Volume ID | Status | Display Name | Size |
+--------------------------------------+--------------------------------------+-----------+--------------+------+
| 3f7f89ad-7d6d-462c-a145-009b73065595 | 8d5ecba7-98fa-4c56-b3d8-ef6f8920a5a6 | available | None | 5 |
+--------------------------------------+--------------------------------------+-----------+--------------+------+
$ cinder snapshot-delete 3f7f89ad-7d6d-462c-a145-009b73065595
$ cinder snapshot-list
+--------------------------------------+--------------------------------------+----------+--------------+------+
| ID | Volume ID | Status | Display Name | Size |
+--------------------------------------+--------------------------------------+----------+--------------+------+
| 3f7f89ad-7d6d-462c-a145-009b73065595 | 8d5ecba7-98fa-4c56-b3d8-ef6f8920a5a6 | deleting | None | 5 |
+--------------------------------------+--------------------------------------+----------+--------------+------+
$ cinder snapshot-list

$

Vincent Hou (houshengbo) wrote :

Hi Mike. I have been checking this bug these days too.

To have it reproduced, each time I install ubuntu 12.04 on a PHYSICAL machine. Then, run the devstack script.

When I tried
                        cinder snapshot-delete <snapshot_id>,
the status of this snapshot is stuck to "deleting".

Each time I run cinder snapshot-list, it takes a VERY long time to respond. And the status of the snapshot is always "deleting". And at the same time the machine will get a sort of "freezed", since It becomes rather slow when I do any other thing, even move the cursor or click on a folder.

I feel weird to have this issue.

Vincent Hou (houshengbo) wrote :

I have tested under both physical machine and VM. The VM resides in my physical machine. My physical machine has a 4G memory.

For VM, the snapshot finally could be gone, but it takes about 7-8 minutes to delete it. It is LONG, but well it finally succeed.

For physical machine, the status is still "deleting" after 30 minutes. Even it will succeed eventually, it is already beyond anyone's expectation.

Well, I don't think it is normal. Have a try when it is convenient for you.

Vincent Hou (houshengbo) wrote :

Some new discovery:
Ubuntu 11.10 is a better OS then Ubuntu 12.04 as an OpenStack host system.
I tried it on Ubuntu 11.10. The volume snapshot can be deleted in very short time. For me, this bug becomes a particular issue for Ubuntu 12.04. Anyway, thank Mike for your concern.

Vincent Hou (houshengbo) wrote :

I put all my logs here for reference.

Vincent Hou (houshengbo) wrote :
Vincent Hou (houshengbo) wrote :
Vincent Hou (houshengbo) wrote :

Hi fellows. There are people meeting the same issue and adding their comments from the horizon project. Please take a look: https://bugs.launchpad.net/horizon/+bug/1028726

Mike Perez (thingee) wrote :

Hey Vincent, for a snapshot of the size of 5G on a physical machine with 12.04 I'm seeing the following:

      2/1 0.000 0.000 13.865 13.865 manager.py:219(delete_snapshot)
        2 0.000 0.000 0.751 0.375 driver.py:83(_try_execute)
        1 0.000 0.000 13.090 13.090 driver.py:113(_copy_volume)

It seems like 13s is being spent on that dd of zeroing the volume.

Mike Perez (thingee) wrote :

Btw, the above output comes from using cProfile. Here's the full output.

John Griffith (john-griffith) wrote :

I've posted some comments in the Horizon version of this as well, but I'm unable to reproduce this issue even using the lvdisplay method outlined by others.

I'm going to close this out tomorrow unless I hear some new updates from someone. If it comes back up again and we find a way to reliably reproduce it we can open it back up.

Changed in cinder:
status: New → Triaged
John Griffith (john-griffith) wrote :

Doing some more testing and a bit of research on this and unfortunately I found some answers to this.

It seems that there are a number of issues with performing dd to /dev/mapper files >= 1G. The result appears to basically be a kernel hang, also noted in some of the information I came across is horrid performance.

This aligns with the issues described including the terrible performance/time it can take for the delete when things don't hang. I'll have to think a bit about the best solution here.

John Griffith (john-griffith) wrote :

A little more thought and detail on this. The issue only seems to be with LVM2 snapshots (ie we don't see this with regular LVM volumes obviously).

I've been trying to convert a snapshot to a standard linear volume using lvmodify, however based on how lvm snapshots work it doesn't seem that this is an option. One thing that came to mind though... is it necessary to perform the dd on a snapshot delete at all? AFAIK the snapshot volume only contains differing blocks anyway, in which case we're just talking inode tables which when the original volume is deleted and the dd is performed on it, all of the data is essentially wiped anyway?

Seems like the zeroing out of the lvm snapshot isn't necessary, so I propose it be removed. I'll submit a patch depending on response from this post.

Changed in cinder:
importance: Undecided → Critical
milestone: none → folsom-rc1
assignee: Mike Perez (thingee) → John Griffith (john-griffith)

Fix proposed to branch: master
Review: https://review.openstack.org/12457

Changed in cinder:
status: Triaged → In Progress
Changed in nova:
status: New → Invalid

Fix proposed to branch: master
Review: https://review.openstack.org/12659

Changed in nova:
assignee: nobody → John Griffith (john-griffith)
status: Invalid → In Progress

This fix finally makes it work... It was a horrible experience to wait 30 min for restart of my Ubuntu each time I reproduce the error for some reason ...

Here is a similar problem in RedHat:
http://www.redhat.com/archives/linux-lvm/2012-June/msg00022.html

The common thing with Ubuntu is that when the "dd if=dev/zero..." hangs on snapshot removal, when system restart is invoked, the logs show an error unmounting the dev/mapper/dm-XX which are used for the snapshots, and this error prevents the system to fully restart. It's not quite obvious if 'dd' causes the device mapper hang, but it worths looking at this redhat thread probably.

Tihomir Trifonov (ttrifonov) wrote :

Just an idea on volume creation - as this fix moves the volume zeroing from delete to create, what about adding a new state for volume creation - e.g. "Initializing" or "Building" ? Otherwise for larger volumes, the user might think that the operation hung on "Creating" ?

Reviewed: https://review.openstack.org/12457
Committed: http://github.com/openstack/cinder/commit/1b3322d45fe2c5ed72cc7f8674e5e319928065ad
Submitter: Jenkins
Branch: master

commit 1b3322d45fe2c5ed72cc7f8674e5e319928065ad
Author: John Griffith <email address hidden>
Date: Wed Sep 5 19:27:48 2012 +0000

    Don't zero out snapshot volume on snapshot_delete

     When trying to zero out an LVM snapshot on precise the kernel sometimes
     hangs when performing the dd. Also the dd process itself can take
     an extremely long time even when it does succesfully complete.
     This can be up to 30 minutes for a 1 Gig volume/snapshot.

     I believe this is a kernel specific issue with LVM snapshots. The
     zeroing process is unreliable and can cause kernel hangs to let's remove it.

     In order to protect against data leakage we'll implement the zeroing process
     on volume creation. This doesn't seem to have a significant impact and doesn't
     suffer from the same isues tha zeroing out an LVM snapshot does.

     No reason to continue zero on delete, the zero on creation should
     probably be sufficient.

     Doesn't seem to cause any timing issues but need to keep this in mind.

     Fixes bug 1023755

    Change-Id: I56209d8e5973cffa997b4ec3e51c3361838386de

Changed in cinder:
status: In Progress → Fix Committed

I agree with Tihomir. This bug used to drive me insame. And I also recommend to add more states for the volume creation.

John Griffith (john-griffith) wrote :

The current "creating" state seems like it should be sufficient IMO. With this change the volume status will be "creating" until the dd is complete at which time it will go to available.

Chuck Short (zulcss) wrote :

When this happens I get the following in my kern.log:

Sep 11 19:09:40 server-3bc2ef49-4871-4dcc-a7c4-7de14141e6ec kernel: [ 1680.188153] INFO: task kworker/0:1:19 blocked for more than 120 seconds.
Sep 11 19:09:40 server-3bc2ef49-4871-4dcc-a7c4-7de14141e6ec kernel: [ 1680.190751] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 11 19:09:40 server-3bc2ef49-4871-4dcc-a7c4-7de14141e6ec kernel: [ 1680.193638] kworker/0:1 D 0000000000000000 0 19 2 0x00000000
Sep 11 19:09:40 server-3bc2ef49-4871-4dcc-a7c4-7de14141e6ec kernel: [ 1680.193641] ffff88007b8ebab0 0000000000000046 00000000357fca28 ffffffff81c17400
Sep 11 19:09:40 server-3bc2ef49-4871-4dcc-a7c4-7de14141e6ec kernel: [ 1680.193644] ffff88007b8ebfd8 ffff88007b8ebfd8 ffff88007b8ebfd8 00000000000136c0
Sep 11 19:09:40 server-3bc2ef49-4871-4dcc-a7c4-7de14141e6ec kernel: [ 1680.193646] ffff88007aaf5b80 ffff88007b8d44a0 ffff88007b8eba80 ffff88007fc13f80
Sep 11 19:09:40 server-3bc2ef49-4871-4dcc-a7c4-7de14141e6ec kernel: [ 1680.193648] Call Trace:
Sep 11 19:09:40 server-3bc2ef49-4871-4dcc-a7c4-7de14141e6ec kernel: [ 1680.197632] [<ffffffff816508cf>] schedule+0x3f/0x60
Sep 11 19:09:40 server-3bc2ef49-4871-4dcc-a7c4-7de14141e6ec kernel: [ 1680.220403] [<ffffffff8165097f>] io_schedule+0x8f/0xd0
Sep 11 19:09:40 server-3bc2ef49-4871-4dcc-a7c4-7de14141e6ec kernel: [ 1680.220437] [<ffffffff814f66bd>] sync_io+0x8d/0x140
Sep 11 19:09:40 server-3bc2ef49-4871-4dcc-a7c4-7de14141e6ec kernel: [ 1680.220457] [<ffffffff81117067>] ? mempool_free+0x57/0xb0
Sep 11 19:09:40 server-3bc2ef49-4871-4dcc-a7c4-7de14141e6ec kernel: [ 1680.220466] [<ffffffff814f68a8>] dm_io+0x138/0x140
Sep 11 19:09:40 server-3bc2ef49-4871-4dcc-a7c4-7de14141e6ec kernel: [ 1680.220473] [<ffffffff814f60b0>] ? dp_init+0xe0/0xe0
Sep 11 19:09:40 server-3bc2ef49-4871-4dcc-a7c4-7de14141e6ec kernel: [ 1680.220480] [<ffffffff814f5f70>] ? bvec_next_page+0x10/0x10
Sep 11 19:09:40 server-3bc2ef49-4871-4dcc-a7c4-7de14141e6ec kernel: [ 1680.220495] [<ffffffffa0309601>] chunk_io+0x111/0x120 [dm_snapshot]
Sep 11 19:09:40 server-3bc2ef49-4871-4dcc-a7c4-7de14141e6ec kernel: [ 1680.220507] [<ffffffffa0308180>] ? pending_complete+0x310/0x310 [dm_snapshot]
Sep 11 19:09:40 server-3bc2ef49-4871-4dcc-a7c4-7de14141e6ec kernel: [ 1680.220519] [<ffffffffa0309867>] persistent_commit_exception+0x117/0x160 [dm_snapshot]
Sep 11 19:09:40 server-3bc2ef49-4871-4dcc-a7c4-7de14141e6ec kernel: [ 1680.220530] [<ffffffffa0308180>] ? pending_complete+0x310/0x310 [dm_snapshot]
Sep 11 19:09:40 server-3bc2ef49-4871-4dcc-a7c4-7de14141e6ec kernel: [ 1680.220540] [<ffffffffa03081b3>] copy_callback+0x33/0x50 [dm_snapshot]
Sep 11 19:09:40 server-3bc2ef49-4871-4dcc-a7c4-7de14141e6ec kernel: [ 1680.220549] [<ffffffff814f72c4>] run_complete_job+0x74/0xd0
Sep 11 19:09:40 server-3bc2ef49-4871-4dcc-a7c4-7de14141e6ec kernel: [ 1680.220557] [<ffffffff814f6f55>] process_jobs+0x75/0x110
Sep 11 19:09:40 server-3bc2ef49-4871-4dcc-a7c4-7de14141e6ec kernel: [ 1680.220564] [<ffffffff814f7250>] ? dispatch_job+0x80/0x80

Chuck Short (zulcss) wrote :

Kernel version 3.2.0-29-virtual

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1023755

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Mark McLoughlin (markmc) on 2012-09-11
Changed in nova:
milestone: none → folsom-rc1
importance: Undecided → Critical

Hi Brad,
I've reverted the work-around and reproduced this quite easily. The problem is, once this happens the kernel is locked up. I don't have the ability to run apport or even pull a copy of the logs for you.

Not sure how to get you the information you need here, any suggestions?

BTW:
To reproduce, do a clean devstack install in virtual-box:
1. create a 2 gig volume
2. create a snapshot of that volume
3. delete that snapshot

Changed in cinder:
status: Fix Committed → In Progress
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Rohan (kanaderohan) wrote :

I am facing the same issue in stable/essex.
Snapshot is stuck at "deleting" status . When i run "nova volume-snapshot-delete <snapshot_id>", It gives me the following error msg
"ERROR: Invalid volume: Volume Snapshot status must be available or error (HTTP 400) (Request-ID: req-6619b4c0-a9c3-46bf-8404-fb959400911b)"

Rohan (kanaderohan) wrote :

Forgot to add, I am not using cinder, this problem is occurring for "nova-volume" service.

Thierry Carrez (ttx) wrote :

Should be fixed at kernel level

summary: - Unable to delete the volume snapshot
+ Precise kernel locks up while dd to /dev/mapper files > 1Gb (was: Unable
+ to delete volume)
Changed in nova:
status: In Progress → Won't Fix
Changed in cinder:
status: In Progress → Won't Fix
Thierry Carrez (ttx) on 2012-09-13
Changed in nova:
milestone: folsom-rc1 → none
Thierry Carrez (ttx) on 2012-09-13
Changed in cinder:
milestone: folsom-rc1 → none
John Griffith (john-griffith) wrote :

psuedo script to reproduce outside of openstack/devstack:

*This omits the iscsi target creation/deletion steps.*

DATA_DIR=~/DATA
VOLUME_BACKING_FILE_SIZE=${VOLUME_BACKING_FILE_SIZE:-5130M}
VOLUME_GROUP=${VOLUME_GROUP:-stack-volumes}
VOLUME_BACKING_FILE=${VOLUME_BACKING_FILE:-$DATA_DIR/${VOLUME_GROUP}-backing-file}

if ! sudo vgs $VOLUME_GROUP; then
    VOLUME_BACKING_FILE=${VOLUME_BACKING_FILE:-$DATA_DIR/${VOLUME_GROUP}-backing-file}
    # Only create if the file doesn't already exists
    [[ -f $VOLUME_BACKING_FILE ]] || truncate -s $VOLUME_BACKING_FILE_SIZE $VOLUME_BACKING_FILE
    DEV=`sudo losetup -f --show $VOLUME_BACKING_FILE`
    # Only create if the loopback device doesn't contain $VOLUME_GROUP
    if ! sudo vgs $VOLUME_GROUP; then sudo vgcreate $VOLUME_GROUP $DEV; fi
fi

lvcreate -L 2G -n "test-vol" $VOLUME_GROUP
lvcreate -L 2G --name "test-snap" --snapshot "test-vol"

dd count=0 if=/dev/zero of=test-vol count= bs=1M iflag=direct oflag=direct

Stefan Bader (smb) wrote :

Just while thinking about it (not yet run a reproduction), I think I can guess around where the problem might be. Both the original volume and the snapshot are set to 2G (at least in the example). The way a snapshot volume works is that it contains a mapping table. Sectors that have not changed in either the original or the snapshot will refer to the original volume. Changes to the origin will cause a copy of the original sector into the exception storage and the table to be updated to point to the exception storage. Writes to the snapshot go into the exception storage directly and the table is updated as before. For this reason the snapshot volume is not a plain volume but is used to store the mapping table and the exception sectors. The table gets bigger with the original volume's size. So I guess around this 2G size a 2G snapshot cannot store exceptions to all sectors, which is what happens when zeroing the whole origin.
Theoretically I would think a write should return an error in that case. But maybe this is a problem caused by having the VG mapped onto a loop device. Also the whole system hanging is unexpected. Could the lack of propagating IO error possibly cause the dd not to see the end of the device, so in the end all memory is exhausted by buffers that cannot be written?

This potentially can be a problem too, if the user of the used device (I assume the origin is used to export and allow its user to update the contents while the snapshot retains the unmodified state) happens to modify _all_ sectors (though I think that will be quite rare. So much for theories. Need to find out whether that is actually true. Might be worth trying to make the snapshot a bit bigger and see what happens. Ultimately I guess one wants to convert the snapshot storage into a linear volume and then erase it (or have it erased while dropping).

Stefan Bader (smb) wrote :

I was unable to reproduce this with the steps shown in comment #32 (with two changes as below). The used kernel was 3.2.0-27.43-generic running as a VM. I tried both the VG on a partition and the VG on a loop-mounted backing file.

To make the testcase work, the snapshot lvcreate needed "--snapshot <vg>/<lv>" and the dd command used was "dd if=/dev/zero of=/dev/<vg>/<lv> bs=1M oflag=direct". This would end with an IO error as soon as the snapshot exception store was full.

John Griffith (john-griffith) wrote :

Hi Stefan, can you confirm whether you did or did not successfully reproduce this issue?

Stefan Bader (smb) wrote :

John, I could *not* reproduce it. For me the dd would (expectedly) abort with an error.

Stefan Bader (smb) wrote :

I tried over the weekend with a Xen guest and 1, 2, and 4 CPUs. No luck in failing.

Just a general question for understanding: it is the original volume that gets zeroed. And out of curiosity: am I understanding this correctly, that the snapshot is kept as is and the origin is handed out to be used in a guest? Just wondering why that way and not let the snapshot get modified (no, this is not really related to the problem).

Stefan Bader (smb) wrote :

Oh, is there a certain amount of memory related to the failures?

John Griffith (john-griffith) wrote :

Hey Stefan,
Actually the volume is NOT deleted/zeroed if the snapshot exists. Snapshots must be removed deleted first.

My setup is devstack in a Virtualbox, with 1G of memory allocated to the VM. Really this seems to be the most effective way to reproduce. I have also been succesfull using devstack on a physical machine, however this was less frequent.

Stefan Bader (smb) wrote :

John, would you be able to guide me through the steps to set up devstack up to the point where I can use it to re-produce the problem?

John Griffith (john-griffith) wrote :

Hi Stefan,

Best way to set this up and reproduce is using virtual-box with a clean 12.04 server image installed.
Personally at this point I typically do a snapshot or clone so I can easily start over again.

After that you can follow the steps outlined on the devstack page to get things up and running: http://devstack.org/guides/single-vm.html

Once openstack is up and running, you can check it's logs by entering 'screen -x'. Now you're ready to try and reproduce the issue.

I've uploaded a testing script to my github account: https://github.com/j-griffith/volume-script/tree/master/exercises, check out the README.md for info on where to place the script and how to run it.

Stefan Bader (smb) wrote :

I have been experimenting with the devstack setup on a Xen guest (if possible I want to use Xen because I can take dumps there). The guest setup is 2VCPUs, 1G memory and 8G and 5G virtual disks (the second is mounted to /opt/stack to be used by the volume group of the test. This is a remainder of the other setup and I surely can reinstall with a single big enough disk. There is about 1G of swap-space used on the first disk.

Running the test, the snapshot is successfully created:

#> dmsetup ls --tree
stack--volumes-_snapshot--68081c3b--d18b--48ca--8f15--e14ec8b0606e (252:1)
 ├─stack--volumes-_snapshot--68081c3b--d18b--48ca--8f15--e14ec8b0606e-cow (2...
 │ └─ (7:0)
 └─stack--volumes-volume--37759b39--408e--4f2a--a58c--9b0adf9dfa43-real (252...
    └─ (7:0)
stack--volumes-volume--37759b39--408e--4f2a--a58c--9b0adf9dfa43 (252:0)
 └─stack--volumes-volume--37759b39--408e--4f2a--a58c--9b0adf9dfa43-real (252...
    └─ (7:0)

When deleting the snapshot, the system gets under pressure (and the timeout of 30s seems too short for erasing 2G). In contrast to what the lvm commands in comment #32 said, it is the snapshot volume that gets erased:

6652 pts/14 S+ 0:00 sudo /usr/local/bin/cinder-rootwrap /etc/cinder/rootwrap.conf dd if=/dev/zero of=/dev/mapper/stack--volumes-_snapshot--68081c3b--d18b--48ca--8f15--e14ec8b0606e count=2048 bs=1M

This actually makes a lot of sense because erasing the origin volume would cause much more IO in the background as every block that gets erased is first copied into the exception storage. By erasing the snapshot only the zeroed block gets written to the exception storage and (not sure how often) the persistent COW table has to be updated.

While for me the system does not hang, I notice that after the activity caused by the erase stops (and nova volume-snapshot-list now returns nothing), the device-mapper setup is not clean:

#> dmsetup ls --tree
stack--volumes-_snapshot--68081c3b--d18b--48ca--8f15--e14ec8b0606e-cow (252:...
 └─ (7:0)
stack--volumes-volume--37759b39--408e--4f2a--a58c--9b0adf9dfa43 (252:0)
 └─ (7:0)

#> lvs
  LV VG Attr LSize Origin Snap% Move Log Copy% Convert
  volume-37759b39-408e-4f2a-a58c-9b0adf9dfa43 stack-volumes -wi-ao 2.00g

So the snapshot was mostly disbanded but the exception storage is still present. Though I was able to remove it manually via dmsetup. After doing that, I could issue a "nova volume-delete" to get rid of the origin volume, too (note, this again does a dd but compared to the dd on the snapshot, this one seems much fater).

Stefan Bader (smb) wrote :

Some thoughts on the previous data:
1. With devstack running it is likely that some portions of the 1G memory are used by executables.
2. Having the VG on a loop mounted file can add some overhead on the IO as everything has to
    go trhough an additional layer and it is not guaranteed that consecutive blocks of the loop
    block device map to sequential blocks in the file system. That could make merges harder.
3. I am not sure whether for the Virtualbox VM a paravirtual device driver exists or whether block
    devices are presented through an emulated controller (often PIIX). Compared to the paravirt
    driven block devices I use, the emulated devices are much slower.
4. Another factor that is often causing problems is that with 12.04, the default IO scheduler changed
    from deadline to cfq. It might be worth a try to set "elevator=deadline" in the kernel command
    line.

Looking at this, this could be a combination of slower block devices and memory pressure and possibly some other event that causes problems for the kernel tread that has to maintain a valid exception table on disk while writing the zero blocks. At least this causes non-sequential access patterns as the exception table is somewhere at the start of the snapshot COW. As those need to be synchronized, this would explain why erasing the snapshot takes much longer.

Next I would try to force the Xen guest to use emulated devices.

Stefan Bader (smb) wrote :

With emulated devices under Xen, I still get the behaviour as before. So not locking up but taking long and leaving the COW volume behind. To have some numbers for long: from issuing the snapshot delete until the hdd stops writing a lot, it takes about 6 minutes, for the volume delete is is about 1m40.

Stefan Bader (smb) wrote :

Ah, it seems that starting the script a second time after manual cleanup seemed successful will cause a hang. Apparently on deleting the snapshot but there is no activity on the disk.

Stefan Bader (smb) wrote :

Now that is interesting: after getting the system into a similar hang on the second execution of the vol-test script, I took a dump and was looking at it. The system would be suspended for the duration of the dump process, that is maybe 1 minute. After a while I suddenly noticed disk activity on the host and the guest which before seemed unresponsive, suddenly was actually doing the snapshot delete. According to timestamps, this took about 47 minutes!

Thierry Carrez (ttx) wrote :

@Stefan: this is actually making "OpenStack on LTS" look quite bad, was wondering if there was a way to increase priority on this bug ?

Stefan Bader (smb) wrote :

I will be back working on this (though realistically this will not be before next week). From the various dumps I have been looking at until now it looks like the problem is somewhere in the way the storage stack is set up and possibly a slightly higher memory footprint.
Basically cpus are in schedule waiting for some io to complete and I wonder whether everything on the io path has at least some emergency buffers in place for cases where io has to be kept running to make room for more io.

John, would you be able to experiment a bit with the setup in order to narrow this a bit down? What I would like to check would be a setup where the volume group resides on a real partition or disk (so excluding the loop device and the filesystem io).
At some point it will also be good to re-check against a recent mainline kernel. I will be able to do that, I just would likely need more time to figure out how to re-configure devstack to use a vg based on a raw block device.

John Griffith (john-griffith) wrote :

I'll see if I can find some time to spin this up on a physical LVM and let you know how it goes.

Vish Ishaya (vishvananda) wrote :

As noted in lib/cinder (or the devstack docs at http://devstack.org/lib/cinder.html)If you create a volume group called stack-volumes it will use that instead of creating a loopback device. You can even use a different volume group by putting VOLUME_GROUP=xxx in your localrc.

Vish Ishaya (vishvananda) wrote :

FYI i could reproduce a hang in dd doing the following:

truncate -s 1G foo
sudo losetup -f --show foo
echo "0 2097152 zero" | sudo dmsetup create zero1
echo "0 2097152 snapshot /dev/mapper/zero1 /dev/loop0 n 128" | sudo dmsetup create clean1
sudo dd if=/dev/zero of=/dev/mapper/clean1 bs=1M count=1024

Using strace, the command hangs partway through:

read(0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576

I can ctrl-c out of strace but the resulting dd process is completely hung and unkillalble

Note that this is running in a (vmware fusion) vm.

Vish Ishaya (vishvananda) wrote :

fyi running the same command on quantal (with kernel 3.5.0-18-generic) works fine

Stefan Bader (smb) wrote :

From what I got so far from the dump I took with the devstack setup on a loop block device, it seems there are many outstanding io requests for the snapshot-cow device (expected as dd is busy), but digging down to loop0 (which is backing the vg) its worker thread got into a situation where it tries to balance dirty pages and went into io schedule for that. So it looks a bit like writing to ext4 would be in a situation which needs some page(s) which it cannot get because the rest of memory is filled with incomplete requests from the snapshot-cow.

In parallel I did the same test on the same machine after adding a usb disk drive that had a stack-volumes vg on it. In that setup the test runs without problems (just 30s as timeout would be far too small, in reality it takes about 70s for deleting the snapshot or the volume). Still with this, the procedure seems to leak the snapshot-cow device (it still exists after deleting the snapshot and also after deleting the volume). I cannot say why this happens, but it is removable without any issues using dmsetup commands.

Vish, with your test-case, would it make a difference to use deadline as io scheduler (as root "echo deadline /sys/block/xxx/queue/scheduler") on Precise? We know that cfq seems to behave badly in certain situations and I wonder whether changing the io scheduler at least improves the situation.

Still need to check whether ext4 is prepared to handle low-memory situations like that (or maybe there is a fix fox that in later kernels). Might be changing as well, depending on speed of the block devices involved. So another piece to check would be to change the dd into doing sync io (if I find the right place to change).

Vish Ishaya (vishvananda) wrote :

I tried this on real hardware and it appears to work although it is abysmally slow:

vishvananda@nebstack006:~$ echo "0 2097152 zero" | sudo dmsetup create zero1
vishvananda@nebstack006:~$ echo "0 2097152 snapshot /dev/mapper/zero1 /dev/sde n 128" | sudo dmsetup create clean1
vishvananda@nebstack006:~$ sudo dd if=/dev/zero of=/dev/mapper/clean1 bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 15.1229 s, 71.0 MB/s
vishvananda@nebstack006:~$ time sudo dd if=/dev/zero of=/dev/mapper/clean1 bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 8.03134 s, 134 MB/s

real 0m8.041s
user 0m0.004s
sys 0m1.208s
vishvananda@nebstack006:~$ time sudo dd if=/dev/zero of=/dev/sde bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 0.666776 s, 1.6 GB/s

real 0m0.677s
user 0m0.000s
sys 0m0.668s
vishvananda@nebstack006:~$ cat /sys/block/sde/queue/scheduler
noop deadline [cfq]
vishvananda@nebstack006:~$ echo deadline | sudo tee /sys/block/sde/queue/scheduler
deadline
vishvananda@nebstack006:~$ cat /sys/block/sde/queue/scheduler
noop [deadline] cfq
vishvananda@nebstack006:~$ time sudo dd if=/dev/zero of=/dev/sde bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 0.503988 s, 2.1 GB/s

real 0m0.513s
user 0m0.004s
sys 0m0.504s
vishvananda@nebstack006:~$ time sudo dd if=/dev/zero of=/dev/mapper/clean1 bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 11.316 s, 94.9 MB/s

real 0m11.327s
user 0m0.000s
sys 0m1.184s

going to try the same on quantal to see how it looks.

Vish Ishaya (vishvananda) wrote :

same crummy performance on quantal, so the bad performance may be unrelated:

vishvananda@nebstack006:~$ sudo dd if=/dev/zero of=/dev/mapper/clean1 bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 15.2082 s, 70.6 MB/s
vishvananda@nebstack006:~$ time sudo dd if=/dev/zero of=/dev/mapper/clean1 bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 8.00752 s, 134 MB/s

real 0m8.017s
user 0m0.008s
sys 0m1.116s

Stefan Bader (smb) wrote :

I am doing the tests on real hardware, too (now). And I see some issues with the Quantal kernel (at least) if that is set to cfq. So I think there are two issues at least. One which I am currently trying to bisect between 3.3-rc7 and 3.3 which would make things work at least slow(er) and the other thing to figure out after that.

Btw, I think you get slightly better results as the devstack case as you create non-persistent snapshots with a bigger chunk size.

Stefan Bader (smb) wrote :

A bit of status update: unfortunately I had to realize that doing single test runs by far is not useful here. By doing several runs of the vol-test the problem will trigger even with 3.6 kernels. A 3.7 kernel could not be tested because Precise userspace had issues with the iscsitarget setup. Anyway, any bisection results from the previous comment are invalid.

Fix proposed to branch: master
Review: https://review.openstack.org/16798

Changed in cinder:
assignee: John Griffith (john-griffith) → Pádraig Brady (p-draigbrady)
status: Won't Fix → In Progress
Changed in cinder:
assignee: Pádraig Brady (p-draigbrady) → nobody
status: In Progress → Invalid

Reviewed: https://review.openstack.org/16798
Committed: http://github.com/openstack/cinder/commit/1405a6440d646524d41adfed4fc1344948e2871f
Submitter: Jenkins
Branch: master

commit 1405a6440d646524d41adfed4fc1344948e2871f
Author: Pádraig Brady <email address hidden>
Date: Fri Nov 23 11:24:44 2012 +0000

    use O_DIRECT when copying from /dev/zero too

    We need to avoid trying O_DIRECT with virtual
    devices like /dev/zero and /dev/urandom etc.
    as it's not supported there, which because of
    our later check, would cause O_DIRECT to be
    not applied to the output device either.

    Related to bug 937694
    Possibly related to bug 1023755
    Change-Id: I034d8345c3d00689c1f1fffcc2c110735c49ee01

Changed in cinder:
status: Invalid → Fix Committed
Changed in cinder:
status: Fix Committed → In Progress
Thierry Carrez (ttx) wrote :

Why oh why reopen the Pandora box ? A comment would have been nice :)

John Griffith (john-griffith) wrote :

This bug was marked as "fix comitted" against Cinder which is INCORRECT. The fact is the root cause of the issue is an Ubuntu bug and NOT a Cinder bug.

Changing it back to Invalid, if we need a new bug for a work-around so be it, but this particular bug should be left clean for the Ubuntu issue IMO. :)

Changed in cinder:
status: In Progress → Invalid
Thiago Martins (martinx) wrote :

Just a curiosity... Isn't this related to the following BUGs:

Ubuntu server uses CFQ scheduler instead of deadline:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1008400

Kernel I/O scheduling writes starving reads, local DoS:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1064521

???

What can be done?

Edit the file: /etc/default/grub and add:

GRUB_CMDLINE_LINUX="elevator=deadline"

run:

update-grub
reboot

Test it again...

Best,
Thiago

Stefan Bader (smb) wrote :

Thiago, this plays a bit into the issue. But mainly this is a result of changes in the writeback code that is doing some complex throttling based on memory limits and estimated drive speeds. This has some issues with the used setup because we now got two backing devices that are not independent and require further memory to proceed.
When writing the dd onto the snapshot, every snapshot chunk number of blocks will create an exception. That means for the first block touching a new chunk, the chunk gets copied to the snapshot COW storage and the mapping table (on the same device) needs to be updated. Then again, when being processed by the loop thread there will be related IO generated (access time updates, journal).
This can be observed even with deadline elevator and more recent kernel versions. 3.8 seemed much better but still had it. Though this could just be a different timing while filling buffers. So maybe the dd process gets normally blocked sooner. But its hard to find specific changes there.
Right now, I am trying to get to a stage that would make Precise at least usable. Right now there is one change that seems to at least keep the whole system from hanging. But still the dd process will not complete because the loop0 thread that is responsible for moving the requests from the LV side to the fs is blocked because the backing dev is assumed to have no bandwidth, but it has no bandwidth because it never gets something to do...

Thiago Martins (martinx) wrote :

Thank you for your explanation Stefan!

Mate Lakat (mate-lakat) wrote :

We had the same issue with devstack running on XenServer domU. Now, we changed to the -virtual kernel, and these problems have gone. See: https://review.openstack.org/#/c/24297/

Mate Lakat (mate-lakat) wrote :

Let me amend my previous comment: The blocking shows up during more stressful tests, even with the virtual kernel.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers