Volume manager loses track of exception

Bug #1197648 reported by Vincent Hou on 2013-07-04
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Cinder
High
Huang Zhiteng

Bug Description

Environment:
Ubuntu 12.10
latest devstack and latest cinder.
volume_driver=cinder.volume.drivers.lvm.LVMISCSIDriver
iscsi_helper = tgtadm

Steps to reproduce:
1. Create a volume with cinder create 1.
2. When the volume is available, clone a volume from this one with cinder create --source-volid $vol_id --display_name=clone 1.

After this, I need to wait for a while...

In the end I get the following error log:
=====================================output of cinder-volume========================================
2013-07-04 13:47:09.408 ERROR cinder.volume.manager [req-a9ec0d24-3552-4b73-bf4c-59e4c176aeb6 c310c0a65c844b7cbf3af8c2b43a1ed1 0d136b6ab5cd4593bdefeb5cd589e6e4] volume volume-9e244264-6137-41b8-9793-379962543e1e: create failed
2013-07-04 13:47:09.410 ERROR cinder.openstack.common.rpc.amqp [req-a9ec0d24-3552-4b73-bf4c-59e4c176aeb6 c310c0a65c844b7cbf3af8c2b43a1ed1 0d136b6ab5cd4593bdefeb5cd589e6e4] Exception during message handling
2013-07-04 13:47:09.410 TRACE cinder.openstack.common.rpc.amqp Traceback (most recent call last):
2013-07-04 13:47:09.410 TRACE cinder.openstack.common.rpc.amqp File "/opt/stack/cinder/cinder/openstack/common/rpc/amqp.py", line 433, in _process_data
2013-07-04 13:47:09.410 TRACE cinder.openstack.common.rpc.amqp **args)
2013-07-04 13:47:09.410 TRACE cinder.openstack.common.rpc.amqp File "/opt/stack/cinder/cinder/openstack/common/rpc/dispatcher.py", line 148, in dispatch
2013-07-04 13:47:09.410 TRACE cinder.openstack.common.rpc.amqp return getattr(proxyobj, method)(ctxt, **kwargs)
2013-07-04 13:47:09.410 TRACE cinder.openstack.common.rpc.amqp File "/opt/stack/cinder/cinder/volume/manager.py", line 307, in create_volume
2013-07-04 13:47:09.410 TRACE cinder.openstack.common.rpc.amqp "create.end")
2013-07-04 13:47:09.410 TRACE cinder.openstack.common.rpc.amqp File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
2013-07-04 13:47:09.410 TRACE cinder.openstack.common.rpc.amqp self.gen.next()
2013-07-04 13:47:09.410 TRACE cinder.openstack.common.rpc.amqp File "/opt/stack/cinder/cinder/volume/manager.py", line 283, in create_volume
2013-07-04 13:47:09.410 TRACE cinder.openstack.common.rpc.amqp allow_reschedule)
2013-07-04 13:47:09.410 TRACE cinder.openstack.common.rpc.amqp File "/opt/stack/cinder/cinder/volume/manager.py", line 366, in _reschedule_or_reraise
2013-07-04 13:47:09.410 TRACE cinder.openstack.common.rpc.amqp raise exc_info[0], exc_info[1], exc_info[2]
2013-07-04 13:47:09.410 TRACE cinder.openstack.common.rpc.amqp TypeError: exceptions must be old-style classes or derived from BaseException, not NoneType
2013-07-04 13:47:09.410 TRACE cinder.openstack.common.rpc.amqp
==============================================================================================

John Griffith (john-griffith) wrote :

Couple of things that need cleaned up in the manager code there, we should probably except as ex and pass the ex in to the retry for starters along with sys.exc_info, and we should have a default for sys.exc_info set to None in the retry method signature.

And, finally based on what we're seeing here check sys.exc_info for None before trying extract the tuples in the retry method.

Changed in cinder:
status: New → Triaged
importance: Undecided → High
Rongze Zhu (zrzhit) on 2013-07-05
Changed in cinder:
assignee: nobody → Rongze Zhu (zrzhit)
Changed in cinder:
status: Triaged → Incomplete
Huang Zhiteng (zhiteng-huang) wrote :

Vincent, John,

The error log seems like a side effect of the clone failure. My opinion is, if this bug is about reschedule not handling well previous exception information well, I'm fine but it seems to me that Vincent wanted to file a bug for the 'failure of create volume clone of LVM' itself? Right? Even if this is the latter case, and I haven't tried to do the same test but I don't think the reproduce steps provided enough information, unless it is a bug of LVM driver.

So Vincent, could you please clarify what this bug report is about and please provide more information.

And John, I agree we should refactor create_volume(), it's a huge function has become too complex.

Rongze Zhu (zrzhit) wrote :
Download full text (4.1 KiB)

I added some code to cinder/cinder/volume/manager.py and reproduce it:

diff --git a/cinder/volume/manager.py b/cinder/volume/manager.py
index d8e23c2..6331165 100644
--- a/cinder/volume/manager.py
+++ b/cinder/volume/manager.py
@@ -270,12 +270,16 @@ class VolumeManager(manager.SchedulerDependentManager):
                                       volume_ref['id'],
                                       {'status': 'error'})
                 return
- except Exception:
+ except Exception as ex:
                 # restore source volume status before reschedule
                 if sourcevol_ref is not None:
                     self.db.volume_update(context, sourcevol_ref['id'],
                                           {'status': sourcevol_ref['status']})
+ print "==="*20
+ print ex
+ print "==="*20
                 exc_info = sys.exc_info()
+ print exc_info
                 # try to re-schedule volume:
                 self._reschedule_or_reraise(context, volume_id, exc_info,
                                             snapshot_id, image_id,

error log:

============================================================
Unexpected error while running command.
Command: sudo cinder-rootwrap /etc/cinder/rootwrap.conf dd if=/dev/zero of=/dev/mapper/stack--volumes-clone--snap--12b110c4--01f2--4221--a06f--3883cb3a3d0a count=1024 bs=1M conv=fdatasync
Exit code: 1
Stdout: ''
Stderr: "/bin/dd: fdatasync failed for `/dev/mapper/stack--volumes-clone--snap--12b110c4--01f2--4221--a06f--3883cb3a3d0a': Input/output error\n1024+0 records in\n1024+0 records out\n1073741824 bytes (1.1 GB) copied, 466.395 s, 2.3 MB/s\n"
============================================================
(None, None, None)
2013-07-04 21:07:23.759 ERROR cinder.volume.manager [req-781185b1-59ef-4d32-a3c5-1ff4c4f36769 d18561d834914ffba68fbb7a0c869ecf 827e49101c9b48e8998360d1e6abb346] volume volume-8419dd91-957b-4759-9ce4-78b08e5c0bd4: create failed
2013-07-04 21:07:23.760 ERROR cinder.openstack.common.rpc.amqp [req-781185b1-59ef-4d32-a3c5-1ff4c4f36769 d18561d834914ffba68fbb7a0c869ecf 827e49101c9b48e8998360d1e6abb346] Exception during message handling
2013-07-04 21:07:23.760 TRACE cinder.openstack.common.rpc.amqp Traceback (most recent call last):
2013-07-04 21:07:23.760 TRACE cinder.openstack.common.rpc.amqp File "/opt/stack/cinder/cinder/openstack/common/rpc/amqp.py", line 433, in _process_data
2013-07-04 21:07:23.760 TRACE cinder.openstack.common.rpc.amqp **args)
2013-07-04 21:07:23.760 TRACE cinder.openstack.common.rpc.amqp File "/opt/stack/cinder/cinder/openstack/common/rpc/dispatcher.py", line 148, in dispatch
2013-07-04 21:07:23.760 TRACE cinder.openstack.common.rpc.amqp return getattr(proxyobj, method)(ctxt, **kwargs)
2013-07-04 21:07:23.760 TRACE cinder.openstack.common.rpc.amqp File "/opt/stack/cinder/cinder/volume/manager.py", line 311, in create_volume
2013-07-04 21:07:23.760 TRACE cinder.openstack.common.rpc.amqp "create.end")
2013-07-04 21:07:23.760 TRACE cinder.openstack.common.rpc.amqp File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
2013-07-04 21:07:23.760 TRACE cinde...

Read more...

Rongze Zhu (zrzhit) wrote :
Download full text (8.5 KiB)

This is dmesg:

[ 5754.760185] INFO: task kworker/1:1:3801 blocked for more than 120 seconds.
[ 5754.761229] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 5754.764108] kworker/1:1 D 0000000000000001 0 3801 2 0x00000000
[ 5754.764114] ffff8800386dbaa0 0000000000000046 ffff8800386dbad0 ffffffff8152c397
[ 5754.764119] ffff8800386dbfd8 ffff8800386dbfd8 ffff8800386dbfd8 00000000000139c0
[ 5754.764159] ffff88003864ae00 ffff8800385dc500 ffff8800386dba70 ffff88007f634258
[ 5754.764163] Call Trace:
[ 5754.764295] [<ffffffff8152c397>] ? dispatch_io+0x117/0x130
[ 5754.764329] [<ffffffff8169d8b9>] schedule+0x29/0x70
[ 5754.764335] [<ffffffff8169d98f>] io_schedule+0x8f/0xd0
[ 5754.764338] [<ffffffff8152c43d>] sync_io+0x8d/0x140
[ 5754.764342] [<ffffffff8152c43d>] ? sync_io+0x8d/0x140
[ 5754.764404] [<ffffffff8117268a>] ? discard_slab+0x3a/0x50
[ 5754.764409] [<ffffffff8152c628>] dm_io+0x138/0x140
[ 5754.764412] [<ffffffff8152c628>] ? dm_io+0x138/0x140
[ 5754.764415] [<ffffffff8152be30>] ? dp_init+0xe0/0xe0
[ 5754.764418] [<ffffffff8152bcf0>] ? bvec_next_page+0x10/0x10
[ 5754.764445] [<ffffffffa04df551>] chunk_io+0x111/0x120 [dm_snapshot]
[ 5754.764450] [<ffffffffa04df551>] ? chunk_io+0x111/0x120 [dm_snapshot]
[ 5754.764454] [<ffffffffa04de160>] ? pending_complete+0x310/0x310 [dm_snapshot]
[ 5754.764458] [<ffffffffa04df5f6>] area_io+0x26/0x30 [dm_snapshot]
[ 5754.764462] [<ffffffffa04df77d>] persistent_commit_exception+0xdd/0x160 [dm_snapshot]
[ 5754.764465] [<ffffffffa04de160>] ? pending_complete+0x310/0x310 [dm_snapshot]
[ 5754.764469] [<ffffffffa04de193>] copy_callback+0x33/0x50 [dm_snapshot]
[ 5754.764472] [<ffffffff8152d054>] run_complete_job+0x74/0xd0
[ 5754.764475] [<ffffffff8152cce5>] process_jobs+0x75/0x110
[ 5754.764478] [<ffffffff8152cfe0>] ? dispatch_job+0x80/0x80
[ 5754.764481] [<ffffffff8152cd80>] ? process_jobs+0x110/0x110
[ 5754.764484] [<ffffffff8152cdb3>] do_work+0x33/0x70
[ 5754.764496] [<ffffffff81071bba>] process_one_work+0x11a/0x480
[ 5754.764500] [<ffffffff81072bd5>] worker_thread+0x165/0x370
[ 5754.764503] [<ffffffff81072a70>] ? manage_workers.isra.29+0x130/0x130
[ 5754.764506] [<ffffffff81077b73>] kthread+0x93/0xa0
[ 5754.764515] [<ffffffff816a8324>] kernel_thread_helper+0x4/0x10
[ 5754.764519] [<ffffffff81077ae0>] ? flush_kthread_worker+0xb0/0xb0
[ 5754.764522] [<ffffffff816a8320>] ? gs_change+0x13/0x13
[ 5754.764563] INFO: task vgs:5495 blocked for more than 120 seconds.
[ 5754.765987] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 5754.767635] vgs D ffffffff8180cbe0 0 5495 5494 0x00000000
[ 5754.767640] ffff88006d02faa8 0000000000000086 0000000000000000 ffff88007b30f300
[ 5754.767645] ffff88006d02ffd8 ffff88006d02ffd8 ffff88006d02ffd8 00000000000139c0
[ 5754.767648] ffff88007c05c500 ffff880038621700 ffff88006d02fa78 ffff88007f654258
[ 5754.767652] Call Trace:
[ 5754.767659] [<ffffffff8169d8b9>] schedule+0x29/0x70
[ 5754.767662] [<ffffffff8169d98f>] io_schedule+0x8f/0xd0
[ 5754.767741] [<ffffffff811c1234>] dio_await_completion+0x54/0xd0
[ 5754.767751] [<ffffffff811c36f4>] do_blockdev_direct_IO+0x964/0...

Read more...

summary: - Unable to clone a volume
+ LVM Unable to clone a volume
Changed in cinder:
status: Incomplete → Confirmed
tags: added: driver lvm
Download full text (5.7 KiB)

After we created the clone volume, we need to zero the temp_snapshot, cinder will write snapshot full and an error occurred in device-mapper.

We can reproduce it .

OS: Ubuntu 12.04.2
Kernel: 3.2.0-40-virtual
LVM:
  LVM version: 2.02.66(2) (2010-05-20)
  Library version: 1.02.48 (2010-05-20)
  Driver version: 4.22.0

### 1. create vg

ubuntu@galaxy2:/mnt$
ubuntu@galaxy2:/mnt$ sudo truncate -s 10G stack-volumes
ubuntu@galaxy2:/mnt$ ll
total 24
drwxr-xr-x 3 root root 4096 Jul 5 07:09 ./
drwxr-xr-x 23 root root 4096 Jul 5 04:36 ../
drwx------ 2 root root 16384 Jun 11 08:29 lost+found/
-rw-r--r-- 1 root root 10737418240 Jul 5 07:09 stack-volumes
ubuntu@galaxy2:/mnt$
ubuntu@galaxy2:/mnt$ sudo losetup -f --show stack-volumes
/dev/loop0
ubuntu@galaxy2:/mnt$
ubuntu@galaxy2:/mnt$ sudo pvcreate /dev/loop0
  Physical volume "/dev/loop0" successfully created
ubuntu@galaxy2:/mnt$ sudo vgcreate vg0 /dev/loop0
  Volume group "vg0" successfully created

### 2. create the origin volume and snapshots of the origin volume

ubuntu@galaxy2:/mnt$ sudo lvcreate -L 1G -n lv0 vg0
  Logical volume "lv0" created
ubuntu@galaxy2:/mnt$
ubuntu@galaxy2:/mnt$ sudo lvs
  LV VG Attr LSize Origin Snap% Move Log Copy% Convert
  lv0 vg0 -wi-a- 1.00g
ubuntu@galaxy2:/mnt$
ubuntu@galaxy2:/mnt$ sudo lvcreate -L 2G -n lv0_snapshot_0 --snapshot vg0/lv0
  Logical volume "lv0_snapshot_0" created
ubuntu@galaxy2:/mnt$
ubuntu@galaxy2:/mnt$ sudo lvcreate -L 1G -n lv0_snapshot_1 --snapshot vg0/lv0
  Logical volume "lv0_snapshot_1" created
ubuntu@galaxy2:/mnt$
ubuntu@galaxy2:/mnt$ sudo lvs
  LV VG Attr LSize Origin Snap% Move Log Copy% Convert
  lv0 vg0 owi-a- 1.00g
  lv0_snapshot_0 vg0 swi-a- 2.00g lv0 0.00
  lv0_snapshot_1 vg0 swi-a- 1.00g lv0 0.00
ubuntu@galaxy2:/mnt$

### 3. zero the first snapshot

ubuntu@galaxy2:/mnt$
ubuntu@galaxy2:/mnt$ sudo dd if=/dev/zero of=/dev/mapper/vg0-lv0_snapshot_0 bs=1M count=1024 conv=fdatasync
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 124.915 s, 8.6 MB/s
ubuntu@galaxy2:/mnt$
ubuntu@galaxy2:/mnt$
ubuntu@galaxy2:/mnt$ sudo lvs
  LV VG Attr LSize Origin Snap% Move Log Copy% Convert
  lv0 vg0 owi-a- 1.00g
  lv0_snapshot_0 vg0 swi-a- 2.00g lv0 50.20
  lv0_snapshot_1 vg0 swi-a- 1.00g lv0 0.00
ubuntu@galaxy2:/mnt$
ubuntu@galaxy2:/mnt$
ubuntu@galaxy2:/mnt$ sudo lvdisplay vg0/lv0_snapshot_0
  --- Logical volume ---
  LV Name /dev/vg0/lv0_snapshot_0
  VG Name vg0
  LV UUID KE7i1E-ZNCR-mc2Z-XNtQ-wiO3-tNjE-a8p2pT
  LV Write Access read/write
  LV snapshot status active destination for /dev/vg0/lv0
  LV Status available
  # open 0
  LV Size 1.00 GiB
  Current LE 256
  COW-table size 2.00 GiB
  COW-table LE 512
  Allocated to snapshot 50.20%
  Snapshot chunk size ...

Read more...

Huang Zhiteng (zhiteng-huang) wrote :

Rongze, thanks for following up. I've updated bug summary to reflect to real case. Also I've proposed a fix in volume manager to handle the case when exc_info is a tuple of None(s).

Vincent, Rongze, please try out the volume manger fix see if how it works.

Fix proposed to branch: master
Review: https://review.openstack.org/35771

Changed in cinder:
assignee: Rongze Zhu (zrzhit) → Huang Zhiteng (zhiteng-huang)
status: Confirmed → In Progress

self.db.volume_update() will clear the exc_info, I traced the code and found out the invoked function "engine.connect()" will clear the exc_info.

volume_update -> get_session() -> get_engine() -> create_engine() -> engine.connect() .

I think sqlalchemy have some side effect, it will clear exc_info.

Eric Harney (eharney) wrote :

The LVM failure here is already being tracked in https://bugs.launchpad.net/cinder/+bug/1191812 .

I'd suggest using this bug for the exception fixup, and handle the LVM issue in that bug.

Huang Zhiteng (zhiteng-huang) wrote :

Rongze, thanks for helping track down the source of clearing exc_info! I've updated the patch to address this.

Eric, bug description updated. Thanks!

summary: - LVM Unable to clone a volume
+ Volume manager loses track of exception
tags: added: exception
Edward Hope-Morley (hopem) wrote :

This is actually caused by a bug in python-eventlet < 0.13 and has now been fixed. See:

https://bitbucket.org/eventlet/eventlet/issue/149/yield-in-except-clause-with-wilcard-raise

It is also an issue the affects all services in Openstack so rather than patch every occurence of the issue, upgrade python-eventlet to this latest version and your problem will hopefully go away.

Huang Zhiteng (zhiteng-huang) wrote :

Edward, thanks for the info about eventlet bug. I agree that upgrading to latest eventlet may possibly fix other similar bugs as well but the question is this bug also affects Grizzly stable release. It would be easier to cherry-pick the proposed fix to grizzly-stable branch than requiring a dependent library version bump for distro packager, I guess?

tags: added: backup-service
tags: added: grizzly-backport-potential
removed: backup-service
Changed in cinder:
milestone: none → havana-2

Reviewed: https://review.openstack.org/35771
Committed: http://github.com/openstack/cinder/commit/40aef764bb24330221b03bafc0e6d85fa493e383
Submitter: Jenkins
Branch: master

commit 40aef764bb24330221b03bafc0e6d85fa493e383
Author: Zhiteng Huang <email address hidden>
Date: Fri Jul 5 01:46:34 2013 -0700

    Refactor reschedule in exception handling of volume manager

    Previous exception handling has a pitfall that may potentially clear
    the sys.exc_info() (by calling SQLalchmey to update db). This patch
    refactors some part of exception handling in create_volume() to make
    sure sys.exc_info() is retrieved so that we can log and reraise it;
    also we make sure exception be reraised at the end of exception handling
    no matter the request is rescheduled or not.

    As a side effect, we fixed a bug in unittest which didn't provide
    correct argument to db API but previously this exception has been wrongly
    consumed by volume manager's exception handling (not reraise exception
    when request is rescheduled).

    fix bug: 1197648

    Change-Id: Idce5d06f8be1fb6018012503ec7f844898a21b25

Changed in cinder:
status: In Progress → Fix Committed
Thierry Carrez (ttx) on 2013-07-17
Changed in cinder:
status: Fix Committed → Fix Released
Thierry Carrez (ttx) on 2013-10-17
Changed in cinder:
milestone: havana-2 → 2013.2
Alan Pevec (apevec) on 2014-03-31
tags: removed: grizzly-backport-potential
no longer affects: cinder/havana
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers