Bug #1197648 “Volume manager loses track of exception” : Bugs : Cinder

Revision history for this message

John Griffith (john-griffith) wrote on 2013-07-04:

#1

Couple of things that need cleaned up in the manager code there, we should probably except as ex and pass the ex in to the retry for starters along with sys.exc_info, and we should have a default for sys.exc_info set to None in the retry method signature.

And, finally based on what we're seeing here check sys.exc_info for None before trying extract the tuples in the retry method.

Changed in cinder:
status:	New → Triaged
importance:	Undecided → High

Rongze Zhu (zrzhit) on 2013-07-05

Changed in cinder:
assignee:	nobody → Rongze Zhu (zrzhit)

Huang Zhiteng (zhiteng-huang) on 2013-07-05

Changed in cinder:
status:	Triaged → Incomplete

Revision history for this message

Huang Zhiteng (zhiteng-huang) wrote on 2013-07-05:

#2

Vincent, John,

The error log seems like a side effect of the clone failure. My opinion is, if this bug is about reschedule not handling well previous exception information well, I'm fine but it seems to me that Vincent wanted to file a bug for the 'failure of create volume clone of LVM' itself? Right? Even if this is the latter case, and I haven't tried to do the same test but I don't think the reproduce steps provided enough information, unless it is a bug of LVM driver.

So Vincent, could you please clarify what this bug report is about and please provide more information.

And John, I agree we should refactor create_volume(), it's a huge function has become too complex.

Revision history for this message

Rongze Zhu (zrzhit) wrote on 2013-07-05:

#3

Download full text (4.1 KiB)

I added some code to cinder/cinder/volume/manager.py and reproduce it:

diff --git a/cinder/volume/manager.py b/cinder/volume/manager.py
index d8e23c2..6331165 100644
--- a/cinder/volume/manager.py
+++ b/cinder/volume/manager.py
@@ -270,12 +270,16 @@ class VolumeManager(manager.SchedulerDependentManager):
                                       volume_ref['id'],
                                       {'status': 'error'})
                 return
- except Exception:
+ except Exception as ex:
                 # restore source volume status before reschedule
                 if sourcevol_ref is not None:
                     self.db.volume_update(context, sourcevol_ref['id'],
                                           {'status': sourcevol_ref['status']})
+ print "==="*20
+ print ex
+ print "==="*20
                 exc_info = sys.exc_info()
+ print exc_info
                 # try to re-schedule volume:
                 self._reschedule_or_reraise(context, volume_id, exc_info,
                                             snapshot_id, image_id,

error log:

============================================================
Unexpected error while running command.
Command: sudo cinder-rootwrap /etc/cinder/rootwrap.conf dd if=/dev/zero of=/dev/mapper/stack--volumes-clone--snap--12b110c4--01f2--4221--a06f--3883cb3a3d0a count=1024 bs=1M conv=fdatasync
Exit code: 1
Stdout: ''
Stderr: "/bin/dd: fdatasync failed for `/dev/mapper/stack--volumes-clone--snap--12b110c4--01f2--4221--a06f--3883cb3a3d0a': Input/output error\n1024+0 records in\n1024+0 records out\n1073741824 bytes (1.1 GB) copied, 466.395 s, 2.3 MB/s\n"
============================================================
(None, None, None)
2013-07-04 21:07:23.759 ERROR cinder.volume.manager [req-781185b1-59ef-4d32-a3c5-1ff4c4f36769 d18561d834914ffba68fbb7a0c869ecf 827e49101c9b48e8998360d1e6abb346] volume volume-8419dd91-957b-4759-9ce4-78b08e5c0bd4: create failed
2013-07-04 21:07:23.760 ERROR cinder.openstack.common.rpc.amqp [req-781185b1-59ef-4d32-a3c5-1ff4c4f36769 d18561d834914ffba68fbb7a0c869ecf 827e49101c9b48e8998360d1e6abb346] Exception during message handling
2013-07-04 21:07:23.760 TRACE cinder.openstack.common.rpc.amqp Traceback (most recent call last):
2013-07-04 21:07:23.760 TRACE cinder.openstack.common.rpc.amqp File "/opt/stack/cinder/cinder/openstack/common/rpc/amqp.py", line 433, in _process_data
2013-07-04 21:07:23.760 TRACE cinder.openstack.common.rpc.amqp **args)
2013-07-04 21:07:23.760 TRACE cinder.openstack.common.rpc.amqp File "/opt/stack/cinder/cinder/openstack/common/rpc/dispatcher.py", line 148, in dispatch
2013-07-04 21:07:23.760 TRACE cinder.openstack.common.rpc.amqp return getattr(proxyobj, method)(ctxt, **kwargs)
2013-07-04 21:07:23.760 TRACE cinder.openstack.common.rpc.amqp File "/opt/stack/cinder/cinder/volume/manager.py", line 311, in create_volume
2013-07-04 21:07:23.760 TRACE cinder.openstack.common.rpc.amqp "create.end")
2013-07-04 21:07:23.760 TRACE cinder.openstack.common.rpc.amqp File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
2013-07-04 21:07:23.760 TRACE cinde...

I added some code to cinder/cinder/volume/manager.py  and reproduce it:

diff --git a/cinder/volume/manager.py b/cinder/volume/manager.py
index d8e23c2..6331165 100644
--- a/cinder/volume/manager.py
+++ b/cinder/volume/manager.py
@@ -270,12 +270,16 @@ class VolumeManager(manager.SchedulerDependentManager):
                                       volume_ref['id'],
                                       {'status': 'error'})
                 return
-            except Exception:
+            except Exception as ex:
                 # restore source volume status before reschedule
                 if sourcevol_ref is not None:
                     self.db.volume_update(context, sourcevol_ref['id'],
                                           {'status': sourcevol_ref['status']})
+                print "==="*20
+                print ex
+                print "==="*20
                 exc_info = sys.exc_info()
+                print exc_info
                 # try to re-schedule volume:
                 self._reschedule_or_reraise(context, volume_id, exc_info,
                                             snapshot_id, image_id,

error log:

============================================================
Unexpected error while running command.
Command: sudo cinder-rootwrap /etc/cinder/rootwrap.conf dd if=/dev/zero of=/dev/mapper/stack--volumes-clone--snap--12b110c4--01f2--4221--a06f--3883cb3a3d0a count=1024 bs=1M conv=fdatasync
Exit code: 1
Stdout: ''
Stderr: "/bin/dd: fdatasync failed for `/dev/mapper/stack--volumes-clone--snap--12b110c4--01f2--4221--a06f--3883cb3a3d0a': Input/output error\n1024+0 records in\n1024+0 records out\n1073741824 bytes (1.1 GB) copied, 466.395 s, 2.3 MB/s\n"
============================================================
(None, None, None)
2013-07-04 21:07:23.759 ERROR cinder.volume.manager [req-781185b1-59ef-4d32-a3c5-1ff4c4f36769 d18561d834914ffba68fbb7a0c869ecf 827e49101c9b48e8998360d1e6abb346] volume volume-8419dd91-957b-4759-9ce4-78b08e5c0bd4: create failed
2013-07-04 21:07:23.760 ERROR cinder.openstack.common.rpc.amqp [req-781185b1-59ef-4d32-a3c5-1ff4c4f36769 d18561d834914ffba68fbb7a0c869ecf 827e49101c9b48e8998360d1e6abb346] Exception during message handling
2013-07-04 21:07:23.760 TRACE cinder.openstack.common.rpc.amqp Traceback (most recent call last):
2013-07-04 21:07:23.760 TRACE cinder.openstack.common.rpc.amqp   File "/opt/stack/cinder/cinder/openstack/common/rpc/amqp.py", line 433, in _process_data
2013-07-04 21:07:23.760 TRACE cinder.openstack.common.rpc.amqp     **args)
2013-07-04 21:07:23.760 TRACE cinder.openstack.common.rpc.amqp   File "/opt/stack/cinder/cinder/openstack/common/rpc/dispatcher.py", line 148, in dispatch
2013-07-04 21:07:23.760 TRACE cinder.openstack.common.rpc.amqp     return getattr(proxyobj, method)(ctxt, **kwargs)
2013-07-04 21:07:23.760 TRACE cinder.openstack.common.rpc.amqp   File "/opt/stack/cinder/cinder/volume/manager.py", line 311, in create_volume
2013-07-04 21:07:23.760 TRACE cinder.openstack.common.rpc.amqp     "create.end")
2013-07-04 21:07:23.760 TRACE cinder.openstack.common.rpc.amqp   File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
2013-07-04 21:07:23.760 TRACE cinder.openstack.common.rpc.amqp     self.gen.next()
2013-07-04 21:07:23.760 TRACE cinder.openstack.common.rpc.amqp   File "/opt/stack/cinder/cinder/volume/manager.py", line 287, in create_volume
2013-07-04 21:07:23.760 TRACE cinder.openstack.common.rpc.amqp     allow_reschedule)
2013-07-04 21:07:23.760 TRACE cinder.openstack.common.rpc.amqp   File "/opt/stack/cinder/cinder/volume/manager.py", line 370, in _reschedule_or_reraise
2013-07-04 21:07:23.760 TRACE cinder.openstack.common.rpc.amqp     raise exc_info[0], exc_info[1], exc_info[2]
2013-07-04 21:07:23.760 TRACE cinder.openstack.common.rpc.amqp TypeError: exceptions must be old-style classes or derived from BaseException, not NoneType
2013-07-04 21:07:23.760 TRACE cinder.openstack.common.rpc.amqp 
2013-07-04 21:08:06.381 DEBUG cinder.openstack.common.periodic_task [-] Running periodic task VolumeManager._publish_service_capabilities from (pid=5155) run_periodic_tasks /opt/stack/cinder/cinder/openstack/common/periodic_task.py:176

Revision history for this message

Rongze Zhu (zrzhit) wrote on 2013-07-05:

#4

Download full text (8.5 KiB)

This is dmesg:

[ 5754.760185] INFO: task kworker/1:1:3801 blocked for more than 120 seconds.
[ 5754.761229] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 5754.764108] kworker/1:1 D 0000000000000001 0 3801 2 0x00000000
[ 5754.764114] ffff8800386dbaa0 0000000000000046 ffff8800386dbad0 ffffffff8152c397
[ 5754.764119] ffff8800386dbfd8 ffff8800386dbfd8 ffff8800386dbfd8 00000000000139c0
[ 5754.764159] ffff88003864ae00 ffff8800385dc500 ffff8800386dba70 ffff88007f634258
[ 5754.764163] Call Trace:
[ 5754.764295] [<ffffffff8152c397>] ? dispatch_io+0x117/0x130
[ 5754.764329] [<ffffffff8169d8b9>] schedule+0x29/0x70
[ 5754.764335] [<ffffffff8169d98f>] io_schedule+0x8f/0xd0
[ 5754.764338] [<ffffffff8152c43d>] sync_io+0x8d/0x140
[ 5754.764342] [<ffffffff8152c43d>] ? sync_io+0x8d/0x140
[ 5754.764404] [<ffffffff8117268a>] ? discard_slab+0x3a/0x50
[ 5754.764409] [<ffffffff8152c628>] dm_io+0x138/0x140
[ 5754.764412] [<ffffffff8152c628>] ? dm_io+0x138/0x140
[ 5754.764415] [<ffffffff8152be30>] ? dp_init+0xe0/0xe0
[ 5754.764418] [<ffffffff8152bcf0>] ? bvec_next_page+0x10/0x10
[ 5754.764445] [<ffffffffa04df551>] chunk_io+0x111/0x120 [dm_snapshot]
[ 5754.764450] [<ffffffffa04df551>] ? chunk_io+0x111/0x120 [dm_snapshot]
[ 5754.764454] [<ffffffffa04de160>] ? pending_complete+0x310/0x310 [dm_snapshot]
[ 5754.764458] [<ffffffffa04df5f6>] area_io+0x26/0x30 [dm_snapshot]
[ 5754.764462] [<ffffffffa04df77d>] persistent_commit_exception+0xdd/0x160 [dm_snapshot]
[ 5754.764465] [<ffffffffa04de160>] ? pending_complete+0x310/0x310 [dm_snapshot]
[ 5754.764469] [<ffffffffa04de193>] copy_callback+0x33/0x50 [dm_snapshot]
[ 5754.764472] [<ffffffff8152d054>] run_complete_job+0x74/0xd0
[ 5754.764475] [<ffffffff8152cce5>] process_jobs+0x75/0x110
[ 5754.764478] [<ffffffff8152cfe0>] ? dispatch_job+0x80/0x80
[ 5754.764481] [<ffffffff8152cd80>] ? process_jobs+0x110/0x110
[ 5754.764484] [<ffffffff8152cdb3>] do_work+0x33/0x70
[ 5754.764496] [<ffffffff81071bba>] process_one_work+0x11a/0x480
[ 5754.764500] [<ffffffff81072bd5>] worker_thread+0x165/0x370
[ 5754.764503] [<ffffffff81072a70>] ? manage_workers.isra.29+0x130/0x130
[ 5754.764506] [<ffffffff81077b73>] kthread+0x93/0xa0
[ 5754.764515] [<ffffffff816a8324>] kernel_thread_helper+0x4/0x10
[ 5754.764519] [<ffffffff81077ae0>] ? flush_kthread_worker+0xb0/0xb0
[ 5754.764522] [<ffffffff816a8320>] ? gs_change+0x13/0x13
[ 5754.764563] INFO: task vgs:5495 blocked for more than 120 seconds.
[ 5754.765987] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 5754.767635] vgs D ffffffff8180cbe0 0 5495 5494 0x00000000
[ 5754.767640] ffff88006d02faa8 0000000000000086 0000000000000000 ffff88007b30f300
[ 5754.767645] ffff88006d02ffd8 ffff88006d02ffd8 ffff88006d02ffd8 00000000000139c0
[ 5754.767648] ffff88007c05c500 ffff880038621700 ffff88006d02fa78 ffff88007f654258
[ 5754.767652] Call Trace:
[ 5754.767659] [<ffffffff8169d8b9>] schedule+0x29/0x70
[ 5754.767662] [<ffffffff8169d98f>] io_schedule+0x8f/0xd0
[ 5754.767741] [<ffffffff811c1234>] dio_await_completion+0x54/0xd0
[ 5754.767751] [<ffffffff811c36f4>] do_blockdev_direct_IO+0x964/0...

This is dmesg:

[ 5754.760185] INFO: task kworker/1:1:3801 blocked for more than 120 seconds.
[ 5754.761229] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 5754.764108] kworker/1:1     D 0000000000000001     0  3801      2 0x00000000
[ 5754.764114]  ffff8800386dbaa0 0000000000000046 ffff8800386dbad0 ffffffff8152c397
[ 5754.764119]  ffff8800386dbfd8 ffff8800386dbfd8 ffff8800386dbfd8 00000000000139c0
[ 5754.764159]  ffff88003864ae00 ffff8800385dc500 ffff8800386dba70 ffff88007f634258
[ 5754.764163] Call Trace:
[ 5754.764295]  [<ffffffff8152c397>] ? dispatch_io+0x117/0x130
[ 5754.764329]  [<ffffffff8169d8b9>] schedule+0x29/0x70
[ 5754.764335]  [<ffffffff8169d98f>] io_schedule+0x8f/0xd0
[ 5754.764338]  [<ffffffff8152c43d>] sync_io+0x8d/0x140
[ 5754.764342]  [<ffffffff8152c43d>] ? sync_io+0x8d/0x140
[ 5754.764404]  [<ffffffff8117268a>] ? discard_slab+0x3a/0x50
[ 5754.764409]  [<ffffffff8152c628>] dm_io+0x138/0x140
[ 5754.764412]  [<ffffffff8152c628>] ? dm_io+0x138/0x140
[ 5754.764415]  [<ffffffff8152be30>] ? dp_init+0xe0/0xe0
[ 5754.764418]  [<ffffffff8152bcf0>] ? bvec_next_page+0x10/0x10
[ 5754.764445]  [<ffffffffa04df551>] chunk_io+0x111/0x120 [dm_snapshot]
[ 5754.764450]  [<ffffffffa04df551>] ? chunk_io+0x111/0x120 [dm_snapshot]
[ 5754.764454]  [<ffffffffa04de160>] ? pending_complete+0x310/0x310 [dm_snapshot]
[ 5754.764458]  [<ffffffffa04df5f6>] area_io+0x26/0x30 [dm_snapshot]
[ 5754.764462]  [<ffffffffa04df77d>] persistent_commit_exception+0xdd/0x160 [dm_snapshot]
[ 5754.764465]  [<ffffffffa04de160>] ? pending_complete+0x310/0x310 [dm_snapshot]
[ 5754.764469]  [<ffffffffa04de193>] copy_callback+0x33/0x50 [dm_snapshot]
[ 5754.764472]  [<ffffffff8152d054>] run_complete_job+0x74/0xd0
[ 5754.764475]  [<ffffffff8152cce5>] process_jobs+0x75/0x110
[ 5754.764478]  [<ffffffff8152cfe0>] ? dispatch_job+0x80/0x80
[ 5754.764481]  [<ffffffff8152cd80>] ? process_jobs+0x110/0x110
[ 5754.764484]  [<ffffffff8152cdb3>] do_work+0x33/0x70
[ 5754.764496]  [<ffffffff81071bba>] process_one_work+0x11a/0x480
[ 5754.764500]  [<ffffffff81072bd5>] worker_thread+0x165/0x370
[ 5754.764503]  [<ffffffff81072a70>] ? manage_workers.isra.29+0x130/0x130
[ 5754.764506]  [<ffffffff81077b73>] kthread+0x93/0xa0
[ 5754.764515]  [<ffffffff816a8324>] kernel_thread_helper+0x4/0x10
[ 5754.764519]  [<ffffffff81077ae0>] ? flush_kthread_worker+0xb0/0xb0
[ 5754.764522]  [<ffffffff816a8320>] ? gs_change+0x13/0x13
[ 5754.764563] INFO: task vgs:5495 blocked for more than 120 seconds.
[ 5754.765987] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 5754.767635] vgs             D ffffffff8180cbe0     0  5495   5494 0x00000000
[ 5754.767640]  ffff88006d02faa8 0000000000000086 0000000000000000 ffff88007b30f300
[ 5754.767645]  ffff88006d02ffd8 ffff88006d02ffd8 ffff88006d02ffd8 00000000000139c0
[ 5754.767648]  ffff88007c05c500 ffff880038621700 ffff88006d02fa78 ffff88007f654258
[ 5754.767652] Call Trace:
[ 5754.767659]  [<ffffffff8169d8b9>] schedule+0x29/0x70
[ 5754.767662]  [<ffffffff8169d98f>] io_schedule+0x8f/0xd0
[ 5754.767741]  [<ffffffff811c1234>] dio_await_completion+0x54/0xd0
[ 5754.767751]  [<ffffffff811c36f4>] do_blockdev_direct_IO+0x964/0xdd0
[ 5754.767754]  [<ffffffff811bfed0>] ? lookup_bdev+0xa0/0xa0
[ 5754.767785]  [<ffffffff811c3bb5>] __blockdev_direct_IO+0x55/0x60
[ 5754.767790]  [<ffffffff811bfed0>] ? lookup_bdev+0xa0/0xa0
[ 5754.767795]  [<ffffffff811bf1f7>] blkdev_direct_IO+0x57/0x60
[ 5754.767797]  [<ffffffff811bfed0>] ? lookup_bdev+0xa0/0xa0
[ 5754.767844]  [<ffffffff81129813>] generic_file_aio_read+0x243/0x280
[ 5754.767852]  [<ffffffff81196e6e>] ? path_openat+0x10e/0x440
[ 5754.767858]  [<ffffffff8118734a>] do_sync_read+0xda/0x120
[ 5754.767885]  [<ffffffff812b9e03>] ? security_file_permission+0x93/0xb0
[ 5754.767889]  [<ffffffff811877d1>] ? rw_verify_area+0x61/0xf0
[ 5754.767892]  [<ffffffff81187cb0>] vfs_read+0xb0/0x180
[ 5754.767895]  [<ffffffff81187dca>] sys_read+0x4a/0x90
[ 5754.767900]  [<ffffffff816a7029>] system_call_fastpath+0x16/0x1b
[ 5874.627608] INFO: task vgs:5495 blocked for more than 120 seconds.
[ 5874.628551] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 5874.629901] vgs             D ffffffff8180cbe0     0  5495   5494 0x00000000
[ 5874.629905]  ffff88006d02faa8 0000000000000086 0000000000000000 ffff88007b30f300
[ 5874.629908]  ffff88006d02ffd8 ffff88006d02ffd8 ffff88006d02ffd8 00000000000139c0
[ 5874.629933]  ffff88007c05c500 ffff880038621700 ffff88006d02fa78 ffff88007f654258
[ 5874.629935] Call Trace:
[ 5874.629942]  [<ffffffff8169d8b9>] schedule+0x29/0x70
[ 5874.629945]  [<ffffffff8169d98f>] io_schedule+0x8f/0xd0
[ 5874.629950]  [<ffffffff811c1234>] dio_await_completion+0x54/0xd0
[ 5874.629952]  [<ffffffff811c36f4>] do_blockdev_direct_IO+0x964/0xdd0
[ 5874.629954]  [<ffffffff811bfed0>] ? lookup_bdev+0xa0/0xa0
[ 5874.629956]  [<ffffffff811c3bb5>] __blockdev_direct_IO+0x55/0x60
[ 5874.629958]  [<ffffffff811bfed0>] ? lookup_bdev+0xa0/0xa0
[ 5874.629961]  [<ffffffff811bf1f7>] blkdev_direct_IO+0x57/0x60
[ 5874.629962]  [<ffffffff811bfed0>] ? lookup_bdev+0xa0/0xa0
[ 5874.629966]  [<ffffffff81129813>] generic_file_aio_read+0x243/0x280
[ 5874.629969]  [<ffffffff81196e6e>] ? path_openat+0x10e/0x440
[ 5874.629972]  [<ffffffff8118734a>] do_sync_read+0xda/0x120
[ 5874.629976]  [<ffffffff812b9e03>] ? security_file_permission+0x93/0xb0
[ 5874.629978]  [<ffffffff811877d1>] ? rw_verify_area+0x61/0xf0
[ 5874.629980]  [<ffffffff81187cb0>] vfs_read+0xb0/0x180
[ 5874.629982]  [<ffffffff81187dca>] sys_read+0x4a/0x90
[ 5874.629985]  [<ffffffff816a7029>] system_call_fastpath+0x16/0x1b
[ 5874.629987] INFO: task lvs:5513 blocked for more than 120 seconds.
[ 5874.631678] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 5874.633143] lvs             D ffffffff8180cbe0     0  5513   5512 0x00000000
[ 5874.633148]  ffff88000d8d9aa8 0000000000000082 0000000000000000 ffff88007b30f580
[ 5874.633152]  ffff88000d8d9fd8 ffff88000d8d9fd8 ffff88000d8d9fd8 00000000000139c0
[ 5874.633154]  ffff88007c05c500 ffff88006583ae00 ffff88000d8d9a78 ffff88007f654258
[ 5874.633156] Call Trace:
[ 5874.633164]  [<ffffffff8169d8b9>] schedule+0x29/0x70
[ 5874.633167]  [<ffffffff8169d98f>] io_schedule+0x8f/0xd0
[ 5874.633170]  [<ffffffff811c1234>] dio_await_completion+0x54/0xd0
[ 5874.633172]  [<ffffffff811c36f4>] do_blockdev_direct_IO+0x964/0xdd0
[ 5874.633175]  [<ffffffff811bfed0>] ? lookup_bdev+0xa0/0xa0
[ 5874.633177]  [<ffffffff811c3bb5>] __blockdev_direct_IO+0x55/0x60
[ 5874.633179]  [<ffffffff811bfed0>] ? lookup_bdev+0xa0/0xa0
[ 5874.633182]  [<ffffffff811bf1f7>] blkdev_direct_IO+0x57/0x60
[ 5874.633183]  [<ffffffff811bfed0>] ? lookup_bdev+0xa0/0xa0
[ 5874.633187]  [<ffffffff81129813>] generic_file_aio_read+0x243/0x280
[ 5874.633191]  [<ffffffff81196e6e>] ? path_openat+0x10e/0x440
[ 5874.633195]  [<ffffffff8118734a>] do_sync_read+0xda/0x120
[ 5874.633198]  [<ffffffff812b9e03>] ? security_file_permission+0x93/0xb0
[ 5874.633201]  [<ffffffff811877d1>] ? rw_verify_area+0x61/0xf0
[ 5874.633203]  [<ffffffff81187cb0>] vfs_read+0xb0/0x180
[ 5874.633205]  [<ffffffff81187dca>] sys_read+0x4a/0x90
[ 5874.633209]  [<ffffffff816a7029>] system_call_fastpath+0x16/0x1b
[ 5989.039990] device-mapper: snapshots: Invalidating snapshot: Unable to allocate exception.
[ 5989.231592] quiet_error: 262279 callbacks suppressed
[ 5989.231597] Buffer I/O error on device dm-8, logical block 261122
[ 5989.231658] Buffer I/O error on device dm-8, logical block 261123
[ 5989.231659] lost page write due to I/O error on dm-8
[ 5989.231677] Buffer I/O error on device dm-8, logical block 261124
[ 5989.231677] lost page write due to I/O error on dm-8
[ 5989.231684] Buffer I/O error on device dm-8, logical block 261125
[ 5989.231685] lost page write due to I/O error on dm-8
[ 5989.231691] Buffer I/O error on device dm-8, logical block 261126
[ 5989.231691] lost page write due to I/O error on dm-8
[ 5989.231696] Buffer I/O error on device dm-8, logical block 261127
[ 5989.231696] lost page write due to I/O error on dm-8
[ 5989.231702] Buffer I/O error on device dm-8, logical block 261128
[ 5989.231702] lost page write due to I/O error on dm-8
[ 5989.231707] Buffer I/O error on device dm-8, logical block 261129
[ 5989.231707] lost page write due to I/O error on dm-8
[ 5989.231712] Buffer I/O error on device dm-8, logical block 261130
[ 5989.231751] lost page write due to I/O error on dm-8
[ 5989.231752] Buffer I/O error on device dm-8, logical block 203264
[ 5989.231753] lost page write due to I/O error on dm-8
[ 5989.246128] lost page write due to I/O error on dm-8
zhu@ubuntu:/opt/stack/cinder$

Huang Zhiteng (zhiteng-huang) on 2013-07-05

summary:	- Unable to clone a volume + LVM Unable to clone a volume
Changed in cinder:
status:	Incomplete → Confirmed

Huang Zhiteng (zhiteng-huang) on 2013-07-05

tags:

added: driver lvm

Revision history for this message

Rongze Zhu (zrzhit) wrote on 2013-07-05: Re: LVM Unable to clone a volume

#5

Download full text (5.7 KiB)

After we created the clone volume, we need to zero the temp_snapshot, cinder will write snapshot full and an error occurred in device-mapper.

We can reproduce it .

OS: Ubuntu 12.04.2
Kernel: 3.2.0-40-virtual
LVM:
  LVM version: 2.02.66(2) (2010-05-20)
  Library version: 1.02.48 (2010-05-20)
  Driver version: 4.22.0

### 1. create vg

ubuntu@galaxy2:/mnt$
ubuntu@galaxy2:/mnt$ sudo truncate -s 10G stack-volumes
ubuntu@galaxy2:/mnt$ ll
total 24
drwxr-xr-x 3 root root 4096 Jul 5 07:09 ./
drwxr-xr-x 23 root root 4096 Jul 5 04:36 ../
drwx------ 2 root root 16384 Jun 11 08:29 lost+found/
-rw-r--r-- 1 root root 10737418240 Jul 5 07:09 stack-volumes
ubuntu@galaxy2:/mnt$
ubuntu@galaxy2:/mnt$ sudo losetup -f --show stack-volumes
/dev/loop0
ubuntu@galaxy2:/mnt$
ubuntu@galaxy2:/mnt$ sudo pvcreate /dev/loop0
Physical volume "/dev/loop0" successfully created
ubuntu@galaxy2:/mnt$ sudo vgcreate vg0 /dev/loop0
Volume group "vg0" successfully created

### 2. create the origin volume and snapshots of the origin volume

ubuntu@galaxy2:/mnt$ sudo lvcreate -L 1G -n lv0 vg0
  Logical volume "lv0" created
ubuntu@galaxy2:/mnt$
ubuntu@galaxy2:/mnt$ sudo lvs
  LV VG Attr LSize Origin Snap% Move Log Copy% Convert
  lv0 vg0 -wi-a- 1.00g
ubuntu@galaxy2:/mnt$
ubuntu@galaxy2:/mnt$ sudo lvcreate -L 2G -n lv0_snapshot_0 --snapshot vg0/lv0
  Logical volume "lv0_snapshot_0" created
ubuntu@galaxy2:/mnt$
ubuntu@galaxy2:/mnt$ sudo lvcreate -L 1G -n lv0_snapshot_1 --snapshot vg0/lv0
  Logical volume "lv0_snapshot_1" created
ubuntu@galaxy2:/mnt$
ubuntu@galaxy2:/mnt$ sudo lvs
  LV VG Attr LSize Origin Snap% Move Log Copy% Convert
  lv0 vg0 owi-a- 1.00g
  lv0_snapshot_0 vg0 swi-a- 2.00g lv0 0.00
  lv0_snapshot_1 vg0 swi-a- 1.00g lv0 0.00
ubuntu@galaxy2:/mnt$

### 3. zero the first snapshot

ubuntu@galaxy2:/mnt$
ubuntu@galaxy2:/mnt$ sudo dd if=/dev/zero of=/dev/mapper/vg0-lv0_snapshot_0 bs=1M count=1024 conv=fdatasync
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 124.915 s, 8.6 MB/s
ubuntu@galaxy2:/mnt$
ubuntu@galaxy2:/mnt$
ubuntu@galaxy2:/mnt$ sudo lvs
  LV VG Attr LSize Origin Snap% Move Log Copy% Convert
  lv0 vg0 owi-a- 1.00g
  lv0_snapshot_0 vg0 swi-a- 2.00g lv0 50.20
  lv0_snapshot_1 vg0 swi-a- 1.00g lv0 0.00
ubuntu@galaxy2:/mnt$
ubuntu@galaxy2:/mnt$
ubuntu@galaxy2:/mnt$ sudo lvdisplay vg0/lv0_snapshot_0
  --- Logical volume ---
  LV Name /dev/vg0/lv0_snapshot_0
  VG Name vg0
  LV UUID KE7i1E-ZNCR-mc2Z-XNtQ-wiO3-tNjE-a8p2pT
  LV Write Access read/write
  LV snapshot status active destination for /dev/vg0/lv0
  LV Status available
  # open 0
  LV Size 1.00 GiB
  Current LE 256
  COW-table size 2.00 GiB
  COW-table LE 512
  Allocated to snapshot 50.20%
  Snapshot chunk size ...

After we created the clone volume, we need to zero the temp_snapshot,  cinder will write snapshot full and an error occurred in device-mapper.

We can reproduce it .

OS: Ubuntu 12.04.2
Kernel: 3.2.0-40-virtual
LVM:
  LVM version:     2.02.66(2) (2010-05-20)
  Library version: 1.02.48 (2010-05-20)
  Driver version:  4.22.0

### 1. create vg

ubuntu@galaxy2:/mnt$ 
ubuntu@galaxy2:/mnt$ sudo truncate -s 10G stack-volumes
ubuntu@galaxy2:/mnt$ ll
total 24
drwxr-xr-x  3 root root        4096 Jul  5 07:09 ./
drwxr-xr-x 23 root root        4096 Jul  5 04:36 ../
drwx------  2 root root       16384 Jun 11 08:29 lost+found/
-rw-r--r--  1 root root 10737418240 Jul  5 07:09 stack-volumes
ubuntu@galaxy2:/mnt$ 
ubuntu@galaxy2:/mnt$ sudo losetup -f --show stack-volumes 
/dev/loop0
ubuntu@galaxy2:/mnt$ 
ubuntu@galaxy2:/mnt$ sudo pvcreate /dev/loop0
  Physical volume "/dev/loop0" successfully created
ubuntu@galaxy2:/mnt$ sudo vgcreate vg0  /dev/loop0
  Volume group "vg0" successfully created

### 2. create the origin volume and snapshots of the origin volume

ubuntu@galaxy2:/mnt$ sudo lvcreate -L 1G -n lv0 vg0
  Logical volume "lv0" created
ubuntu@galaxy2:/mnt$ 
ubuntu@galaxy2:/mnt$ sudo lvs
  LV   VG   Attr   LSize Origin Snap%  Move Log Copy%  Convert
  lv0  vg0  -wi-a- 1.00g                                      
ubuntu@galaxy2:/mnt$ 
ubuntu@galaxy2:/mnt$ sudo lvcreate -L 2G -n lv0_snapshot_0 --snapshot vg0/lv0 
  Logical volume "lv0_snapshot_0" created
ubuntu@galaxy2:/mnt$ 
ubuntu@galaxy2:/mnt$ sudo lvcreate -L 1G -n lv0_snapshot_1 --snapshot vg0/lv0
  Logical volume "lv0_snapshot_1" created
ubuntu@galaxy2:/mnt$ 
ubuntu@galaxy2:/mnt$ sudo lvs
  LV             VG   Attr   LSize Origin Snap%  Move Log Copy%  Convert
  lv0            vg0  owi-a- 1.00g                                      
  lv0_snapshot_0 vg0  swi-a- 2.00g lv0      0.00                        
  lv0_snapshot_1 vg0  swi-a- 1.00g lv0      0.00                        
ubuntu@galaxy2:/mnt$

### 3. zero the first snapshot

ubuntu@galaxy2:/mnt$ 
ubuntu@galaxy2:/mnt$ sudo dd if=/dev/zero of=/dev/mapper/vg0-lv0_snapshot_0 bs=1M count=1024 conv=fdatasync
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 124.915 s, 8.6 MB/s
ubuntu@galaxy2:/mnt$ 
ubuntu@galaxy2:/mnt$ 
ubuntu@galaxy2:/mnt$ sudo lvs
  LV             VG   Attr   LSize Origin Snap%  Move Log Copy%  Convert
  lv0            vg0  owi-a- 1.00g                                      
  lv0_snapshot_0 vg0  swi-a- 2.00g lv0     50.20                        
  lv0_snapshot_1 vg0  swi-a- 1.00g lv0      0.00                        
ubuntu@galaxy2:/mnt$ 
ubuntu@galaxy2:/mnt$ 
ubuntu@galaxy2:/mnt$ sudo lvdisplay vg0/lv0_snapshot_0
  --- Logical volume ---
  LV Name                /dev/vg0/lv0_snapshot_0
  VG Name                vg0
  LV UUID                KE7i1E-ZNCR-mc2Z-XNtQ-wiO3-tNjE-a8p2pT
  LV Write Access        read/write
  LV snapshot status     active destination for /dev/vg0/lv0
  LV Status              available
  # open                 0
  LV Size                1.00 GiB
  Current LE             256
  COW-table size         2.00 GiB
  COW-table LE           512
  Allocated to snapshot  50.20% 
  Snapshot chunk size    4.00 KiB
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           252:1
   
ubuntu@galaxy2:/mnt$

### 4. zero the second snapshot

ubuntu@galaxy2:/mnt$ 
ubuntu@galaxy2:/mnt$ 
ubuntu@galaxy2:/mnt$ sudo dd if=/dev/zero of=/dev/mapper/vg0-lv0_snapshot_1 bs=1M count=1024 conv=fdatasync
dd: fdatasync failed for `/dev/mapper/vg0-lv0_snapshot_1': Input/output error
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 48.987 s, 21.9 MB/s
ubuntu@galaxy2:/mnt$ 
ubuntu@galaxy2:/mnt$ 
ubuntu@galaxy2:/mnt$ 
ubuntu@galaxy2:/mnt$ sudo lvs
  /dev/dm-4: read failed after 0 of 4096 at 0: Input/output error
  LV             VG   Attr   LSize Origin Snap%  Move Log Copy%  Convert
  lv0            vg0  owi-a- 1.00g                                      
  lv0_snapshot_0 vg0  swi-a- 2.00g lv0     50.20                        
  lv0_snapshot_1 vg0  Swi-I- 1.00g lv0    100.00                        
ubuntu@galaxy2:/mnt$ 
ubuntu@galaxy2:/mnt$ sudo lvdisplay vg0/lv0_snapshot_1
  /dev/dm-4: read failed after 0 of 4096 at 0: Input/output error
  --- Logical volume ---
  LV Name                /dev/vg0/lv0_snapshot_1
  VG Name                vg0
  LV UUID                fO6gFN-c2b5-8bQO-TsKU-or9Q-SYEy-zlxY38
  LV Write Access        read/write
  LV snapshot status     INACTIVE destination for /dev/vg0/lv0
  LV Status              available
  # open                 0
  LV Size                1.00 GiB
  Current LE             256
  COW-table size         1.00 GiB
  COW-table LE           256
  Snapshot chunk size    4.00 KiB
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           252:4
   
ubuntu@galaxy2:/mnt$ 
ubuntu@galaxy2:/mnt$ dmesg
[2069983.216951] device-mapper: snapshots: Invalidating snapshot: Unable to allocate exception.
[2070028.707217] quiet_error: 136 callbacks suppressed
[2070028.707223] Buffer I/O error on device dm-4, logical block 261122
[2070028.707236] Buffer I/O error on device dm-4, logical block 261123
[2070028.707240] lost page write due to I/O error on dm-4
......

###########

we notice that  it is 50.20% in Snap% after we wrote 1024MB data to the first snapshot.

2048 * 50.2% = 1028.096
1028.096 - 1024 = 4.096

I think 4.096MB  is additional metadata of snapshot that need to be written to snapshot space along with 1024 MB data.
So, I think when we wrote 1024MB data to the second snapshot ,  data and additional data is more than snapshot size,  an error occurred in device-mapper.

Revision history for this message

Huang Zhiteng (zhiteng-huang) wrote on 2013-07-05:

#6

Rongze, thanks for following up. I've updated bug summary to reflect to real case. Also I've proposed a fix in volume manager to handle the case when exc_info is a tuple of None(s).

Vincent, Rongze, please try out the volume manger fix see if how it works.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2013-07-05: Fix proposed to cinder (master)

#7

Fix proposed to branch: master
Review: https://review.openstack.org/35771

Changed in cinder:
assignee:	Rongze Zhu (zrzhit) → Huang Zhiteng (zhiteng-huang)
status:	Confirmed → In Progress

Revision history for this message

Rongze Zhu (zrzhit) wrote on 2013-07-05: Re: LVM Unable to clone a volume

#8

self.db.volume_update() will clear the exc_info, I traced the code and found out the invoked function "engine.connect()" will clear the exc_info.

volume_update -> get_session() -> get_engine() -> create_engine() -> engine.connect() .

I think sqlalchemy have some side effect, it will clear exc_info.

Revision history for this message

Eric Harney (eharney) wrote on 2013-07-05:

#9

The LVM failure here is already being tracked in https://bugs.launchpad.net/cinder/+bug/1191812 .

I'd suggest using this bug for the exception fixup, and handle the LVM issue in that bug.

Revision history for this message

Huang Zhiteng (zhiteng-huang) wrote on 2013-07-05:

#10

Rongze, thanks for helping track down the source of clearing exc_info! I've updated the patch to address this.

Eric, bug description updated. Thanks!

summary:	- LVM Unable to clone a volume + Volume manager loses track of exception
tags:	added: exception

Revision history for this message

Edward Hope-Morley (hopem) wrote on 2013-07-08:

#11

This is actually caused by a bug in python-eventlet < 0.13 and has now been fixed. See:

https://bitbucket.org/eventlet/eventlet/issue/149/yield-in-except-clause-with-wilcard-raise

It is also an issue the affects all services in Openstack so rather than patch every occurence of the issue, upgrade python-eventlet to this latest version and your problem will hopefully go away.

Revision history for this message

Huang Zhiteng (zhiteng-huang) wrote on 2013-07-08:

#12

Edward, thanks for the info about eventlet bug. I agree that upgrading to latest eventlet may possibly fix other similar bugs as well but the question is this bug also affects Grizzly stable release. It would be easier to cherry-pick the proposed fix to grizzly-stable branch than requiring a dependent library version bump for distro packager, I guess?

tags:	added: backup-service
tags:	added: grizzly-backport-potential removed: backup-service
Changed in cinder:
milestone:	none → havana-2

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2013-07-09: Fix merged to cinder (master)

#13

Reviewed: https://review.openstack.org/35771
Committed: http://github.com/openstack/cinder/commit/40aef764bb24330221b03bafc0e6d85fa493e383
Submitter: Jenkins
Branch: master

commit 40aef764bb24330221b03bafc0e6d85fa493e383
Author: Zhiteng Huang <email address hidden>
Date: Fri Jul 5 01:46:34 2013 -0700

Refactor reschedule in exception handling of volume manager

    Previous exception handling has a pitfall that may potentially clear
    the sys.exc_info() (by calling SQLalchmey to update db). This patch
    refactors some part of exception handling in create_volume() to make
    sure sys.exc_info() is retrieved so that we can log and reraise it;
    also we make sure exception be reraised at the end of exception handling
    no matter the request is rescheduled or not.

    As a side effect, we fixed a bug in unittest which didn't provide
    correct argument to db API but previously this exception has been wrongly
    consumed by volume manager's exception handling (not reraise exception
    when request is rescheduled).

fix bug: 1197648

Change-Id: Idce5d06f8be1fb6018012503ec7f844898a21b25

Changed in cinder:
status:	In Progress → Fix Committed

Thierry Carrez (ttx) on 2013-07-17

Changed in cinder:
status:	Fix Committed → Fix Released

Thierry Carrez (ttx) on 2013-10-17

Changed in cinder:
milestone:	havana-2 → 2013.2

Alan Pevec (apevec) on 2014-03-31

tags:	removed: grizzly-backport-potential
no longer affects:	cinder/havana

Cinder

Volume manager loses track of exception

Bug Description

Other bug subscribers

Remote bug watches