Ceph SEGV error while extracting cloned images from RBD

Bug #1260911 reported by Andrew Woodward
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
High
Dmitry Borodaenko

Bug Description

After deploying in centos neutron VLAN on iso 139

{"release": "4.0", "nailgun_sha": "e5c0e954d6d03ba9fd857f1f7a662a6386099232", "ostf_sha": "c1c353909cd1a0af018bbe89fb12570db6b09969", "astute_sha": "75aa0877cba772f409d3cef4f36ba2ec1b8b603b", "fuellib_sha": "912a1c77d0506115452bb455b476d74b927f1153"}

Cirros image looks like
+-------------------+--------------------------------------+
| Property | Value |
+-------------------+--------------------------------------+
| Property 'distro' | cirros |
| checksum | aa5f78bc34691a1fd48adc40fad067f6 |
| container_format | ovf |
| created_at | 2013-12-14T01:15:52 |
| deleted | False |
| disk_format | raw |
| id | 7f051bed-05b2-4f30-bfb9-0cef8495b7f8 |
| is_public | True |
| min_disk | 0 |
| min_ram | 0 |
| name | TestVM |
| owner | dab00b4423814be2984898bee3461801 |
| protected | False |
| size | 14221312 |
| status | active |
| updated_at | 2013-12-14T01:15:54 |
+-------------------+--------------------------------------+

but is invalid

console gives no boot device (see attachment)

ps Ryan saw it too.

Tags: ceph
Revision history for this message
Andrew Woodward (xarses) wrote :
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

This bug is ceph-ephemeral storage specific only. Disabling of rbd driver for nova-compute makes everything work.

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :
summary: - cirros TestVM image does not boot
+ Ephemeral storage broken
Changed in fuel:
status: Confirmed → Triaged
Revision history for this message
Andrey Korolyov (xdeller) wrote : Re: Ephemeral storage broken

Dumpling bug. Please remove CoW copying in a meantime in Glance (happens only on exporting cloned and not-yet-flattened image) and report to the ceph-devel with me in CC.

Changed in fuel:
assignee: nobody → Dmitry Borodaenko (dborodaenko)
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

The problem is that our cirros packages keep the image in QCOW2 format, even though they tell Glance that it's RAW:

[root@node-1 ~]# file /opt/vm/cirros-0.3.1-x86_64-disk.img
/opt/vm/cirros-0.3.1-x86_64-disk.img: Qemu Image, Format: Qcow , Version: 2

File backend can deal with such inconsistency, but it doesn't work with Ceph RBD.

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

The problem is that ceph fails with SEGFAULT during exporting of image to the compute RAM, thus ending with consistent ephemeral disk image. We need to disable cloning of images and use copying instead in glance code.

Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

According to Ceph developers, the attached stack trace matches http://tracker.ceph.com/issues/5426.

Revision history for this message
Andrey Korolyov (xdeller) wrote :

Please poke ceph-devel with real-world case as soon as you`ve been able to reproduce it on the dumpling.

Revision history for this message
Dmitry Borodaenko (angdraug) wrote : Re: Ephemeral storage broken

Looking closer to the original bug report from Andrew, there's simply no way the issue he has encountered would be related to ephemeral RBD backend: ISO #139 didn't include the Puppet manifest changes to enable RBD backend for Nova, so he was using the file backend when the bug occurred.

Can we untangle the Ceph crash when using RBD backend for Nova ephemeral storage from the Cirros image problem Andrew has originally reported?

Revision history for this message
Dmitry Borodaenko (angdraug) wrote : Re: Ceph SEGV error while cloning images

Unable to reproduce the Ceph SEGV problem in a clean environement based on ISO #146 (CentOS, Nova Network, Ephemeral RBD enabled).

Please provide exact steps sufficient to reproduce this problem, including nova and glance command lines.

Changed in fuel:
status: Triaged → Incomplete
assignee: Dmitry Borodaenko (dborodaenko) → Vladimir Kuklin (vkuklin)
summary: - Ephemeral storage broken
+ Ceph SEGV error while cloning images
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

The original CirrOS image problem also doesn't reproduce with ISO #146.

Revision history for this message
Andrey Korolyov (xdeller) wrote :

Not reproducible.

Changed in fuel:
importance: Critical → Low
Andrew Woodward (xarses)
tags: added: ceph
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

Reproduced with non-HA Ubuntu 12.04, Ceph enabled for Glance, Cinder, and Nova (and rados-gw). After converting and uploading a RAW CirrOS image, all you have to do is launch a VM from that image, then run rbd -p compute export <instance-id>_disk - > /dev/null.

Changed in fuel:
status: Incomplete → Confirmed
importance: Low → High
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

A WIP fix is being worked on by Inktank on this branch:
https://github.com/ceph/ceph/commits/wip-rbd-deadlock-lockdep-dumpling

Changed in fuel:
assignee: Vladimir Kuklin (vkuklin) → Dmitry Borodaenko (dborodaenko)
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

The fix mentioned above didn't make the segfault disappear:
http://paste.openstack.org/show/55391/

summary: - Ceph SEGV error while cloning images
+ Ceph SEGV error while extracting cloned images from RBD
Changed in fuel:
milestone: 4.0 → 4.1
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

A very simple fix was published by Josh Durgin:
https://github.com/ceph/ceph/pull/1000

Binary packages are available from:
http://gitbuilder.ceph.com/ceph-deb-precise-x86_64-basic/ref/dumpling-5426/

Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

Fix confirmed, with Ceph upgraded to the version from the gitbuilder repository above, SEGV is no longer reproducible:

root@node-1:~# rbd -p compute ls
b750dd39-dbbf-4d55-b42e-34e5812f6c58_disk
root@node-1:~# rbd -p compute export b750dd39-dbbf-4d55-b42e-34e5812f6c58_disk - > /dev/null
Exporting image: 100% complete...done.

Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

Jira OSCI-990 created to update Ceph packages in Fuel 4.0 and 4.1 to include this fix.

Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

Updated Ceph packages are available in Fuel 4.1 repositories for CentOS (0.67.5-16.g69a99e6.el6) and Ubuntu (0.67.5-ubuntu0).

Changed in fuel:
status: In Progress → Fix Committed
Changed in fuel:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.