Ceph OSD is down on one node after deployment, image removing from glance hangs

Bug #1590824 reported by Artem Panchenko
This bug report is a duplicate of:  Bug #1587427: Ceph OSD is down after deployment. Edit Remove
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
New
Undecided
Oleksiy Molchanov
Mitaka
New
Undecided
Oleksiy Molchanov

Bug Description

Fuel version info (9.0 mos build #458): http://paste.openstack.org/show/509227/

After successful deployment, OSTF tests which create/remove images (or instance snapshots) fail, because deletion hangs in glance. The cause of that is Ceph OSD failure on one of 5 nodes:

root@node-1:~# ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 0.34991 root default
-2 0.04999 host node-1
 0 0.04999 osd.0 down 0 1.00000
-3 0.04999 host node-3
 1 0.04999 osd.1 up 1.00000 1.00000
-4 0.04999 host node-2
 2 0.04999 osd.2 up 1.00000 1.00000
-5 0.09998 host node-4
 3 0.04999 osd.3 up 1.00000 1.00000
 6 0.04999 osd.6 up 1.00000 1.00000
-6 0.09998 host node-5
 4 0.04999 osd.4 up 1.00000 1.00000
 5 0.04999 osd.5 up 1.00000 1.00000

Here is the same issue reported for kilo: https://ask.openstack.org/en/question/69313/glance-image-delete-problem-with-ceph-backend/

Looks like ceph-osd died or was killed by something during the deployment:

root@node-1:~# service ceph-osd-all status
ceph-osd-all start/running
root@node-1:~# service ceph-all status
ceph-all start/running
root@node-1:~# ps aux | grep [c]eph
root 22524 0.1 1.1 279320 35900 ? Ssl 09:43 0:20 /usr/bin/ceph-mon --cluster=ceph -i node-1 -f
root@node-1:~#

Logs from node-1:
  * ceph osd http://paste.openstack.org/show/509216/
  * ceph mon http://paste.openstack.org/show/509218/
  * puppet http://paste.openstack.org/show/509220
  * upstart-osd http://paste.openstack.org/show/509221/
  * upstart-osd-all http://paste.openstack.org/show/509222/

After I manually restarted not running OSD, deletion images from glance became to work fine:

http://paste.openstack.org/

I think there are 3 problems here:

1) upstart ceph-* jobs don't detect OSD failures
2) puppet task finished OK despite the fact that OSD wasn't running
3) glance operations (image deletion) just hanged without any errors

Diagnostic snapshot: https://drive.google.com/file/d/0BzaZINLQ8-xkQkJRQlZQeUtWRUE/view?usp=sharing

Revision history for this message
Bug Checker Bot (bug-checker) wrote : Autochecker

(This check performed automatically)
Please, make sure that bug description contains the following sections filled in with the appropriate data related to the bug you are describing:

actual result

expected result

steps to reproduce

For more detailed information on the contents of each of the listed sections see https://wiki.openstack.org/wiki/Fuel/How_to_contribute#Here_is_how_you_file_a_bug

tags: added: need-info
Changed in fuel:
assignee: Fuel Sustaining (fuel-sustaining-team) → Oleksiy Molchanov (omolchanov)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.