Ceph OSD is down after deployment

Bug #1587427 reported by Volodymyr Shypyguzov
32
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Confirmed
Medium
Volodymyr Shypyguzov
Mitaka
Won't Fix
Medium
Volodymyr Shypyguzov
Newton
Confirmed
Medium
Volodymyr Shypyguzov

Bug Description

Steps to reproduce:
1. Create cluster
2. Add 3 nodes with controller and ceph OSD roles
3. Add 1 node with ceph OSD roles
4. Add 2 nodes with compute and ceph OSD roles
5. Deploy the cluster
6. Check ceph osd tree <<< Fail

Expected result: All OSDs are up
Actual result: OSD-2 is down

From osd-2 log:
2016-05-31 00:55:29.845358 7f52ce29f700 -1 osd.2 9 *** Got signal Terminated ***
2016-05-31 00:55:29.845396 7f52ce29f700 0 osd.2 9 prepare_to_stop telling mon we are shutting down
2016-05-31 00:55:29.850002 7f52e1ac6700 0 monclient: hunting for new mon
2016-05-31 00:55:34.845536 7f52ce29f700 0 osd.2 9 prepare_to_stop starting shutdown
2016-05-31 00:55:34.845574 7f52ce29f700 -1 osd.2 9 shutdown

cat /etc/fuel_build_id:
 420
cat /etc/fuel_build_number:
 420
cat /etc/fuel_release:
 9.0
cat /etc/fuel_openstack_version:
 mitaka-9.0
rpm -qa | egrep 'fuel|astute|network-checker|nailgun|packetary|shotgun':
 fuel-release-9.0.0-1.mos6347.noarch
 fuel-bootstrap-cli-9.0.0-1.mos284.noarch
 fuel-migrate-9.0.0-1.mos8405.noarch
 rubygem-astute-9.0.0-1.mos746.noarch
 fuel-provisioning-scripts-9.0.0-1.mos8709.noarch
 network-checker-9.0.0-1.mos72.x86_64
 fuel-mirror-9.0.0-1.mos137.noarch
 fuel-openstack-metadata-9.0.0-1.mos8709.noarch
 fuel-notify-9.0.0-1.mos8405.noarch
 nailgun-mcagents-9.0.0-1.mos746.noarch
 python-fuelclient-9.0.0-1.mos317.noarch
 fuelmenu-9.0.0-1.mos270.noarch
 fuel-9.0.0-1.mos6347.noarch
 fuel-utils-9.0.0-1.mos8405.noarch
 fuel-setup-9.0.0-1.mos6347.noarch
 fuel-library9.0-9.0.0-1.mos8405.noarch
 shotgun-9.0.0-1.mos90.noarch
 fuel-agent-9.0.0-1.mos284.noarch
 fuel-ui-9.0.0-1.mos2706.noarch
 fuel-ostf-9.0.0-1.mos934.noarch
 fuel-misc-9.0.0-1.mos8405.noarch
 python-packetary-9.0.0-1.mos137.noarch
 fuel-nailgun-9.0.0-1.mos8709.noarch

Tags: swarm-fail
Revision history for this message
Volodymyr Shypyguzov (vshypyguzov) wrote :
description: updated
Changed in fuel:
assignee: nobody → Oleksiy Molchanov (omolchanov)
importance: Undecided → High
status: New → Confirmed
Changed in fuel:
assignee: Oleksiy Molchanov (omolchanov) → MOS Ceph (mos-ceph)
tags: added: swarm-fail
Revision history for this message
Kostiantyn Danylov (kdanylov) wrote :

Does it reproducible? Log tells, that OSD get SIGTERM, is it send from upstart script or it's a random sigterm?

Changed in fuel:
assignee: MOS Ceph (mos-ceph) → Volodymyr Shypyguzov (vshypyguzov)
Revision history for this message
Volodymyr Shypyguzov (vshypyguzov) wrote :
Changed in fuel:
assignee: Volodymyr Shypyguzov (vshypyguzov) → Kostiantyn Danylov (kdanylov)
Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

> Add 3 nodes with controller and ceph OSD roles

Deploying OSDs and monitors on the same node is not supported. Please don't do that.

Revision history for this message
Volodymyr Shypyguzov (vshypyguzov) wrote :

> Deploying OSDs and monitors on the same node is not supported. Please don't do that.

Could you please provide documentation regarding it? Those tests were running successfully in such configuration for at least a year

Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

> 2. Add 3 nodes with controller and ceph OSD roles

Deploying OSDs and monitors on the same host is not supported, please don't do that.

Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

> 2016-05-31 00:55:24.159444 7f52f2df0800 0 ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403), process ceph-osd, pid 28419
[skipped]
> 2016-05-31 00:55:29.845358 7f52ce29f700 -1 osd.2 9 *** Got signal Terminated ***
> 2016-05-31 00:55:29.845396 7f52ce29f700 0 osd.2 9 prepare_to_stop telling mon we are shutting down

The test starts an OSD only to shut it down 5 seconds later. It looks weird.

Revision history for this message
Kostiantyn Danylov (kdanylov) wrote :

1) MON and OSD on same node is not recommended configuration
2) There no ceph error in log. OSD get SIGTERM and gracefully shutdown. Log has no further records, that mean, that it is shouted down and never turn on again. So it has to be down.

Changed in fuel:
assignee: Kostiantyn Danylov (kdanylov) → nobody
importance: High → Medium
Revision history for this message
Kostiantyn Danylov (kdanylov) wrote :

Drop level to medium, as configuration is not recommended

Changed in fuel:
assignee: nobody → Volodymyr Shypyguzov (vshypyguzov)
Revision history for this message
Dina Belova (dbelova) wrote :

Marking as won't fix for 9.0 so far (due to the Medium priority)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-qa (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/335501

Revision history for this message
Alexey. Kalashnikov (akalashnikov) wrote :

Reproduced on swarm 9.1 snapshot #237
https://product-ci.infra.mirantis.net/job/9.x.system_test.ubuntu.thread_7/53/testReport/(root)/deploy_ceph_ha_nodegroups/deploy_ceph_ha_nodegroups/

OSD going down and not start later.

Logs had the same lines OSD is going down and doesn't start anymore:
http://paste.openstack.org/show/570194/
...
2016-09-08 22:35:15.631031 7f1d8fb67700 -1 osd.2 16 *** Got signal Terminated ***
2016-09-08 22:35:15.631293 7f1d8fb67700 0 osd.2 16 prepare_to_stop telling mon we are shutting down

It happened on deployment stage.

Revision history for this message
Dmitry Belyaninov (dbelyaninov) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.