Fuel for OpenStack

Ceph OSD is down after deployment

Series newton
Bug #1587427

Bug #1587427 reported by Volodymyr Shypyguzov on 2016-05-31

This bug affects 5 people

	Status	Importance	Assigned to	Milestone
Fuel for OpenStack	Confirmed	Medium	Volodymyr Shypyguzov	Fuel for OpenStack 10.0
Mitaka	Won't Fix	Medium	Volodymyr Shypyguzov	Fuel for OpenStack 9.0
Newton	Confirmed	Medium	Volodymyr Shypyguzov	Fuel for OpenStack 10.0

Bug Description

Steps to reproduce:
1. Create cluster
2. Add 3 nodes with controller and ceph OSD roles
3. Add 1 node with ceph OSD roles
4. Add 2 nodes with compute and ceph OSD roles
5. Deploy the cluster
6. Check ceph osd tree <<< Fail

Expected result: All OSDs are up
Actual result: OSD-2 is down

From osd-2 log:
2016-05-31 00:55:29.845358 7f52ce29f700 -1 osd.2 9 *** Got signal Terminated ***
2016-05-31 00:55:29.845396 7f52ce29f700 0 osd.2 9 prepare_to_stop telling mon we are shutting down
2016-05-31 00:55:29.850002 7f52e1ac6700 0 monclient: hunting for new mon
2016-05-31 00:55:34.845536 7f52ce29f700 0 osd.2 9 prepare_to_stop starting shutdown
2016-05-31 00:55:34.845574 7f52ce29f700 -1 osd.2 9 shutdown

cat /etc/fuel_build_id:
420
cat /etc/fuel_build_number:
420
cat /etc/fuel_release:
9.0
cat /etc/fuel_openstack_version:
mitaka-9.0
rpm -qa | egrep 'fuel|astute|network-checker|nailgun|packetary|shotgun':
fuel-release-9.0.0-1.mos6347.noarch
fuel-bootstrap-cli-9.0.0-1.mos284.noarch
fuel-migrate-9.0.0-1.mos8405.noarch
rubygem-astute-9.0.0-1.mos746.noarch
fuel-provisioning-scripts-9.0.0-1.mos8709.noarch
network-checker-9.0.0-1.mos72.x86_64
fuel-mirror-9.0.0-1.mos137.noarch
fuel-openstack-metadata-9.0.0-1.mos8709.noarch
fuel-notify-9.0.0-1.mos8405.noarch
nailgun-mcagents-9.0.0-1.mos746.noarch
python-fuelclient-9.0.0-1.mos317.noarch
fuelmenu-9.0.0-1.mos270.noarch
fuel-9.0.0-1.mos6347.noarch
fuel-utils-9.0.0-1.mos8405.noarch
fuel-setup-9.0.0-1.mos6347.noarch
fuel-library9.0-9.0.0-1.mos8405.noarch
shotgun-9.0.0-1.mos90.noarch
fuel-agent-9.0.0-1.mos284.noarch
fuel-ui-9.0.0-1.mos2706.noarch
fuel-ostf-9.0.0-1.mos934.noarch
fuel-misc-9.0.0-1.mos8405.noarch
python-packetary-9.0.0-1.mos137.noarch
fuel-nailgun-9.0.0-1.mos8709.noarch

See original description

Tags:

Revision history for this message

Volodymyr Shypyguzov (vshypyguzov) wrote on 2016-05-31:

fail_error_check_ceph_ha-fuel-snapshot-2016-05-31_01-20-19.tar.gz Edit (48.5 MiB, application/x-tar)

description:

updated

Oleksiy Molchanov (omolchanov) on 2016-05-31

Changed in fuel:
assignee:	nobody → Oleksiy Molchanov (omolchanov)
importance:	Undecided → High
status:	New → Confirmed

Oleksiy Molchanov (omolchanov) on 2016-06-01

Changed in fuel:
assignee:	Oleksiy Molchanov (omolchanov) → MOS Ceph (mos-ceph)

Sergey Shevorakov (sshevorakov) on 2016-06-01

tags:

added: swarm-fail

Revision history for this message

Kostiantyn Danylov (kdanylov) wrote on 2016-06-02:

Does it reproducible? Log tells, that OSD get SIGTERM, is it send from upstart script or it's a random sigterm?

Changed in fuel:
assignee:	MOS Ceph (mos-ceph) → Volodymyr Shypyguzov (vshypyguzov)

Revision history for this message

Volodymyr Shypyguzov (vshypyguzov) wrote on 2016-06-02:

Since then it has passed two times and failed one time with same error on custom iso

Failed job: https://product-ci.infra.mirantis.net/job/9.0.system_test.ubuntu.thread_3/127/testReport/%28root%29/check_ceph_ha/check_ceph_ha/
Custom iso: https://custom-ci.infra.mirantis.net/view/9.0/job/9.0.custom.iso/193/

Changed in fuel:
assignee:	Volodymyr Shypyguzov (vshypyguzov) → Kostiantyn Danylov (kdanylov)

Revision history for this message

Alexei Sheplyakov (asheplyakov) wrote on 2016-06-02:

> Add 3 nodes with controller and ceph OSD roles

Deploying OSDs and monitors on the same node is not supported. Please don't do that.

Revision history for this message

Volodymyr Shypyguzov (vshypyguzov) wrote on 2016-06-02:

> Deploying OSDs and monitors on the same node is not supported. Please don't do that.

Could you please provide documentation regarding it? Those tests were running successfully in such configuration for at least a year

Revision history for this message

Alexei Sheplyakov (asheplyakov) wrote on 2016-06-02:

> 2. Add 3 nodes with controller and ceph OSD roles

Deploying OSDs and monitors on the same host is not supported, please don't do that.

Revision history for this message

Alexei Sheplyakov (asheplyakov) wrote on 2016-06-02:

> 2016-05-31 00:55:24.159444 7f52f2df0800 0 ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403), process ceph-osd, pid 28419
[skipped]
> 2016-05-31 00:55:29.845358 7f52ce29f700 -1 osd.2 9 *** Got signal Terminated ***
> 2016-05-31 00:55:29.845396 7f52ce29f700 0 osd.2 9 prepare_to_stop telling mon we are shutting down

The test starts an OSD only to shut it down 5 seconds later. It looks weird.

Revision history for this message

Kostiantyn Danylov (kdanylov) wrote on 2016-06-02:

1) MON and OSD on same node is not recommended configuration
2) There no ceph error in log. OSD get SIGTERM and gracefully shutdown. Log has no further records, that mean, that it is shouted down and never turn on again. So it has to be down.

Changed in fuel:
assignee:	Kostiantyn Danylov (kdanylov) → nobody
importance:	High → Medium

Revision history for this message

Kostiantyn Danylov (kdanylov) wrote on 2016-06-02:

Drop level to medium, as configuration is not recommended

Changed in fuel:
assignee:	nobody → Volodymyr Shypyguzov (vshypyguzov)

Revision history for this message

Dina Belova (dbelova) wrote on 2016-06-02:

#10

Marking as won't fix for 9.0 so far (due to the Medium priority)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-06-29: Related fix proposed to fuel-qa (master)

#11

Related fix proposed to branch: master
Review: https://review.openstack.org/335501

Revision history for this message

Alexey. Kalashnikov (akalashnikov) wrote on 2016-09-09:

#12

Reproduced on swarm 9.1 snapshot #237
https://product-ci.infra.mirantis.net/job/9.x.system_test.ubuntu.thread_7/53/testReport/(root)/deploy_ceph_ha_nodegroups/deploy_ceph_ha_nodegroups/

OSD going down and not start later.

Logs had the same lines OSD is going down and doesn't start anymore:
http://paste.openstack.org/show/570194/
...
2016-09-08 22:35:15.631031 7f1d8fb67700 -1 osd.2 16 *** Got signal Terminated ***
2016-09-08 22:35:15.631293 7f1d8fb67700 0 osd.2 16 prepare_to_stop telling mon we are shutting down

It happened on deployment stage.

Revision history for this message

Dmitry Belyaninov (dbelyaninov) wrote on 2016-09-15:

#13

Reproduced again. Snapshot #264

https://product-ci.infra.mirantis.net/job/9.x.system_test.ubuntu.ha_destructive_ceph_neutron/60/testReport/(root)/ha_ceph_neutron_rabbit_master_destroy/ha_ceph_neutron_rabbit_master_destroy/