StarlingX

osd process failed to recover after being killed

Bug #1807444 reported by Peng Peng on 2018-12-07

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Medium	Wei Zhou

Bug Description

Brief Description
-----------------
After the ceph osd process was killed, it failed to recover.

Severity
--------
Major

Steps to Reproduce
------------------
1. Randomly pick one OSD and get the pid of the OSD process
2 Kill the ceph-osd PID for that OSD using kill -9
3. Check that the OSD process is re-spawned using a different pid and validate alarms (OSD process was not observed to recover and this was confirmed by the alarms present on the system)

Expected Behaviour
------------------
ceph-osd process is running again with a different PID

Actual Behaviour
----------------
ceph-osd process does not recover after being killed.

Reproducibility
---------------
Reproducible
10/10

System Configuration
--------------------
Dedicated storage

Branch/Pull Time/Commit
-----------------------
master as of 2018-12-06_20-18-00

Timestamp/Logs
--------------
1. Process killed
[2018-12-07 15:21:31,288] 125 INFO MainThread storage_helper.kill_process:: kill -9 33089
[2018-12-07 15:21:31,289] 426 DEBUG MainThread ssh.exec_cmd:: Executing command...
[2018-12-07 15:21:31,289] 263 DEBUG MainThread ssh.send :: Send 'whoami'
[2018-12-07 15:21:31,393] 389 DEBUG MainThread ssh.expect :: Output:
root
]0;root@storage-0:~storage-0:~#
2. Confirmation of process kill
[2018-12-07 15:21:31,505] 263 DEBUG MainThread ssh.send :: Send 'kill -0 33089'
[2018-12-07 15:21:31,608] 389 DEBUG MainThread ssh.expect :: Output:
-sh: kill: (33089) - No such process
storage-0:~$
3. Process not re-spawned
[2018-12-07 15:21:35,233] 263 DEBUG MainThread ssh.send :: Send 'cat /var/run/ceph/osd.11.pid'
[2018-12-07 15:21:35,336] 389 DEBUG MainThread ssh.expect :: Output:
33089
storage-0:~$
4. Alarms reported:
[2018-12-07 15:26:39,479] 263 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'
[2018-12-07 15:26:40,976] 389 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+----------+----------------------------+
| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+--------------------------------------+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+----------+----------------------------+
| 9c43f0ad-4824-4ea6-8b9c-942925fd1b07 | 800.011 | Loss of replication in replication group group-0: OSDs are down | cluster=ff089d70-94ab-42e9-97ce-32fcfbd475f5.peergroup=group-0.host=storage-0 | major | 2018-12-07T15:22:06.933823 |
| ba5404da-0374-4ff9-8f1f-83e82fcd45ad | 800.001 | Storage Alarm Condition: HEALTH_WARN [PGs are degraded/stuck or undersized;recovery 60/969 objects degraded (6.192%)]. Please check 'ceph -s' for more details. | cluster=ff089d70-94ab-42e9-97ce-32fcfbd475f5 | warning | 2018-12-07T15:22:06.641279 |
| 20f05871-ae14-4339-af43-32a3d15d8fbe | 100.114 | NTP address 142.93.85.177 is not a valid or a reachable NTP server. | host=controller-1.ntp=142.93.85.177 | minor | 2018-12-07T13:33:48.833721 |
+--------------------------------------+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+----------+----------------------------+
controller-1:~$

See original description

Tags:

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2018-12-10:

Marking as release gating until further investigation

Changed in starlingx:
status:	New → Triaged
importance:	Undecided → Medium
assignee:	nobody → Wei Zhou (wzhou007)
tags:	added: stx.2019.03 stx.config

Ghada Khalil (gkhalil) on 2018-12-10

summary:

- STX: Storage, failed to kill ceph osd process
+ osd process failed to recover after being killed

Maria Yousaf (myousaf) on 2018-12-10

description:

updated

Wei Zhou (wzhou007) on 2018-12-17

Changed in starlingx:
status:	Triaged → In Progress

Revision history for this message

Wei Zhou (wzhou007) wrote on 2018-12-17:

This is the pull request for the fix. https://github.com/starlingx-staging/stx-ceph/pull/19

Changed in starlingx:
status:	In Progress → Fix Released

Ken Young (kenyis) on 2019-01-18

tags:

added: stx.2019.05
removed: stx.2019.03

Ken Young (kenyis) on 2019-04-05

tags:

added: stx.2.0
removed: stx.2019.05

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

Bug #1807748

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.