osd process failed to recover after being killed

Bug #1807444 reported by Peng Peng
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Wei Zhou

Bug Description

Brief Description
-----------------
After the ceph osd process was killed, it failed to recover.

Severity
--------
Major

Steps to Reproduce
------------------
1. Randomly pick one OSD and get the pid of the OSD process
2 Kill the ceph-osd PID for that OSD using kill -9
3. Check that the OSD process is re-spawned using a different pid and validate alarms (OSD process was not observed to recover and this was confirmed by the alarms present on the system)

Expected Behaviour
------------------
ceph-osd process is running again with a different PID

Actual Behaviour
----------------
ceph-osd process does not recover after being killed.

Reproducibility
---------------
Reproducible
10/10

System Configuration
--------------------
Dedicated storage

Branch/Pull Time/Commit
-----------------------
master as of 2018-12-06_20-18-00

Timestamp/Logs
--------------
1. Process killed
[2018-12-07 15:21:31,288] 125 INFO MainThread storage_helper.kill_process:: kill -9 33089
[2018-12-07 15:21:31,289] 426 DEBUG MainThread ssh.exec_cmd:: Executing command...
[2018-12-07 15:21:31,289] 263 DEBUG MainThread ssh.send :: Send 'whoami'
[2018-12-07 15:21:31,393] 389 DEBUG MainThread ssh.expect :: Output:
root
]0;root@storage-0:~storage-0:~#
2. Confirmation of process kill
[2018-12-07 15:21:31,505] 263 DEBUG MainThread ssh.send :: Send 'kill -0 33089'
[2018-12-07 15:21:31,608] 389 DEBUG MainThread ssh.expect :: Output:
-sh: kill: (33089) - No such process
storage-0:~$
3. Process not re-spawned
[2018-12-07 15:21:35,233] 263 DEBUG MainThread ssh.send :: Send 'cat /var/run/ceph/osd.11.pid'
[2018-12-07 15:21:35,336] 389 DEBUG MainThread ssh.expect :: Output:
33089
storage-0:~$
4. Alarms reported:
[2018-12-07 15:26:39,479] 263 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'
[2018-12-07 15:26:40,976] 389 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+----------+----------------------------+
| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+--------------------------------------+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+----------+----------------------------+
| 9c43f0ad-4824-4ea6-8b9c-942925fd1b07 | 800.011 | Loss of replication in replication group group-0: OSDs are down | cluster=ff089d70-94ab-42e9-97ce-32fcfbd475f5.peergroup=group-0.host=storage-0 | major | 2018-12-07T15:22:06.933823 |
| ba5404da-0374-4ff9-8f1f-83e82fcd45ad | 800.001 | Storage Alarm Condition: HEALTH_WARN [PGs are degraded/stuck or undersized;recovery 60/969 objects degraded (6.192%)]. Please check 'ceph -s' for more details. | cluster=ff089d70-94ab-42e9-97ce-32fcfbd475f5 | warning | 2018-12-07T15:22:06.641279 |
| 20f05871-ae14-4339-af43-32a3d15d8fbe | 100.114 | NTP address 142.93.85.177 is not a valid or a reachable NTP server. | host=controller-1.ntp=142.93.85.177 | minor | 2018-12-07T13:33:48.833721 |
+--------------------------------------+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+----------+----------------------------+
controller-1:~$

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as release gating until further investigation

Changed in starlingx:
status: New → Triaged
importance: Undecided → Medium
assignee: nobody → Wei Zhou (wzhou007)
tags: added: stx.2019.03 stx.config
Ghada Khalil (gkhalil)
summary: - STX: Storage, failed to kill ceph osd process
+ osd process failed to recover after being killed
Maria Yousaf (myousaf)
description: updated
Wei Zhou (wzhou007)
Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
Wei Zhou (wzhou007) wrote :

This is the pull request for the fix. https://github.com/starlingx-staging/stx-ceph/pull/19

Changed in starlingx:
status: In Progress → Fix Released
Ken Young (kenyis)
tags: added: stx.2019.05
removed: stx.2019.03
Ken Young (kenyis)
tags: added: stx.2.0
removed: stx.2019.05
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.