osd process failed to recover after being killed
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Medium
|
Wei Zhou |
Bug Description
Brief Description
-----------------
After the ceph osd process was killed, it failed to recover.
Severity
--------
Major
Steps to Reproduce
------------------
1. Randomly pick one OSD and get the pid of the OSD process
2 Kill the ceph-osd PID for that OSD using kill -9
3. Check that the OSD process is re-spawned using a different pid and validate alarms (OSD process was not observed to recover and this was confirmed by the alarms present on the system)
Expected Behaviour
------------------
ceph-osd process is running again with a different PID
Actual Behaviour
----------------
ceph-osd process does not recover after being killed.
Reproducibility
---------------
Reproducible
10/10
System Configuration
-------
Dedicated storage
Branch/Pull Time/Commit
-------
master as of 2018-12-06_20-18-00
Timestamp/Logs
--------------
1. Process killed
[2018-12-07 15:21:31,288] 125 INFO MainThread storage_
[2018-12-07 15:21:31,289] 426 DEBUG MainThread ssh.exec_cmd:: Executing command...
[2018-12-07 15:21:31,289] 263 DEBUG MainThread ssh.send :: Send 'whoami'
[2018-12-07 15:21:31,393] 389 DEBUG MainThread ssh.expect :: Output:
root
]0;root@
2. Confirmation of process kill
[2018-12-07 15:21:31,505] 263 DEBUG MainThread ssh.send :: Send 'kill -0 33089'
[2018-12-07 15:21:31,608] 389 DEBUG MainThread ssh.expect :: Output:
-sh: kill: (33089) - No such process
storage-0:~$
3. Process not re-spawned
[2018-12-07 15:21:35,233] 263 DEBUG MainThread ssh.send :: Send 'cat /var/run/
[2018-12-07 15:21:35,336] 389 DEBUG MainThread ssh.expect :: Output:
33089
storage-0:~$
4. Alarms reported:
[2018-12-07 15:26:39,479] 263 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://
[2018-12-07 15:26:40,976] 389 DEBUG MainThread ssh.expect :: Output:
+------
| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+------
| 9c43f0ad-
| ba5404da-
| 20f05871-
+------
controller-1:~$
summary: |
- STX: Storage, failed to kill ceph osd process + osd process failed to recover after being killed |
description: | updated |
Changed in starlingx: | |
status: | Triaged → In Progress |
tags: |
added: stx.2019.05 removed: stx.2019.03 |
tags: |
added: stx.2.0 removed: stx.2019.05 |
Marking as release gating until further investigation