ceph-osd process kill on AIO-SX resulted in kernel hang

Bug #1827936 reported by Maria Yousaf on 2019-05-06
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Low
Tingjie Chen

Bug Description

Brief Description
-----------------
The ceph-osd process was killed on an AIO-SX system. The target was not observed to recover afterwards and a kernel hang was observed on the console.

Severity
--------
Major

Steps to Reproduce
------------------
1. ps -ef | grep ceph-osd. Make note of the pid.
2. Issue a kill -9 <pid>
3. Observe console output and see hung kernel messages
4. System recovery was only possible by rebooting the target

Expected Behavior
------------------
The process is killed, and is then restarted.

Actual Behavior
----------------
The process is killed and then kernel hang is observed on the console. Command line is no longer responsive. System must be rebooted.

Reproducibility
---------------
Intermittent. Did a number of OSD process kills and the issue was only seen once. 1 out of 4 attempts so far.

System Configuration
--------------------
One node system

Branch/Pull Time/Commit
-----------------------
Master as of 20190505T233000Z

Last Pass
---------
First time testing on StarlingX ceph mimic

Timestamp/Logs
--------------
From kern.log:

2019-05-06T18:30:05.512 controller-0 kernel: warning [ 4538.906310] libceph: osd0 192.168.204.3:6800 socket error on write
2019-05-06T18:30:29.450 controller-0 kernel: err [ 4562.830603] INFO: task jbd2/rbd0-8:126770 blocked for more than 120 seconds.
2019-05-06T18:30:29.450 controller-0 kernel: err [ 4562.838488] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2019-05-06T18:30:29.450 controller-0 kernel: info [ 4562.847368] jbd2/rbd0-8 D ffff97948d0222e0 0 126770 2 0x00000000
2019-05-06T18:30:29.450 controller-0 kernel: warning [ 4562.847372] Call Trace:
2019-05-06T18:30:29.450 controller-0 kernel: warning [ 4562.847381] [<ffffffffb380fb30>] ? bit_wait+0x50/0x50
2019-05-06T18:30:29.450 controller-0 kernel: warning [ 4562.847385] [<ffffffffb3812709>] schedule+0x29/0x70
2019-05-06T18:30:29.450 controller-0 kernel: warning [ 4562.847391] [<ffffffffb380f4f9>] schedule_timeout+0x289/0x330
2019-05-06T18:30:29.450 controller-0 kernel: warning [ 4562.847397] [<ffffffffb3316e49>] ? __blk_mq_run_hw_queue+0x69/0xa0
2019-05-06T18:30:29.450 controller-0 kernel: warning [ 4562.847404] [<ffffffffb30e5cc2>] ? ktime_get_ts64+0x52/0xf0
2019-05-06T18:30:29.450 controller-0 kernel: warning [ 4562.847408] [<ffffffffb312ff9f>] ? delayacct_end+0x8f/0xb0
2019-05-06T18:30:29.450 controller-0 kernel: warning [ 4562.847411] [<ffffffffb30e5cc2>] ? ktime_get_ts64+0x52/0xf0
2019-05-06T18:30:29.450 controller-0 kernel: warning [ 4562.847413] [<ffffffffb380fb30>] ? bit_wait+0x50/0x50
2019-05-06T18:30:29.450 controller-0 kernel: warning [ 4562.847415] [<ffffffffb381121d>] io_schedule_timeout+0xad/0x130
2019-05-06T18:30:29.450 controller-0 kernel: warning [ 4562.847416] [<ffffffffb38112b8>] io_schedule+0x18/0x20
2019-05-06T18:30:29.450 controller-0 kernel: warning [ 4562.847418] [<ffffffffb380fb41>] bit_wait_io+0x11/0x50
2019-05-06T18:30:29.450 controller-0 kernel: warning [ 4562.847420] [<ffffffffb380f667>] __wait_on_bit+0x67/0x90
2019-05-06T18:30:29.450 controller-0 kernel: warning [ 4562.847423] [<ffffffffb380fb30>] ? bit_wait+0x50/0x50
2019-05-06T18:30:29.450 controller-0 kernel: warning [ 4562.847427] [<ffffffffb380f7d1>] out_of_line_wait_on_bit+0x81/0xb0
2019-05-06T18:30:29.450 controller-0 kernel: warning [ 4562.847434] [<ffffffffb30a7060>] ? wake_bit_function+0x40/0x40
2019-05-06T18:30:29.450 controller-0 kernel: warning [ 4562.847445] [<ffffffffb323979a>] __wait_on_buffer+0x2a/0x30
2019-05-06T18:30:29.450 controller-0 kernel: warning [ 4562.847474] [<ffffffffc06d58a6>] jbd2_journal_commit_transaction+0xfd6/0x1c90 [jbd2]
2019-05-06T18:30:29.450 controller-0 kernel: warning [ 4562.847479] [<ffffffffb30b355e>] ? finish_task_switch+0x4e/0x1b0
2019-05-06T18:30:29.467 controller-0 kernel: warning [ 4562.847488] [<ffffffffc06dba9d>] kjournald2+0xcd/0x260 [jbd2]
2019-05-06T18:30:29.467 controller-0 kernel: warning [ 4562.847491] [<ffffffffb30a6fa0>] ? wake_up_atomic_t+0x30/0x30
2019-05-06T18:30:29.467 controller-0 kernel: warning [ 4562.847498] [<ffffffffc06db9d0>] ? commit_timeout+0x10/0x10 [jbd2]
2019-05-06T18:30:29.467 controller-0 kernel: warning [ 4562.847501] [<ffffffffb30a5ee1>] kthread+0xd1/0xe0
2019-05-06T18:30:29.467 controller-0 kernel: warning [ 4562.847505] [<ffffffffb30a5e10>] ? kthread_create_on_node+0x140/0x140
2019-05-06T18:30:29.467 controller-0 kernel: warning [ 4562.847509] [<ffffffffb3816e1d>] ret_from_fork_nospec_begin+0x7/0x21
2019-05-06T18:30:29.467 controller-0 kernel: warning [ 4562.847513] [<ffffffffb30a5e10>] ? kthread_create_on_node+0x140/0x140
2019-05-06T18:30:29.467 controller-0 kernel: err [ 4562.847516] INFO: task jbd2/rbd1-8:126832 blocked for more than 120 seconds.
2019-05-06T18:30:29.467 controller-0 kernel: err [ 4562.855438] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

Test Activity
-------------
Feature

Maria Yousaf (myousaf) wrote :

Same behavior on BM 2+2+2.

---------------------------------------------------
storage-1:~$ date
Tue May 7 06:20:58 CDT 2019
storage-1:~$ ps -A | grep ceph*
storage-1:~$ ps -A | grep ceph*
storage-1:~$ ps -A | grep ceph*
storage-1:~$ ps -A | grep ceph*
storage-1:~$ date
Tue May 7 08:43:50 CDT 2019
---------------------------------------------------
[wrsroot@controller-0 bin(keystone_admin)]$ ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.87097 root storage-tier
-3 0.87097 chassis group-0
-4 0.43549 host storage-0
 0 ssd 0.43549 osd.0 up 1.00000 1.00000
-5 0.43549 host storage-1
 1 ssd 0.43549 osd.1 down 0 1.00000
[wrsroot@controller-0 bin(keystone_admin)]$ ceph osd tree
---------------------------------------------------
[wrsroot@controller-0 ~(keystone_admin)]$ ceph -s
  cluster:
    id: 85261611-1245-4b21-bd0b-cc9afdc26cff
    health: HEALTH_WARN
            Degraded data redundancy: 1719/3438 objects degraded (50.000%), 147 pgs degraded, 856 pgs undersized

  services:
    mon: 3 daemons, quorum controller-0,controller-1,storage-0
    mgr: controller-0(active), standbys: controller-1
    osd: 2 osds: 1 up, 1 in
    rgw: 1 daemon active

  data:
    pools: 9 pools, 856 pgs
    objects: 1.72 k objects, 1.3 GiB
    usage: 1.3 GiB used, 445 GiB / 446 GiB avail
    pgs: 1719/3438 objects degraded (50.000%)
             709 active+undersized
             147 active+undersized+degraded

  io:
    client: 6.5 KiB/s rd, 245 KiB/s wr, 8 op/s rd, 63 op/s wr

Changed in starlingx:
assignee: nobody → Tingjie Chen (silverhandy)
Daniel Badea (daniel.badea) wrote :

Kernel warnings about jbd2/rbd0-* task blocked are to be expected because this system is running with a single OSD in replication 1 and killing ceph-osd blocks any client using Ceph including mariadb container:
- rbd0 is mounted to /var/lib/kubelet/plugins/kubernetes.io/rbd/mounts/kube-rbd-image-kubernetes-dynamic-pvc-765f2082-701c-11e9-8184-9662bc166b57.
- rbd1 is mounted to /var/lib/kubelet/plugins/kubernetes.io/rbd/mounts/kube-rbd-image-kubernetes-dynamic-pvc-a312923d-701c-11e9-8184-9662bc166b57
- mysqld (mariadb) is also blocked:
    2019-05-06T18:30:29.604 controller-0 kernel: err [ 4562.873872] INFO: task mysqld:249372 blocked for more than 120 seconds.

The are some questions here:

1. is why is the system unresponsive after killing ceph-osd?

2. what is the exact time stamp when this happened? There are kernel log messages from the claimed hang at 2019-05-06T18:30:29.467 up to 2019-05-06T19:17:06.914 when another kernel instance starts to boot (system was restarted) which means kernel is not really hanged

3. after reboot ceph-osd is killed again 3 times then logs are collected without restarting the system again. Why did it not hang this time?

Numan Waheed (nwaheed) on 2019-05-08
tags: added: stx.retestneeded
Ghada Khalil (gkhalil) wrote :

Marking as release gating; process kill should not result in a kernel hang. Maybe related to the new ceph version.

tags: added: stx.distro.other
tags: added: stx.2.0
Changed in starlingx:
status: New → Triaged
importance: Undecided → Medium
tags: added: stx.storage
removed: stx.2.0 stx.distro.other stx.retestneeded
Changed in starlingx:
status: Triaged → In Progress
Ghada Khalil (gkhalil) wrote :

Tingjie, Please don't remove tags added by me. They are added as part of initial bug screening. If you feel a bug should not be gating, please add a clear justification as to why. This then needs to be reviewed with the sub-project lead and agreed to before the release tag is removed.

tags: added: stx.2.0 stx.distro.other
tags: added: stx.retestneeded
Ghada Khalil (gkhalil) wrote :

For stx.retestneeded, this label should only be removed by the reporter of the bug when they've tested the provided fix.

Tingjie Chen (silverhandy) wrote :

It is hard for me to reproduce the kernel hang issue, and I killed ceph-osd process for many times but the thread will recover after seconds.

The log:
--------------------
2019-05-06T18:30:05.512 controller-0 kernel: warning [ 4538.906310] libceph: osd0 192.168.204.3:6800 socket error on write
2019-05-06T18:30:29.450 controller-0 kernel: err [ 4562.830603] INFO: task jbd2/rbd0-8:126770 blocked for more than 120 seconds.
...
2019-05-06T18:30:29.467 controller-0 kernel: err [ 4562.847516] INFO: task jbd2/rbd1-8:126832 blocked for more than 120 seconds.

this log means process: jbd2/rbd0-8 didn’t leave uninterruptible sleep after this mark. This state can be caused by waiting for disk IO, by vfork() and many other cases.
jbd2/rbd0-8 is ext4 journal thread for ceph: Distributed Replicated Block Device (DRBD, it is /dev/drbd0 in my env), so it is may occurred in waiting for IO when kill ceph-osd thread.

Maria, when you met the hang, have you deploy containerized openstack application helm tarball?

BTW: for Khalil, actually I am not remove tags of yours, maybe because I add the tag: stx.storage but not refresh the page and that time you has added several tags already :)

Cindy Xie (xxie1) on 2019-05-22
tags: removed: stx.distro.other
Cindy Xie (xxie1) wrote :

@Tingjie to add the build# for the issue reproduced once. May related to kernel driver.
@Ovidiu updated the analysis results, Ceph-osd was killed mulitple times.Need to retest it. if the user has more than 1 OSD in the system, the issue will not occur.

tags: removed: stx.2.0
Ghada Khalil (gkhalil) wrote :

Since Cindy removed the stx.2.0 release tag, I'm setting the priority to Low as this bug is no longer considered gating.

Changed in starlingx:
importance: Medium → Low
Cindy Xie (xxie1) on 2019-06-10
Changed in starlingx:
status: In Progress → Triaged
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers