StarlingX

ceph-osd process kill on AIO-SX resulted in kernel hang

Bug #1827936 reported by Maria Yousaf on 2019-05-06

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Triaged	Low	Tingjie Chen

Bug Description

Brief Description
-----------------
The ceph-osd process was killed on an AIO-SX system. The target was not observed to recover afterwards and a kernel hang was observed on the console.

Severity
--------
Major

Steps to Reproduce
------------------
1. ps -ef | grep ceph-osd. Make note of the pid.
2. Issue a kill -9 <pid>
3. Observe console output and see hung kernel messages
4. System recovery was only possible by rebooting the target

Expected Behavior
------------------
The process is killed, and is then restarted.

Actual Behavior
----------------
The process is killed and then kernel hang is observed on the console. Command line is no longer responsive. System must be rebooted.

Reproducibility
---------------
Intermittent. Did a number of OSD process kills and the issue was only seen once. 1 out of 4 attempts so far.

System Configuration
--------------------
One node system

Branch/Pull Time/Commit
-----------------------
Master as of 20190505T233000Z

Last Pass
---------
First time testing on StarlingX ceph mimic

Timestamp/Logs
--------------
From kern.log:

2019-05-06T18:30:05.512 controller-0 kernel: warning [ 4538.906310] libceph: osd0 192.168.204.3:6800 socket error on write
2019-05-06T18:30:29.450 controller-0 kernel: err [ 4562.830603] INFO: task jbd2/rbd0-8:126770 blocked for more than 120 seconds.
2019-05-06T18:30:29.450 controller-0 kernel: err [ 4562.838488] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2019-05-06T18:30:29.450 controller-0 kernel: info [ 4562.847368] jbd2/rbd0-8 D ffff97948d0222e0 0 126770 2 0x00000000
2019-05-06T18:30:29.450 controller-0 kernel: warning [ 4562.847372] Call Trace:
2019-05-06T18:30:29.450 controller-0 kernel: warning [ 4562.847381] [<ffffffffb380fb30>] ? bit_wait+0x50/0x50
2019-05-06T18:30:29.450 controller-0 kernel: warning [ 4562.847385] [<ffffffffb3812709>] schedule+0x29/0x70
2019-05-06T18:30:29.450 controller-0 kernel: warning [ 4562.847391] [<ffffffffb380f4f9>] schedule_timeout+0x289/0x330
2019-05-06T18:30:29.450 controller-0 kernel: warning [ 4562.847397] [<ffffffffb3316e49>] ? __blk_mq_run_hw_queue+0x69/0xa0
2019-05-06T18:30:29.450 controller-0 kernel: warning [ 4562.847404] [<ffffffffb30e5cc2>] ? ktime_get_ts64+0x52/0xf0
2019-05-06T18:30:29.450 controller-0 kernel: warning [ 4562.847408] [<ffffffffb312ff9f>] ? delayacct_end+0x8f/0xb0
2019-05-06T18:30:29.450 controller-0 kernel: warning [ 4562.847411] [<ffffffffb30e5cc2>] ? ktime_get_ts64+0x52/0xf0
2019-05-06T18:30:29.450 controller-0 kernel: warning [ 4562.847413] [<ffffffffb380fb30>] ? bit_wait+0x50/0x50
2019-05-06T18:30:29.450 controller-0 kernel: warning [ 4562.847415] [<ffffffffb381121d>] io_schedule_timeout+0xad/0x130
2019-05-06T18:30:29.450 controller-0 kernel: warning [ 4562.847416] [<ffffffffb38112b8>] io_schedule+0x18/0x20
2019-05-06T18:30:29.450 controller-0 kernel: warning [ 4562.847418] [<ffffffffb380fb41>] bit_wait_io+0x11/0x50
2019-05-06T18:30:29.450 controller-0 kernel: warning [ 4562.847420] [<ffffffffb380f667>] __wait_on_bit+0x67/0x90
2019-05-06T18:30:29.450 controller-0 kernel: warning [ 4562.847423] [<ffffffffb380fb30>] ? bit_wait+0x50/0x50
2019-05-06T18:30:29.450 controller-0 kernel: warning [ 4562.847427] [<ffffffffb380f7d1>] out_of_line_wait_on_bit+0x81/0xb0
2019-05-06T18:30:29.450 controller-0 kernel: warning [ 4562.847434] [<ffffffffb30a7060>] ? wake_bit_function+0x40/0x40
2019-05-06T18:30:29.450 controller-0 kernel: warning [ 4562.847445] [<ffffffffb323979a>] __wait_on_buffer+0x2a/0x30
2019-05-06T18:30:29.450 controller-0 kernel: warning [ 4562.847474] [<ffffffffc06d58a6>] jbd2_journal_commit_transaction+0xfd6/0x1c90 [jbd2]
2019-05-06T18:30:29.450 controller-0 kernel: warning [ 4562.847479] [<ffffffffb30b355e>] ? finish_task_switch+0x4e/0x1b0
2019-05-06T18:30:29.467 controller-0 kernel: warning [ 4562.847488] [<ffffffffc06dba9d>] kjournald2+0xcd/0x260 [jbd2]
2019-05-06T18:30:29.467 controller-0 kernel: warning [ 4562.847491] [<ffffffffb30a6fa0>] ? wake_up_atomic_t+0x30/0x30
2019-05-06T18:30:29.467 controller-0 kernel: warning [ 4562.847498] [<ffffffffc06db9d0>] ? commit_timeout+0x10/0x10 [jbd2]
2019-05-06T18:30:29.467 controller-0 kernel: warning [ 4562.847501] [<ffffffffb30a5ee1>] kthread+0xd1/0xe0
2019-05-06T18:30:29.467 controller-0 kernel: warning [ 4562.847505] [<ffffffffb30a5e10>] ? kthread_create_on_node+0x140/0x140
2019-05-06T18:30:29.467 controller-0 kernel: warning [ 4562.847509] [<ffffffffb3816e1d>] ret_from_fork_nospec_begin+0x7/0x21
2019-05-06T18:30:29.467 controller-0 kernel: warning [ 4562.847513] [<ffffffffb30a5e10>] ? kthread_create_on_node+0x140/0x140
2019-05-06T18:30:29.467 controller-0 kernel: err [ 4562.847516] INFO: task jbd2/rbd1-8:126832 blocked for more than 120 seconds.
2019-05-06T18:30:29.467 controller-0 kernel: err [ 4562.855438] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

Test Activity
-------------
Feature

Tags:

Revision history for this message

Maria Yousaf (myousaf) wrote on 2019-05-06:

LP1827936.tar Edit (39.1 MiB, application/x-tar)

Revision history for this message

Fernando Hernandez Gonzalez (fhernan2) wrote on 2019-05-07:

2+2+2 sysinv log Edit (3.5 MiB, text/plain)

Same behavior on BM 2+2+2.

Revision history for this message

Fernando Hernandez Gonzalez (fhernan2) wrote on 2019-05-07:

ceph-pools-audit-1557237600-nmnwz.txt Edit (18.2 KiB, text/plain)

Revision history for this message

Fernando Hernandez Gonzalez (fhernan2) wrote on 2019-05-07:

---------------------------------------------------
storage-1:~$ date
Tue May 7 06:20:58 CDT 2019
storage-1:~$ ps -A | grep ceph*
storage-1:~$ ps -A | grep ceph*
storage-1:~$ ps -A | grep ceph*
storage-1:~$ ps -A | grep ceph*
storage-1:~$ date
Tue May 7 08:43:50 CDT 2019
---------------------------------------------------
[wrsroot@controller-0 bin(keystone_admin)]$ ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.87097 root storage-tier
-3 0.87097 chassis group-0
-4 0.43549 host storage-0
0 ssd 0.43549 osd.0 up 1.00000 1.00000
-5 0.43549 host storage-1
1 ssd 0.43549 osd.1 down 0 1.00000
[wrsroot@controller-0 bin(keystone_admin)]$ ceph osd tree
---------------------------------------------------
[wrsroot@controller-0 ~(keystone_admin)]$ ceph -s
  cluster:
    id: 85261611-1245-4b21-bd0b-cc9afdc26cff
    health: HEALTH_WARN
            Degraded data redundancy: 1719/3438 objects degraded (50.000%), 147 pgs degraded, 856 pgs undersized

  services:
    mon: 3 daemons, quorum controller-0,controller-1,storage-0
    mgr: controller-0(active), standbys: controller-1
    osd: 2 osds: 1 up, 1 in
    rgw: 1 daemon active

  data:
    pools: 9 pools, 856 pgs
    objects: 1.72 k objects, 1.3 GiB
    usage: 1.3 GiB used, 445 GiB / 446 GiB avail
    pgs: 1719/3438 objects degraded (50.000%)
             709 active+undersized
             147 active+undersized+degraded

io:
client: 6.5 KiB/s rd, 245 KiB/s wr, 8 op/s rd, 63 op/s wr

Tingjie Chen (silverhandy) on 2019-05-08

Changed in starlingx:
assignee:	nobody → Tingjie Chen (silverhandy)

Revision history for this message

Daniel Badea (daniel.badea) wrote on 2019-05-08:

Kernel warnings about jbd2/rbd0-* task blocked are to be expected because this system is running with a single OSD in replication 1 and killing ceph-osd blocks any client using Ceph including mariadb container:
- rbd0 is mounted to /var/lib/kubelet/plugins/kubernetes.io/rbd/mounts/kube-rbd-image-kubernetes-dynamic-pvc-765f2082-701c-11e9-8184-9662bc166b57.
- rbd1 is mounted to /var/lib/kubelet/plugins/kubernetes.io/rbd/mounts/kube-rbd-image-kubernetes-dynamic-pvc-a312923d-701c-11e9-8184-9662bc166b57
- mysqld (mariadb) is also blocked:
2019-05-06T18:30:29.604 controller-0 kernel: err [ 4562.873872] INFO: task mysqld:249372 blocked for more than 120 seconds.

The are some questions here:

1. is why is the system unresponsive after killing ceph-osd?

2. what is the exact time stamp when this happened? There are kernel log messages from the claimed hang at 2019-05-06T18:30:29.467 up to 2019-05-06T19:17:06.914 when another kernel instance starts to boot (system was restarted) which means kernel is not really hanged

3. after reboot ceph-osd is killed again 3 times then logs are collected without restarting the system again. Why did it not hang this time?

Numan Waheed (nwaheed) on 2019-05-08

tags:

added: stx.retestneeded

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-05-08:

Marking as release gating; process kill should not result in a kernel hang. Maybe related to the new ceph version.

tags:	added: stx.distro.other
tags:	added: stx.2.0
Changed in starlingx:
status:	New → Triaged
importance:	Undecided → Medium

Tingjie Chen (silverhandy) on 2019-05-09

tags:

added: stx.storage
removed: stx.2.0 stx.distro.other stx.retestneeded

Tingjie Chen (silverhandy) on 2019-05-10

Changed in starlingx:
status:	Triaged → In Progress

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-05-10:

Tingjie, Please don't remove tags added by me. They are added as part of initial bug screening. If you feel a bug should not be gating, please add a clear justification as to why. This then needs to be reviewed with the sub-project lead and agreed to before the release tag is removed.

tags:	added: stx.2.0 stx.distro.other
tags:	added: stx.retestneeded

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-05-10:

For stx.retestneeded, this label should only be removed by the reporter of the bug when they've tested the provided fix.

Revision history for this message

Tingjie Chen (silverhandy) wrote on 2019-05-14:

It is hard for me to reproduce the kernel hang issue, and I killed ceph-osd process for many times but the thread will recover after seconds.

The log:
--------------------
2019-05-06T18:30:05.512 controller-0 kernel: warning [ 4538.906310] libceph: osd0 192.168.204.3:6800 socket error on write
2019-05-06T18:30:29.450 controller-0 kernel: err [ 4562.830603] INFO: task jbd2/rbd0-8:126770 blocked for more than 120 seconds.
...
2019-05-06T18:30:29.467 controller-0 kernel: err [ 4562.847516] INFO: task jbd2/rbd1-8:126832 blocked for more than 120 seconds.

this log means process: jbd2/rbd0-8 didn’t leave uninterruptible sleep after this mark. This state can be caused by waiting for disk IO, by vfork() and many other cases.
jbd2/rbd0-8 is ext4 journal thread for ceph: Distributed Replicated Block Device (DRBD, it is /dev/drbd0 in my env), so it is may occurred in waiting for IO when kill ceph-osd thread.

Maria, when you met the hang, have you deploy containerized openstack application helm tarball?

BTW: for Khalil, actually I am not remove tags of yours, maybe because I add the tag: stx.storage but not refresh the page and that time you has added several tags already :)

Cindy Xie (xxie1) on 2019-05-22

tags:

removed: stx.distro.other

Revision history for this message

Cindy Xie (xxie1) wrote on 2019-05-22:

#10

@Tingjie to add the build# for the issue reproduced once. May related to kernel driver.
@Ovidiu updated the analysis results, Ceph-osd was killed mulitple times.Need to retest it. if the user has more than 1 OSD in the system, the issue will not occur.

tags:

removed: stx.2.0

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-05-22:

#11

Since Cindy removed the stx.2.0 release tag, I'm setting the priority to Low as this bug is no longer considered gating.

Changed in starlingx:
importance:	Medium → Low

Cindy Xie (xxie1) on 2019-06-10

Changed in starlingx:
status:	In Progress → Triaged

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.