ceph-osd process hung and blocked ps listings
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
ceph (Ubuntu) |
Expired
|
Low
|
Unassigned | ||
linux (Ubuntu) |
Expired
|
Undecided
|
Unassigned |
Bug Description
We ran into a situation over the past couple of days where we had 2 different ceph-osd nodes crash in such a way that they caused ps listing to hang when enumerating the process. Both had a call trace associated with them:
Node 1:
Jul 4 07:46:15 provider-cs-03 kernel: [4188396.493011] ceph-osd D ffff882029a67b90 0 5312 1 0x00000004
Jul 4 07:46:15 provider-cs-03 kernel: [4188396.590564] ffff882029a67b90 ffff881037cb8000 ffff8820284f3700 ffff882029a68000
Jul 4 07:46:16 provider-cs-03 kernel: [4188396.688603] ffff88203296e5a8 ffff88203296e5c0 0000000000000015 ffff8820284f3700
Jul 4 07:46:16 provider-cs-03 kernel: [4188396.789329] ffff882029a67ba8 ffffffff817ec495 ffff8820284f3700 ffff882029a67bf8
Jul 4 07:46:16 provider-cs-03 kernel: [4188396.891376] Call Trace:
Jul 4 07:46:16 provider-cs-03 kernel: [4188396.939271] [<ffffffff817ec
Jul 4 07:46:16 provider-cs-03 kernel: [4188396.989957] [<ffffffff817ee
Jul 4 07:46:16 provider-cs-03 kernel: [4188397.041502] [<ffffffff813db
Jul 4 07:46:16 provider-cs-03 kernel: [4188397.092616] [<ffffffff813db
Jul 4 07:46:16 provider-cs-03 kernel: [4188397.141510] [<ffffffff813db
Jul 4 07:46:16 provider-cs-03 kernel: [4188397.189877] [<ffffffff817ee
Jul 4 07:46:16 provider-cs-03 kernel: [4188397.237513] [<ffffffff81067
Jul 4 07:46:16 provider-cs-03 kernel: [4188397.285588] [<ffffffff81067
Jul 4 07:46:16 provider-cs-03 kernel: [4188397.332936] [<ffffffff817f1
Jul 4 07:46:16 provider-cs-03 kernel: [4188397.379495] [<ffffffff813db
Jul 4 07:46:16 provider-cs-03 kernel: [4188397.426400] [<ffffffff81039
Jul 4 07:46:16 provider-cs-03 kernel: [4188397.474904] [<ffffffff8102d
Jul 4 07:46:16 provider-cs-03 kernel: [4188397.563204] [<ffffffff8102d
Jul 4 07:46:16 provider-cs-03 kernel: [4188397.609783] [<ffffffff816d1
Jul 4 07:46:16 provider-cs-03 kernel: [4188397.656633] [<ffffffff811b2
Jul 4 07:46:16 provider-cs-03 kernel: [4188397.703785] [<ffffffff81078
Jul 4 07:46:17 provider-cs-03 kernel: [4188397.751367] [<ffffffff81003
Jul 4 07:46:17 provider-cs-03 kernel: [4188397.799369] [<ffffffff817ef
Node 2:
[733869.727139] CPU: 17 PID: 1735127 Comm: ceph-osd Not tainted 4.4.0-15-generic #31-Ubuntu
[733869.796954] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.3.6 06/03/2015
[733869.927182] task: ffff881841dc6e00 ti: ffff8810cc0a0000 task.ti: ffff8810cc0a0000
[733870.059139] RIP: 0010:[<
[733870.192753] RSP: 0000:ffff8810cc
[733870.260298] RAX: 0000000000000000 RBX: ffff8810cc0a3c78 RCX: 0000000000000012
[733870.389322] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8810210a0e00
[733870.517883] RBP: ffff8810cc0a3c40 R08: 0000000000000006 R09: 000000000000013e
[733870.646335] R10: 00000000000003b4 R11: 000000000000001f R12: ffff881018118000
[733870.774514] R13: 0000000000000006 R14: ffff8810210a0e00 R15: 0000000000000379
[733870.902262] FS: 00007fdcfab0370
[733871.031347] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[733871.097820] CR2: 00007fdcfab02c20 CR3: 0000001029204000 CR4: 00000000001406e0
[733871.223381] Stack:
[733871.282453] ffff8810cc0a3c40 ffffffff811f04ce ffff88102f6e9680 0000000000000012
[733871.404947] 0000000000000077 000000000000008f 0000000000016d40 0000000000000006
[733871.527250] ffff881841dc6e00 ffff8810cc0a3c78 00000000000001ac 00000000000001b8
[733871.649648] Call Trace:
[733871.707884] [<ffffffff811f0
[733871.770946] [<ffffffff810b5
[733871.832808] [<ffffffff811c9
[733871.893897] [<ffffffff810b5
[733871.954283] [<ffffffff810b9
[733872.013128] [<ffffffff811bd
[733872.071309] [<ffffffff81101
[733872.128149] [<ffffffff81228
[733872.184044] [<ffffffff8106a
[733872.239300] [<ffffffff8106a
[733872.293001] [<ffffffff81824
[733872.345187] Code: d0 4c 89 f7 e8 95 c7 ff ff 49 8b 84 24 d8 01 00 00 49 8b 76 78 31 d2 49 0f af 86 b0 00 00 00 4c 8b 45 d0 48 8b 4d b0 48 83 c6 01 <48> f7 f6 4c 89 c6 48 89 da 48 8d 3c 01 48 29 c6 e8 de c5 ff ff
[733872.507088] RIP [<ffffffff810b4
[733872.559965] RSP <ffff8810cc0a3bd8>
[733872.673773] ---[ end trace aec37273a19e57dc ]---
In the ceph logs for node 1 there is:
./include/
ceph version 0.94.7 (d56bdf93ced6b8
1: (ceph::
2: (()+0x4892b8) [0x56042e9512b8]
3: (boost:
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost:
0>::react_
4: (boost:
d_events()+0x127) [0x56042e9646e7]
5: (boost:
(boost:
6: (ReplicatedPG:
7: (OSD::SnapTrimW
8: (ThreadPool:
9: (ThreadPool:
10: (()+0x8184) [0x7f27ecc66184]
11: (clone()+0x6d) [0x7f27eb1d137d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Unfortunately the only way we could get the processes to respond again was to reboot the systems.
Is there any way of figuring out what went wrong here?
$ lsb_release -rd
Description: Ubuntu 14.04.4 LTS
Release: 14.04
$ dpkg-query -W ceph
ceph 0.94.7-
Status changed to 'Confirmed' because the bug affects multiple users.