ceph-osd process hung and blocked ps listings

Bug #1599681 reported by Brad Marshall
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
ceph (Ubuntu)
Expired
Low
Unassigned
linux (Ubuntu)
Expired
Undecided
Unassigned

Bug Description

We ran into a situation over the past couple of days where we had 2 different ceph-osd nodes crash in such a way that they caused ps listing to hang when enumerating the process. Both had a call trace associated with them:

Node 1:
Jul 4 07:46:15 provider-cs-03 kernel: [4188396.493011] ceph-osd D ffff882029a67b90 0 5312 1 0x00000004
Jul 4 07:46:15 provider-cs-03 kernel: [4188396.590564] ffff882029a67b90 ffff881037cb8000 ffff8820284f3700 ffff882029a68000
Jul 4 07:46:16 provider-cs-03 kernel: [4188396.688603] ffff88203296e5a8 ffff88203296e5c0 0000000000000015 ffff8820284f3700
Jul 4 07:46:16 provider-cs-03 kernel: [4188396.789329] ffff882029a67ba8 ffffffff817ec495 ffff8820284f3700 ffff882029a67bf8
Jul 4 07:46:16 provider-cs-03 kernel: [4188396.891376] Call Trace:
Jul 4 07:46:16 provider-cs-03 kernel: [4188396.939271] [<ffffffff817ec495>] schedule+0x35/0x80
Jul 4 07:46:16 provider-cs-03 kernel: [4188396.989957] [<ffffffff817eeb6a>] rwsem_down_read_failed+0xea/0x120
Jul 4 07:46:16 provider-cs-03 kernel: [4188397.041502] [<ffffffff813dbd84>] call_rwsem_down_read_failed+0x14/0x30
Jul 4 07:46:16 provider-cs-03 kernel: [4188397.092616] [<ffffffff813dbf15>] ? __clear_user+0x25/0x50
Jul 4 07:46:16 provider-cs-03 kernel: [4188397.141510] [<ffffffff813dbf15>] ? __clear_user+0x25/0x50
Jul 4 07:46:16 provider-cs-03 kernel: [4188397.189877] [<ffffffff817ee1f0>] ? down_read+0x20/0x30
Jul 4 07:46:16 provider-cs-03 kernel: [4188397.237513] [<ffffffff81067f18>] __do_page_fault+0x398/0x430
Jul 4 07:46:16 provider-cs-03 kernel: [4188397.285588] [<ffffffff81067fd2>] do_page_fault+0x22/0x30
Jul 4 07:46:16 provider-cs-03 kernel: [4188397.332936] [<ffffffff817f1e78>] page_fault+0x28/0x30
Jul 4 07:46:16 provider-cs-03 kernel: [4188397.379495] [<ffffffff813dbf15>] ? __clear_user+0x25/0x50
Jul 4 07:46:16 provider-cs-03 kernel: [4188397.426400] [<ffffffff81039c58>] copy_fpstate_to_sigframe+0x118/0x1d0
Jul 4 07:46:16 provider-cs-03 kernel: [4188397.474904] [<ffffffff8102d1fd>] get_sigframe.isra.7.constprop.9+0x12d/0x150
Jul 4 07:46:16 provider-cs-03 kernel: [4188397.563204] [<ffffffff8102d698>] do_signal+0x1e8/0x6d0
Jul 4 07:46:16 provider-cs-03 kernel: [4188397.609783] [<ffffffff816d19f2>] ? __sys_sendmsg+0x42/0x80
Jul 4 07:46:16 provider-cs-03 kernel: [4188397.656633] [<ffffffff811b2ed0>] ? handle_mm_fault+0x250/0x540
Jul 4 07:46:16 provider-cs-03 kernel: [4188397.703785] [<ffffffff8107884c>] exit_to_usermode_loop+0x59/0xa2
Jul 4 07:46:17 provider-cs-03 kernel: [4188397.751367] [<ffffffff81003a6e>] syscall_return_slowpath+0x4e/0x60
Jul 4 07:46:17 provider-cs-03 kernel: [4188397.799369] [<ffffffff817efe58>] int_ret_from_sys_call+0x25/0x8f

Node 2:
[733869.727139] CPU: 17 PID: 1735127 Comm: ceph-osd Not tainted 4.4.0-15-generic #31-Ubuntu
[733869.796954] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.3.6 06/03/2015
[733869.927182] task: ffff881841dc6e00 ti: ffff8810cc0a0000 task.ti: ffff8810cc0a0000
[733870.059139] RIP: 0010:[<ffffffff810b479d>] [<ffffffff810b479d>] task_numa_find_cpu+0x2cd/0x710
[733870.192753] RSP: 0000:ffff8810cc0a3bd8 EFLAGS: 00010257
[733870.260298] RAX: 0000000000000000 RBX: ffff8810cc0a3c78 RCX: 0000000000000012
[733870.389322] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8810210a0e00
[733870.517883] RBP: ffff8810cc0a3c40 R08: 0000000000000006 R09: 000000000000013e
[733870.646335] R10: 00000000000003b4 R11: 000000000000001f R12: ffff881018118000
[733870.774514] R13: 0000000000000006 R14: ffff8810210a0e00 R15: 0000000000000379
[733870.902262] FS: 00007fdcfab03700(0000) GS:ffff88203e600000(0000) knlGS:0000000000000000
[733871.031347] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[733871.097820] CR2: 00007fdcfab02c20 CR3: 0000001029204000 CR4: 00000000001406e0
[733871.223381] Stack:
[733871.282453] ffff8810cc0a3c40 ffffffff811f04ce ffff88102f6e9680 0000000000000012
[733871.404947] 0000000000000077 000000000000008f 0000000000016d40 0000000000000006
[733871.527250] ffff881841dc6e00 ffff8810cc0a3c78 00000000000001ac 00000000000001b8
[733871.649648] Call Trace:
[733871.707884] [<ffffffff811f04ce>] ? migrate_page_copy+0x21e/0x530
[733871.770946] [<ffffffff810b501e>] task_numa_migrate+0x43e/0x9b0
[733871.832808] [<ffffffff811c9700>] ? page_add_anon_rmap+0x10/0x20
[733871.893897] [<ffffffff810b5609>] numa_migrate_preferred+0x79/0x80
[733871.954283] [<ffffffff810b9c24>] task_numa_fault+0x7f4/0xd40
[733872.013128] [<ffffffff811bdf90>] handle_mm_fault+0xbc0/0x1820
[733872.071309] [<ffffffff81101420>] ? do_futex+0x120/0x500
[733872.128149] [<ffffffff812288c5>] ? __fget_light+0x25/0x60
[733872.184044] [<ffffffff8106a537>] __do_page_fault+0x197/0x400
[733872.239300] [<ffffffff8106a7c2>] do_page_fault+0x22/0x30
[733872.293001] [<ffffffff81824178>] page_fault+0x28/0x30
[733872.345187] Code: d0 4c 89 f7 e8 95 c7 ff ff 49 8b 84 24 d8 01 00 00 49 8b 76 78 31 d2 49 0f af 86 b0 00 00 00 4c 8b 45 d0 48 8b 4d b0 48 83 c6 01 <48> f7 f6 4c 89 c6 48 89 da 48 8d 3c 01 48 29 c6 e8 de c5 ff ff
[733872.507088] RIP [<ffffffff810b479d>] task_numa_find_cpu+0x2cd/0x710
[733872.559965] RSP <ffff8810cc0a3bd8>
[733872.673773] ---[ end trace aec37273a19e57dc ]---

In the ceph logs for node 1 there is:

./include/interval_set.h: 340: FAILED assert(0)

 ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x56042ebdeceb]
 2: (()+0x4892b8) [0x56042e9512b8]
 3: (boost::statechart::simple_state<ReplicatedPG::WaitingOnReplicas, ReplicatedPG::SnapTrimmer, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)
0>::react_impl(boost::statechart::event_base const&, void const*)+0xb2) [0x56042e97a8d2]
 4: (boost::statechart::state_machine<ReplicatedPG::SnapTrimmer, ReplicatedPG::NotTrimming, std::allocator<void>, boost::statechart::null_exception_translator>::process_queue
d_events()+0x127) [0x56042e9646e7]
 5: (boost::statechart::state_machine<ReplicatedPG::SnapTrimmer, ReplicatedPG::NotTrimming, std::allocator<void>, boost::statechart::null_exception_translator>::process_event
(boost::statechart::event_base const&)+0x84) [0x56042e9648b4]
 6: (ReplicatedPG::snap_trimmer()+0x52c) [0x56042e8eb5dc]
 7: (OSD::SnapTrimWQ::_process(PG*)+0x1a) [0x56042e7807da]
 8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa56) [0x56042ebcf8d6]
 9: (ThreadPool::WorkThread::entry()+0x10) [0x56042ebd0980]
 10: (()+0x8184) [0x7f27ecc66184]
 11: (clone()+0x6d) [0x7f27eb1d137d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Unfortunately the only way we could get the processes to respond again was to reboot the systems.

Is there any way of figuring out what went wrong here?

$ lsb_release -rd
Description: Ubuntu 14.04.4 LTS
Release: 14.04

$ dpkg-query -W ceph
ceph 0.94.7-0ubuntu0.15.04.1~cloud0

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in ceph (Ubuntu):
status: New → Confirmed
Revision history for this message
James Page (james-page) wrote :

Tricky; is this something you've seen regularly? Some googling indicates that this can be indicative of other issues on the server causing problems for the ceph processes, but that's only a hint.

We don't publish debug symbols (yet) for the UCA so getting a complete stack trace is a non-started.

Changed in ceph (Ubuntu):
importance: Undecided → Low
Revision history for this message
James Page (james-page) wrote :

Marking 'Low' for now as I've not seen either a) any other bug reports of this type or b) any reported bugs upstream which are similar.

Revision history for this message
James Page (james-page) wrote :

It does look like there was some sort of underlying kernel fault that might be the cause of the issue.

Changed in ceph (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1599681

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: trusty
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for ceph (Ubuntu) because there has been no activity for 60 days.]

Changed in ceph (Ubuntu):
status: Incomplete → Expired
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.