ceph-osd early start interferes with kdump-tools during kernel dump

Bug #1461429 reported by Louis Bouchard on 2015-06-03
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
ceph (Ubuntu)
Medium
Unassigned
Trusty
Medium
Unassigned

Bug Description

When a kernel crash dump occurs on a system with kdump-tools configured and enabled, kexec triggers a reboot of the server which will start kdump-tools to capture the kernel dump.

On systems running CEPH osd's, the configured osds will start even if kdump-tools is setup to start very early in the boot phase. Even replacing the kdump-tools sysVinit script by an upstart job that runs before the runlevel signal is not sufficient.

The /etc/init/ceph-osd.conf job still runs while kdump is capturing the dums, triggering the OOM killer.

Louis Bouchard (louis) on 2015-06-03
Changed in ceph (Ubuntu):
status: New → Confirmed
importance: Undecided → Medium
assignee: nobody → Louis Bouchard (louis-bouchard)
Louis Bouchard (louis) wrote :
Download full text (4.0 KiB)

Here is an example of a captured session :

[ 399.597207] SysRq : Trigger a crash
[ 399.599050] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 399.600421] IP: [<ffffffff81457fc6>] sysrq_handle_crash+0x16/0x20
[ 399.600745] PGD 3758e067 PUD 3d616067 PMD 0
[ 399.600745] Oops: 0002 [#1] SMP
...
 * Starting enable remaining boot-time encrypted block devices[74G[ OK ]
Cloud-init v. 0.7.5 running 'init' at Wed, 03 Jun 2015 07:33:31 +0000. Up 8.83 seconds.
ci-info: ++++++++++++++++++++++++Net device info++++++++++++++++++++++++
ci-info: +--------+------+-----------+-------------+-------------------+
ci-info: | Device | Up | Address | Mask | Hw-Address |
ci-info: +--------+------+-----------+-------------+-------------------+
ci-info: | lo | True | 127.0.0.1 | 255.0.0.0 | . |
ci-info: | eth0 | True | 10.5.1.39 | 255.255.0.0 | fa:16:3e:09:27:0b |
ci-info: +--------+------+-----------+-------------+-------------------+
ci-info: +++++++++++++++++++++++++++++Route info+++++++++++++++++++++++++++++
ci-info: +-------+-------------+----------+-------------+-----------+-------+
ci-info: | Route | Destination | Gateway | Genmask | Interface | Flags |
ci-info: +-------+-------------+----------+-------------+-----------+-------+
ci-info: | 0 | 0.0.0.0 | 10.5.0.1 | 0.0.0.0 | eth0 | UG |
ci-info: | 1 | 10.5.0.0 | 0.0.0.0 | 255.255.0.0 | eth0 | U |
ci-info: +-------+-------------+----------+-------------+-----------+-------+
 * Starting Ceph OSD[74G[ OK ] <<<<<<<<<<<<<<<
 * Starting Signal sysvinit that local filesystems are mounted[74G[ OK ]
 * Starting configure network device security[74G[ OK ]
 * Starting flush early job output to logs[74G[ OK ]
...
 * Starting Ceph OSD (start all instances)[74G[ OK ]
 * Starting regular background program processing daemon[74G[ OK ]
 * Starting deferred execution scheduler[74G[ OK ]
Starting kdump-tools: * Starting Ceph MON (start all instances)[74G[ OK ]
 * Stopping save kernel messages[74G[ OK ]
 * Starting Ceph MON[74G[ OK ]
 * Starting automatic crash report generation[74G[ OK ]
 * Stopping CPU interrupts balancing daemon[74G[ OK ]
 * running makedumpfile -c -d 31 /proc/vmcore /var/crash/201506030733/dump-incomplete
 * Stopping Ceph MON (start all instances)[74G[ OK ]
 * Starting Ceph monitor (all instances)[74G[ OK ]
The kernel version is not supported.
The created dumpfile may be incomplete.
cyclic buffer size has been changed: 32767 => 32640
Excluding unnecessary pages : [ 0.0 %] /
* Starting Create Ceph client.admin key when possible[74G[ OK ]
Excluding unnecessary pages : [100.0 %] |
Excluding unnecessary pages : [100.0 %] \
Excluding unnecessary pages : [100.0 %] -
Excluding unnecessary pages : [100.0 %] /
Excluding unnecessary pages : [100.0 %] |
 * Starting OpenSSH server[74G[ OK ]
Copying data : [ 9.4 %] \
Copying data : [ 25.1 %] -
Copying data : [ 47.2 %] /
Copying data : [...

Read more...

Changed in ceph (Ubuntu Trusty):
status: New → Confirmed
importance: Undecided → Medium
assignee: nobody → Louis Bouchard (louis-bouchard)
Changed in ceph (Ubuntu):
assignee: Louis Bouchard (louis-bouchard) → nobody
status: Confirmed → Invalid
Louis Bouchard (louis) wrote :

Marking the development release as Invalid since Wily uses systemd

Chris J Arges (arges) on 2015-06-03
summary: - ceph-osd early start intefere with kdump-tools during kernel dump
+ ceph-osd early start interferes with kdump-tools during kernel dump
Louis Bouchard (louis) wrote :

This is caused by the fact that Ceph OSDs are started by the following udev rule : /lib/udev/rules.d/95-ceph-osd.rules. This rules emits the ceph-osd signal that triggers the job to start.

The problem is not with ceph but with how kdump-tools relies on sysVinit script ordering to run early.

Louis Bouchard (louis) on 2015-11-27
Changed in ceph (Ubuntu Trusty):
status: Confirmed → Fix Released
assignee: Louis Bouchard (louis-bouchard) → nobody
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers