Comment 0 for bug 1572630

Revision history for this message
Dale Hamel (daleha-gmail) wrote :

We discovered a pretty serious regression introduced in 4.4.0-18.

At boot-time, the kernel will panic somewhere in 'blk_mq_register_disk', a snippet of the track is below, and full panic dump is attached. The panic dump was collected via serial console, as the kernel panics so early that we cannot kdump it.

[ 2.650512] [<ffffffff813ac8a6>] blk_mq_register_disk+0xa6/0x160
[ 2.656675] [<ffffffff813a1b44>] blk_register_queue+0xb4/0x160
[ 2.662661] [<ffffffff813af53e>] add_disk+0x1ce/0x490
[ 2.667869] [<ffffffff815477e0>] loop_add+0x1f0/0x270

This seems somewhat similar to https://lkml.org/lkml/2016/3/16/40, but the trace is not identical.

We discovered this issue when we were experimenting with linux-generic-lts-xenial from trusty-updates on a 14.04 installation. When we installed it, 4.4.0-15 was the current package, and it worked fine and provided a large amount of improvements for us. Background security updates installed 4.4.0-18, and this updated and grub and became the default kernel. On a reboot, the node panics about 2 seconds in, resulting in a machine in a dead state. We were able to boot a rescue image and roll bac kto 4.4.0-15, which works nicely. We currently have pinning on 4.4.0-15 to prevent this problem from coming back, but would prefer to see the problem fixed.

I'll attach lspci, lshw, and dmidecode for our hardware as well, but this is happening on pretty vanilla supermicro nodes. We are able to consistently reproduce it on our hardware. It is not reproducible in EC2, only on metal.