Bionic Server ISO soft lockup on Dell C6420 in swapper - Xenial appears OK

Bug #1773100 reported by A. Karl Kornel on 2018-05-24
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
High
Unassigned
Bionic
High
Unassigned

Bug Description

Hello!

We are experiencing a soft lockup when booting the Bionic Server installer amd64 ISO (kernel 4.15.0-20). The installer seems to hang on boot: After GRUB, nothing ever appears on the display, and we only managed to get kernel logs through Serial-Over-LAN. The 16.04.4 Server installer ISO (kernel 4.4.0-16) does not seem to have this problem.

Our hardware is a Dell C6420. The C6420 is one node in a four-node, 2 U chassis (the C6400 chassis). The node's hardware is two Intel Xeon Gold 6134 CPUs (8 cores @ 3.2 GHz/core). RAM is 96 GB, as twelve 8 GB DIMMs. We have tested with Hyperthreading on and off, and with UFEI and BIOS boot modes, and we get the same results in both cases.

Right now, we have Hyperthreading on, and we are booting in UEFI mode. If you need us to change that for testing, let us know!

Here are the first 25 lines of the soft lockup:

[ 40.544002] watchdog: BUG: soft lockup - CPU#24 stuck for 22s! [swapper/0:1]
[ 40.628002] Modules linked in:
[ 40.664000] CPU: 24 PID: 1 Comm: swapper/0 Not tainted 4.15.0-20-generic #21-Ubuntu
[ 40.756002] Hardware name: Dell Inc. PowerEdge C6420/0K2TT6, BIOS 1.3.7 02/09/2018
[ 40.844002] RIP: 0010:smp_call_function_many+0x229/0x250
[ 40.908002] RSP: 0000:ffffaa61000eb868 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff11
[ 41.000000] RAX: 0000000000000004 RBX: ffff8eba9f5238c0 RCX: 0000000000000001
[ 41.084002] RDX: ffff8eba9f2a8f60 RSI: 0000000000000000 RDI: ffff8eba96754de0
[ 41.168002] RBP: ffffaa61000eb8a0 R08: fffffffffffffff0 R09: 00000000feffffff
[ 41.252003] R10: ffffecb61f56b380 R11: 0000000000000004 R12: 0000000000000100
[ 41.340002] R13: 0000000000023880 R14: ffffffffb4035030 R15: 0000000000000000
[ 41.424002] FS: 0000000000000000(0000) GS:ffff8eba9f500000(0000) knlGS:0000000000000000
[ 41.520002] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 41.588002] CR2: 0000000000000000 CR3: 000000018c00a001 CR4: 00000000007606e0
[ 41.676002] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 41.760004] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 41.844002] PKRU: 00000000
[ 41.876002] Call Trace:
[ 41.908002] ? nsio_rw_bytes+0xd9/0x250
[ 41.952002] ? cpumask_weight+0x20/0x20
[ 42.000002] ? nsio_rw_bytes+0xda/0x250
[ 42.044003] ? quirk_intel_brickland_xeon_ras_cap+0x60/0x60
[ 42.112002] on_each_cpu+0x2d/0x60
[ 42.152000] ? nsio_rw_bytes+0xd9/0x250
[ 42.196003] text_poke_bp+0x6a/0xf0

The full output is attached as "kernel 4.15.0-20 boot.txt". We used the following kernel command line:

BOOT_IMAGE=/install/vmlinuz file=/cdrom/preseed/ubuntu-server.seed console=tty0 console=ttyS0 console=ttyS1 debug ---

As I mentioned up top, we have started installing the Ubuntu Server 16.04.4 Server ISO, and this far the kernel is booting, and we have gotten through to the text-based installer.

I'm sorry that I couldn't use the normal kernel bug-reporting process, but since we're hitting this problem in the installer, I'm not sure what else we can do.

I also apologize that I don't have any more info, since it's after Midnight local time, but I wanted to make sure I got this information through to you ASAP.

At this point, we're probably going to move forward with the 16.04.4 Server installer for now, but we do have an identical C6420 node (in a different chassis), and we will probably be able to use that for testing for at least a little while! For example, we were thinking of trying the 16.04 HWE kernel, as an additional data point.

So, please let us know what additional information you would like, and what else we can try. Thanks very much!

affects: linux-meta (Ubuntu) → linux (Ubuntu)

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1773100

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: bionic
Adam Seishas (aseishas) wrote :

The lockup occurs to early in the boot process to use appport-collect.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Joseph Salisbury (jsalisbury) wrote :

Can you see if this issue also happens with the latest daily image? It can be downloaded from:
Desktop:
http://cdimage.ubuntu.com/daily-live/current/
Server:
http://cdimage.ubuntu.com/ubuntu-server/daily/current/

Changed in linux (Ubuntu):
importance: Undecided → High
Changed in linux (Ubuntu Bionic):
importance: Undecided → High
status: New → Confirmed
tags: added: kernel-da-key
Adam Seishas (aseishas) wrote :

Same behavior using the latest cosmic-server-amd64.iso daily.

Adam Seishas (aseishas) wrote :

This bug appears to have been fixed in a recent kernel, perhaps 4.15.0-33 or 4.15.0-34.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers