Random kernel panics in QEMU instances when running on EPYC architecture

Bug #1834505 reported by Louis Bouchard
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

We have been seeing many kernel panic in QEMU instances on newly deployed servers all running on the EPYC architecture. Many of the KP occur early after the start of the QEMU process or within a few hours.

All the servers are running an up to date Bionic. After the first few panics on 4.15 kernels, we upgraded to 4.18 and still had panics.

The only thing running on the underlying server is QEMU processes. Here is a typical backtrace of a kernel panic :

[58034.598930] BUG: unable to handle kernel paging request at ffff943276c49f64
[58034.612039] PGD 8363067 P4D 8363067 PUD 8367067 PMD 759ac063 PTE 8000000076c49163
[58034.612992] Oops: 0000 [#1] SMP NOPTI
[58034.613462] CPU: 0 PID: 0 Comm: swapper/0 Kdump: loaded Not tainted 4.18.0-20-generic #21~18.04.1-Ubuntu
[58034.614685] Hardware name: Scaleway Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
[58034.615803] RIP: 0010:sched_ttwu_pending+0x6b/0xe0
[58034.616385] Code: 4b a1 93 00 41 83 a4 24 98 09 00 00 03 4c 89 e7 48 89 45 d8 c7 45 e0 00 00 00 00 e8 ef ca ff ff 48 8d 73 d0 48 83 fe d0 74 2c <0f> b6 96 64 08 00 00 48 8b 46 30 48 8d 4d d8 4c 89 e7 48 8d 5>
[58034.618537] RSP: 0018:ffff94327f803f90 EFLAGS: 00010087
[58034.619134] RAX: 000034c83ba48a83 RBX: ffff943276c49730 RCX: 00000000ffffffff
[58034.619992] RDX: 0000000000002a12 RSI: ffff943276c49700 RDI: ffff94327fe3e000
[58034.620913] RBP: ffff94327f803fb8 R08: 00003753c7b87c93 R09: 0000000000000000
[58034.621835] R10: 0000000000000000 R11: 0000000000000000 R12: ffff94327f822c00
[58034.622723] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[58034.623539] FS: 0000000000000000(0000) GS:ffff94327f800000(0000) knlGS:0000000000000000
[58034.624513] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[58034.625214] CR2: ffff943276c49f64 CR3: 0000000076544000 CR4: 00000000003406f0
[58034.626268] Call Trace:
[58034.626605] <IRQ>
[58034.626924] scheduler_ipi+0xa9/0x130
[58034.627400] smp_reschedule_interrupt+0x39/0xe0
[58034.627925] reschedule_interrupt+0xf/0x20
[58034.628398] </IRQ>
[58034.628659] RIP: 0010:rcu_idle_exit+0x40/0x70
[58034.629181] Code: fa 66 0f 1f 44 00 00 48 c7 c3 80 a6 01 00 65 48 03 1d ec 4b 50 4e 48 8b 03 48 85 c0 74 16 48 83 c0 01 48 89 03 4c 89 e7 57 9d <0f> 1f 44 00 00 5b 41 5c 5d c3 e8 a1 c6 ff ff 48 b8 00 00 00 0>
[58034.631577] RSP: 0018:ffffffffb2e03e48 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff02
[58034.632566] RAX: 4000000000000000 RBX: ffff94327f81a680 RCX: ffff94327f81a680
[58034.633557] RDX: 0000000000000000 RSI: ffff94327f81a680 RDI: 0000000000000202
[58034.634416] RBP: ffffffffb2e03e58 R08: 000000000001cc00 R09: 0000000000000001
[58034.635292] R10: ffffffffb2e03d98 R11: 0000000000000000 R12: 0000000000000202
[58034.636107] R13: 0000000000000000 R14: 0000000000000000 R15: 000000007e369c93
[58034.637052] do_idle+0x13f/0x280
[58034.637453] cpu_startup_entry+0x73/0x80
[58034.638067] rest_init+0xae/0xb0
[58034.638506] start_kernel+0x539/0x55a
[58034.638938] x86_64_start_reservations+0x24/0x26
[58034.639470] x86_64_start_kernel+0x74/0x77
[58034.639958] secondary_startup_64+0xa5/0xb0
[58034.640441] Modules linked in: cfg80211 veth xt_nat xt_tcpudp ipt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_fi>

None of the server has experienced any kernel panic itself.

After analysing many of the crashes and seeing many of them handling interrupts, we decided to reboot the instances with the noapic parameter.

No kernel panic has been seen since but this is just a workaround until a solution is found.

AMD has been informed of the issue.

TIA,

...Louis (aka Caribou)

Tags: cosmic
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1834505

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: cosmic
Louis Bouchard (louis)
description: updated
Louis Bouchard (louis)
Changed in linux (Ubuntu):
status: Incomplete → Opinion
status: Opinion → Confirmed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.