Activity log for bug #1834505

Date Who What changed Old value New value Message
2019-06-27 15:59:15 Louis Bouchard bug added bug
2019-06-27 16:00:07 Ubuntu Kernel Bot linux (Ubuntu): status New Incomplete
2019-06-27 16:00:08 Ubuntu Kernel Bot tags cosmic
2019-06-27 16:00:21 Louis Bouchard description We have been seeing many kernel panic in QEMU instances on newly deployed servers all running on the EPYC architecture. Many of the KP occur early after the start of the QEMU process or within a few hours. The only thing running on the underlying server is QEMU processes. Here is a typical backtrace of a kernel panic : [58034.598930] BUG: unable to handle kernel paging request at ffff943276c49f64 [58034.612039] PGD 8363067 P4D 8363067 PUD 8367067 PMD 759ac063 PTE 8000000076c49163 [58034.612992] Oops: 0000 [#1] SMP NOPTI [58034.613462] CPU: 0 PID: 0 Comm: swapper/0 Kdump: loaded Not tainted 4.18.0-20-generic #21~18.04.1-Ubuntu [58034.614685] Hardware name: Scaleway Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015 [58034.615803] RIP: 0010:sched_ttwu_pending+0x6b/0xe0 [58034.616385] Code: 4b a1 93 00 41 83 a4 24 98 09 00 00 03 4c 89 e7 48 89 45 d8 c7 45 e0 00 00 00 00 e8 ef ca ff ff 48 8d 73 d0 48 83 fe d0 74 2c <0f> b6 96 64 08 00 00 48 8b 46 30 48 8d 4d d8 4c 89 e7 48 8d 5> [58034.618537] RSP: 0018:ffff94327f803f90 EFLAGS: 00010087 [58034.619134] RAX: 000034c83ba48a83 RBX: ffff943276c49730 RCX: 00000000ffffffff [58034.619992] RDX: 0000000000002a12 RSI: ffff943276c49700 RDI: ffff94327fe3e000 [58034.620913] RBP: ffff94327f803fb8 R08: 00003753c7b87c93 R09: 0000000000000000 [58034.621835] R10: 0000000000000000 R11: 0000000000000000 R12: ffff94327f822c00 [58034.622723] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [58034.623539] FS: 0000000000000000(0000) GS:ffff94327f800000(0000) knlGS:0000000000000000 [58034.624513] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [58034.625214] CR2: ffff943276c49f64 CR3: 0000000076544000 CR4: 00000000003406f0 [58034.626268] Call Trace: [58034.626605] <IRQ> [58034.626924] scheduler_ipi+0xa9/0x130 [58034.627400] smp_reschedule_interrupt+0x39/0xe0 [58034.627925] reschedule_interrupt+0xf/0x20 [58034.628398] </IRQ> [58034.628659] RIP: 0010:rcu_idle_exit+0x40/0x70 [58034.629181] Code: fa 66 0f 1f 44 00 00 48 c7 c3 80 a6 01 00 65 48 03 1d ec 4b 50 4e 48 8b 03 48 85 c0 74 16 48 83 c0 01 48 89 03 4c 89 e7 57 9d <0f> 1f 44 00 00 5b 41 5c 5d c3 e8 a1 c6 ff ff 48 b8 00 00 00 0> [58034.631577] RSP: 0018:ffffffffb2e03e48 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff02 [58034.632566] RAX: 4000000000000000 RBX: ffff94327f81a680 RCX: ffff94327f81a680 [58034.633557] RDX: 0000000000000000 RSI: ffff94327f81a680 RDI: 0000000000000202 [58034.634416] RBP: ffffffffb2e03e58 R08: 000000000001cc00 R09: 0000000000000001 [58034.635292] R10: ffffffffb2e03d98 R11: 0000000000000000 R12: 0000000000000202 [58034.636107] R13: 0000000000000000 R14: 0000000000000000 R15: 000000007e369c93 [58034.637052] do_idle+0x13f/0x280 [58034.637453] cpu_startup_entry+0x73/0x80 [58034.638067] rest_init+0xae/0xb0 [58034.638506] start_kernel+0x539/0x55a [58034.638938] x86_64_start_reservations+0x24/0x26 [58034.639470] x86_64_start_kernel+0x74/0x77 [58034.639958] secondary_startup_64+0xa5/0xb0 [58034.640441] Modules linked in: cfg80211 veth xt_nat xt_tcpudp ipt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_fi> None of the server has experienced any kernel panic itself. After analysing many of the crashes and seeing many of them handling interrupts, we decided to reboot the instances with the noapic parameter. No kernel panic has been seen since but this is just a workaround until a solution is found. AMD has been informed of the issue. TIA, ...Louis (aka Caribou) We have been seeing many kernel panic in QEMU instances on newly deployed servers all running on the EPYC architecture. Many of the KP occur early after the start of the QEMU process or within a few hours. All the servers are running an up to date Bionic. After the first few panics on 4.15 kernels, we upgraded to 4.18 and still had panics. The only thing running on the underlying server is QEMU processes. Here is a typical backtrace of a kernel panic : [58034.598930] BUG: unable to handle kernel paging request at ffff943276c49f64 [58034.612039] PGD 8363067 P4D 8363067 PUD 8367067 PMD 759ac063 PTE 8000000076c49163 [58034.612992] Oops: 0000 [#1] SMP NOPTI [58034.613462] CPU: 0 PID: 0 Comm: swapper/0 Kdump: loaded Not tainted 4.18.0-20-generic #21~18.04.1-Ubuntu [58034.614685] Hardware name: Scaleway Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015 [58034.615803] RIP: 0010:sched_ttwu_pending+0x6b/0xe0 [58034.616385] Code: 4b a1 93 00 41 83 a4 24 98 09 00 00 03 4c 89 e7 48 89 45 d8 c7 45 e0 00 00 00 00 e8 ef ca ff ff 48 8d 73 d0 48 83 fe d0 74 2c <0f> b6 96 64 08 00 00 48 8b 46 30 48 8d 4d d8 4c 89 e7 48 8d 5> [58034.618537] RSP: 0018:ffff94327f803f90 EFLAGS: 00010087 [58034.619134] RAX: 000034c83ba48a83 RBX: ffff943276c49730 RCX: 00000000ffffffff [58034.619992] RDX: 0000000000002a12 RSI: ffff943276c49700 RDI: ffff94327fe3e000 [58034.620913] RBP: ffff94327f803fb8 R08: 00003753c7b87c93 R09: 0000000000000000 [58034.621835] R10: 0000000000000000 R11: 0000000000000000 R12: ffff94327f822c00 [58034.622723] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [58034.623539] FS: 0000000000000000(0000) GS:ffff94327f800000(0000) knlGS:0000000000000000 [58034.624513] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [58034.625214] CR2: ffff943276c49f64 CR3: 0000000076544000 CR4: 00000000003406f0 [58034.626268] Call Trace: [58034.626605] <IRQ> [58034.626924] scheduler_ipi+0xa9/0x130 [58034.627400] smp_reschedule_interrupt+0x39/0xe0 [58034.627925] reschedule_interrupt+0xf/0x20 [58034.628398] </IRQ> [58034.628659] RIP: 0010:rcu_idle_exit+0x40/0x70 [58034.629181] Code: fa 66 0f 1f 44 00 00 48 c7 c3 80 a6 01 00 65 48 03 1d ec 4b 50 4e 48 8b 03 48 85 c0 74 16 48 83 c0 01 48 89 03 4c 89 e7 57 9d <0f> 1f 44 00 00 5b 41 5c 5d c3 e8 a1 c6 ff ff 48 b8 00 00 00 0> [58034.631577] RSP: 0018:ffffffffb2e03e48 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff02 [58034.632566] RAX: 4000000000000000 RBX: ffff94327f81a680 RCX: ffff94327f81a680 [58034.633557] RDX: 0000000000000000 RSI: ffff94327f81a680 RDI: 0000000000000202 [58034.634416] RBP: ffffffffb2e03e58 R08: 000000000001cc00 R09: 0000000000000001 [58034.635292] R10: ffffffffb2e03d98 R11: 0000000000000000 R12: 0000000000000202 [58034.636107] R13: 0000000000000000 R14: 0000000000000000 R15: 000000007e369c93 [58034.637052] do_idle+0x13f/0x280 [58034.637453] cpu_startup_entry+0x73/0x80 [58034.638067] rest_init+0xae/0xb0 [58034.638506] start_kernel+0x539/0x55a [58034.638938] x86_64_start_reservations+0x24/0x26 [58034.639470] x86_64_start_kernel+0x74/0x77 [58034.639958] secondary_startup_64+0xa5/0xb0 [58034.640441] Modules linked in: cfg80211 veth xt_nat xt_tcpudp ipt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_fi> None of the server has experienced any kernel panic itself. After analysing many of the crashes and seeing many of them handling interrupts, we decided to reboot the instances with the noapic parameter. No kernel panic has been seen since but this is just a workaround until a solution is found. AMD has been informed of the issue. TIA, ...Louis (aka Caribou)
2019-06-28 07:28:00 Louis Bouchard linux (Ubuntu): status Incomplete Opinion
2019-06-28 07:28:09 Louis Bouchard linux (Ubuntu): status Opinion Confirmed