Ubuntu 16.04 4.8.0 kernel crashing on EC2 instances at boot

Bug #1668297 reported by James Ravn
20
This bug affects 4 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

After switching to the linux-hwe kernel on 16.04.2, we started observing kernel crashes on boot on some of our EC2 instances (so far, it only seems to happen on the newer M4 types). The instance becomes unresponsive when this happens. It looks like a rapl issue - we have blacklisted intel_rapl and intel_rapl_perf for now. Here is the trace:

general protection fault: 0000 [#1] SMP
Modules linked in: intel_rapl_perf(+) i2c_piix4 input_leds parport_pc serio_raw mac_hid parport sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscs
i_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32
c raid1 raid0 multipath linear cirrus crct10dif_pclmul ttm crc32_pclmul drm_kms_helper ghash_clmulni_intel syscopyarea sysfillrect sysimgblt aesni_intel fb_sy
s_fops aes_x86_64 lrw glue_helper ablk_helper cryptd drm ixgbevf psmouse pata_acpi floppy fjes
CPU: 2 PID: 20 Comm: cpuhp/2 Not tainted 4.8.0-39-generic #42~16.04.1-Ubuntu
Hardware name: Xen HVM domU, BIOS 4.2.amazon 11/11/2016
task: ffff8bee465a1d80 task.stack: ffff8bee465ac000
RIP: 0010:[<ffffffffc0728793>] [<ffffffffc0728793>] rapl_cpu_online+0x63/0x71 [intel_rapl_perf]
RSP: 0000:ffff8bee465afe18 EFLAGS: 00010212
RAX: 0000000000000200 RBX: ffffffffc0728730 RCX: 0000000000000000
RDX: 0000000000000200 RSI: 0000000000000200 RDI: 0000000000000200
RBP: ffff8bee465afe30 R08: 0000000000000000 R09: 0000000000000001
R10: ffff8bee45ec2600 R11: ffff8bec41fbce00 R12: 6401b4899ff8202c
R13: 0000000000000002 R14: ffff8bee4fc0daa0 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff8bee4fc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000563b0d9f9dc8 CR3: 000000020608e000 CR4: 00000000001406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Stack:
 ffffffffc0728730 0000000000000002 000000000000004e ffff8bee465afe70
 ffffffff95883d86 ffff8bee4fc0daa0 ffff8bee4fc0daa0 0000000000000002
 ffffffff9663df60 ffff8bee464b85f0 ffff8bec47c19300 ffff8bee465afe90
Call Trace:
 [<ffffffffc0728730>] ? rapl_cpu_prepare+0x100/0x100 [intel_rapl_perf]
 [<ffffffff95883d86>] cpuhp_invoke_callback+0x46/0x110
 [<ffffffff958840d1>] cpuhp_thread_fun+0x41/0x100
 [<ffffffff958a7405>] smpboot_thread_fn+0x105/0x160
 [<ffffffff958a7300>] ? sort_range+0x30/0x30
 [<ffffffff958a3fa8>] kthread+0xd8/0xf0
 [<ffffffff9609ba1f>] ret_from_fork+0x1f/0x40
 [<ffffffff958a3ed0>] ? kthread_create_on_node+0x1e0/0x1e0
Code: 23 00 00 4c 8b a4 ca 10 01 00 00 48 c7 c2 80 a0 00 00 48 01 c2 e8 6e 56 50 d5 3b 05 fc 67 03 d6 7c 0e f0 4c 0f ab 2d 4d 23 00 00 <45> 89 6c 24 08 5b 31
c0 41 5c 41 5d 5d c3 0f 1f 44 00 00 55 48
RIP [<ffffffffc0728793>] rapl_cpu_online+0x63/0x71 [intel_rapl_perf]
 RSP <ffff8bee465afe18>
---[ end trace cd71880c1b07dfa5 ]---
BUG: unable to handle kernel paging request at 000000007957b4e8
IP: [<ffffffff958c6d7b>] __wake_up_common+0x2b/0x90
PGD 0
Oops: 0000 [#2] SMP
Modules linked in: intel_rapl_perf(+) i2c_piix4 input_leds parport_pc serio_raw mac_hid parport sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscs
i_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32
c raid1 raid0 multipath linear cirrus crct10dif_pclmul ttm crc32_pclmul drm_kms_helper ghash_clmulni_intel syscopyarea sysfillrect sysimgblt aesni_intel fb_sy
s_fops aes_x86_64 lrw glue_helper ablk_helper cryptd drm ixgbevf psmouse pata_acpi floppy fjes
CPU: 2 PID: 20 Comm: cpuhp/2 Tainted: G D 4.8.0-39-generic #42~16.04.1-Ubuntu
Hardware name: Xen HVM domU, BIOS 4.2.amazon 11/11/2016
task: ffff8bee465a1d80 task.stack: ffff8bee465ac000
RIP: 0010:[<ffffffff958c6d7b>] [<ffffffff958c6d7b>] __wake_up_common+0x2b/0x90
RSP: 0000:ffff8bee465afe38 EFLAGS: 00010086
RAX: 0000000000000282 RBX: ffff8bee465aff10 RCX: 0000000000000000
RDX: 000000007957b4e8 RSI: 0000000000000003 RDI: ffff8bee465aff10
RBP: ffff8bee465afe70 R08: 0000000000000000 R09: 0000000000000000
R10: ffff8bee45ec2600 R11: 000000000000022f R12: ffff8bee465aff18
R13: 0000000000000282 R14: 0000000000000000 R15: 0000000000000003
FS: 0000000000000000(0000) GS:ffff8bee4fc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000007957b4e8 CR3: 000000005fc06000 CR4: 00000000001406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Stack:
 00000001465a1d80 0000000000000000 ffff8bee465aff10 ffff8bee465aff08
 0000000000000282 0000000000000000 0000000000000000 ffff8bee465afe80
 ffffffff958c6e43 ffff8bee465afea8 ffffffff958c78c7 ffff8bee465a24d8
Call Trace:
 [<ffffffff958c6e43>] __wake_up_locked+0x13/0x20
 [<ffffffff958c78c7>] complete+0x37/0x50
 [<ffffffff9588052f>] mm_release+0xbf/0x140
 [<ffffffff95886e7d>] do_exit+0x14d/0xb50
 [<ffffffff9609cf97>] rewind_stack_do_exit+0x17/0x20
 [<ffffffff958a3ed0>] ? kthread_create_on_node+0x1e0/0x1e0
Code: 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54 4c 8d 67 08 53 41 89 f7 48 83 ec 10 89 55 cc 48 8b 57 08 4c 89 45 d0 49 39 d4 <48> 8b 32 74 45 41 89
ce 48 8d 42 e8 4c 8d 6e e8 eb 03 49 89 d5
RIP [<ffffffff958c6d7b>] __wake_up_common+0x2b/0x90
 RSP <ffff8bee465afe38>
CR2: 000000007957b4e8
---[ end trace cd71880c1b07dfa6 ]---
Fixing recursive fault but reboot is needed!

# uname -a
Linux ip-10-50-244-48 4.8.0-39-generic #42~16.04.1-Ubuntu SMP Mon Feb 20 15:06:07 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

# lsb_release -rd
Description: Ubuntu 16.04.2 LTS
Release: 16.04

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux-hwe (Ubuntu):
status: New → Confirmed
Revision history for this message
Scott Emmons (lscotte) wrote :
Download full text (6.0 KiB)

We are seeing this same issue in AWS with 16.04 Xenial and 4.8.0-39-generic on g2.8xlarge instances:

[ 37.191160] BUG: unable to handle kernel paging request at 0000220900000013
[ 37.192105] IP: [<ffffffffc0705793>] rapl_cpu_online+0x63/0x71 [intel_rapl_perf]
[ 37.192105] PGD 0
[ 37.192105] Oops: 0002 [#1] SMP
[ 37.192105] Modules linked in: intel_rapl_perf(+) ib_iser sunrpc rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear nvidia_drm(POE) nvidia_modeset(POE) cirrus ttm crct10dif_pclmul crc32_pclmul drm_kms_helper ghash_clmulni_intel syscopyarea aesni_intel aes_x86_64 sysfillrect lrw sysimgblt glue_helper fb_sys_fops ablk_helper nvidia(POE) cryptd drm psmouse pata_acpi floppy fjes
[ 37.220096] CPU: 16 PID: 104 Comm: cpuhp/16 Tainted: P OE 4.8.0-39-generic #42~16.04.1-Ubuntu
[ 37.220096] Hardware name: Xen HVM domU, BIOS 4.2.amazon 11/11/2016
[ 37.220096] task: ffff9e6f5c8aac40 task.stack: ffff9e6f5c8e4000
[ 37.220096] RIP: 0010:[<ffffffffc0705793>] [<ffffffffc0705793>] rapl_cpu_online+0x63/0x71 [intel_rapl_perf]
[ 37.220096] RSP: 0018:ffff9e6f5c8e7e18 EFLAGS: 00010202
[ 37.220096] RAX: 0000000000000200 RBX: ffffffffc0705730 RCX: 0000000000000000
[ 37.220096] RDX: 0000000000000200 RSI: 0000000000000200 RDI: 0000000000000200
[ 37.220096] RBP: ffff9e6f5c8e7e30 R08: 0000000000000000 R09: 0000000000000001
[ 37.220096] R10: 0000000000000000 R11: 0000000000000001 R12: 000022090000000b
[ 37.220096] R13: 0000000000000010 R14: ffff9e6f6040daa0 R15: 0000000000000000
[ 37.220096] FS: 0000000000000000(0000) GS:ffff9e6f60400000(0000) knlGS:0000000000000000
[ 37.220096] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 37.220096] CR2: 0000220900000013 CR3: 000000078f468000 CR4: 00000000001406e0
[ 37.220096] Stack:
[ 37.220096] ffffffffc0705730 0000000000000010 000000000000004e ffff9e6f5c8e7e70
[ 37.220096] ffffffff8ea83d86 ffff9e6f6040daa0 ffff9e6f6040daa0 0000000000000010
[ 37.220096] ffffffff8f83df60 ffff9e6f5d2a40f0 ffff9e67d5019300 ffff9e6f5c8e7e90
[ 37.220096] Call Trace:
[ 37.220096] [<ffffffffc0705730>] ? rapl_cpu_prepare+0x100/0x100 [intel_rapl_perf]
[ 37.220096] [<ffffffff8ea83d86>] cpuhp_invoke_callback+0x46/0x110
[ 37.220096] [<ffffffff8ea840d1>] cpuhp_thread_fun+0x41/0x100
[ 37.220096] [<ffffffff8eaa7405>] smpboot_thread_fn+0x105/0x160
[ 37.220096] [<ffffffff8eaa7300>] ? sort_range+0x30/0x30
[ 37.220096] [<ffffffff8eaa3fa8>] kthread+0xd8/0xf0
[ 37.220096] [<ffffffff8f29ba1f>] ret_from_fork+0x1f/0x40
[ 37.220096] [<ffffffff8eaa3ed0>] ? kthread_create_on_node+0x1e0/0x1e0
[ 37.220096] Code: 23 00 00 4c 8b a4 ca 10 01 00 00 48 c7 c2 80 a0 00 00 48 01 c2 e8 6e 86 72 ce 3b 05 fc 97 25 cf 7c 0e f0 4c 0f ab 2d 4d 23 00 00 <45> 89 6c 24 08 5b 31 c0 41 5c 41 5d 5d c3 0f 1f 44 00 00 55 48
[ 37.220096] RIP [<ffffffffc0705793>] rapl_cpu_online+0x63/0x71 [intel_rapl_perf]
[ 37.220096] RSP <ffff9e6f5c8e7e18>
[ 37.220096] CR2: 0000220900000013
[ 37.220096] ---[ end trace eeeab5be1b...

Read more...

affects: linux-hwe (Ubuntu) → linux (Ubuntu)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.