[trusty] 3.13.0-167-generic repeatedly trips kernel paging request BUG:

Bug #1823216 reported by Liz Fong-Jones
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Invalid
Undecided
Unassigned
Trusty
Invalid
Undecided
Unassigned

Bug Description

While trying to diagnose a problem with getting a kernel BUG: Bad page map in process retriever (retriever is the name of our serving binary), causing the serving binary to crash, we tried upgrading from 3.13.0-121-generic to 3.13.0-167-generic and instead started more regularly tripping a different bug that hardlocks the system when we attempt to stop & restart the affected process.

We are running Trusty on AWS in i3.xlarge instances.

BUG: unable to handle kernel paging request at ffffeafffffffff0
[89149.537277] IP: [<ffffffff8174416e>] _raw_spin_lock+0xe/0x50
[89149.543462] PGD 0
[89149.545919] Oops: 0002 [#1] SMP
[89149.549887] Modules linked in: 8021q garp stp mrp llc dm_crypt crct10dif_pclmul crc32_pclmul ghash_clmulni_intel serio_raw isofs aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd psmouse nvme ena floppy
[89149.574925] CPU: 3 PID: 21681 Comm: retriever Not tainted 3.13.0-167-generic #217-Ubuntu
[89149.583168] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
[89149.589649] task: ffff8807f2f7e000 ti: ffff88015b674000 task.ti: ffff88015b674000
[89149.597338] RIP: 0010:[<ffffffff8174416e>] [<ffffffff8174416e>] _raw_spin_lock+0xe/0x50
[89149.605845] RSP: 0000:ffff88015b675d88 EFLAGS: 00010206
[89149.611431] RAX: 0000000000020000 RBX: 000000c0dee00020 RCX: f000c00000000f53
[89149.618873] RDX: ffff880000000000 RSI: 000000ffffffffc0 RDI: ffffeafffffffff0
[89149.626336] RBP: ffff88015b675d88 R08: 00003fffffe00000 R09: 00000000000000a9
[89149.633705] R10: 0000000000000000 R11: f000ff53f000ff53 R12: ffff880037471900
[89149.641019] R13: ffff8807f49c57b8 R14: ffff8807f2eb2680 R15: ffffeafffffffff0
[89149.648287] FS: 00007f92262dd700(0000) GS:ffff88081fc60000(0000) knlGS:0000000000000000
[89149.657764] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[89149.664294] CR2: ffffeafffffffff0 CR3: 00000007f4380000 CR4: 0000000000160670
[89149.672158] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[89149.680246] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[89149.688126] Stack:
[89149.690903] ffff88015b675e20 ffffffff811831fe 8000000001e85225 ffff8807f2eb2680
[89149.700310] ffff88015b675f20 ffff88015b675e10 0000000000000007 f000c00000000f53
[89149.709739] 00000000000000a9 f000ff53f000ff53 ffff880000000000 8000000300000000
[89149.719331] Call Trace:
[89149.722581] [<ffffffff811831fe>] handle_mm_fault+0x27e/0xfb0
[89149.729984] [<ffffffff81748743>] __do_page_fault+0x183/0x570
[89149.737506] [<ffffffff810e33ea>] ? do_futex+0x10a/0x760
[89149.744529] [<ffffffff810a7a05>] ? set_next_entity+0x95/0xb0
[89149.752059] [<ffffffff810a7a7f>] ? pick_next_task_fair+0x5f/0x1b0
[89149.760000] [<ffffffff810a4cc5>] ? sched_clock_cpu+0xb5/0x100
[89149.767489] [<ffffffff81748b4a>] do_page_fault+0x1a/0x70
[89149.774460] [<ffffffff81744c68>] page_fault+0x28/0x30
[89149.781134] Code: 00 00 55 48 89 e5 f0 81 07 00 00 10 00 48 89 f7 57 9d 0f 1f 44 00 00 5d c3 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 b8 00 00 02 00 <f0> 0f c1 07 89 c2 c1 ea 10 66 39 c2 75 02 5d c3 83 e2 fe 0f b7
[89149.820963] RIP [<ffffffff8174416e>] _raw_spin_lock+0xe/0x50
[89149.828646] RSP <ffff88015b675d88>
[89149.833460] CR2: ffffeafffffffff0
[89149.843886] ---[ end trace 5e5ae13dbb1d0f6e ]---

BUG: unable to handle kernel paging request at ffffeafffffffff0
[66384.498399] IP: [<ffffffff8174416e>] _raw_spin_lock+0xe/0x50
[66384.502611] PGD 0
[66384.504586] Oops: 0002 [#1] SMP
[66384.507578] Modules linked in: 8021q garp stp mrp llc dm_crypt crct10dif_pclmul crc32_pclmul ghash_clmulni_intel serio_raw isofs aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd psmouse nvme ena floppy
[66384.525392] CPU: 1 PID: 21144 Comm: retriever Not tainted 3.13.0-167-generic #217-Ubuntu
[66384.531418] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
[66384.535812] task: ffff88074a9a9800 ti: ffff8807f411e000 task.ti: ffff8807f411e000
[66384.542092] RIP: 0010:[<ffffffff8174416e>] [<ffffffff8174416e>] _raw_spin_lock+0xe/0x50
[66384.548699] RSP: 0000:ffff8807f411fd88 EFLAGS: 00010206
[66384.552560] RAX: 0000000000020000 RBX: 000000c2b0c00000 RCX: f000c00000000f53
[66384.557499] RDX: ffff880000000000 RSI: 000000ffffffffc0 RDI: ffffeafffffffff0
[66384.562419] RBP: ffff8807f411fd88 R08: 00003fffffe00000 R09: 00000000000000a9
[66384.567450] R10: 0000000000000000 R11: f000ff53f000ff53 R12: ffff8807f4af9f00
[66384.572935] R13: ffff8807f2ed9c30 R14: ffff8807f3ee2680 R15: ffffeafffffffff0
[66384.578165] FS: 00007f9886ac4700(0000) GS:ffff88081fc20000(0000) knlGS:0000000000000000
[66384.584224] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[66384.588318] CR2: ffffeafffffffff0 CR3: 000000079ec8e000 CR4: 0000000000160670
[66384.593312] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[66384.598189] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[66384.603115] Stack:
[66384.604990] ffff8807f411fe20 ffffffff811831fe 00000000f411fda8 0000000000008000
[66384.611706] ffff8807f411fe70 ffffffff81622244 ffff8807f411fdf8 f000c00000000f53
[66384.618451] ffffffff000000a9 f000ff53f000ff53 ffff880000000000 8000000100000000
[66384.625091] Call Trace:
[66384.627330] [<ffffffff811831fe>] handle_mm_fault+0x27e/0xfb0
[66384.631601] [<ffffffff81622244>] ? sock_aio_read.part.5+0x104/0x120
[66384.636281] [<ffffffff81748743>] __do_page_fault+0x183/0x570
[66384.640575] [<ffffffff81622281>] ? sock_aio_read+0x21/0x30
[66384.644745] [<ffffffff811c8730>] ? do_sync_read+0x60/0xa0
[66384.648891] [<ffffffff81748b4a>] do_page_fault+0x1a/0x70
[66384.652957] [<ffffffff81744c68>] page_fault+0x28/0x30
[66384.656849] Code: 00 00 55 48 89 e5 f0 81 07 00 00 10 00 48 89 f7 57 9d 0f 1f 44 00 00 5d c3 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 b8 00 00 02 00 <f0> 0f c1 07 89 c2 c1 ea 10 66 39 c2 75 02 5d c3 83 e2 fe 0f b7
[66384.678274] RIP [<ffffffff8174416e>] _raw_spin_lock+0xe/0x50
[66384.682555] RSP <ffff8807f411fd88>
[66384.685382] CR2: ffffeafffffffff0
[66384.691222] ---[ end trace 78397061d8d25a28 ]---

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1823216

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: trusty
Revision history for this message
Liz Fong-Jones (lizthegrey) wrote : Re: [Bug 1823216] Missing required logs.
Download full text (7.5 KiB)

Attached, find the apport log.

On Thu, Apr 4, 2019 at 3:41 PM Ubuntu Kernel Bot <
<email address hidden>> wrote:

> This bug is missing log files that will aid in diagnosing the problem.
> While running an Ubuntu kernel (not a mainline or third-party kernel)
> please enter the following command in a terminal window:
>
> apport-collect 1823216
>
> and then change the status of the bug to 'Confirmed'.
>
> If, due to the nature of the issue you have encountered, you are unable
> to run this command, please add a comment stating that fact and change
> the bug status to 'Confirmed'.
>
> This change has been made by an automated script, maintained by the
> Ubuntu Kernel Team.
>
> ** Changed in: linux (Ubuntu)
> Status: New => Incomplete
>
> ** Tags added: trusty
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1823216
>
> Title:
> [trusty] 3.13.0-167-generic repeatedly trips kernel paging request
> BUG:
>
> Status in linux package in Ubuntu:
> Incomplete
>
> Bug description:
> While trying to diagnose a problem with getting a kernel BUG: Bad page
> map in process retriever (retriever is the name of our serving
> binary), causing the serving binary to crash, we tried upgrading from
> 3.13.0-121-generic to 3.13.0-167-generic and instead started more
> regularly tripping a different bug that hardlocks the system when we
> attempt to stop & restart the affected process.
>
> We are running Trusty on AWS in i3.xlarge instances.
>
> BUG: unable to handle kernel paging request at ffffeafffffffff0
> [89149.537277] IP: [<ffffffff8174416e>] _raw_spin_lock+0xe/0x50
> [89149.543462] PGD 0
> [89149.545919] Oops: 0002 [#1] SMP
> [89149.549887] Modules linked in: 8021q garp stp mrp llc dm_crypt
> crct10dif_pclmul crc32_pclmul ghash_clmulni_intel serio_raw isofs
> aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd psmouse
> nvme ena floppy
> [89149.574925] CPU: 3 PID: 21681 Comm: retriever Not tainted
> 3.13.0-167-generic #217-Ubuntu
> [89149.583168] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
> [89149.589649] task: ffff8807f2f7e000 ti: ffff88015b674000 task.ti:
> ffff88015b674000
> [89149.597338] RIP: 0010:[<ffffffff8174416e>] [<ffffffff8174416e>]
> _raw_spin_lock+0xe/0x50
> [89149.605845] RSP: 0000:ffff88015b675d88 EFLAGS: 00010206
> [89149.611431] RAX: 0000000000020000 RBX: 000000c0dee00020 RCX:
> f000c00000000f53
> [89149.618873] RDX: ffff880000000000 RSI: 000000ffffffffc0 RDI:
> ffffeafffffffff0
> [89149.626336] RBP: ffff88015b675d88 R08: 00003fffffe00000 R09:
> 00000000000000a9
> [89149.633705] R10: 0000000000000000 R11: f000ff53f000ff53 R12:
> ffff880037471900
> [89149.641019] R13: ffff8807f49c57b8 R14: ffff8807f2eb2680 R15:
> ffffeafffffffff0
> [89149.648287] FS: 00007f92262dd700(0000) GS:ffff88081fc60000(0000)
> knlGS:0000000000000000
> [89149.657764] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [89149.664294] CR2: ffffeafffffffff0 CR3: 00000007f4380000 CR4:
> 0000000000160670
> [89149.672158] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [89149....

Read more...

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
tags: added: ec2-images
no longer affects: linux-signed (Ubuntu)
Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Is there a reproducer?

Stefan Bader (smb)
Changed in linux (Ubuntu Trusty):
status: New → Confirmed
Changed in linux (Ubuntu):
status: Confirmed → Invalid
Revision history for this message
Liz Fong-Jones (lizthegrey) wrote :

Unfortunately no externalized minimal test case, so far we've only reproduced this with running our data storage binary under production load.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Maybe give v4.4 based HWE kernel a try...

Revision history for this message
Liz Fong-Jones (lizthegrey) wrote :

Yup, I'm happy to try the HWE kernel. I'll do that and report back.

Revision history for this message
Liz Fong-Jones (lizthegrey) wrote :

(our other option is going straight to 18.04 but that's a much larger change to test...

Revision history for this message
Liz Fong-Jones (lizthegrey) wrote :

Canarying both 4.4.0-1040-aws #43-Ubuntu and 4.4.0-144-generic #170~14.04.1-Ubuntu in parallel to see whether the bug trips in either of these. Should have results by Monday -- it was tripping every ~24h with 3.13.0-167-generic so if there's no failures by Monday I'll consider it resolved with the newer kernel train.

Revision history for this message
Liz Fong-Jones (lizthegrey) wrote :

No faults detected since a week ago. Going to declare this fixed in the 4.4 kernel series.

Revision history for this message
Liz Fong-Jones (lizthegrey) wrote :

And Trusty is now desupported, so going to close this as obsolete.

Changed in linux (Ubuntu Trusty):
status: Confirmed → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.