Bug #1733662 “System hang with Linux kernel due to mainline comm...” : Bugs : linux package : Ubuntu

Revision history for this message

Rod Smith (rodsmith) wrote on 2017-11-21:

#1

dmesg output with a 4.10 kernel Edit (8.6 KiB, text/plain)
Dependencies.txt Edit (2.3 KiB, text/plain; charset="utf-8")
JournalErrors.txt Edit (3.1 KiB, text/plain; charset="utf-8")
ProcCpuinfoMinimal.txt Edit (1.0 KiB, text/plain; charset="utf-8")

tags:

added: hwcert-server

Revision history for this message

Rod Smith (rodsmith) wrote on 2017-11-21:

#2

dmesg output with a 4.13 kernel Edit (10.7 KiB, text/plain)

Revision history for this message

Rod Smith (rodsmith) wrote on 2017-11-21:

#3

dmesg output on another system that hangs in a similar way Edit (14.3 KiB, text/plain)

I've discovered what may be the same bug on another system -- feebas, a Cisco UCS C220 M4 (Intel Series v3), with the same CPU type (Intel Xeon E5-2640 v3). I'm attaching dmesg output from it, but on this particular run, the computer did not hang indefinitely, although it did become unresponsive for a few seconds.

Revision history for this message

Rod Smith (rodsmith) wrote on 2017-11-21:

#4

Another dmesg output from feebas Edit (7.2 KiB, text/plain)

Here's the dmesg output from another run on feebas. In this case, the system has become unresponsive via SSH, although the console remains active.

Revision history for this message

Rod Smith (rodsmith) wrote on 2017-12-01:

#5

dmesg output from three runs with the 4.15.0-041500rc1 kernel Edit (6.5 KiB, application/x-tar)

I've tried upgrading to the latest development kernel, from http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15-rc1/, and re-testing. The details of the problem have changed (but they were never 100% consistent), but the problem definitely still exists. I'm attaching dmesg output from three runs:

* run1.txt -- In this run, the cpu_offlining script successfully shut
  down all CPU nodes (except node 0, of course), but when bringing
  them up again, the system segfaulted after bringing up several
  nodes. Thereafter, any remotely substantive command (top or
  shutdown, for instance) hung, although bash remained responsive
  and I could take file listings with ls.
* run2.txt -- In this run, the cpu_offlining script segfaulted
  when taking CPU nodes offline. The system then became unreliable
  in the same way as with run 1.
* run3.txt -- In this run, the script seemed to complete successfully,
  but the dmesg output includes errors associated with bringing up
  several nodes. The system SEEMED TO operate normally thereafter,
  but my testing was limited.

Revision history for this message

Rod Smith (rodsmith) wrote on 2017-12-01:

#6

dmesg outputs from several kernels Edit (100.4 KiB, application/x-tar)

Here are some more test runs on boldore, using different kernels, mostly from http://kernel.ubuntu.com/~kernel-ppa/mainline/?C=N;O=D. The attachment is a tarball containing dmesg output associated with runs of the cpu_offlining script. An overview:

* 4.10.0-38-generic: No hang or misbehavior; verbose dmesg output.
* 4.11.0-041100-generic: No hang or misbehavior; verbose dmesg output.
* 4.12.0-041200-generic: No hang or misbehavior; dmesg output is even
  more verbose and includes multiple "error -22" messages.
* 4.13.0-041300-generic: Similar to the above, but dmesg errors are
  now "error -19".
* 4.13.16-041316-generic: No system hang or misbehavior; dmesg
  output has no errors and is much shorter.
* 4.14.0-041400-generic: Segfault and limited functionality
  thereafter; dmesg has multiple "error -19" messages and multiple
  general protection fault dumps.

Joseph Salisbury (jsalisbury) on 2017-12-19

Changed in linux (Ubuntu):
importance:	Undecided → High

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2017-12-19:

#7

When you have a chance, could you also test the current mainline kernel:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15-rc4/

This will tell us if we should perform a regular bisect to find the offending commit, or if it's fixed in mainline, we would perform a "Reverse" bisect to find the commit that fixes things.

tags:

added: kernel-da-key performing-bisect

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2017-12-19:

#8

I see you already tested 4.15-rc1, but it's worth while to also test -rc4.

Changed in linux (Ubuntu Artful):
status:	New → Triaged
Changed in linux (Ubuntu Bionic):
status:	New → Triaged
Changed in linux (Ubuntu Artful):
importance:	Undecided → High
assignee:	nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Bionic):
assignee:	nobody → Joseph Salisbury (jsalisbury)

Revision history for this message

Rod Smith (rodsmith) wrote on 2017-12-19:

#9

Joseph, I've just tested 4.15-rc4, and the script crashed and the system became responsive to only the simplest commands when bringing CPU 9 back up, accompanied by this out of dmesg:

[ 166.722460] Hardware name: Cisco Systems Inc UCSC-C240-M4L/UCSC-C240-M4L, BIOS C240M4.2.0.10c.0.032320160820 03/23/2016
[ 166.722540] RIP: 0010:__kmalloc_track_caller+0xc5/0x210
[ 166.722578] RSP: 0000:ffffb75e8c7cbb08 EFLAGS: 00010206
[ 166.722615] RAX: 0000000000000000 RBX: 43ea0882f873c0e8 RCX: 00000000000001bf
[ 166.722663] RDX: 00000000000001be RSI: 0000000000000000 RDI: 0000000000021040
[ 166.722711] RBP: ffffb75e8c7cbb40 R08: ffff9cc35d341eaa R09: ffff9ca3ff807c00
[ 166.722757] R10: ffffb75e8c7cbd08 R11: bc159441a547de42 R12: ffff9cc35d341eaa
[ 166.722805] R13: 00000000014000c0 R14: 0000000000000007 R15: ffff9ca3ff807c00
[ 166.722852] FS: 0000000000000000(0000) GS:ffff9cc3ff240000(0000) knlGS:0000000000000000
[ 166.722905] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 166.722945] CR2: 0000000000000000 CR3: 0000001be7e09001 CR4: 00000000001606e0
[ 166.722992] Call Trace:
[ 166.723020] ? idr_alloc_cmn+0x97/0xd0
[ 166.723051] ? kstrdup_const+0x23/0x30
[ 166.723081] kstrdup+0x31/0x60
[ 166.723107] kstrdup_const+0x23/0x30
[ 166.723137] __kernfs_new_node+0x2c/0x120
[ 166.723168] kernfs_new_node+0x28/0x50
[ 166.723197] kernfs_create_dir_ns+0x34/0x90
[ 166.723229] sysfs_create_dir_ns+0x40/0x90
[ 166.723261] kobject_add_internal+0xac/0x2b0
[ 166.723294] kobject_add+0x71/0xd0
[ 166.723323] ? device_private_init+0x23/0x70
[ 166.723356] device_add+0x12c/0x680
[ 166.723385] cpu_device_create+0xe1/0x100
[ 166.723418] ? __slab_alloc+0x20/0x40
[ 166.723449] ? _cond_resched+0x19/0x40
[ 166.723481] cacheinfo_cpu_online+0x29a/0x3f0
[ 166.723515] ? get_cpu_cacheinfo+0x50/0x50
[ 166.723549] cpuhp_invoke_callback+0x9b/0x550
[ 166.723587] ? padata_replace+0xf0/0xf0
[ 166.725151] cpuhp_thread_fun+0xc4/0x150
[ 166.726682] smpboot_thread_fn+0xec/0x160
[ 166.728221] kthread+0x11e/0x140
[ 166.729701] ? sort_range+0x30/0x30
[ 166.731145] ? kthread_create_worker_on_cpu+0x70/0x70
[ 166.732551] ret_from_fork+0x1f/0x30
[ 166.733906] Code: 4d 01 e0 4d 8b 18 4d 33 99 40 01 00 00 4c 89 c3 4c 31 db 65 48 0f c7 0f 0f 94 c0 84 c0 74 ac 4d 39 d8 74 14 49 63 41 20 48 01 c3 <48> 33 1b 49 33 99 40 01 00 00 0f 18 0b 41 f7 c5 00 80 00 00 0f
[ 166.736776] RIP: __kmalloc_track_caller+0xc5/0x210 RSP: ffffb75e8c7cbb08
[ 166.738188] ---[ end trace 39ce10746b0f4324 ]---

If you want direct access to the affected hardware, that can be arranged. (If you've already got access to the certification network in 1SS, the affected system on which I've been doing most of the testing is boldore.) I'm also happy to run tests using test kernels that you give me.

Joseph, I've just tested 4.15-rc4, and the script crashed and the system became responsive to only the simplest commands when bringing CPU 9 back up, accompanied by this out of dmesg:

[  166.722460] Hardware name: Cisco Systems Inc UCSC-C240-M4L/UCSC-C240-M4L, BIOS C240M4.2.0.10c.0.032320160820 03/23/2016
[  166.722540] RIP: 0010:__kmalloc_track_caller+0xc5/0x210
[  166.722578] RSP: 0000:ffffb75e8c7cbb08 EFLAGS: 00010206
[  166.722615] RAX: 0000000000000000 RBX: 43ea0882f873c0e8 RCX: 00000000000001bf
[  166.722663] RDX: 00000000000001be RSI: 0000000000000000 RDI: 0000000000021040
[  166.722711] RBP: ffffb75e8c7cbb40 R08: ffff9cc35d341eaa R09: ffff9ca3ff807c00
[  166.722757] R10: ffffb75e8c7cbd08 R11: bc159441a547de42 R12: ffff9cc35d341eaa
[  166.722805] R13: 00000000014000c0 R14: 0000000000000007 R15: ffff9ca3ff807c00
[  166.722852] FS:  0000000000000000(0000) GS:ffff9cc3ff240000(0000) knlGS:0000000000000000
[  166.722905] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  166.722945] CR2: 0000000000000000 CR3: 0000001be7e09001 CR4: 00000000001606e0
[  166.722992] Call Trace:
[  166.723020]  ? idr_alloc_cmn+0x97/0xd0
[  166.723051]  ? kstrdup_const+0x23/0x30
[  166.723081]  kstrdup+0x31/0x60
[  166.723107]  kstrdup_const+0x23/0x30
[  166.723137]  __kernfs_new_node+0x2c/0x120
[  166.723168]  kernfs_new_node+0x28/0x50
[  166.723197]  kernfs_create_dir_ns+0x34/0x90
[  166.723229]  sysfs_create_dir_ns+0x40/0x90
[  166.723261]  kobject_add_internal+0xac/0x2b0
[  166.723294]  kobject_add+0x71/0xd0
[  166.723323]  ? device_private_init+0x23/0x70
[  166.723356]  device_add+0x12c/0x680
[  166.723385]  cpu_device_create+0xe1/0x100
[  166.723418]  ? __slab_alloc+0x20/0x40
[  166.723449]  ? _cond_resched+0x19/0x40
[  166.723481]  cacheinfo_cpu_online+0x29a/0x3f0
[  166.723515]  ? get_cpu_cacheinfo+0x50/0x50
[  166.723549]  cpuhp_invoke_callback+0x9b/0x550
[  166.723587]  ? padata_replace+0xf0/0xf0
[  166.725151]  cpuhp_thread_fun+0xc4/0x150
[  166.726682]  smpboot_thread_fn+0xec/0x160
[  166.728221]  kthread+0x11e/0x140
[  166.729701]  ? sort_range+0x30/0x30
[  166.731145]  ? kthread_create_worker_on_cpu+0x70/0x70
[  166.732551]  ret_from_fork+0x1f/0x30
[  166.733906] Code: 4d 01 e0 4d 8b 18 4d 33 99 40 01 00 00 4c 89 c3 4c 31 db 65 48 0f c7 0f 0f 94 c0 84 c0 74 ac 4d 39 d8 74 14 49 63 41 20 48 01 c3 <48> 33 1b 49 33 99 40 01 00 00 0f 18 0b 41 f7 c5 00 80 00 00 0f 
[  166.736776] RIP: __kmalloc_track_caller+0xc5/0x210 RSP: ffffb75e8c7cbb08
[  166.738188] ---[ end trace 39ce10746b0f4324 ]---

If you want direct access to the affected hardware, that can be arranged. (If you've already got access to the certification network in 1SS, the affected system on which I've been doing most of the testing is boldore.) I'm also happy to run tests using test kernels that you give me.

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2017-12-19:

#10

Thanks for testing mainline. The stack trace looks the same as prior kernels. We should perform a regular kernel bisect to identify the commit that introduced this regression.

It sounds like none of the upstream kernels exhibit this bug per comment #6, is that correct?

If that is the case, it may be due to an Ubuntu SAUCE patch. Can you give an early 17.10 kernel a test:

https://launchpad.net/~canonical-kernel-team/+archive/ubuntu/unstable/+build/13358561

Revision history for this message

Rod Smith (rodsmith) wrote on 2017-12-19:

#11

The upstream 4.14.0 kernel DOES segfault, but none of the 4.13-series kernels does. Some of the 4.13-series kernels do have "error -19" or "error -22" messages in their dmesg output, though.

I've tried the kernel at https://launchpad.net/~canonical-kernel-team/+archive/ubuntu/unstable/+build/13358561, and it ran through the cpu_offlining script five times without error, so I think it's OK.

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2017-12-19:

#12

Thanks for testing. So we now know that 4.13.0-16 has the bug but 4.13.0-10 does not.

Can you next try 4.13.0-14:
https://launchpad.net/ubuntu/+source/linux/4.13.0-14.15/+build/13541235

Revision history for this message

Rod Smith (rodsmith) wrote on 2017-12-20:

#13

4.13.0-14 failed when offlining CPU 9:

[ 104.500965] ------------[ cut here ]------------
[ 104.500968] kernel BUG at /build/linux-0p6sBa/linux-4.13.0/mm/slub.c:3878!
[ 104.501256] invalid opcode: 0000 [#1] SMP
[ 104.501422] Modules linked in: nls_iso8859_1 kvm_intel kvm irqbypass joydev input_leds ipmi_ssif ipmi_si ipmi_devintf ipmi_msghandler acpi_pad ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic igb crct10dif_pclmul crc32_pclmul ghash_clmulni_intel usbhid pcbc hid aesni_intel dca aes_x86_64 crypto_simd ptp glue_helper cryptd ahci pps_core i2c_algo_bit libahci megaraid_sas
[ 104.503659] CPU: 9 PID: 63 Comm: cpuhp/9 Not tainted 4.13.0-14-generic #15-Ubuntu
[ 104.504019] Hardware name: Cisco Systems Inc UCSC-C240-M4L/UCSC-C240-M4L, BIOS C240M4.2.0.10c.0.032320160820 03/23/2016
[ 104.504537] task: ffff9a9838b6ae80 task.stack: ffffb7e90c7b8000
[ 104.504827] RIP: 0010:kfree+0x11c/0x160
[ 104.505003] RSP: 0018:ffffb7e90c7bbd60 EFLAGS: 00010246
[ 104.505311] RAX: ffffd9d77eff0020 RBX: ffff9a9800000000 RCX: 00000001802a001a
[ 104.505617] RDX: 0000000000000000 RSI: ffffd9d77fe02400 RDI: 000065a740000000
[ 104.505938] RBP: ffffb7e90c7bbd78 R08: ffff9a9838091ec0 R09: 00000001802a001a
[ 104.506255] R10: ffffd9d77f000000 R11: 0000000000000000 R12: ffffffff87798960
[ 104.506763] R13: ffffffff869dd4f0 R14: 0000000000000009 R15: 0000000000000001
[ 104.507216] FS: 0000000000000000(0000) GS:ffff9a983f240000(0000) knlGS:0000000000000000
[ 104.507638] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 104.507884] CR2: 00007ffdd8f1bff8 CR3: 00000016ff209000 CR4: 00000000001406e0
[ 104.508188] Call Trace:
[ 104.508311] kfree_const+0x20/0x30
[ 104.508468] kobject_put+0x91/0x1a0
[ 104.508626] device_unregister+0x28/0x60
[ 104.508796] cpu_cache_sysfs_exit+0x5a/0xc0
[ 104.508971] ? free_cache_attributes.part.7+0x110/0x110
[ 104.509201] cacheinfo_cpu_pre_down+0x48/0x50
[ 104.509401] cpuhp_invoke_callback+0x84/0x3b0
[ 104.509616] cpuhp_down_callbacks+0x42/0x80
[ 104.509812] cpuhp_thread_fun+0x88/0xe0
[ 104.509997] smpboot_thread_fn+0xec/0x160
[ 104.510182] kthread+0x125/0x140
[ 104.510322] ? sort_range+0x30/0x30
[ 104.510491] ? kthread_create_on_node+0x70/0x70
[ 104.510706] ret_from_fork+0x25/0x30
[ 104.510870] Code: 08 49 83 c4 18 48 89 da 4c 89 ee ff d0 49 8b 04 24 48 85 c0 75 e6 e9 0e ff ff ff 49 8b 02 f6 c4 80 75 0a 49 8b 42 20 a8 01 75 02 <0f> 0b 49 8b 02 31 f6 f6 c4 80 74 04 41 8b 72 6c 4c 89 d7 e8 2c
[ 104.511761] RIP: kfree+0x11c/0x160 RSP: ffffb7e90c7bbd60
[ 104.512003] ---[ end trace 2290fcc444ad32ff ]---

Bash remained active, but I couldn't issue any significant commands.

4.13.0-14 failed when offlining CPU 9:

[  104.500965] ------------[ cut here ]------------
[  104.500968] kernel BUG at /build/linux-0p6sBa/linux-4.13.0/mm/slub.c:3878!
[  104.501256] invalid opcode: 0000 [#1] SMP
[  104.501422] Modules linked in: nls_iso8859_1 kvm_intel kvm irqbypass joydev input_leds ipmi_ssif ipmi_si ipmi_devintf ipmi_msghandler acpi_pad ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic igb crct10dif_pclmul crc32_pclmul ghash_clmulni_intel usbhid pcbc hid aesni_intel dca aes_x86_64 crypto_simd ptp glue_helper cryptd ahci pps_core i2c_algo_bit libahci megaraid_sas
[  104.503659] CPU: 9 PID: 63 Comm: cpuhp/9 Not tainted 4.13.0-14-generic #15-Ubuntu
[  104.504019] Hardware name: Cisco Systems Inc UCSC-C240-M4L/UCSC-C240-M4L, BIOS C240M4.2.0.10c.0.032320160820 03/23/2016
[  104.504537] task: ffff9a9838b6ae80 task.stack: ffffb7e90c7b8000
[  104.504827] RIP: 0010:kfree+0x11c/0x160
[  104.505003] RSP: 0018:ffffb7e90c7bbd60 EFLAGS: 00010246
[  104.505311] RAX: ffffd9d77eff0020 RBX: ffff9a9800000000 RCX: 00000001802a001a
[  104.505617] RDX: 0000000000000000 RSI: ffffd9d77fe02400 RDI: 000065a740000000
[  104.505938] RBP: ffffb7e90c7bbd78 R08: ffff9a9838091ec0 R09: 00000001802a001a
[  104.506255] R10: ffffd9d77f000000 R11: 0000000000000000 R12: ffffffff87798960
[  104.506763] R13: ffffffff869dd4f0 R14: 0000000000000009 R15: 0000000000000001
[  104.507216] FS:  0000000000000000(0000) GS:ffff9a983f240000(0000) knlGS:0000000000000000
[  104.507638] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  104.507884] CR2: 00007ffdd8f1bff8 CR3: 00000016ff209000 CR4: 00000000001406e0
[  104.508188] Call Trace:
[  104.508311]  kfree_const+0x20/0x30
[  104.508468]  kobject_put+0x91/0x1a0
[  104.508626]  device_unregister+0x28/0x60
[  104.508796]  cpu_cache_sysfs_exit+0x5a/0xc0
[  104.508971]  ? free_cache_attributes.part.7+0x110/0x110
[  104.509201]  cacheinfo_cpu_pre_down+0x48/0x50
[  104.509401]  cpuhp_invoke_callback+0x84/0x3b0
[  104.509616]  cpuhp_down_callbacks+0x42/0x80
[  104.509812]  cpuhp_thread_fun+0x88/0xe0
[  104.509997]  smpboot_thread_fn+0xec/0x160
[  104.510182]  kthread+0x125/0x140
[  104.510322]  ? sort_range+0x30/0x30
[  104.510491]  ? kthread_create_on_node+0x70/0x70
[  104.510706]  ret_from_fork+0x25/0x30
[  104.510870] Code: 08 49 83 c4 18 48 89 da 4c 89 ee ff d0 49 8b 04 24 48 85 c0 75 e6 e9 0e ff ff ff 49 8b 02 f6 c4 80 75 0a 49 8b 42 20 a8 01 75 02 <0f> 0b 49 8b 02 31 f6 f6 c4 80 74 04 41 8b 72 6c 4c 89 d7 e8 2c 
[  104.511761] RIP: kfree+0x11c/0x160 RSP: ffffb7e90c7bbd60
[  104.512003] ---[ end trace 2290fcc444ad32ff ]---

Bash remained active, but I couldn't issue any significant commands.

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2017-12-20:

#14

Can you next try 4.13.0-12:
https://launchpad.net/~canonical-kernel-team/+archive/ubuntu/ppa/+build/13498518

Revision history for this message

Rod Smith (rodsmith) wrote on 2017-12-20:

#15

4.13.0-12 seems to be OK; I ran it seven or eight times without a failure.

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2017-12-20:

#16

There was no version 4.13.0-13, so I'll start a bisect between 4.13.0-12 and 4.13.0-14. I'll build a test kernel and post it shortly.

Changed in linux (Ubuntu Artful):
status:	Triaged → In Progress
Changed in linux (Ubuntu Bionic):
status:	Triaged → In Progress

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2017-12-20:

#17

Hmm, now that I looked at the commits between 4.13.0-12 and 4.13.0-14, bug 1734327 looks similar. I built a test kernel already for that bug, and was wondering if you could test it.

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1734327/revert-test

Can you test that kernel and report back if it has the bug or not?

Revision history for this message

Rod Smith (rodsmith) wrote on 2017-12-20:

#18

That one failed (the script stopped running after taking CPU 9 offline) with the following dmesg output:

[ 119.360953] ------------[ cut here ]------------
[ 119.360955] kernel BUG at /home/jsalisbury/bugs/lp1734327/ac8f82a-revert-test/ubuntu-artful/mm/slub.c:3878!
[ 119.361405] invalid opcode: 0000 [#1] SMP
[ 119.361586] Modules linked in: nls_iso8859_1 kvm_intel kvm irqbypass joydev input_leds ipmi_ssif ipmi_si ipmi_devintf ipmi_msghandler acpi_pad ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear crct10dif_pclmul crc32_pclmul ghash_clmulni_intel hid_generic pcbc igb usbhid dca aesni_intel hid aes_x86_64 crypto_simd glue_helper ptp cryptd ahci pps_core libahci i2c_algo_bit megaraid_sas
[ 119.363826] CPU: 9 PID: 63 Comm: cpuhp/9 Not tainted 4.13.0-19-generic #22~lp1731031TwoReverts
[ 119.364209] Hardware name: Cisco Systems Inc UCSC-C240-M4L/UCSC-C240-M4L, BIOS C240M4.2.0.10c.0.032320160820 03/23/2016
[ 119.364687] task: ffff98cff8b49740 task.stack: ffffb3274c7b8000
[ 119.364973] RIP: 0010:kfree+0x11c/0x160
[ 119.365133] RSP: 0018:ffffb3274c7bbd60 EFLAGS: 00010246
[ 119.365356] RAX: fffff57a3bff0020 RBX: ffff98cf00000000 RCX: 0000000000000490
[ 119.365663] RDX: 0000000000000000 RSI: ffff98cfff25f4a0 RDI: 0000676f80000000
[ 119.365964] RBP: ffffb3274c7bbd78 R08: 000000000001f4a0 R09: ffffffffbb5dcf6a
[ 119.366262] R10: fffff57a3c000000 R11: 0000000000000000 R12: ffffffffbbf98e60
[ 119.366552] R13: ffffffffbb1dd820 R14: 0000000000000009 R15: 0000000000000001
[ 119.366844] FS: 0000000000000000(0000) GS:ffff98cfff240000(0000) knlGS:0000000000000000
[ 119.367176] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 119.367412] CR2: 000055cc84772018 CR3: 0000000e48e09000 CR4: 00000000001406e0
[ 119.367706] Call Trace:
[ 119.367824] kfree_const+0x20/0x30
[ 119.367975] kobject_put+0x91/0x1a0
[ 119.368134] device_unregister+0x28/0x60
[ 119.368311] cpu_cache_sysfs_exit+0x5a/0xc0
[ 119.368486] ? free_cache_attributes.part.7+0x110/0x110
[ 119.368709] cacheinfo_cpu_pre_down+0x48/0x50
[ 119.368897] cpuhp_invoke_callback+0x84/0x3b0
[ 119.369082] cpuhp_down_callbacks+0x42/0x80
[ 119.369253] cpuhp_thread_fun+0x88/0xe0
[ 119.369433] smpboot_thread_fn+0xec/0x160
[ 119.369598] kthread+0x125/0x140
[ 119.369732] ? sort_range+0x30/0x30
[ 119.369882] ? kthread_create_on_node+0x70/0x70
[ 119.370075] ret_from_fork+0x25/0x30
[ 119.370233] Code: 08 49 83 c4 18 48 89 da 4c 89 ee ff d0 49 8b 04 24 48 85 c0 75 e6 e9 0e ff ff ff 49 8b 02 f6 c4 80 75 0a 49 8b 42 20 a8 01 75 02 <0f> 0b 49 8b 02 31 f6 f6 c4 80 74 04 41 8b 72 6c 4c 89 d7 e8 1c
[ 119.371052] RIP: kfree+0x11c/0x160 RSP: ffffb3274c7bbd60
[ 119.371313] ---[ end trace edef5d0868ec0d2a ]---

The system continued to run, and I was able to issue other commands (ifconfig, efibootmgr), but I rebooted just to be safe.

That one failed (the script stopped running after taking CPU 9 offline) with the following dmesg output:

[  119.360953] ------------[ cut here ]------------
[  119.360955] kernel BUG at /home/jsalisbury/bugs/lp1734327/ac8f82a-revert-test/ubuntu-artful/mm/slub.c:3878!
[  119.361405] invalid opcode: 0000 [#1] SMP
[  119.361586] Modules linked in: nls_iso8859_1 kvm_intel kvm irqbypass joydev input_leds ipmi_ssif ipmi_si ipmi_devintf ipmi_msghandler acpi_pad ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear crct10dif_pclmul crc32_pclmul ghash_clmulni_intel hid_generic pcbc igb usbhid dca aesni_intel hid aes_x86_64 crypto_simd glue_helper ptp cryptd ahci pps_core libahci i2c_algo_bit megaraid_sas
[  119.363826] CPU: 9 PID: 63 Comm: cpuhp/9 Not tainted 4.13.0-19-generic #22~lp1731031TwoReverts
[  119.364209] Hardware name: Cisco Systems Inc UCSC-C240-M4L/UCSC-C240-M4L, BIOS C240M4.2.0.10c.0.032320160820 03/23/2016
[  119.364687] task: ffff98cff8b49740 task.stack: ffffb3274c7b8000
[  119.364973] RIP: 0010:kfree+0x11c/0x160
[  119.365133] RSP: 0018:ffffb3274c7bbd60 EFLAGS: 00010246
[  119.365356] RAX: fffff57a3bff0020 RBX: ffff98cf00000000 RCX: 0000000000000490
[  119.365663] RDX: 0000000000000000 RSI: ffff98cfff25f4a0 RDI: 0000676f80000000
[  119.365964] RBP: ffffb3274c7bbd78 R08: 000000000001f4a0 R09: ffffffffbb5dcf6a
[  119.366262] R10: fffff57a3c000000 R11: 0000000000000000 R12: ffffffffbbf98e60
[  119.366552] R13: ffffffffbb1dd820 R14: 0000000000000009 R15: 0000000000000001
[  119.366844] FS:  0000000000000000(0000) GS:ffff98cfff240000(0000) knlGS:0000000000000000
[  119.367176] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  119.367412] CR2: 000055cc84772018 CR3: 0000000e48e09000 CR4: 00000000001406e0
[  119.367706] Call Trace:
[  119.367824]  kfree_const+0x20/0x30
[  119.367975]  kobject_put+0x91/0x1a0
[  119.368134]  device_unregister+0x28/0x60
[  119.368311]  cpu_cache_sysfs_exit+0x5a/0xc0
[  119.368486]  ? free_cache_attributes.part.7+0x110/0x110
[  119.368709]  cacheinfo_cpu_pre_down+0x48/0x50
[  119.368897]  cpuhp_invoke_callback+0x84/0x3b0
[  119.369082]  cpuhp_down_callbacks+0x42/0x80
[  119.369253]  cpuhp_thread_fun+0x88/0xe0
[  119.369433]  smpboot_thread_fn+0xec/0x160
[  119.369598]  kthread+0x125/0x140
[  119.369732]  ? sort_range+0x30/0x30
[  119.369882]  ? kthread_create_on_node+0x70/0x70
[  119.370075]  ret_from_fork+0x25/0x30
[  119.370233] Code: 08 49 83 c4 18 48 89 da 4c 89 ee ff d0 49 8b 04 24 48 85 c0 75 e6 e9 0e ff ff ff 49 8b 02 f6 c4 80 75 0a 49 8b 42 20 a8 01 75 02 <0f> 0b 49 8b 02 31 f6 f6 c4 80 74 04 41 8b 72 6c 4c 89 d7 e8 1c 
[  119.371052] RIP: kfree+0x11c/0x160 RSP: ffffb3274c7bbd60
[  119.371313] ---[ end trace edef5d0868ec0d2a ]---

The system continued to run, and I was able to issue other commands (ifconfig, efibootmgr), but I rebooted just to be safe.

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2017-12-20:

#19

I started a kernel bisect between v4.13.0-12 and v4.13.0-14. The kernel bisect will require testing of about 7-10 test kernels.

I built the first test kernel, up to the following commit:
1c8d41925cff57972056048511a451040fa3b790

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1733662

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Revision history for this message

Rod Smith (rodsmith) wrote on 2017-12-20:

#20

There's nothing at the URL you posted, Joseph. Do I just need to give it more time to build, or is something wrong?

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2017-12-20:

#21

Sorry, the packages should be there now. You should only need the linux-image and linux-image-extra .deb files.

Revision history for this message

Rod Smith (rodsmith) wrote on 2017-12-21:

#22

Download full text (3.7 KiB)

OK, I've run tests now. The system did not crash or otherwise misbehave, but the dmesg output was quite verbose, and included "error -19" messages. Here's a sample (apparently for just one CPU core; this sequence was repeated quite a few times):

[ 439.341956] smpboot: Booting Node 1 Processor 31 APIC 0x1f
[ 439.354783] EDAC sbridge: Seeking for: PCI ID 8086:2fa0
[ 439.354795] EDAC sbridge: Seeking for: PCI ID 8086:2fa0
[ 439.354814] EDAC sbridge: Seeking for: PCI ID 8086:2fa0
[ 439.354836] EDAC sbridge: Seeking for: PCI ID 8086:2f60
[ 439.354849] EDAC sbridge: Seeking for: PCI ID 8086:2fa8
[ 439.354853] EDAC sbridge: Seeking for: PCI ID 8086:2fa8
[ 439.354859] EDAC sbridge: Seeking for: PCI ID 8086:2fa8
[ 439.354866] EDAC sbridge: Seeking for: PCI ID 8086:2f71
[ 439.354870] EDAC sbridge: Seeking for: PCI ID 8086:2f71
[ 439.354876] EDAC sbridge: Seeking for: PCI ID 8086:2f71
[ 439.354882] EDAC sbridge: Seeking for: PCI ID 8086:2faa
[ 439.354886] EDAC sbridge: Seeking for: PCI ID 8086:2faa
[ 439.354892] EDAC sbridge: Seeking for: PCI ID 8086:2faa
[ 439.354898] EDAC sbridge: Seeking for: PCI ID 8086:2fab
[ 439.354902] EDAC sbridge: Seeking for: PCI ID 8086:2fab
[ 439.354909] EDAC sbridge: Seeking for: PCI ID 8086:2fab
[ 439.354915] EDAC sbridge: Seeking for: PCI ID 8086:2fac
[ 439.354919] EDAC sbridge: Seeking for: PCI ID 8086:2fac
[ 439.354925] EDAC sbridge: Seeking for: PCI ID 8086:2fac
[ 439.354931] EDAC sbridge: Seeking for: PCI ID 8086:2fad
[ 439.354936] EDAC sbridge: Seeking for: PCI ID 8086:2fad
[ 439.354942] EDAC sbridge: Seeking for: PCI ID 8086:2fad
[ 439.354948] EDAC sbridge: Seeking for: PCI ID 8086:2f68
[ 439.354953] EDAC sbridge: Seeking for: PCI ID 8086:2f68
[ 439.354960] EDAC sbridge: Seeking for: PCI ID 8086:2f68
[ 439.354965] EDAC sbridge: Seeking for: PCI ID 8086:2f79
[ 439.354978] EDAC sbridge: Seeking for: PCI ID 8086:2f6a
[ 439.354991] EDAC sbridge: Seeking for: PCI ID 8086:2f6b
[ 439.355003] EDAC sbridge: Seeking for: PCI ID 8086:2f6c
[ 439.355016] EDAC sbridge: Seeking for: PCI ID 8086:2f6d
[ 439.355029] EDAC sbridge: Seeking for: PCI ID 8086:2ffc
[ 439.355033] EDAC sbridge: Seeking for: PCI ID 8086:2ffc
[ 439.355039] EDAC sbridge: Seeking for: PCI ID 8086:2ffc
[ 439.355046] EDAC sbridge: Seeking for: PCI ID 8086:2ffd
[ 439.355049] EDAC sbridge: Seeking for: PCI ID 8086:2ffd
[ 439.355055] EDAC sbridge: Seeking for: PCI ID 8086:2ffd
[ 439.355062] EDAC sbridge: Seeking for: PCI ID 8086:2fbd
[ 439.355067] EDAC sbridge: Seeking for: PCI ID 8086:2fbd
[ 439.355073] EDAC sbridge: Seeking for: PCI ID 8086:2fbd
[ 439.355079] EDAC sbridge: Seeking for: PCI ID 8086:2fbf
[ 439.355084] EDAC sbridge: Seeking for: PCI ID 8086:2fbf
[ 439.355090] EDAC sbridge: Seeking for: PCI ID 8086:2fbf
[ 439.355095] EDAC sbridge: Seeking for: PCI ID 8086:2fb9
[ 439.355101] EDAC sbridge: Seeking for: PCI ID 8086:2fb9
[ 439.355107] EDAC sbridge: Seeking for: PCI ID 8086:2fb9
[ 439.355112] EDAC sbridge: Seeking for: PCI ID 8086:2fbb
[ 439.355117] EDAC sbridge: Seeking for: PCI ID 8086:2fbb
[ 439.355123] EDAC sbridge: Seeking for: PCI ID 8086:2fbb
[ 439.355355] EDAC MC0: Giving out device to module sb_eda...

OK, I've run tests now. The system did not crash or otherwise misbehave, but the dmesg output was quite verbose, and included "error -19" messages. Here's a sample (apparently for just one CPU core; this sequence was repeated quite a few times):

[  439.341956] smpboot: Booting Node 1 Processor 31 APIC 0x1f
[  439.354783] EDAC sbridge: Seeking for: PCI ID 8086:2fa0
[  439.354795] EDAC sbridge: Seeking for: PCI ID 8086:2fa0
[  439.354814] EDAC sbridge: Seeking for: PCI ID 8086:2fa0
[  439.354836] EDAC sbridge: Seeking for: PCI ID 8086:2f60
[  439.354849] EDAC sbridge: Seeking for: PCI ID 8086:2fa8
[  439.354853] EDAC sbridge: Seeking for: PCI ID 8086:2fa8
[  439.354859] EDAC sbridge: Seeking for: PCI ID 8086:2fa8
[  439.354866] EDAC sbridge: Seeking for: PCI ID 8086:2f71
[  439.354870] EDAC sbridge: Seeking for: PCI ID 8086:2f71
[  439.354876] EDAC sbridge: Seeking for: PCI ID 8086:2f71
[  439.354882] EDAC sbridge: Seeking for: PCI ID 8086:2faa
[  439.354886] EDAC sbridge: Seeking for: PCI ID 8086:2faa
[  439.354892] EDAC sbridge: Seeking for: PCI ID 8086:2faa
[  439.354898] EDAC sbridge: Seeking for: PCI ID 8086:2fab
[  439.354902] EDAC sbridge: Seeking for: PCI ID 8086:2fab
[  439.354909] EDAC sbridge: Seeking for: PCI ID 8086:2fab
[  439.354915] EDAC sbridge: Seeking for: PCI ID 8086:2fac
[  439.354919] EDAC sbridge: Seeking for: PCI ID 8086:2fac
[  439.354925] EDAC sbridge: Seeking for: PCI ID 8086:2fac
[  439.354931] EDAC sbridge: Seeking for: PCI ID 8086:2fad
[  439.354936] EDAC sbridge: Seeking for: PCI ID 8086:2fad
[  439.354942] EDAC sbridge: Seeking for: PCI ID 8086:2fad
[  439.354948] EDAC sbridge: Seeking for: PCI ID 8086:2f68
[  439.354953] EDAC sbridge: Seeking for: PCI ID 8086:2f68
[  439.354960] EDAC sbridge: Seeking for: PCI ID 8086:2f68
[  439.354965] EDAC sbridge: Seeking for: PCI ID 8086:2f79
[  439.354978] EDAC sbridge: Seeking for: PCI ID 8086:2f6a
[  439.354991] EDAC sbridge: Seeking for: PCI ID 8086:2f6b
[  439.355003] EDAC sbridge: Seeking for: PCI ID 8086:2f6c
[  439.355016] EDAC sbridge: Seeking for: PCI ID 8086:2f6d
[  439.355029] EDAC sbridge: Seeking for: PCI ID 8086:2ffc
[  439.355033] EDAC sbridge: Seeking for: PCI ID 8086:2ffc
[  439.355039] EDAC sbridge: Seeking for: PCI ID 8086:2ffc
[  439.355046] EDAC sbridge: Seeking for: PCI ID 8086:2ffd
[  439.355049] EDAC sbridge: Seeking for: PCI ID 8086:2ffd
[  439.355055] EDAC sbridge: Seeking for: PCI ID 8086:2ffd
[  439.355062] EDAC sbridge: Seeking for: PCI ID 8086:2fbd
[  439.355067] EDAC sbridge: Seeking for: PCI ID 8086:2fbd
[  439.355073] EDAC sbridge: Seeking for: PCI ID 8086:2fbd
[  439.355079] EDAC sbridge: Seeking for: PCI ID 8086:2fbf
[  439.355084] EDAC sbridge: Seeking for: PCI ID 8086:2fbf
[  439.355090] EDAC sbridge: Seeking for: PCI ID 8086:2fbf
[  439.355095] EDAC sbridge: Seeking for: PCI ID 8086:2fb9
[  439.355101] EDAC sbridge: Seeking for: PCI ID 8086:2fb9
[  439.355107] EDAC sbridge: Seeking for: PCI ID 8086:2fb9
[  439.355112] EDAC sbridge: Seeking for: PCI ID 8086:2fbb
[  439.355117] EDAC sbridge: Seeking for: PCI ID 8086:2fbb
[  439.355123] EDAC sbridge: Seeking for: PCI ID 8086:2fbb
[  439.355355] EDAC MC0: Giving out device to module sb_edac.c controller Haswell SrcID#1_Ha#0: DEV 0000:ff:12.0 (INTERRUPT)
[  439.355601] EDAC MC1: Giving out device to module sb_edac.c controller Haswell SrcID#0_Ha#0: DEV 0000:7f:12.0 (INTERRUPT)
[  439.355629] EDAC sbridge: Some needed devices are missing
[  439.382001] EDAC MC: Removed device 0 for sb_edac.c Haswell SrcID#1_Ha#0: DEV 0000:ff:12.0
[  439.398059] EDAC MC: Removed device 1 for sb_edac.c Haswell SrcID#0_Ha#0: DEV 0000:7f:12.0
[  439.398115] EDAC sbridge: Couldn't find mci handler
[  439.399135] EDAC sbridge: Couldn't find mci handler
[  439.399887] EDAC sbridge: Failed to register device with error -19.

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2017-12-21:

#23

I built the next test kernel, up to the following commit:
8d9d2235a82ea41e65eff607005ea4f334e2e503

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1733662

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Revision history for this message

Rod Smith (rodsmith) wrote on 2017-12-21:

#24

Download full text (4.9 KiB)

That one completed one run of the test OK, but then crashed on the second one, when bringing CPU 15 back online, with the following dmesg output:

[ 160.596312] EDAC MC0: Giving out device to module sb_edac.c controller Haswell SrcID#1_Ha#0: DEV 0000:ff:12.0 (INTERRUPT)
[ 160.596537] EDAC MC1: Giving out device to module sb_edac.c controller Haswell SrcID#0_Ha#0: DEV 0000:7f:12.0 (INTERRUPT)
[ 160.596679] EDAC sbridge: Some needed devices are missing
[ 160.627089] EDAC MC: Removed device 0 for sb_edac.c Haswell SrcID#1_Ha#0: DEV 0000:ff:12.0
[ 160.651100] EDAC MC: Removed device 1 for sb_edac.c Haswell SrcID#0_Ha#0: DEV 0000:7f:12.0
[ 160.651271] EDAC sbridge: Couldn't find mci handler
[ 160.651422] EDAC sbridge: Couldn't find mci handler
[ 160.651572] EDAC sbridge: Failed to register device with error -19.
[ 161.099074] BUG: unable to handle kernel paging request at 0000000180040100
[ 161.099512] IP: __kmalloc_node+0x135/0x2a0
[ 161.099704] PGD 1ff1f01067
[ 161.099705] P4D 1ff1f01067
[ 161.099871] PUD 0

[ 161.100373] Oops: 0000 [#2] SMP
[ 161.100548] Modules linked in: nls_iso8859_1 intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp intel_cstate kvm_intel kvm irqbypass intel_rapl_perf joydev input_leds ipmi_ssif ipmi_si ipmi_devintf ipmi_msghandler mei_me mei shpchp lpc_ich acpi_pad mac_hid acpi_power_meter ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear ses enclosure scsi_transport_sas fnic crct10dif_pclmul crc32_pclmul mgag200 ghash_clmulni_intel ttm pcbc igb hid_generic drm_kms_helper aesni_intel dca syscopyarea i2c_algo_bit sysfillrect aes_x86_64 sysimgblt usbhid libfcoe crypto_simd fb_sys_fops ahci ptp glue_helper hid mxm_wmi libfc cryptd libahci
[ 161.102507] pps_core drm enic scsi_transport_fc megaraid_sas wmi
[ 161.102856] CPU: 2 PID: 3686 Comm: python3 Tainted: G D 4.13.0-13-generic #14~lp1733662Commit8d9d2235a82ea41
[ 161.103230] Hardware name: Cisco Systems Inc UCSC-C240-M4L/UCSC-C240-M4L, BIOS C240M4.2.0.10c.0.032320160820 03/23/2016
[ 161.103624] task: ffff8f3de5989740 task.stack: ffffa3a7ce288000
[ 161.104024] RIP: 0010:__kmalloc_node+0x135/0x2a0
[ 161.104431] RSP: 0018:ffffa3a7ce28bc30 EFLAGS: 00010246
[ 161.104846] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000f95
[ 161.105274] RDX: 0000000000000f94 RSI: 0000000000000000 RDI: 000000000001f3e0
[ 161.105705] RBP: ffffa3a7ce28bc70 R08: ffff8f3dffc9f3e0 R09: ffff8f3dff807c00
[ 161.106148] R10: ffffffffbb017760 R11: ffff8f5df8fa21f2 R12: 00000000014080c0
[ 161.106599] R13: 0000000000000008 R14: 0000000180040100 R15: ffff8f3dff807c00
[ 161.107057] FS: 00007f7849b98700(0000) GS:ffff8f3dffc80000(0000) knlGS:0000000000000000
[ 161.107530] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 161.108014] CR2: 0000000180040100 CR3: 0000001ff6e6e000 CR4: 00000000001406e0
[ 161.108509] Call Trace:
[ 161.109012] ? alloc_cpumask_var_node+0x1f/0x30
[ 161.109523] ? on_each_cpu_cond+0x160/0x160
[ 161.110036] alloc_cpumask_var_node+0x1f/0x30
...

That one completed one run of the test OK, but then crashed on the second one, when bringing CPU 15 back online, with the following dmesg output:

[  160.596312] EDAC MC0: Giving out device to module sb_edac.c controller Haswell SrcID#1_Ha#0: DEV 0000:ff:12.0 (INTERRUPT)
[  160.596537] EDAC MC1: Giving out device to module sb_edac.c controller Haswell SrcID#0_Ha#0: DEV 0000:7f:12.0 (INTERRUPT)
[  160.596679] EDAC sbridge: Some needed devices are missing
[  160.627089] EDAC MC: Removed device 0 for sb_edac.c Haswell SrcID#1_Ha#0: DEV 0000:ff:12.0
[  160.651100] EDAC MC: Removed device 1 for sb_edac.c Haswell SrcID#0_Ha#0: DEV 0000:7f:12.0
[  160.651271] EDAC sbridge: Couldn't find mci handler
[  160.651422] EDAC sbridge: Couldn't find mci handler
[  160.651572] EDAC sbridge: Failed to register device with error -19.
[  161.099074] BUG: unable to handle kernel paging request at 0000000180040100
[  161.099512] IP: __kmalloc_node+0x135/0x2a0
[  161.099704] PGD 1ff1f01067 
[  161.099705] P4D 1ff1f01067 
[  161.099871] PUD 0

[  161.100373] Oops: 0000 [#2] SMP
[  161.100548] Modules linked in: nls_iso8859_1 intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp intel_cstate kvm_intel kvm irqbypass intel_rapl_perf joydev input_leds ipmi_ssif ipmi_si ipmi_devintf ipmi_msghandler mei_me mei shpchp lpc_ich acpi_pad mac_hid acpi_power_meter ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear ses enclosure scsi_transport_sas fnic crct10dif_pclmul crc32_pclmul mgag200 ghash_clmulni_intel ttm pcbc igb hid_generic drm_kms_helper aesni_intel dca syscopyarea i2c_algo_bit sysfillrect aes_x86_64 sysimgblt usbhid libfcoe crypto_simd fb_sys_fops ahci ptp glue_helper hid mxm_wmi libfc cryptd libahci
[  161.102507]  pps_core drm enic scsi_transport_fc megaraid_sas wmi
[  161.102856] CPU: 2 PID: 3686 Comm: python3 Tainted: G      D         4.13.0-13-generic #14~lp1733662Commit8d9d2235a82ea41
[  161.103230] Hardware name: Cisco Systems Inc UCSC-C240-M4L/UCSC-C240-M4L, BIOS C240M4.2.0.10c.0.032320160820 03/23/2016
[  161.103624] task: ffff8f3de5989740 task.stack: ffffa3a7ce288000
[  161.104024] RIP: 0010:__kmalloc_node+0x135/0x2a0
[  161.104431] RSP: 0018:ffffa3a7ce28bc30 EFLAGS: 00010246
[  161.104846] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000f95
[  161.105274] RDX: 0000000000000f94 RSI: 0000000000000000 RDI: 000000000001f3e0
[  161.105705] RBP: ffffa3a7ce28bc70 R08: ffff8f3dffc9f3e0 R09: ffff8f3dff807c00
[  161.106148] R10: ffffffffbb017760 R11: ffff8f5df8fa21f2 R12: 00000000014080c0
[  161.106599] R13: 0000000000000008 R14: 0000000180040100 R15: ffff8f3dff807c00
[  161.107057] FS:  00007f7849b98700(0000) GS:ffff8f3dffc80000(0000) knlGS:0000000000000000
[  161.107530] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  161.108014] CR2: 0000000180040100 CR3: 0000001ff6e6e000 CR4: 00000000001406e0
[  161.108509] Call Trace:
[  161.109012]  ? alloc_cpumask_var_node+0x1f/0x30
[  161.109523]  ? on_each_cpu_cond+0x160/0x160
[  161.110036]  alloc_cpumask_var_node+0x1f/0x30
[  161.110558]  zalloc_cpumask_var_node+0xf/0x20
[  161.111084]  smpcfd_prepare_cpu+0x64/0xc0
[  161.111615]  cpuhp_invoke_callback+0x84/0x3b0
[  161.112151]  cpuhp_up_callbacks+0x36/0xc0
[  161.112690]  _cpu_up+0x87/0xd0
[  161.113235]  do_cpu_up+0x8b/0xb0
[  161.113785]  cpu_up+0x13/0x20
[  161.114342]  cpu_subsys_online+0x3d/0x90
[  161.114881]  device_online+0x4a/0x90
[  161.115422]  online_store+0x89/0xa0
[  161.115951]  dev_attr_store+0x18/0x30
[  161.116472]  sysfs_kf_write+0x37/0x40
[  161.116994]  kernfs_fop_write+0x11c/0x1a0
[  161.117510]  __vfs_write+0x18/0x40
[  161.118029]  vfs_write+0xb1/0x1a0
[  161.118544]  SyS_write+0x55/0xc0
[  161.119062]  entry_SYSCALL_64_fastpath+0x1e/0xa9
[  161.119581] RIP: 0033:0x7f78497784a0
[  161.120081] RSP: 002b:00007fff6e69ed48 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  161.120602] RAX: ffffffffffffffda RBX: 0000000001ea8410 RCX: 00007f78497784a0
[  161.121129] RDX: 0000000000000002 RSI: 0000000001fbe400 RDI: 0000000000000003
[  161.121666] RBP: 0000000000a3e020 R08: 0000000000000000 R09: 0000000000000001
[  161.122202] R10: 0000000000000100 R11: 0000000000000246 R12: 0000000000000003
[  161.122720] R13: 0000000000501520 R14: 00007fff6e69f1b0 R15: 00007f7848690240
[  161.123226] Code: 89 cf 4c 89 4d c0 e8 0b 7f 01 00 49 89 c7 4c 8b 4d c0 4d 85 ff 0f 85 47 ff ff ff 45 31 f6 eb 3c 49 63 47 20 49 8b 3f 48 8d 4a 01 <49> 8b 1c 06 4c 89 f0 65 48 0f c7 0f 0f 94 c0 84 c0 0f 84 20 ff 
[  161.124251] RIP: __kmalloc_node+0x135/0x2a0 RSP: ffffa3a7ce28bc30
[  161.124738] CR2: 0000000180040100
[  161.125220] ---[ end trace 1246d63efc5b2bf0 ]---

Rather than hang, as has happened before, the script crashed ("Killed" was displayed and I was dropped back to a bash prompt). The system behaved unreliably and I was forced to reboot it via its BMC.

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2017-12-21:

#25

I built the next test kernel, up to the following commit:
83d4a97746e5fac08e2a1498c3649586bab953a3

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1733662

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Revision history for this message

Rod Smith (rodsmith) wrote on 2017-12-21:

#26

The build from http://kernel.ubuntu.com/~jsalisbury/lp1733662/ successfully completed about six runs of the test script, albeit with the verbose dmesg output that includes the "EDAC sbridge: Failed to register device with error -19" messages.

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2017-12-21:

#27

Thanks for testing. I'll mark that kernel as good. I think it's safe to ignore the "error -19" messages during the bisect. We just need to tell the bisect whether the kernel exhibits the original bug or not.

I built the next test kernel, up to the following commit:
97327adfdaf5d72053b1ce8d0847e93706c10dc6

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1733662

Revision history for this message

Rod Smith (rodsmith) wrote on 2017-12-22:

#28

Download full text (7.6 KiB)

That one hung much like the others, with the system responding only to very basic commands (mostly bash internals), although the dmesg output continued further after the kernel bug message. Here's the dmesg output:

[ 107.652875] EDAC MC0: Giving out device to module sb_edac.c controller Haswell SrcID#1_Ha#0: DEV 0000:ff:12.0 (INTERRUPT)
[ 107.652995] EDAC MC1: Giving out device to module sb_edac.c controller Haswell SrcID#0_Ha#0: DEV 0000:7f:12.0 (INTERRUPT)
[ 107.653010] EDAC sbridge: Some needed devices are missing
[ 107.675559] EDAC MC: Removed device 0 for sb_edac.c Haswell SrcID#1_Ha#0: DEV 0000:ff:12.0
[ 107.703606] EDAC MC: Removed device 1 for sb_edac.c Haswell SrcID#0_Ha#0: DEV 0000:7f:12.0
[ 107.703639] EDAC sbridge: Couldn't find mci handler
[ 107.704195] EDAC sbridge: Couldn't find mci handler
[ 107.704618] EDAC sbridge: Failed to register device with error -19.
[ 108.163612] smpboot: Booting Node 1 Processor 8 APIC 0x10
[ 108.189804] intel_rapl: Found RAPL domain package
[ 108.189810] intel_rapl: Found RAPL domain dram
[ 108.189812] intel_rapl: DRAM domain energy unit 15300pj
[ 108.190389] ------------[ cut here ]------------
[ 108.190390] kernel BUG at /home/jsalisbury/bugs/lp1733662/ubuntu-artful/mm/slub.c:3878!
[ 108.191016] invalid opcode: 0000 [#1] SMP
[ 108.191511] Modules linked in: nls_iso8859_1 intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp ipmi_ssif kvm_intel kvm input_leds irqbypass joydev mei_me intel_cstate ipmi_si intel_rapl_perf shpchp acpi_power_meter ipmi_devintf ipmi_msghandler mei lpc_ich mac_hid acpi_pad ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear ses enclosure scsi_transport_sas crct10dif_pclmul crc32_pclmul mgag200 ghash_clmulni_intel ttm pcbc fnic hid_generic drm_kms_helper igb syscopyarea aesni_intel usbhid dca sysfillrect i2c_algo_bit sysimgblt aes_x86_64 ptp fb_sys_fops crypto_simd mxm_wmi hid libfcoe glue_helper ahci cryptd libfc drm
[ 108.195174] libahci pps_core enic scsi_transport_fc megaraid_sas wmi
[ 108.195756] CPU: 8 PID: 302 Comm: kworker/8:3 Not tainted 4.13.0-13-generic #14~lp1733662Commit97327adfdaf5d
[ 108.196353] Hardware name: Cisco Systems Inc UCSC-C240-M4L/UCSC-C240-M4L, BIOS C240M4.2.0.10c.0.032320160820 03/23/2016
[ 108.196971] Workqueue: events cpuset_hotplug_workfn
[ 108.197583] task: ffff8e3432fcae80 task.stack: ffffb5fb4e104000
[ 108.198236] RIP: 0010:kfree+0x11c/0x160
[ 108.198861] RSP: 0000:ffffb5fb4e107cc8 EFLAGS: 00010246
[ 108.199485] RAX: fffffb0ffeff0020 RBX: ffff8e3400000000 RCX: 000000018020001d
[ 108.200121] RDX: 0000000000000000 RSI: fffffb0fffd33600 RDI: 0000720b40000000
[ 108.200764] RBP: ffffb5fb4e107ce0 R08: ffff8e3434cd8c00 R09: 000000018020001d
[ 108.201405] R10: fffffb0fff000000 R11: 0000000000000000 R12: ffff8e343254f058
[ 108.202053] R13: ffffffff876ce3d3 R14: ffff8e34382b6d10 R15: 0000000000000000
[ 108.202703] FS: 0000000000000000(0000) GS:ffff8e343f200000(0000) knlGS:0000000000000000
[ 108.203367] CS: 0010 DS: 0000 ES: 0000 CR0: 0000...

That one hung much like the others, with the system responding only to very basic commands (mostly bash internals), although the dmesg output continued further after the kernel bug message. Here's the dmesg output:

[  107.652875] EDAC MC0: Giving out device to module sb_edac.c controller Haswell SrcID#1_Ha#0: DEV 0000:ff:12.0 (INTERRUPT)
[  107.652995] EDAC MC1: Giving out device to module sb_edac.c controller Haswell SrcID#0_Ha#0: DEV 0000:7f:12.0 (INTERRUPT)
[  107.653010] EDAC sbridge: Some needed devices are missing
[  107.675559] EDAC MC: Removed device 0 for sb_edac.c Haswell SrcID#1_Ha#0: DEV 0000:ff:12.0
[  107.703606] EDAC MC: Removed device 1 for sb_edac.c Haswell SrcID#0_Ha#0: DEV 0000:7f:12.0
[  107.703639] EDAC sbridge: Couldn't find mci handler
[  107.704195] EDAC sbridge: Couldn't find mci handler
[  107.704618] EDAC sbridge: Failed to register device with error -19.
[  108.163612] smpboot: Booting Node 1 Processor 8 APIC 0x10
[  108.189804] intel_rapl: Found RAPL domain package
[  108.189810] intel_rapl: Found RAPL domain dram
[  108.189812] intel_rapl: DRAM domain energy unit 15300pj
[  108.190389] ------------[ cut here ]------------
[  108.190390] kernel BUG at /home/jsalisbury/bugs/lp1733662/ubuntu-artful/mm/slub.c:3878!
[  108.191016] invalid opcode: 0000 [#1] SMP
[  108.191511] Modules linked in: nls_iso8859_1 intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp ipmi_ssif kvm_intel kvm input_leds irqbypass joydev mei_me intel_cstate ipmi_si intel_rapl_perf shpchp acpi_power_meter ipmi_devintf ipmi_msghandler mei lpc_ich mac_hid acpi_pad ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear ses enclosure scsi_transport_sas crct10dif_pclmul crc32_pclmul mgag200 ghash_clmulni_intel ttm pcbc fnic hid_generic drm_kms_helper igb syscopyarea aesni_intel usbhid dca sysfillrect i2c_algo_bit sysimgblt aes_x86_64 ptp fb_sys_fops crypto_simd mxm_wmi hid libfcoe glue_helper ahci cryptd libfc drm
[  108.195174]  libahci pps_core enic scsi_transport_fc megaraid_sas wmi
[  108.195756] CPU: 8 PID: 302 Comm: kworker/8:3 Not tainted 4.13.0-13-generic #14~lp1733662Commit97327adfdaf5d
[  108.196353] Hardware name: Cisco Systems Inc UCSC-C240-M4L/UCSC-C240-M4L, BIOS C240M4.2.0.10c.0.032320160820 03/23/2016
[  108.196971] Workqueue: events cpuset_hotplug_workfn
[  108.197583] task: ffff8e3432fcae80 task.stack: ffffb5fb4e104000
[  108.198236] RIP: 0010:kfree+0x11c/0x160
[  108.198861] RSP: 0000:ffffb5fb4e107cc8 EFLAGS: 00010246
[  108.199485] RAX: fffffb0ffeff0020 RBX: ffff8e3400000000 RCX: 000000018020001d
[  108.200121] RDX: 0000000000000000 RSI: fffffb0fffd33600 RDI: 0000720b40000000
[  108.200764] RBP: ffffb5fb4e107ce0 R08: ffff8e3434cd8c00 R09: 000000018020001d
[  108.201405] R10: fffffb0fff000000 R11: 0000000000000000 R12: ffff8e343254f058
[  108.202053] R13: ffffffff876ce3d3 R14: ffff8e34382b6d10 R15: 0000000000000000
[  108.202703] FS:  0000000000000000(0000) GS:ffff8e343f200000(0000) knlGS:0000000000000000
[  108.203367] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  108.204031] CR2: 0000000000000000 CR3: 00000026c7609000 CR4: 00000000001406e0
[  108.204702] Call Trace:
[  108.205377]  sd_free_ctl_entry+0x63/0x70
[  108.206054]  sd_free_ctl_entry+0x53/0x70
[  108.206727]  unregister_sched_domain_sysctl+0x36/0x40
[  108.207396]  partition_sched_domains+0x34/0x2f0
[  108.208070]  rebuild_sched_domains_locked+0x5a/0x80
[  108.208771]  rebuild_sched_domains+0x1a/0x30
[  108.209442]  cpuset_hotplug_workfn+0x1b1/0xd30
[  108.210119]  ? mutex_lock+0x12/0x40
[  108.210790]  process_one_work+0x1e7/0x410
[  108.211470]  worker_thread+0x4a/0x410
[  108.212146]  kthread+0x125/0x140
[  108.212814]  ? process_one_work+0x410/0x410
[  108.213482]  ? kthread_create_on_node+0x70/0x70
[  108.214159]  ret_from_fork+0x25/0x30
[  108.214830] Code: 08 49 83 c4 18 48 89 da 4c 89 ee ff d0 49 8b 04 24 48 85 c0 75 e6 e9 0e ff ff ff 49 8b 02 f6 c4 80 75 0a 49 8b 42 20 a8 01 75 02 <0f> 0b 49 8b 02 31 f6 f6 c4 80 74 04 41 8b 72 6c 4c 89 d7 e8 2c 
[  108.216269] RIP: kfree+0x11c/0x160 RSP: ffffb5fb4e107cc8
[  108.217042] ---[ end trace 8c27258fb7e406c8 ]---
[  108.225116] EDAC sbridge: Seeking for: PCI ID 8086:2fa0
[  108.225880] EDAC sbridge: Seeking for: PCI ID 8086:2fa0
[  108.226442] EDAC sbridge: Seeking for: PCI ID 8086:2fa0
[  108.227006] EDAC sbridge: Seeking for: PCI ID 8086:2f60
[  108.227623] EDAC sbridge: Seeking for: PCI ID 8086:2fa8
[  108.228288] EDAC sbridge: Seeking for: PCI ID 8086:2fa8
[  108.228809] EDAC sbridge: Seeking for: PCI ID 8086:2fa8
[  108.229313] EDAC sbridge: Seeking for: PCI ID 8086:2f71
[  108.229794] EDAC sbridge: Seeking for: PCI ID 8086:2f71
[  108.230253] EDAC sbridge: Seeking for: PCI ID 8086:2f71
[  108.230692] EDAC sbridge: Seeking for: PCI ID 8086:2faa
[  108.231110] EDAC sbridge: Seeking for: PCI ID 8086:2faa
[  108.231552] EDAC sbridge: Seeking for: PCI ID 8086:2faa
[  108.232040] EDAC sbridge: Seeking for: PCI ID 8086:2fab
[  108.232421] EDAC sbridge: Seeking for: PCI ID 8086:2fab
[  108.232770] EDAC sbridge: Seeking for: PCI ID 8086:2fab
[  108.233103] EDAC sbridge: Seeking for: PCI ID 8086:2fac
[  108.233415] EDAC sbridge: Seeking for: PCI ID 8086:2fac
[  108.233723] EDAC sbridge: Seeking for: PCI ID 8086:2fac
[  108.234021] EDAC sbridge: Seeking for: PCI ID 8086:2fad
[  108.234314] EDAC sbridge: Seeking for: PCI ID 8086:2fad
[  108.234597] EDAC sbridge: Seeking for: PCI ID 8086:2fad
[  108.234872] EDAC sbridge: Seeking for: PCI ID 8086:2f68
[  108.235140] EDAC sbridge: Seeking for: PCI ID 8086:2f68
[  108.235397] EDAC sbridge: Seeking for: PCI ID 8086:2f68
[  108.235697] EDAC sbridge: Seeking for: PCI ID 8086:2f79
[  108.236003] EDAC sbridge: Seeking for: PCI ID 8086:2f6a
[  108.236293] EDAC sbridge: Seeking for: PCI ID 8086:2f6b
[  108.236529] EDAC sbridge: Seeking for: PCI ID 8086:2f6c
[  108.236752] EDAC sbridge: Seeking for: PCI ID 8086:2f6d
[  108.236971] EDAC sbridge: Seeking for: PCI ID 8086:2ffc
[  108.237183] EDAC sbridge: Seeking for: PCI ID 8086:2ffc
[  108.237390] EDAC sbridge: Seeking for: PCI ID 8086:2ffc
[  108.237589] EDAC sbridge: Seeking for: PCI ID 8086:2ffd
[  108.237798] EDAC sbridge: Seeking for: PCI ID 8086:2ffd
[  108.237987] EDAC sbridge: Seeking for: PCI ID 8086:2ffd
[  108.238172] EDAC sbridge: Seeking for: PCI ID 8086:2fbd
[  108.238349] EDAC sbridge: Seeking for: PCI ID 8086:2fbd
[  108.238520] EDAC sbridge: Seeking for: PCI ID 8086:2fbd
[  108.238689] EDAC sbridge: Seeking for: PCI ID 8086:2fbf
[  108.238852] EDAC sbridge: Seeking for: PCI ID 8086:2fbf
[  108.239011] EDAC sbridge: Seeking for: PCI ID 8086:2fbf
[  108.239164] EDAC sbridge: Seeking for: PCI ID 8086:2fb9
[  108.239309] EDAC sbridge: Seeking for: PCI ID 8086:2fb9
[  108.239439] EDAC sbridge: Seeking for: PCI ID 8086:2fb9
[  108.239583] EDAC sbridge: Seeking for: PCI ID 8086:2fbb
[  108.239690] EDAC sbridge: Seeking for: PCI ID 8086:2fbb
[  108.239799] EDAC sbridge: Seeking for: PCI ID 8086:2fbb
[  108.239998] EDAC MC0: Giving out device to module sb_edac.c controller Haswell SrcID#1_Ha#0: DEV 0000:ff:12.0 (INTERRUPT)
[  108.240338] EDAC MC1: Giving out device to module sb_edac.c controller Haswell SrcID#0_Ha#0: DEV 0000:7f:12.0 (INTERRUPT)
[  108.240473] EDAC sbridge: Some needed devices are missing
[  108.267599] EDAC MC: Removed device 0 for sb_edac.c Haswell SrcID#1_Ha#0: DEV 0000:ff:12.0
[  108.303631] EDAC MC: Removed device 1 for sb_edac.c Haswell SrcID#0_Ha#0: DEV 0000:7f:12.0
[  108.304152] EDAC sbridge: Couldn't find mci handler
[  108.304369] EDAC sbridge: Couldn't find mci handler
[  108.304577] EDAC sbridge: Failed to register device with error -19.

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2017-12-22:

#29

I built the next test kernel, up to the following commit:
646779c79c8ab1382f81d79d346937c51746b07e

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1733662

Revision history for this message

Rod Smith (rodsmith) wrote on 2017-12-22:

#30

That one ran our test script half a dozen times without failure, albeit with the "Error -19" messages in the dmesg output.

Note that I'm about to EOD, so I probably won't get to the next one until next year. Have a good holiday, Joseph!

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2018-01-04:

#31

I hope you had a good holiday, Rod. I started up the bisect again.

I built the next test kernel, up to the following commit:
9ebf47f152918cce0caaa9c2767656635fbf59e4

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1733662

Revision history for this message

Rod Smith (rodsmith) wrote on 2018-01-04:

#32

Thanks, Joseph. My break was good; I hope yours was, too!

That latest version you posted completed half a dozen runs of the test script without incident, aside from the "error -19" messages.

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2018-01-04:

#33

The bisect should only require testing about 2 or 3 more kernels.

I built the next test kernel, up to the following commit:
aa0998e265482fd260b188dea8c03e7dd7c83c72

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1733662

Revision history for this message

Rod Smith (rodsmith) wrote on 2018-01-04:

#34

Joseph, that one also completed six runs with no problems except the "error -19" messages.

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2018-01-04:

#35

I built the next test kernel, up to the following commit:
e6108d5475696d0deaf37b59ff704aead9c5a8a7

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1733662

Revision history for this message

Rod Smith (rodsmith) wrote on 2018-01-05:

#36

Download full text (7.8 KiB)

That one completed its first run, but then crashed when bringing CPU 14 back online, with the following dmesg output:

[ 163.176945] ------------[ cut here ]------------
[ 163.176949] kernel BUG at /home/jsalisbury/bugs/lp1733662/ubuntu-artful/mm/slub.c:3878!
[ 163.178043] invalid opcode: 0000 [#1] SMP
[ 163.178995] Modules linked in: nls_iso8859_1 intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass intel_cstate joydev input_leds shpchp ipmi_ssif intel_rapl_perf acpi_power_meter lpc_ich ipmi_si ipmi_devintf ipmi_msghandler acpi_pad mac_hid mei_me mei ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear ses enclosure scsi_transport_sas mgag200 ttm drm_kms_helper crct10dif_pclmul crc32_pclmul ghash_clmulni_intel syscopyarea pcbc sysfillrect fnic aesni_intel hid_generic sysimgblt igb fb_sys_fops aes_x86_64 dca usbhid crypto_simd i2c_algo_bit glue_helper libfcoe hid ahci ptp libfc mxm_wmi cryptd libahci
[ 163.186785] drm pps_core enic scsi_transport_fc megaraid_sas wmi
[ 163.188025] CPU: 14 PID: 93 Comm: cpuhp/14 Not tainted 4.13.0-13-generic #14~lp1733662Commite6108d5475696
[ 163.189294] Hardware name: Cisco Systems Inc UCSC-C240-M4L/UCSC-C240-M4L, BIOS C240M4.2.0.10c.0.032320160820 03/23/2016
[ 163.190606] task: ffff8dbaf809c5c0 task.stack: ffffae2acc8a8000
[ 163.191926] RIP: 0010:kfree+0x11c/0x160
[ 163.193255] RSP: 0000:ffffae2acc8abb80 EFLAGS: 00010246
[ 163.194600] RAX: fffff9cb3bff0020 RBX: ffff8dba00000000 RCX: ffffae2acc8abb60
[ 163.195954] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000728480000000
[ 163.197311] RBP: ffffae2acc8abb98 R08: ffffae2acc8abaec R09: 0000000000000002
[ 163.198703] R10: fffff9cb3c000000 R11: 0000000000000000 R12: ffff8d9aff94beb0
[ 163.200096] R13: ffffffffa6f2034b R14: ffff8dbaf27e4318 R15: ffff8dbaf27e4200
[ 163.201497] FS: 0000000000000000(0000) GS:ffff8dbaff380000(0000) knlGS:0000000000000000
[ 163.202919] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 163.204351] CR2: 0000000000000000 CR3: 000000101aa09000 CR4: 00000000001406e0
[ 163.205802] Call Trace:
[ 163.207253] acpi_ns_get_node_unlocked+0xac/0xd8
[ 163.208704] ? kernfs_add_one+0xe4/0x130
[ 163.210183] ? down_timeout+0x37/0x60
[ 163.211644] ? acpi_os_wait_semaphore+0x4c/0x70
[ 163.213098] acpi_ns_get_node+0x41/0x58
[ 163.214550] ? acpi_ns_get_node+0x41/0x58
[ 163.216016] acpi_get_handle+0x95/0xbe
[ 163.217486] acpi_has_method+0x25/0x40
[ 163.218932] acpi_processor_get_performance_info+0x57/0x580
[ 163.220391] ? wrmsrl_on_cpu+0x57/0x70
[ 163.221870] acpi_processor_register_performance+0x5e/0xd0
[ 163.223354] __intel_pstate_cpu_init.part.16+0xed/0x2e0
[ 163.224835] ? intel_pstate_init_cpu+0xc9/0x2d0
[ 163.226323] intel_pstate_cpu_init+0x24/0x40
[ 163.227819] cpufreq_online+0xd8/0x750
[ 163.229301] ? cpufreq_online+0x750/0x750
[ 163.230781] cpuhp_cpufreq_online+0xe/0x20
[ 163.232262] cpuhp_invoke_callback+0x84/0x3b0
[ 163.233758] cpuhp_up_callbacks+0x36/0xc0
[ 163.235254] cpuhp_thr...

That one completed its first run, but then crashed when bringing CPU 14 back online, with the following dmesg output:

[  163.176945] ------------[ cut here ]------------
[  163.176949] kernel BUG at /home/jsalisbury/bugs/lp1733662/ubuntu-artful/mm/slub.c:3878!
[  163.178043] invalid opcode: 0000 [#1] SMP
[  163.178995] Modules linked in: nls_iso8859_1 intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass intel_cstate joydev input_leds shpchp ipmi_ssif intel_rapl_perf acpi_power_meter lpc_ich ipmi_si ipmi_devintf ipmi_msghandler acpi_pad mac_hid mei_me mei ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear ses enclosure scsi_transport_sas mgag200 ttm drm_kms_helper crct10dif_pclmul crc32_pclmul ghash_clmulni_intel syscopyarea pcbc sysfillrect fnic aesni_intel hid_generic sysimgblt igb fb_sys_fops aes_x86_64 dca usbhid crypto_simd i2c_algo_bit glue_helper libfcoe hid ahci ptp libfc mxm_wmi cryptd libahci
[  163.186785]  drm pps_core enic scsi_transport_fc megaraid_sas wmi
[  163.188025] CPU: 14 PID: 93 Comm: cpuhp/14 Not tainted 4.13.0-13-generic #14~lp1733662Commite6108d5475696
[  163.189294] Hardware name: Cisco Systems Inc UCSC-C240-M4L/UCSC-C240-M4L, BIOS C240M4.2.0.10c.0.032320160820 03/23/2016
[  163.190606] task: ffff8dbaf809c5c0 task.stack: ffffae2acc8a8000
[  163.191926] RIP: 0010:kfree+0x11c/0x160
[  163.193255] RSP: 0000:ffffae2acc8abb80 EFLAGS: 00010246
[  163.194600] RAX: fffff9cb3bff0020 RBX: ffff8dba00000000 RCX: ffffae2acc8abb60
[  163.195954] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000728480000000
[  163.197311] RBP: ffffae2acc8abb98 R08: ffffae2acc8abaec R09: 0000000000000002
[  163.198703] R10: fffff9cb3c000000 R11: 0000000000000000 R12: ffff8d9aff94beb0
[  163.200096] R13: ffffffffa6f2034b R14: ffff8dbaf27e4318 R15: ffff8dbaf27e4200
[  163.201497] FS:  0000000000000000(0000) GS:ffff8dbaff380000(0000) knlGS:0000000000000000
[  163.202919] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  163.204351] CR2: 0000000000000000 CR3: 000000101aa09000 CR4: 00000000001406e0
[  163.205802] Call Trace:
[  163.207253]  acpi_ns_get_node_unlocked+0xac/0xd8
[  163.208704]  ? kernfs_add_one+0xe4/0x130
[  163.210183]  ? down_timeout+0x37/0x60
[  163.211644]  ? acpi_os_wait_semaphore+0x4c/0x70
[  163.213098]  acpi_ns_get_node+0x41/0x58
[  163.214550]  ? acpi_ns_get_node+0x41/0x58
[  163.216016]  acpi_get_handle+0x95/0xbe
[  163.217486]  acpi_has_method+0x25/0x40
[  163.218932]  acpi_processor_get_performance_info+0x57/0x580
[  163.220391]  ? wrmsrl_on_cpu+0x57/0x70
[  163.221870]  acpi_processor_register_performance+0x5e/0xd0
[  163.223354]  __intel_pstate_cpu_init.part.16+0xed/0x2e0
[  163.224835]  ? intel_pstate_init_cpu+0xc9/0x2d0
[  163.226323]  intel_pstate_cpu_init+0x24/0x40
[  163.227819]  cpufreq_online+0xd8/0x750
[  163.229301]  ? cpufreq_online+0x750/0x750
[  163.230781]  cpuhp_cpufreq_online+0xe/0x20
[  163.232262]  cpuhp_invoke_callback+0x84/0x3b0
[  163.233758]  cpuhp_up_callbacks+0x36/0xc0
[  163.235254]  cpuhp_thread_fun+0xd4/0xe0
[  163.236731]  smpboot_thread_fn+0xec/0x160
[  163.238210]  kthread+0x125/0x140
[  163.239693]  ? sort_range+0x30/0x30
[  163.241165]  ? kthread_create_on_node+0x70/0x70
[  163.242629]  ret_from_fork+0x25/0x30
[  163.244061] Code: 08 49 83 c4 18 48 89 da 4c 89 ee ff d0 49 8b 04 24 48 85 c0 75 e6 e9 0e ff ff ff 49 8b 02 f6 c4 80 75 0a 49 8b 42 20 a8 01 75 02 <0f> 0b 49 8b 02 31 f6 f6 c4 80 74 04 41 8b 72 6c 4c 89 d7 e8 2c 
[  163.247030] RIP: kfree+0x11c/0x160 RSP: ffffae2acc8abb80
[  163.248463] ---[ end trace e22fa4721cb983b5 ]---
[  168.454846] ------------[ cut here ]------------
[  168.456219] kernel BUG at /home/jsalisbury/bugs/lp1733662/ubuntu-artful/mm/slub.c:3878!
[  168.457561] invalid opcode: 0000 [#2] SMP
[  168.458849] Modules linked in: nls_iso8859_1 intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass intel_cstate joydev input_leds shpchp ipmi_ssif intel_rapl_perf acpi_power_meter lpc_ich ipmi_si ipmi_devintf ipmi_msghandler acpi_pad mac_hid mei_me mei ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear ses enclosure scsi_transport_sas mgag200 ttm drm_kms_helper crct10dif_pclmul crc32_pclmul ghash_clmulni_intel syscopyarea pcbc sysfillrect fnic aesni_intel hid_generic sysimgblt igb fb_sys_fops aes_x86_64 dca usbhid crypto_simd i2c_algo_bit glue_helper libfcoe hid ahci ptp libfc mxm_wmi cryptd libahci
[  168.468659]  drm pps_core enic scsi_transport_fc megaraid_sas wmi
[  168.470126] CPU: 0 PID: 2683 Comm: irqbalance Tainted: G      D         4.13.0-13-generic #14~lp1733662Commite6108d5475696
[  168.471648] Hardware name: Cisco Systems Inc UCSC-C240-M4L/UCSC-C240-M4L, BIOS C240M4.2.0.10c.0.032320160820 03/23/2016
[  168.473183] task: ffff8dbae2bf9740 task.stack: ffffae2acf51c000
[  168.474734] RIP: 0010:kfree+0x11c/0x160
[  168.476246] RSP: 0018:ffffae2acf51fa08 EFLAGS: 00010246
[  168.477765] RAX: fffff9cb3bff0020 RBX: ffff8dba00000000 RCX: 0000000000000000
[  168.479292] RDX: 0000000000000000 RSI: ffff8dbae313ed10 RDI: 0000728480000000
[  168.480797] RBP: ffffae2acf51fa20 R08: ffff8dbae2a5bac8 R09: 0000000180220021
[  168.482306] R10: fffff9cb3c000000 R11: 0000000000000001 R12: ffff8dbaf2f60960
[  168.483831] R13: ffffffffa6bdd4e0 R14: ffff8dbae33fbcd8 R15: ffff8dbae33fae00
[  168.485365] FS:  00007f342d25a740(0000) GS:ffff8d9affc00000(0000) knlGS:0000000000000000
[  168.486926] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  168.488478] CR2: 0000560651c9f3a8 CR3: 0000003ff4879000 CR4: 00000000001406f0
[  168.490066] Call Trace:
[  168.491641]  kfree_const+0x20/0x30
[  168.493227]  kernfs_put+0x71/0x180
[  168.494793]  kernfs_dop_release+0x12/0x20
[  168.496367]  __dentry_kill+0xe5/0x150
[  168.497925]  shrink_dentry_list+0x11f/0x2e0
[  168.499478]  d_invalidate+0x67/0x110
[  168.501018]  lookup_fast+0x2b9/0x310
[  168.502552]  ? dput.part.23+0x2d/0x1e0
[  168.504096]  walk_component+0x49/0x340
[  168.505624]  ? kernfs_iop_permission+0x4f/0x60
[  168.507170]  link_path_walk+0x1bc/0x590
[  168.508703]  ? path_init+0x177/0x2f0
[  168.510248]  path_lookupat+0x56/0x1f0
[  168.511794]  filename_lookup+0xb6/0x190
[  168.513341]  ? sprintf+0x51/0x70
[  168.514885]  ? __check_object_size+0xaf/0x1b0
[  168.516429]  ? strncpy_from_user+0x4d/0x170
[  168.517968]  user_path_at_empty+0x36/0x40
[  168.519514]  ? user_path_at_empty+0x36/0x40
[  168.521020]  vfs_statx+0x76/0xe0
[  168.522481]  SYSC_newstat+0x3d/0x70
[  168.523922]  ? ____fput+0xe/0x10
[  168.525346]  ? task_work_run+0x7b/0x90
[  168.526777]  ? exit_to_usermode_loop+0x9b/0xd0
[  168.528186]  SyS_newstat+0xe/0x10
[  168.529565]  entry_SYSCALL_64_fastpath+0x1e/0xa9
[  168.530924] RIP: 0033:0x7f342c34abb5
[  168.532229] RSP: 002b:00007ffcd3f64668 EFLAGS: 00000246 ORIG_RAX: 0000000000000004
[  168.533535] RAX: ffffffffffffffda RBX: 0000000000b95fa0 RCX: 00007f342c34abb5
[  168.534805] RDX: 00007ffcd3f646c0 RSI: 00007ffcd3f646c0 RDI: 00007ffcd3f65f50
[  168.536043] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000038
[  168.537240] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  168.538390] R13: 00007ffcd3f64f6b R14: 0000000000b95fa0 R15: 0000000000b96250
[  168.539524] Code: 08 49 83 c4 18 48 89 da 4c 89 ee ff d0 49 8b 04 24 48 85 c0 75 e6 e9 0e ff ff ff 49 8b 02 f6 c4 80 75 0a 49 8b 42 20 a8 01 75 02 <0f> 0b 49 8b 02 31 f6 f6 c4 80 74 04 41 8b 72 6c 4c 89 d7 e8 2c 
[  168.541855] RIP: kfree+0x11c/0x160 RSP: ffffae2acf51fa08
[  168.543000] ---[ end trace e22fa4721cb983b6 ]---

The system is semi-responsive; bash continues to run, but most external commands seem to hang. Thus, I've rebooted via the BMC.

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2018-01-05:

#37

I built the next test kernel, up to the following commit:
ac2fc5adab0f4b83f01214af61c8478c6ef186f9

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1733662

Revision history for this message

Rod Smith (rodsmith) wrote on 2018-01-05:

#38

Download full text (12.5 KiB)

That one completed two runs, but on the second run, dmesg included the following message at one point:

[ 240.841694] kernel BUG at /home/jsalisbury/bugs/lp1733662/ubuntu-artful/mm/slub.c:3878!
[ 240.842765] invalid opcode: 0000 [#1] SMP
[ 240.843718] Modules linked in: nls_iso8859_1 intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass intel_cstate intel_rapl_perf ipmi_ssif joydev input_leds ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter lpc_ich shpchp acpi_pad mac_hid mei_me mei ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear ses enclosure scsi_transport_sas crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc fnic mgag200 ttm hid_generic drm_kms_helper syscopyarea igb sysfillrect aesni_intel sysimgblt usbhid libfcoe fb_sys_fops aes_x86_64 dca hid crypto_simd i2c_algo_bit mxm_wmi glue_helper ptp cryptd ahci libfc libahci
[ 240.851457] drm pps_core megaraid_sas scsi_transport_fc enic wmi
[ 240.852693] CPU: 8 PID: 2724 Comm: irqbalance Not tainted 4.13.0-13-generic #14~lp1733662Commitac2fc5adab0f4
[ 240.853965] Hardware name: Cisco Systems Inc UCSC-C240-M4L/UCSC-C240-M4L, BIOS C240M4.2.0.10c.0.032320160820 03/23/2016
[ 240.855281] task: ffff9b62a76645c0 task.stack: ffffb973cf6fc000
[ 240.856603] RIP: 0010:kfree+0x11c/0x160
[ 240.857937] RSP: 0018:ffffb973cf6ffa08 EFLAGS: 00010246
[ 240.859280] RAX: fffff8803cff0020 RBX: ffff9b6200000000 RCX: 0000000000000000
[ 240.860632] RDX: 0000000000000000 RSI: ffff9b62b0eb5348 RDI: 000064dcc0000000
[ 240.861995] RBP: ffffb973cf6ffa20 R08: ffff9b62b22f70f0 R09: 0000000180220021
[ 240.863367] R10: fffff8803d000000 R11: 0000000000000001 R12: ffff9b62b1648780
[ 240.864756] R13: ffffffffb65dd4e0 R14: ffff9b62a872f0d8 R15: ffff9b62a872fac0
[ 240.866145] FS: 00007ff8c4d06740(0000) GS:ffff9b62bf200000(0000) knlGS:0000000000000000
[ 240.867562] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 240.868986] CR2: 00007fff9ef860f8 CR3: 0000003fe7876000 CR4: 00000000001406e0
[ 240.870438] Call Trace:
[ 240.871882] kfree_const+0x20/0x30
[ 240.873328] kernfs_put+0x71/0x180
[ 240.874778] kernfs_dop_release+0x12/0x20
[ 240.876218] __dentry_kill+0xe5/0x150
[ 240.877644] shrink_dentry_list+0x11f/0x2e0
[ 240.879078] d_invalidate+0x67/0x110
[ 240.880526] lookup_fast+0x2b9/0x310
[ 240.881968] ? dput.part.23+0x2d/0x1e0
[ 240.883393] walk_component+0x49/0x340
[ 240.884811] ? kernfs_iop_permission+0x4f/0x60
[ 240.886253] link_path_walk+0x1bc/0x590
[ 240.887690] ? path_init+0x177/0x2f0
[ 240.889105] path_lookupat+0x56/0x1f0
[ 240.890529] filename_lookup+0xb6/0x190
[ 240.891964] ? sprintf+0x51/0x70
[ 240.893387] ? __check_object_size+0xaf/0x1b0
[ 240.894822] ? strncpy_from_user+0x4d/0x170
[ 240.896240] user_path_at_empty+0x36/0x40
[ 240.897673] ? user_path_at_empty+0x36/0x40
[ 240.899101] vfs_statx+0x76/0xe0
[ 240.900517] SYSC_newstat+0x3d/0x70
[ 240.901934] ? ____fput+0xe/0x10
[ 240.903365] ? task_work_run+0x7b/0x90
[ 240.904783] ? exit_to_usermode...

That one completed two runs, but on the second run, dmesg included the following message at one point:

[  240.841694] kernel BUG at /home/jsalisbury/bugs/lp1733662/ubuntu-artful/mm/slub.c:3878!
[  240.842765] invalid opcode: 0000 [#1] SMP
[  240.843718] Modules linked in: nls_iso8859_1 intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass intel_cstate intel_rapl_perf ipmi_ssif joydev input_leds ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter lpc_ich shpchp acpi_pad mac_hid mei_me mei ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear ses enclosure scsi_transport_sas crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc fnic mgag200 ttm hid_generic drm_kms_helper syscopyarea igb sysfillrect aesni_intel sysimgblt usbhid libfcoe fb_sys_fops aes_x86_64 dca hid crypto_simd i2c_algo_bit mxm_wmi glue_helper ptp cryptd ahci libfc libahci
[  240.851457]  drm pps_core megaraid_sas scsi_transport_fc enic wmi
[  240.852693] CPU: 8 PID: 2724 Comm: irqbalance Not tainted 4.13.0-13-generic #14~lp1733662Commitac2fc5adab0f4
[  240.853965] Hardware name: Cisco Systems Inc UCSC-C240-M4L/UCSC-C240-M4L, BIOS C240M4.2.0.10c.0.032320160820 03/23/2016
[  240.855281] task: ffff9b62a76645c0 task.stack: ffffb973cf6fc000
[  240.856603] RIP: 0010:kfree+0x11c/0x160
[  240.857937] RSP: 0018:ffffb973cf6ffa08 EFLAGS: 00010246
[  240.859280] RAX: fffff8803cff0020 RBX: ffff9b6200000000 RCX: 0000000000000000
[  240.860632] RDX: 0000000000000000 RSI: ffff9b62b0eb5348 RDI: 000064dcc0000000
[  240.861995] RBP: ffffb973cf6ffa20 R08: ffff9b62b22f70f0 R09: 0000000180220021
[  240.863367] R10: fffff8803d000000 R11: 0000000000000001 R12: ffff9b62b1648780
[  240.864756] R13: ffffffffb65dd4e0 R14: ffff9b62a872f0d8 R15: ffff9b62a872fac0
[  240.866145] FS:  00007ff8c4d06740(0000) GS:ffff9b62bf200000(0000) knlGS:0000000000000000
[  240.867562] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  240.868986] CR2: 00007fff9ef860f8 CR3: 0000003fe7876000 CR4: 00000000001406e0
[  240.870438] Call Trace:
[  240.871882]  kfree_const+0x20/0x30
[  240.873328]  kernfs_put+0x71/0x180
[  240.874778]  kernfs_dop_release+0x12/0x20
[  240.876218]  __dentry_kill+0xe5/0x150
[  240.877644]  shrink_dentry_list+0x11f/0x2e0
[  240.879078]  d_invalidate+0x67/0x110
[  240.880526]  lookup_fast+0x2b9/0x310
[  240.881968]  ? dput.part.23+0x2d/0x1e0
[  240.883393]  walk_component+0x49/0x340
[  240.884811]  ? kernfs_iop_permission+0x4f/0x60
[  240.886253]  link_path_walk+0x1bc/0x590
[  240.887690]  ? path_init+0x177/0x2f0
[  240.889105]  path_lookupat+0x56/0x1f0
[  240.890529]  filename_lookup+0xb6/0x190
[  240.891964]  ? sprintf+0x51/0x70
[  240.893387]  ? __check_object_size+0xaf/0x1b0
[  240.894822]  ? strncpy_from_user+0x4d/0x170
[  240.896240]  user_path_at_empty+0x36/0x40
[  240.897673]  ? user_path_at_empty+0x36/0x40
[  240.899101]  vfs_statx+0x76/0xe0
[  240.900517]  SYSC_newstat+0x3d/0x70
[  240.901934]  ? ____fput+0xe/0x10
[  240.903365]  ? task_work_run+0x7b/0x90
[  240.904783]  ? exit_to_usermode_loop+0x9b/0xd0
[  240.906181]  SyS_newstat+0xe/0x10
[  240.907559]  entry_SYSCALL_64_fastpath+0x1e/0xa9
[  240.908900] RIP: 0033:0x7ff8c3df6bb5
[  240.910196] RSP: 002b:00007ffe6cf8a928 EFLAGS: 00000246 ORIG_RAX: 0000000000000004
[  240.911496] RAX: ffffffffffffffda RBX: 0000000000fe9a40 RCX: 00007ff8c3df6bb5
[  240.912763] RDX: 00007ffe6cf8a980 RSI: 00007ffe6cf8a980 RDI: 00007ffe6cf8c210
[  240.913985] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000039
[  240.915181] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  240.916320] R13: 00007ffe6cf8b22b R14: 0000000000fe9a40 R15: 0000000000fe92f0
[  240.917447] Code: 08 49 83 c4 18 48 89 da 4c 89 ee ff d0 49 8b 04 24 48 85 c0 75 e6 e9 0e ff ff ff 49 8b 02 f6 c4 80 75 0a 49 8b 42 20 a8 01 75 02 <0f> 0b 49 8b 02 31 f6 f6 c4 80 74 04 41 8b 72 6c 4c 89 d7 e8 2c 
[  240.919769] RIP: kfree+0x11c/0x160 RSP: ffffb973cf6ffa08
[  240.920909] ---[ end trace 67fe147f4dd931eb ]---

A third run produced a hang when offlining CPU 8, with the following dmesg output:

[  352.776303] EDAC MC1: Giving out device to module sb_edac.c controller Haswell SrcID#0_Ha#0: DEV 0000:7f:12.0 (INTERRUPT)
[  352.776572] EDAC sbridge: Some needed devices are missing
[  352.801614] EDAC MC: Removed device 0 for sb_edac.c Haswell SrcID#1_Ha#0: DEV 0000:ff:12.0
[  352.825588] EDAC MC: Removed device 1 for sb_edac.c Haswell SrcID#0_Ha#0: DEV 0000:7f:12.0
[  352.826090] EDAC sbridge: Couldn't find mci handler
[  352.826457] EDAC sbridge: Couldn't find mci handler
[  352.826826] EDAC sbridge: Failed to register device with error -19.
[  353.286163] BUG: unable to handle kernel paging request at 0000317865646e69
[  353.286790] IP: __kmalloc_node+0x135/0x2a0
[  353.287303] PGD 0 
[  353.287304] P4D 0

[  353.288695] Oops: 0000 [#2] SMP
[  353.289158] Modules linked in: nls_iso8859_1 intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass intel_cstate intel_rapl_perf ipmi_ssif joydev input_leds ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter lpc_ich shpchp acpi_pad mac_hid mei_me mei ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear ses enclosure scsi_transport_sas crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc fnic mgag200 ttm hid_generic drm_kms_helper syscopyarea igb sysfillrect aesni_intel sysimgblt usbhid libfcoe fb_sys_fops aes_x86_64 dca hid crypto_simd i2c_algo_bit mxm_wmi glue_helper ptp cryptd ahci libfc libahci
[  353.294318]  drm pps_core megaraid_sas scsi_transport_fc enic wmi
[  353.295246] CPU: 8 PID: 56 Comm: cpuhp/8 Tainted: G      D         4.13.0-13-generic #14~lp1733662Commitac2fc5adab0f4
[  353.296231] Hardware name: Cisco Systems Inc UCSC-C240-M4L/UCSC-C240-M4L, BIOS C240M4.2.0.10c.0.032320160820 03/23/2016
[  353.297274] task: ffff9b62b8fc0000 task.stack: ffffb973cc780000
[  353.298341] RIP: 0010:__kmalloc_node+0x135/0x2a0
[  353.299416] RSP: 0018:ffffb973cc783bb0 EFLAGS: 00010246
[  353.300511] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00000000000008a2
[  353.301652] RDX: 00000000000008a1 RSI: 0000000000000000 RDI: 000000000001f3e0
[  353.302793] RBP: ffffb973cc783bf0 R08: ffff9b62bf21f3e0 R09: ffff9b42bf807c00
[  353.303960] R10: 000000000000024c R11: 0000000000020dd1 R12: 00000000014080c0
[  353.305155] R13: 0000000000000008 R14: 0000317865646e69 R15: ffff9b42bf807c00
[  353.306379] FS:  0000000000000000(0000) GS:ffff9b62bf200000(0000) knlGS:0000000000000000
[  353.307637] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  353.308901] CR2: 0000317865646e69 CR3: 0000002343409000 CR4: 00000000001406e0
[  353.310205] Call Trace:
[  353.311531]  ? alloc_cpumask_var_node+0x1f/0x30
[  353.312881]  alloc_cpumask_var_node+0x1f/0x30
[  353.314245]  zalloc_cpumask_var+0x14/0x20
[  353.315616]  cpudl_init+0x6a/0xe0
[  353.316992]  init_rootdomain+0x7a/0xd0
[  353.318393]  build_sched_domains+0x26a/0xdd0
[  353.319817]  ? call_rcu_sched+0x17/0x20
[  353.321249]  ? cpu_attach_domain+0x1af/0x6a0
[  353.322698]  ? kfree+0x14a/0x160
[  353.324146]  partition_sched_domains+0x1c6/0x2f0
[  353.325623]  ? sched_cpu_activate+0xd0/0xd0
[  353.327122]  cpuset_update_active_cpus+0x17/0x40
[  353.328583]  sched_cpu_deactivate+0x94/0xd0
[  353.330052]  ? call_rcu_bh+0x20/0x20
[  353.331495]  ? call_rcu_bh+0x20/0x20
[  353.332894]  ? trace_raw_output_rcu_utilization+0x50/0x50
[  353.334320]  ? pick_next_task_fair+0x48e/0x560
[  353.335736]  cpuhp_invoke_callback+0x84/0x3b0
[  353.337164]  cpuhp_down_callbacks+0x42/0x80
[  353.338579]  cpuhp_thread_fun+0x88/0xe0
[  353.339971]  smpboot_thread_fn+0xec/0x160
[  353.341346]  kthread+0x125/0x140
[  353.342723]  ? sort_range+0x30/0x30
[  353.344106]  ? kthread_create_on_node+0x70/0x70
[  353.345521]  ret_from_fork+0x25/0x30
[  353.346928] Code: 89 cf 4c 89 4d c0 e8 0b 7f 01 00 49 89 c7 4c 8b 4d c0 4d 85 ff 0f 85 47 ff ff ff 45 31 f6 eb 3c 49 63 47 20 49 8b 3f 48 8d 4a 01 <49> 8b 1c 06 4c 89 f0 65 48 0f c7 0f 0f 94 c0 84 c0 0f 84 20 ff 
[  353.349833] RIP: __kmalloc_node+0x135/0x2a0 RSP: ffffb973cc783bb0
[  353.351218] CR2: 0000317865646e69
[  353.352559] ---[ end trace 67fe147f4dd931ec ]---

Although the test script hung, I was able to continue using my other terminal normally, run other programs, log out, log back in, etc. An attempt to reboot ("sudo shutdown -h now") did not succeed; the system hung with "[ OK ] Stopped target Multi-User System" on the console. After forcing a restart via the BMC, I ran the test script again, which completed one run but then hung on the second run, with limited functionality thereafter. The dmesg output on the second run included the following:

[  103.752641] ------------[ cut here ]------------
[  103.752643] kernel BUG at /home/jsalisbury/bugs/lp1733662/ubuntu-artful/mm/slub.c:3878!
[  103.753548] invalid opcode: 0000 [#1] SMP
[  103.754440] Modules linked in: nls_iso8859_1 intel_rapl x86_pkg_temp_thermal intel_powerclamp ipmi_ssif coretemp joydev input_leds intel_cstate ipmi_si intel_rapl_perf mei_me ipmi_devintf ipmi_msghandler kvm_intel kvm irqbypass mei mac_hid shpchp acpi_power_meter lpc_ich acpi_pad ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear ses enclosure scsi_transport_sas crct10dif_pclmul mgag200 crc32_pclmul igb ttm hid_generic ghash_clmulni_intel drm_kms_helper fnic pcbc usbhid dca syscopyarea aesni_intel sysfillrect i2c_algo_bit sysimgblt fb_sys_fops hid libfcoe aes_x86_64 ahci ptp crypto_simd libfc glue_helper mxm_wmi cryptd drm
[  103.762134]  libahci pps_core enic scsi_transport_fc megaraid_sas wmi
[  103.763369] CPU: 0 PID: 3649 Comm: python3 Not tainted 4.13.0-13-generic #14~lp1733662Commitac2fc5adab0f4
[  103.764641] Hardware name: Cisco Systems Inc UCSC-C240-M4L/UCSC-C240-M4L, BIOS C240M4.2.0.10c.0.032320160820 03/23/2016
[  103.765948] task: ffff8e90a5999740 task.stack: ffff9dbb4e320000
[  103.767263] RIP: 0010:kfree+0x11c/0x160
[  103.768601] RSP: 0018:ffff9dbb4e323cb0 EFLAGS: 00010246
[  103.769941] RAX: fffffa5b3cff0020 RBX: ffff8eb000000000 RCX: 0000000000000000
[  103.771301] RDX: 0000000000000000 RSI: 0000000000000028 RDI: 0000718ec0000000
[  103.772663] RBP: ffff9dbb4e323cc8 R08: dead000000000100 R09: ffffffff985ed7a8
[  103.774049] R10: fffffa5b3d000000 R11: 0000000000000000 R12: 0000000000000028
[  103.775426] R13: ffffffff97eead09 R14: 000000000000000a R15: ffffffff977143f0
[  103.776809] FS:  00007f1e1c29f700(0000) GS:ffff8e90bfc00000(0000) knlGS:0000000000000000
[  103.778214] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  103.779645] CR2: 000055be9d7243a8 CR3: 0000003ff74a3000 CR4: 00000000001406f0
[  103.781094] Call Trace:
[  103.782527]  free_cpumask_var+0x9/0x10
[  103.783961]  smpcfd_dead_cpu+0x24/0x40
[  103.785415]  cpuhp_invoke_callback+0x84/0x3b0
[  103.786859]  ? flow_cache_lookup+0x4c0/0x4c0
[  103.788303]  cpuhp_down_callbacks+0x42/0x80
[  103.789745]  _cpu_down+0xc2/0x100
[  103.791191]  do_cpu_down+0x33/0x50
[  103.792624]  cpu_down+0x10/0x20
[  103.794056]  cpu_subsys_offline+0x14/0x20
[  103.795492]  device_offline+0x73/0xc0
[  103.796926]  online_store+0x4c/0xa0
[  103.798351]  dev_attr_store+0x18/0x30
[  103.799779]  sysfs_kf_write+0x37/0x40
[  103.801201]  kernfs_fop_write+0x11c/0x1a0
[  103.802634]  __vfs_write+0x18/0x40
[  103.804065]  vfs_write+0xb1/0x1a0
[  103.805485]  SyS_write+0x55/0xc0
[  103.806888]  entry_SYSCALL_64_fastpath+0x1e/0xa9
[  103.808310] RIP: 0033:0x7f1e1be7f4a0
[  103.809730] RSP: 002b:00007ffc4ead2768 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  103.811181] RAX: ffffffffffffffda RBX: 0000000001d8b410 RCX: 00007f1e1be7f4a0
[  103.812648] RDX: 0000000000000002 RSI: 0000000001ea1060 RDI: 0000000000000003
[  103.814122] RBP: 0000000000a3e020 R08: 0000000000000000 R09: 0000000000000001
[  103.815600] R10: 0000000000000100 R11: 0000000000000246 R12: 0000000000000003
[  103.817048] R13: 0000000000501520 R14: 00007ffc4ead2bd0 R15: 00007f1e1ad98240
[  103.818475] Code: 08 49 83 c4 18 48 89 da 4c 89 ee ff d0 49 8b 04 24 48 85 c0 75 e6 e9 0e ff ff ff 49 8b 02 f6 c4 80 75 0a 49 8b 42 20 a8 01 75 02 <0f> 0b 49 8b 02 31 f6 f6 c4 80 74 04 41 8b 72 6c 4c 89 d7 e8 2c 
[  103.821390] RIP: kfree+0x11c/0x160 RSP: ffff9dbb4e323cb0
[  103.822826] ---[ end trace 7c1d545f713a5ad1 ]---

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2018-01-08:

#39

The bisect reported the following as the first bad commit:
commit ac2fc5adab0f4b83f01214af61c8478c6ef186f9
Author: Vikas Shivappa <email address hidden>
Date: Tue Aug 15 18:00:43 2017 -0700
x86/intel_rdt/cqm: Improve limbo list processing

I built a test kernel with a revert of ac2fc5adab0.

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1733662

Revision history for this message

Rod Smith (rodsmith) wrote on 2018-01-08:

#40

Download full text (3.6 KiB)

I'm afraid that one fails, too, on the second run when bringing CPU10 back online. Here's the dmesg output:

[ 154.987312] smpboot: Booting Node 1 Processor 10 APIC 0x14
[ 154.992953] BUG: unable to handle kernel paging request at 0000317865646e69
[ 154.993932] IP: __kmalloc_track_caller+0x97/0x1f0
[ 154.994847] PGD 0
[ 154.994848] P4D 0

[ 154.997397] Oops: 0000 [#1] SMP
[ 154.998250] Modules linked in: nls_iso8859_1 intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm joydev input_leds ipmi_ssif irqbypass mac_hid ipmi_si shpchp intel_cstate intel_rapl_perf acpi_power_meter ipmi_devintf acpi_pad mei_me lpc_ich ipmi_msghandler mei ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear ses enclosure scsi_transport_sas mgag200 ttm fnic hid_generic crct10dif_pclmul crc32_pclmul ghash_clmulni_intel drm_kms_helper pcbc usbhid syscopyarea igb sysfillrect libfcoe aesni_intel sysimgblt dca fb_sys_fops i2c_algo_bit aes_x86_64 hid crypto_simd glue_helper libfc ptp mxm_wmi ahci drm cryptd
[ 155.005714] libahci pps_core scsi_transport_fc enic megaraid_sas wmi
[ 155.006913] CPU: 10 PID: 69 Comm: cpuhp/10 Not tainted 4.13.0-13-generic #14~lp1733662Commitac2fc5adab0f4
[ 155.008154] Hardware name: Cisco Systems Inc UCSC-C240-M4L/UCSC-C240-M4L, BIOS C240M4.2.0.10c.0.032320160820 03/23/2016
[ 155.009427] task: ffff91c7b8785d00 task.stack: ffffa8760c7e8000
[ 155.010718] RIP: 0010:__kmalloc_track_caller+0x97/0x1f0
[ 155.012014] RSP: 0000:ffffa8760c7ebc48 EFLAGS: 00010206
[ 155.013308] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00000000000014b9
[ 155.014618] RDX: 00000000000014b8 RSI: 0000000000000000 RDI: 000000000001f3e0
[ 155.015946] RBP: ffffa8760c7ebc80 R08: ffff91c7bf29f3e0 R09: ffff91a7bf807c00
[ 155.017284] R10: ffffa8760c7ebce0 R11: 0000000000000006 R12: 0000317865646e69
[ 155.018620] R13: 00000000014000c0 R14: 0000000000000007 R15: ffff91a7bf807c00
[ 155.019965] FS: 0000000000000000(0000) GS:ffff91c7bf280000(0000) knlGS:0000000000000000
[ 155.021329] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 155.022710] CR2: 0000317865646e69 CR3: 0000000ec6c09000 CR4: 00000000001406e0
[ 155.024101] Call Trace:
[ 155.025490] ? kvasprintf_const+0x45/0xa0
[ 155.026906] kvasprintf+0x66/0xd0
[ 155.028304] kvasprintf_const+0x45/0xa0
[ 155.029703] kobject_set_name_vargs+0x23/0x90
[ 155.031101] cpu_device_create+0xa4/0x100
[ 155.032485] ? smp_call_function_single+0xb9/0xe0
[ 155.033891] cacheinfo_cpu_online+0x2ac/0x400
[ 155.035295] ? get_cpu_cacheinfo+0x50/0x50
[ 155.036709] cpuhp_invoke_callback+0x84/0x3b0
[ 155.038101] cpuhp_up_callbacks+0x36/0xc0
[ 155.039513] cpuhp_thread_fun+0xd4/0xe0
[ 155.040923] smpboot_thread_fn+0xec/0x160
[ 155.042319] kthread+0x125/0x140
[ 155.043706] ? sort_range+0x30/0x30
[ 155.045107] ? kthread_create_on_node+0x70/0x70
[ 155.046515] ret_from_fork+0x25/0x30
[ 155.047906] Code: 08 65 4c 03 05 ab e5 7d 5b 49 83 78 10 00 4d 8b 20 0f 84 ef 00 00 00 4d 85 e4 0f 84 e6 00 00 00 49 63 41 20 4...

I'm afraid that one fails, too, on the second run when bringing CPU10 back online. Here's the dmesg output:

[  154.987312] smpboot: Booting Node 1 Processor 10 APIC 0x14
[  154.992953] BUG: unable to handle kernel paging request at 0000317865646e69
[  154.993932] IP: __kmalloc_track_caller+0x97/0x1f0
[  154.994847] PGD 0 
[  154.994848] P4D 0

[  154.997397] Oops: 0000 [#1] SMP
[  154.998250] Modules linked in: nls_iso8859_1 intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm joydev input_leds ipmi_ssif irqbypass mac_hid ipmi_si shpchp intel_cstate intel_rapl_perf acpi_power_meter ipmi_devintf acpi_pad mei_me lpc_ich ipmi_msghandler mei ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear ses enclosure scsi_transport_sas mgag200 ttm fnic hid_generic crct10dif_pclmul crc32_pclmul ghash_clmulni_intel drm_kms_helper pcbc usbhid syscopyarea igb sysfillrect libfcoe aesni_intel sysimgblt dca fb_sys_fops i2c_algo_bit aes_x86_64 hid crypto_simd glue_helper libfc ptp mxm_wmi ahci drm cryptd
[  155.005714]  libahci pps_core scsi_transport_fc enic megaraid_sas wmi
[  155.006913] CPU: 10 PID: 69 Comm: cpuhp/10 Not tainted 4.13.0-13-generic #14~lp1733662Commitac2fc5adab0f4
[  155.008154] Hardware name: Cisco Systems Inc UCSC-C240-M4L/UCSC-C240-M4L, BIOS C240M4.2.0.10c.0.032320160820 03/23/2016
[  155.009427] task: ffff91c7b8785d00 task.stack: ffffa8760c7e8000
[  155.010718] RIP: 0010:__kmalloc_track_caller+0x97/0x1f0
[  155.012014] RSP: 0000:ffffa8760c7ebc48 EFLAGS: 00010206
[  155.013308] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00000000000014b9
[  155.014618] RDX: 00000000000014b8 RSI: 0000000000000000 RDI: 000000000001f3e0
[  155.015946] RBP: ffffa8760c7ebc80 R08: ffff91c7bf29f3e0 R09: ffff91a7bf807c00
[  155.017284] R10: ffffa8760c7ebce0 R11: 0000000000000006 R12: 0000317865646e69
[  155.018620] R13: 00000000014000c0 R14: 0000000000000007 R15: ffff91a7bf807c00
[  155.019965] FS:  0000000000000000(0000) GS:ffff91c7bf280000(0000) knlGS:0000000000000000
[  155.021329] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  155.022710] CR2: 0000317865646e69 CR3: 0000000ec6c09000 CR4: 00000000001406e0
[  155.024101] Call Trace:
[  155.025490]  ? kvasprintf_const+0x45/0xa0
[  155.026906]  kvasprintf+0x66/0xd0
[  155.028304]  kvasprintf_const+0x45/0xa0
[  155.029703]  kobject_set_name_vargs+0x23/0x90
[  155.031101]  cpu_device_create+0xa4/0x100
[  155.032485]  ? smp_call_function_single+0xb9/0xe0
[  155.033891]  cacheinfo_cpu_online+0x2ac/0x400
[  155.035295]  ? get_cpu_cacheinfo+0x50/0x50
[  155.036709]  cpuhp_invoke_callback+0x84/0x3b0
[  155.038101]  cpuhp_up_callbacks+0x36/0xc0
[  155.039513]  cpuhp_thread_fun+0xd4/0xe0
[  155.040923]  smpboot_thread_fn+0xec/0x160
[  155.042319]  kthread+0x125/0x140
[  155.043706]  ? sort_range+0x30/0x30
[  155.045107]  ? kthread_create_on_node+0x70/0x70
[  155.046515]  ret_from_fork+0x25/0x30
[  155.047906] Code: 08 65 4c 03 05 ab e5 7d 5b 49 83 78 10 00 4d 8b 20 0f 84 ef 00 00 00 4d 85 e4 0f 84 e6 00 00 00 49 63 41 20 49 8b 39 48 8d 4a 01 <49> 8b 1c 04 4c 89 e0 65 48 0f c7 0f 0f 94 c0 84 c0 74 bb 49 63 
[  155.050922] RIP: __kmalloc_track_caller+0x97/0x1f0 RSP: ffffa8760c7ebc48
[  155.052426] CR2: 0000317865646e69
[  155.053914] ---[ end trace f7bb4aa3c197a453 ]---

To be sure, here's the kernel version information:

$ uname -a
Linux oil-boldore 4.13.0-13-generic #14~lp1733662Commitac2fc5adab0f4 SMP Fri Jan 5 15:31:13 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2018-01-08:

#41

The uname looks like you may still be running the kernel from comment #37. The test kernel with the revert should have a name like:

linux-image-4.13.0-21-generic_4.13.0-21.24~lp1733662Revert_amd64

The string "Revert" should be in the uname output.

Revision history for this message

Rod Smith (rodsmith) wrote on 2018-01-08:

#42

You're right. (I've got too many kernels installed on that system!) When I tested again, it got through eight runs without problems, beyond the "error -19" message. Here's the uname information, just to be sure:

$ uname -a
Linux oil-boldore 4.13.0-21-generic #24~lp1733662Revert SMP Mon Jan 8 15:35:41 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2018-01-10:

#43

Thanks for the update. I'll ping the author of mainline commit 24247aeeabe99eab to get some feedback.

Before I do that, can you confirm the bug still exists with the latest mainline kernel:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15-rc7/

Revision history for this message

Rod Smith (rodsmith) wrote on 2018-01-10:

#44

Download full text (4.1 KiB)

Yes, it still exists. To confirm the kernel version:

$ uname -a
Linux oil-boldore 4.15.0-041500rc7-generic #201801072330 SMP Sun Jan 7 23:31:29 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

The system hung bringing CPU 11 back online, with the following dmesg output:

[ 101.179624] smpboot: Booting Node 1 Processor 11 APIC 0x16
[ 101.727507] general protection fault: 0000 [#1] SMP PTI
[ 101.727812] Modules linked in: nls_iso8859_1 intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm joydev input_leds irqbypass ipmi_ssif intel_cstate intel_rapl_perf ipmi_si acpi_power_meter mei_me shpchp ipmi_devintf ipmi_msghandler mei lpc_ich mac_hid acpi_pad ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic crct10dif_pclmul mgag200 crc32_pclmul ttm ghash_clmulni_intel pcbc usbhid ses igb drm_kms_helper enclosure dca syscopyarea sysfillrect aesni_intel scsi_transport_sas hid fnic aes_x86_64 sysimgblt ptp crypto_simd libfcoe fb_sys_fops glue_helper ahci pps_core mxm_wmi
[ 101.730450] cryptd libfc libahci i2c_algo_bit drm scsi_transport_fc enic megaraid_sas wmi
[ 101.730883] CPU: 6 PID: 3205 Comm: python3 Not tainted 4.15.0-041500rc7-generic #201801072330
[ 101.731319] Hardware name: Cisco Systems Inc UCSC-C240-M4L/UCSC-C240-M4L, BIOS C240M4.2.0.10c.0.032320160820 03/23/2016
[ 101.731773] RIP: 0010:__kmalloc_node+0x16a/0x2c0
[ 101.732224] RSP: 0018:ffffa7d0cf86bbe0 EFLAGS: 00010206
[ 101.732682] RAX: 0000000000000000 RBX: 3b37355eb8b32f18 RCX: 0000000000000349
[ 101.733146] RDX: 0000000000000348 RSI: 0000000000000000 RDI: 0000000000027040
[ 101.733609] RBP: ffffa7d0cf86bc20 R08: ffff94818ede9cdc R09: ffff9461bf807c00
[ 101.734075] R10: ffffffffaaa16cc0 R11: c4c8a1df366db3c4 R12: 00000000014080c0
[ 101.734547] R13: 0000000000000008 R14: ffff94818ede9cdc R15: ffff9461bf807c00
[ 101.735023] FS: 00007f8b0a2c2700(0000) GS:ffff9461bfd80000(0000) knlGS:0000000000000000
[ 101.735510] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 101.735997] CR2: 000056075d0c11a8 CR3: 0000001fe0b32003 CR4: 00000000001606e0
[ 101.736491] Call Trace:
[ 101.736988] ? alloc_cpumask_var_node+0x1f/0x30
[ 101.737488] ? on_each_cpu_cond+0x140/0x140
[ 101.737986] alloc_cpumask_var_node+0x1f/0x30
[ 101.738489] zalloc_cpumask_var_node+0xf/0x20
[ 101.738988] smpcfd_prepare_cpu+0x46/0xc0
[ 101.739493] cpuhp_invoke_callback+0x9b/0x550
[ 101.740012] ? init_idle+0x179/0x190
[ 101.740515] _cpu_up+0xb1/0x180
[ 101.741017] do_cpu_up+0x8b/0xb0
[ 101.741515] cpu_up+0x13/0x20
[ 101.742012] cpu_subsys_online+0x3d/0x90
[ 101.742510] device_online+0x4a/0x90
[ 101.743010] online_store+0x89/0xa0
[ 101.743506] dev_attr_store+0x18/0x30
[ 101.744003] sysfs_kf_write+0x37/0x40
[ 101.744501] kernfs_fop_write+0x11c/0x1a0
[ 101.744998] __vfs_write+0x37/0x170
[ 101.745494] ? common_file_perm+0x50/0x140
[ 101.745994] ? apparmor_file_permission+0x1a/0x20
[ 101.746495] ? security_file_permission+0x3b/0xc0
[ 101.746993] ? _cond_resched...

Yes, it still exists. To confirm the kernel version:

$ uname -a
Linux oil-boldore 4.15.0-041500rc7-generic #201801072330 SMP Sun Jan 7 23:31:29 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

The system hung bringing CPU 11 back online, with the following dmesg output:

[  101.179624] smpboot: Booting Node 1 Processor 11 APIC 0x16
[  101.727507] general protection fault: 0000 [#1] SMP PTI
[  101.727812] Modules linked in: nls_iso8859_1 intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm joydev input_leds irqbypass ipmi_ssif intel_cstate intel_rapl_perf ipmi_si acpi_power_meter mei_me shpchp ipmi_devintf ipmi_msghandler mei lpc_ich mac_hid acpi_pad ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic crct10dif_pclmul mgag200 crc32_pclmul ttm ghash_clmulni_intel pcbc usbhid ses igb drm_kms_helper enclosure dca syscopyarea sysfillrect aesni_intel scsi_transport_sas hid fnic aes_x86_64 sysimgblt ptp crypto_simd libfcoe fb_sys_fops glue_helper ahci pps_core mxm_wmi
[  101.730450]  cryptd libfc libahci i2c_algo_bit drm scsi_transport_fc enic megaraid_sas wmi
[  101.730883] CPU: 6 PID: 3205 Comm: python3 Not tainted 4.15.0-041500rc7-generic #201801072330
[  101.731319] Hardware name: Cisco Systems Inc UCSC-C240-M4L/UCSC-C240-M4L, BIOS C240M4.2.0.10c.0.032320160820 03/23/2016
[  101.731773] RIP: 0010:__kmalloc_node+0x16a/0x2c0
[  101.732224] RSP: 0018:ffffa7d0cf86bbe0 EFLAGS: 00010206
[  101.732682] RAX: 0000000000000000 RBX: 3b37355eb8b32f18 RCX: 0000000000000349
[  101.733146] RDX: 0000000000000348 RSI: 0000000000000000 RDI: 0000000000027040
[  101.733609] RBP: ffffa7d0cf86bc20 R08: ffff94818ede9cdc R09: ffff9461bf807c00
[  101.734075] R10: ffffffffaaa16cc0 R11: c4c8a1df366db3c4 R12: 00000000014080c0
[  101.734547] R13: 0000000000000008 R14: ffff94818ede9cdc R15: ffff9461bf807c00
[  101.735023] FS:  00007f8b0a2c2700(0000) GS:ffff9461bfd80000(0000) knlGS:0000000000000000
[  101.735510] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  101.735997] CR2: 000056075d0c11a8 CR3: 0000001fe0b32003 CR4: 00000000001606e0
[  101.736491] Call Trace:
[  101.736988]  ? alloc_cpumask_var_node+0x1f/0x30
[  101.737488]  ? on_each_cpu_cond+0x140/0x140
[  101.737986]  alloc_cpumask_var_node+0x1f/0x30
[  101.738489]  zalloc_cpumask_var_node+0xf/0x20
[  101.738988]  smpcfd_prepare_cpu+0x46/0xc0
[  101.739493]  cpuhp_invoke_callback+0x9b/0x550
[  101.740012]  ? init_idle+0x179/0x190
[  101.740515]  _cpu_up+0xb1/0x180
[  101.741017]  do_cpu_up+0x8b/0xb0
[  101.741515]  cpu_up+0x13/0x20
[  101.742012]  cpu_subsys_online+0x3d/0x90
[  101.742510]  device_online+0x4a/0x90
[  101.743010]  online_store+0x89/0xa0
[  101.743506]  dev_attr_store+0x18/0x30
[  101.744003]  sysfs_kf_write+0x37/0x40
[  101.744501]  kernfs_fop_write+0x11c/0x1a0
[  101.744998]  __vfs_write+0x37/0x170
[  101.745494]  ? common_file_perm+0x50/0x140
[  101.745994]  ? apparmor_file_permission+0x1a/0x20
[  101.746495]  ? security_file_permission+0x3b/0xc0
[  101.746993]  ? _cond_resched+0x19/0x40
[  101.747490]  vfs_write+0xb1/0x1a0
[  101.747988]  SyS_write+0x55/0xc0
[  101.748481]  entry_SYSCALL_64_fastpath+0x1e/0x81
[  101.748977] RIP: 0033:0x7f8b09ea24a0
[  101.749474] RSP: 002b:00007ffd5e22ba78 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  101.749996] RAX: ffffffffffffffda RBX: 0000000001671410 RCX: 00007f8b09ea24a0
[  101.750490] RDX: 0000000000000002 RSI: 0000000001789410 RDI: 0000000000000003
[  101.750971] RBP: 0000000000a3e020 R08: 0000000000000000 R09: 0000000000000001
[  101.751436] R10: 0000000000000100 R11: 0000000000000246 R12: 0000000000000003
[  101.751885] R13: 0000000000501520 R14: 00007ffd5e22bee0 R15: 00007f8b08dbb240
[  101.752318] Code: 8b 18 4d 33 9f 40 01 00 00 4c 89 c3 4c 31 db 65 48 0f c7 0f 0f 94 c0 84 c0 0f 84 0e ff ff ff 4d 39 d8 74 14 49 63 47 20 48 01 c3 <48> 33 1b 49 33 9f 40 01 00 00 0f 18 0b 41 f7 c4 00 80 00 00 0f 
[  101.753201] RIP: __kmalloc_node+0x16a/0x2c0 RSP: ffffa7d0cf86bbe0
[  101.753642] ---[ end trace ef4dea51947a8216 ]---

Revision history for this message

Launchpad Janitor (janitor) wrote on 2018-01-11:

#45

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux-hwe (Ubuntu Artful):
status:	New → Confirmed
Changed in linux-hwe (Ubuntu):
status:	New → Confirmed

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2018-01-12: [REGRESSION][v4.14.y][v4.15] x86/intel_rdt/cqm: Improve limbo list processing

#47

Hi Vikas,

A kernel bug report was opened against Ubuntu [0]. After a kernel
bisect, it was found that reverting the following commit resolved this bug:

commit 24247aeeabe99eab13b798ccccc2dec066dd6f07
Author: Vikas Shivappa <email address hidden>
Date: Tue Aug 15 18:00:43 2017 -0700

x86/intel_rdt/cqm: Improve limbo list processing

The regression was introduced as of v4.14-r1 and still exists with
current mainline. The trace with v4.15-rc7 is in comment #44[1].

I was hoping to get your feedback, since you are the patch author. Do
you think gathering any additional data will help diagnose this issue,
or would it be best to submit a revert request?

Thanks,

Joe
[0] http://pad.lv/1733662
[1]
https://bugs.launchpad.net/ubuntu/+source/linux-hwe/+bug/1733662/comments/44

summary:

- System hang with Linux kernel 4.13, not with 4.10
+ System hang with Linux kernel due to mainline commit 24247aeeabe

Revision history for this message

tglx (tglx) wrote on 2018-01-14:

#48

On Fri, 12 Jan 2018, Joseph Salisbury wrote:

> Hi Vikas,
>
> A kernel bug report was opened against Ubuntu [0]. After a kernel
> bisect, it was found that reverting the following commit resolved this bug:
>
> commit 24247aeeabe99eab13b798ccccc2dec066dd6f07
> Author: Vikas Shivappa <email address hidden>
> Date: Tue Aug 15 18:00:43 2017 -0700
>
> x86/intel_rdt/cqm: Improve limbo list processing
>
>
> The regression was introduced as of v4.14-r1 and still exists with
> current mainline. The trace with v4.15-rc7 is in comment #44[1].
>
> I was hoping to get your feedback, since you are the patch author. Do
> you think gathering any additional data will help diagnose this issue,
> or would it be best to submit a revert request?

That stinks like a use after free. Can you run with KASAN enabled?

Thanks,

tglx

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2018-01-15:

#49

Hi Rod,

I built an Artful test kernel with KASAN enable.

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1733662

Can you test this kernel as requested by upstream?

Revision history for this message

tglx (tglx) wrote on 2018-01-16:

#50

Vikas, Fenghua can you please look at that ASAP?

On Sun, 14 Jan 2018, Thomas Gleixner wrote:

> On Fri, 12 Jan 2018, Joseph Salisbury wrote:
>
> > Hi Vikas,
> >
> > A kernel bug report was opened against Ubuntu [0]. After a kernel
> > bisect, it was found that reverting the following commit resolved this bug:
> >
> > commit 24247aeeabe99eab13b798ccccc2dec066dd6f07
> > Author: Vikas Shivappa <email address hidden>
> > Date: Tue Aug 15 18:00:43 2017 -0700
> >
> > x86/intel_rdt/cqm: Improve limbo list processing
> >
> >
> > The regression was introduced as of v4.14-r1 and still exists with
> > current mainline. The trace with v4.15-rc7 is in comment #44[1].
> >
> > I was hoping to get your feedback, since you are the patch author. Do
> > you think gathering any additional data will help diagnose this issue,
> > or would it be best to submit a revert request?
>
> That stinks like a use after free. Can you run with KASAN enabled?
>
> Thanks,
>
> tglx

Revision history for this message

Rod Smith (rodsmith) wrote on 2018-01-16:

#51

Download full text (4.7 KiB)

Joseph,

The first run of your latest kernel completed; however, I noticed the following in the dmesg output:

[ 426.281083] ==================================================================
[ 426.286615] BUG: KASAN: use-after-free in find_first_bit+0x1f/0x80
[ 426.291841] Read of size 8 at addr ffff883ff7c1e780 by task cpuhp/31/195

[ 426.302209] CPU: 31 PID: 195 Comm: cpuhp/31 Not tainted 4.13.0-25-generic #29~lp1733662KASANenabled
[ 426.302213] Hardware name: Cisco Systems Inc UCSC-C240-M4L/UCSC-C240-M4L, BIOS C240M4.2.0.10c.0.032320160820 03/23/2016
[ 426.302215] Call Trace:
[ 426.302233] dump_stack+0xb8/0x12d
[ 426.302241] ? dma_virt_map_sg+0xd3/0xd3
[ 426.302252] ? show_regs_print_info+0x41/0x41
[ 426.302263] print_address_description+0x6f/0x280
[ 426.302269] kasan_report+0x27a/0x370
[ 426.302276] ? find_first_bit+0x1f/0x80
[ 426.302288] __asan_load8+0x54/0x90
[ 426.302295] find_first_bit+0x1f/0x80
[ 426.302306] has_busy_rmid+0x47/0x70
[ 426.302314] intel_rdt_offline_cpu+0x4b4/0x510
[ 426.302321] ? clear_closid_rmid.isra.4+0x70/0x70
[ 426.302333] ? sysfs_remove_group+0x7a/0xc0
[ 426.302339] ? clear_closid_rmid.isra.4+0x70/0x70
[ 426.302351] cpuhp_invoke_callback+0x15f/0x7e0
[ 426.302360] ? cpuhp_kick_ap_work+0x2d0/0x2d0
[ 426.302372] ? __schedule+0x4f1/0xeb0
[ 426.302377] ? cpuhp_kick_ap_work+0x2d0/0x2d0
[ 426.302385] ? firmware_map_remove+0x1b1/0x1b1
[ 426.302395] ? migrate_swap_stop+0x2f0/0x2f0
[ 426.302402] ? firmware_map_remove+0x1b1/0x1b1
[ 426.302407] ? migrate_swap_stop+0x2f0/0x2f0
[ 426.302414] ? schedule+0xd8/0x2a0
[ 426.302421] ? __schedule+0xeb0/0xeb0
[ 426.302427] ? default_wake_function+0x2f/0x40
[ 426.302439] ? __wake_up_common+0xa1/0xc0
[ 426.302446] cpuhp_down_callbacks+0x52/0xa0
[ 426.302453] cpuhp_thread_fun+0x117/0x1a0
[ 426.302459] ? cpu_up+0x20/0x20
[ 426.302468] smpboot_thread_fn+0x20e/0x2f0
[ 426.302474] ? sort_range+0x30/0x30
[ 426.302482] kthread+0x1b7/0x1e0
[ 426.302488] ? sort_range+0x30/0x30
[ 426.302493] ? kthread_create_on_node+0xc0/0xc0
[ 426.302500] ret_from_fork+0x1f/0x30

[ 426.307683] Allocated by task 56:
[ 426.312817] save_stack_trace+0x1b/0x20
[ 426.312824] save_stack+0x43/0xd0
[ 426.312829] kasan_kmalloc+0xad/0xe0
[ 426.312834] __kmalloc+0x105/0x230
[ 426.312840] intel_rdt_online_cpu+0x5a8/0x830
[ 426.312846] cpuhp_invoke_callback+0x15f/0x7e0
[ 426.312850] cpuhp_thread_fun+0x8b/0x1a0
[ 426.312856] smpboot_thread_fn+0x20e/0x2f0
[ 426.312861] kthread+0x1b7/0x1e0
[ 426.312866] ret_from_fork+0x1f/0x30

[ 426.317887] Freed by task 195:
[ 426.322879] save_stack_trace+0x1b/0x20
[ 426.322887] save_stack+0x43/0xd0
[ 426.322891] kasan_slab_free+0x72/0xc0
[ 426.322896] kfree+0x94/0x1a0
[ 426.322902] intel_rdt_offline_cpu+0x17d/0x510
[ 426.322908] cpuhp_invoke_callback+0x15f/0x7e0
[ 426.322912] cpuhp_down_callbacks+0x52/0xa0
[ 426.322917] cpuhp_thread_fun+0x117/0x1a0
[ 426.322925] smpboot_thread_fn+0x20e/0x2f0
[ 426.322929] kthread+0x1b7/0x1e0
[ 426.322935] ret_from_fork+0x1f/0x30

[ 426.327837] The buggy address belongs to the object at ffff883ff7c1e780
which belongs to the c...

Joseph,

The first run of your latest kernel completed; however, I noticed the following in the dmesg output:

[  426.281083] ==================================================================
[  426.286615] BUG: KASAN: use-after-free in find_first_bit+0x1f/0x80
[  426.291841] Read of size 8 at addr ffff883ff7c1e780 by task cpuhp/31/195

[  426.302209] CPU: 31 PID: 195 Comm: cpuhp/31 Not tainted 4.13.0-25-generic #29~lp1733662KASANenabled
[  426.302213] Hardware name: Cisco Systems Inc UCSC-C240-M4L/UCSC-C240-M4L, BIOS C240M4.2.0.10c.0.032320160820 03/23/2016
[  426.302215] Call Trace:
[  426.302233]  dump_stack+0xb8/0x12d
[  426.302241]  ? dma_virt_map_sg+0xd3/0xd3
[  426.302252]  ? show_regs_print_info+0x41/0x41
[  426.302263]  print_address_description+0x6f/0x280
[  426.302269]  kasan_report+0x27a/0x370
[  426.302276]  ? find_first_bit+0x1f/0x80
[  426.302288]  __asan_load8+0x54/0x90
[  426.302295]  find_first_bit+0x1f/0x80
[  426.302306]  has_busy_rmid+0x47/0x70
[  426.302314]  intel_rdt_offline_cpu+0x4b4/0x510
[  426.302321]  ? clear_closid_rmid.isra.4+0x70/0x70
[  426.302333]  ? sysfs_remove_group+0x7a/0xc0
[  426.302339]  ? clear_closid_rmid.isra.4+0x70/0x70
[  426.302351]  cpuhp_invoke_callback+0x15f/0x7e0
[  426.302360]  ? cpuhp_kick_ap_work+0x2d0/0x2d0
[  426.302372]  ? __schedule+0x4f1/0xeb0
[  426.302377]  ? cpuhp_kick_ap_work+0x2d0/0x2d0
[  426.302385]  ? firmware_map_remove+0x1b1/0x1b1
[  426.302395]  ? migrate_swap_stop+0x2f0/0x2f0
[  426.302402]  ? firmware_map_remove+0x1b1/0x1b1
[  426.302407]  ? migrate_swap_stop+0x2f0/0x2f0
[  426.302414]  ? schedule+0xd8/0x2a0
[  426.302421]  ? __schedule+0xeb0/0xeb0
[  426.302427]  ? default_wake_function+0x2f/0x40
[  426.302439]  ? __wake_up_common+0xa1/0xc0
[  426.302446]  cpuhp_down_callbacks+0x52/0xa0
[  426.302453]  cpuhp_thread_fun+0x117/0x1a0
[  426.302459]  ? cpu_up+0x20/0x20
[  426.302468]  smpboot_thread_fn+0x20e/0x2f0
[  426.302474]  ? sort_range+0x30/0x30
[  426.302482]  kthread+0x1b7/0x1e0
[  426.302488]  ? sort_range+0x30/0x30
[  426.302493]  ? kthread_create_on_node+0xc0/0xc0
[  426.302500]  ret_from_fork+0x1f/0x30

[  426.307683] Allocated by task 56:
[  426.312817]  save_stack_trace+0x1b/0x20
[  426.312824]  save_stack+0x43/0xd0
[  426.312829]  kasan_kmalloc+0xad/0xe0
[  426.312834]  __kmalloc+0x105/0x230
[  426.312840]  intel_rdt_online_cpu+0x5a8/0x830
[  426.312846]  cpuhp_invoke_callback+0x15f/0x7e0
[  426.312850]  cpuhp_thread_fun+0x8b/0x1a0
[  426.312856]  smpboot_thread_fn+0x20e/0x2f0
[  426.312861]  kthread+0x1b7/0x1e0
[  426.312866]  ret_from_fork+0x1f/0x30

[  426.317887] Freed by task 195:
[  426.322879]  save_stack_trace+0x1b/0x20
[  426.322887]  save_stack+0x43/0xd0
[  426.322891]  kasan_slab_free+0x72/0xc0
[  426.322896]  kfree+0x94/0x1a0
[  426.322902]  intel_rdt_offline_cpu+0x17d/0x510
[  426.322908]  cpuhp_invoke_callback+0x15f/0x7e0
[  426.322912]  cpuhp_down_callbacks+0x52/0xa0
[  426.322917]  cpuhp_thread_fun+0x117/0x1a0
[  426.322925]  smpboot_thread_fn+0x20e/0x2f0
[  426.322929]  kthread+0x1b7/0x1e0
[  426.322935]  ret_from_fork+0x1f/0x30

[  426.327837] The buggy address belongs to the object at ffff883ff7c1e780
                which belongs to the cache kmalloc-8 of size 8
[  426.338289] The buggy address is located 0 bytes inside of
                8-byte region [ffff883ff7c1e780, ffff883ff7c1e788)
[  426.348805] The buggy address belongs to the page:
[  426.354223] page:ffffea00ffdf0780 count:1 mapcount:0 mapping:          (null) index:0x0
[  426.359838] flags: 0x57ffffc0000100(slab)
[  426.365373] raw: 0057ffffc0000100 0000000000000000 0000000000000000 0000000100aa00aa
[  426.371135] raw: dead000000000100 dead000000000200 ffff8817f500fb80 0000000000000000
[  426.377004] page dumped because: kasan: bad access detected

[  426.388626] Memory state around the buggy address:
[  426.394498]  ffff883ff7c1e680: fc fc 00 fc fc fb fc fc 00 fc fc fb fc fc 00 fc
[  426.400634]  ffff883ff7c1e700: fc 00 fc fc fb fc fc 00 fc fc fb fc fc fb fc fc
[  426.406721] >ffff883ff7c1e780: fb fc fc fb fc fc fb fc fc 00 fc fc fb fc fc fb
[  426.412737]                    ^
[  426.418698]  ffff883ff7c1e800: fc fc fb fc fc fb fc fc fb fc fc fb fc fc fb fc
[  426.424961]  ffff883ff7c1e880: fc 00 fc fc fb fc fc fb fc fc fb fc fc fb fc fc
[  426.431154] ==================================================================
[  426.437413] Disabling lock debugging due to kernel taint
[  426.472795] IRQ 8: no longer affine to CPU31
[  426.472806] IRQ 9: no longer affine to CPU31
[  426.472827] IRQ 40: no longer affine to CPU31
[  426.473962] smpboot: CPU 31 is now offline

I ran it several more times without any obvious errors; however, I might have missed something. (The dmesg output is quite verbose and scrolls by quickly!)

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2018-01-16:

#52

On 01/16/2018 08:32 AM, Shankar, Ravi V wrote:
> Vikas on vacation until end of the month. Fenghua will look into this
> issue.
>
> On Jan 16, 2018, at 5:09 AM, Thomas Gleixner <<email address hidden>
> <mailto:<email address hidden>>> wrote:
>
>>
>> Vikas, Fenghua can you please look at that ASAP?
>>
>> On Sun, 14 Jan 2018, Thomas Gleixner wrote:
>>
>>> On Fri, 12 Jan 2018, Joseph Salisbury wrote:
>>>
>>>> Hi Vikas,
>>>>
>>>> A kernel bug report was opened against Ubuntu [0]. After a kernel
>>>> bisect, it was found that reverting the following commit resolved
>>>> this bug:
>>>>
>>>> commit 24247aeeabe99eab13b798ccccc2dec066dd6f07
>>>> Author: Vikas Shivappa <<email address hidden>
>>>> <mailto:<email address hidden>>>
>>>> Date: Tue Aug 15 18:00:43 2017 -0700
>>>>
>>>> x86/intel_rdt/cqm: Improve limbo list processing
>>>>
>>>>
>>>> The regression was introduced as of v4.14-r1 and still exists with
>>>> current mainline. The trace with v4.15-rc7 is in comment #44[1].
>>>>
>>>> I was hoping to get your feedback, since you are the patch author. Do
>>>> you think gathering any additional data will help diagnose this issue,
>>>> or would it be best to submit a revert request?
>>>
>>> That stinks like a use after free. Can you run with KASAN enabled?
>>>
>>> Thanks,
>>>
>>> tglx

Here is some data wiht KASAN enabled:
https://bugs.launchpad.net/ubuntu/+source/linux-hwe/+bug/1733662/comments/51

Are there any specific logs you would like to see, or specific actions
executed?

Thanks,

Joe

Revision history for this message

tglx (tglx) wrote on 2018-01-16:

#53

On Tue, 16 Jan 2018, Joseph Salisbury wrote:
> On 01/16/2018 08:32 AM, Shankar, Ravi V wrote:
> > Vikas on vacation until end of the month. Fenghua will look into this
> > issue.
> >
> > On Jan 16, 2018, at 5:09 AM, Thomas Gleixner <<email address hidden>
> > <mailto:<email address hidden>>> wrote:
> >
> >>
> >> Vikas, Fenghua can you please look at that ASAP?
> >>
> >> On Sun, 14 Jan 2018, Thomas Gleixner wrote:
> >>
> >>> On Fri, 12 Jan 2018, Joseph Salisbury wrote:
> >>>
> >>>> Hi Vikas,
> >>>>
> >>>> A kernel bug report was opened against Ubuntu [0]. After a kernel
> >>>> bisect, it was found that reverting the following commit resolved
> >>>> this bug:
> >>>>
> >>>> commit 24247aeeabe99eab13b798ccccc2dec066dd6f07
> >>>> Author: Vikas Shivappa <<email address hidden>
> >>>> <mailto:<email address hidden>>>
> >>>> Date: Tue Aug 15 18:00:43 2017 -0700
> >>>>
> >>>> x86/intel_rdt/cqm: Improve limbo list processing
> >>>>
> >>>>
> >>>> The regression was introduced as of v4.14-r1 and still exists with
> >>>> current mainline. The trace with v4.15-rc7 is in comment #44[1].
> >>>>
> >>>> I was hoping to get your feedback, since you are the patch author. Do
> >>>> you think gathering any additional data will help diagnose this issue,
> >>>> or would it be best to submit a revert request?
> >>>
> >>> That stinks like a use after free. Can you run with KASAN enabled?
> >>>
> >>> Thanks,
> >>>
> >>> tglx
>
>
> Here is some data wiht KASAN enabled:
> https://bugs.launchpad.net/ubuntu/+source/linux-hwe/+bug/1733662/comments/51
>
> Are there any specific logs you would like to see, or specific actions
> executed?

No, the KASAN output is pretty clear where the issue is.

Thanks,

tglx

Revision history for this message

Fenghua Yu (fyu) wrote on 2018-01-16:

#54

> From: Thomas Gleixner [mailto:<email address hidden>]
> On Tue, 16 Jan 2018, Joseph Salisbury wrote:
> > On 01/16/2018 08:32 AM, Shankar, Ravi V wrote:
> > > Vikas on vacation until end of the month. Fenghua will look into
> > > this issue.
> > >
> > > On Jan 16, 2018, at 5:09 AM, Thomas Gleixner <<email address hidden>
> > > <mailto:<email address hidden>>> wrote:
> > >
> > >>
> > >> Vikas, Fenghua can you please look at that ASAP?
> > >>
> > >> On Sun, 14 Jan 2018, Thomas Gleixner wrote:
> > >>
> > >>> On Fri, 12 Jan 2018, Joseph Salisbury wrote:
> > >>>
> > >>>> Hi Vikas,
> > >>>>
> > >>>> A kernel bug report was opened against Ubuntu [0]. After a
> > >>>> kernel bisect, it was found that reverting the following commit
> > >>>> resolved this bug:
> > >>>>
> > >>>> commit 24247aeeabe99eab13b798ccccc2dec066dd6f07
> > >>>> Author: Vikas Shivappa <<email address hidden>
> > >>>> <mailto:<email address hidden>>>
> > >>>> Date: Tue Aug 15 18:00:43 2017 -0700
> > >>>>
> > >>>> x86/intel_rdt/cqm: Improve limbo list processing
> > >>>>
> > >>>>
> > >>>> The regression was introduced as of v4.14-r1 and still exists
> > >>>> with current mainline. The trace with v4.15-rc7 is in comment #44[1].
> > >>>>
> > >>>> I was hoping to get your feedback, since you are the patch
> > >>>> author. Do you think gathering any additional data will help
> > >>>> diagnose this issue, or would it be best to submit a revert request?
> > >>>
> > >>> That stinks like a use after free. Can you run with KASAN enabled?
> > >>>
> > >>> Thanks,
> > >>>
> > >>> tglx
> >
> >
> > Here is some data wiht KASAN enabled:
> > https://bugs.launchpad.net/ubuntu/+source/linux-
> hwe/+bug/1733662/comme
> > nts/51
> >
> > Are there any specific logs you would like to see, or specific actions
> > executed?
>
> No, the KASAN output is pretty clear where the issue is.
>
> Thanks,
>
> tglx

Is this a Haswell specific issue?

I run the following test forever without issue on Broadwell and 4.15.0-rc6 with rdt mounted:
for ((;;)) do
        for ((i=1;i<88;i++)) do
                echo 0 >/sys/devices/system/cpu/cpu$i/online
        done
        echo "online cpus:"
        grep processor /proc/cpuinfo |wc
        for ((i=1;i<88;i++)) do
                echo 1 >/sys/devices/system/cpu/cpu$i/online
        done
        echo "online cpus:"
        grep processor /proc/cpuinfo|wc
done

I'm finding a Haswell to reproduce the issue.

Thanks.

-Fenghua

> From: Thomas Gleixner [mailto:tglx@linutronix.de]
> On Tue, 16 Jan 2018, Joseph Salisbury wrote:
> > On 01/16/2018 08:32 AM, Shankar, Ravi V wrote:
> > > Vikas on vacation until end of the month. Fenghua will look into
> > > this issue.
> > >
> > > On Jan 16, 2018, at 5:09 AM, Thomas Gleixner <tglx@linutronix.de
> > > <mailto:tglx@linutronix.de>> wrote:
> > >
> > >>
> > >> Vikas, Fenghua can you please look at that ASAP?
> > >>
> > >> On Sun, 14 Jan 2018, Thomas Gleixner wrote:
> > >>
> > >>> On Fri, 12 Jan 2018, Joseph Salisbury wrote:
> > >>>
> > >>>> Hi Vikas,
> > >>>>
> > >>>> A kernel bug report was opened against Ubuntu [0].  After a
> > >>>> kernel bisect, it was found that reverting the following commit
> > >>>> resolved this bug:
> > >>>>
> > >>>> commit 24247aeeabe99eab13b798ccccc2dec066dd6f07
> > >>>> Author: Vikas Shivappa <vikas.shivappa@linux.intel.com
> > >>>> <mailto:vikas.shivappa@linux.intel.com>>
> > >>>> Date:   Tue Aug 15 18:00:43 2017 -0700
> > >>>>
> > >>>>     x86/intel_rdt/cqm: Improve limbo list processing
> > >>>>
> > >>>>
> > >>>> The regression was introduced as of v4.14-r1 and still exists
> > >>>> with current mainline.  The trace with v4.15-rc7 is in comment #44[1].
> > >>>>
> > >>>> I was hoping to get your feedback, since you are the patch
> > >>>> author.  Do you think gathering any additional data will help
> > >>>> diagnose this issue, or would it be best to submit a revert request?
> > >>>
> > >>> That stinks like a use after free. Can you run with KASAN enabled?
> > >>>
> > >>> Thanks,
> > >>>
> > >>>    tglx
> >
> >
> > Here is some data wiht KASAN enabled:
> > https://bugs.launchpad.net/ubuntu/+source/linux-
> hwe/+bug/1733662/comme
> > nts/51
> >
> > Are there any specific logs you would like to see, or specific actions
> > executed?
> 
> No, the KASAN output is pretty clear where the issue is.
> 
> Thanks,
> 
> 	tglx

Is this a Haswell specific issue?

I run the following test forever without issue on Broadwell and 4.15.0-rc6 with rdt mounted:
for ((;;)) do
        for ((i=1;i<88;i++)) do
                echo 0 >/sys/devices/system/cpu/cpu$i/online
        done
        echo "online cpus:"
        grep processor /proc/cpuinfo |wc
        for ((i=1;i<88;i++)) do
                echo 1 >/sys/devices/system/cpu/cpu$i/online
        done
        echo "online cpus:"
        grep processor /proc/cpuinfo|wc
done

I'm finding a Haswell to reproduce the issue.

Thanks.

-Fenghua

Revision history for this message

tglx (tglx) wrote on 2018-01-16:

#55

On Tue, 16 Jan 2018, Yu, Fenghua wrote:
> > From: Thomas Gleixner [mailto:<email address hidden>]
> Is this a Haswell specific issue?
>
> I run the following test forever without issue on Broadwell and 4.15.0-rc6 with rdt mounted:
> for ((;;)) do
> for ((i=1;i<88;i++)) do
> echo 0 >/sys/devices/system/cpu/cpu$i/online
> done
> echo "online cpus:"
> grep processor /proc/cpuinfo |wc
> for ((i=1;i<88;i++)) do
> echo 1 >/sys/devices/system/cpu/cpu$i/online
> done
> echo "online cpus:"
> grep processor /proc/cpuinfo|wc
> done
>
> I'm finding a Haswell to reproduce the issue.

Come on. This is crystal clear from the KASAN trace. And the fix is simple enough.

You simply do not run into it because on your machine

is_llc_occupancy_enabled() is false...

Thanks,

tglx

8<--------------------

diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index 88dcf8479013..99442370de40 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -525,10 +525,6 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
    */
   if (static_branch_unlikely(&rdt_mon_enable_key))
    rmdir_mondata_subdir_allrdtgrp(r, d->id);
- kfree(d->ctrl_val);
- kfree(d->rmid_busy_llc);
- kfree(d->mbm_total);
- kfree(d->mbm_local);
   list_del(&d->list);
   if (is_mbm_enabled())
    cancel_delayed_work(&d->mbm_over);
@@ -545,6 +541,10 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
    cancel_delayed_work(&d->cqm_limbo);
   }

+ kfree(d->ctrl_val);
+ kfree(d->rmid_busy_llc);
+ kfree(d->mbm_total);
+ kfree(d->mbm_local);
   kfree(d);
   return;
  }

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2018-01-17:

#56

On 01/16/2018 01:59 PM, Thomas Gleixner wrote:
> On Tue, 16 Jan 2018, Yu, Fenghua wrote:
>>> From: Thomas Gleixner [mailto:<email address hidden>]
>> Is this a Haswell specific issue?
>>
>> I run the following test forever without issue on Broadwell and 4.15.0-rc6 with rdt mounted:
>> for ((;;)) do
>> for ((i=1;i<88;i++)) do
>> echo 0 >/sys/devices/system/cpu/cpu$i/online
>> done
>> echo "online cpus:"
>> grep processor /proc/cpuinfo |wc
>> for ((i=1;i<88;i++)) do
>> echo 1 >/sys/devices/system/cpu/cpu$i/online
>> done
>> echo "online cpus:"
>> grep processor /proc/cpuinfo|wc
>> done
>>
>> I'm finding a Haswell to reproduce the issue.
> Come on. This is crystal clear from the KASAN trace. And the fix is simple enough.
>
> You simply do not run into it because on your machine
>
> is_llc_occupancy_enabled() is false...
>
> Thanks,
>
> tglx
>
> 8<--------------------
>
> diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
> index 88dcf8479013..99442370de40 100644
> --- a/arch/x86/kernel/cpu/intel_rdt.c
> +++ b/arch/x86/kernel/cpu/intel_rdt.c
> @@ -525,10 +525,6 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
> */
> if (static_branch_unlikely(&rdt_mon_enable_key))
> rmdir_mondata_subdir_allrdtgrp(r, d->id);
> - kfree(d->ctrl_val);
> - kfree(d->rmid_busy_llc);
> - kfree(d->mbm_total);
> - kfree(d->mbm_local);
> list_del(&d->list);
> if (is_mbm_enabled())
> cancel_delayed_work(&d->mbm_over);
> @@ -545,6 +541,10 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
> cancel_delayed_work(&d->cqm_limbo);
> }
>
> + kfree(d->ctrl_val);
> + kfree(d->rmid_busy_llc);
> + kfree(d->mbm_total);
> + kfree(d->mbm_local);
> kfree(d);
> return;
> }

Thanks, Thomas. I'll build some test kernels and have your patch tested
out.

Thanks,

Joe

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2018-01-17:

#57

I built Artful and mainline test kernels with the patch from tglx. The test kernels can be downloaded from:

Artful: http://kernel.ubuntu.com/~jsalisbury/lp1733662/artful
mainline: http://kernel.ubuntu.com/~jsalisbury/lp1733662/mainline

Can you test these kernels out and see if they resolve the bug?

Revision history for this message

Rod Smith (rodsmith) wrote on 2018-01-17:

#58

That seems to have fixed it! I've run the test script six or seven times on both kernels, with nary a hiccup (aside from the "error -19" messages with the 4.13 kernel). Below is the reported kernel information from both your builds, just to be sure I booted the correct kernels.

$ uname -a
Linux oil-boldore 4.13.0-25-generic #29~lp1733662PatchFromUpstream SMP Wed Jan 17 20:13:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

$ uname -a
Linux oil-boldore 4.15.0-041500rc8-generic #201801172011 SMP Wed Jan 17 20:13:51 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2018-01-17:

#59

On 01/16/2018 01:59 PM, Thomas Gleixner wrote:
> On Tue, 16 Jan 2018, Yu, Fenghua wrote:
>>> From: Thomas Gleixner [mailto:<email address hidden>]
>> Is this a Haswell specific issue?
>>
>> I run the following test forever without issue on Broadwell and 4.15.0-rc6 with rdt mounted:
>> for ((;;)) do
>> for ((i=1;i<88;i++)) do
>> echo 0 >/sys/devices/system/cpu/cpu$i/online
>> done
>> echo "online cpus:"
>> grep processor /proc/cpuinfo |wc
>> for ((i=1;i<88;i++)) do
>> echo 1 >/sys/devices/system/cpu/cpu$i/online
>> done
>> echo "online cpus:"
>> grep processor /proc/cpuinfo|wc
>> done
>>
>> I'm finding a Haswell to reproduce the issue.
> Come on. This is crystal clear from the KASAN trace. And the fix is simple enough.
>
> You simply do not run into it because on your machine
>
> is_llc_occupancy_enabled() is false...
>
> Thanks,
>
> tglx
>
> 8<--------------------
>
> diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
> index 88dcf8479013..99442370de40 100644
> --- a/arch/x86/kernel/cpu/intel_rdt.c
> +++ b/arch/x86/kernel/cpu/intel_rdt.c
> @@ -525,10 +525,6 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
> */
> if (static_branch_unlikely(&rdt_mon_enable_key))
> rmdir_mondata_subdir_allrdtgrp(r, d->id);
> - kfree(d->ctrl_val);
> - kfree(d->rmid_busy_llc);
> - kfree(d->mbm_total);
> - kfree(d->mbm_local);
> list_del(&d->list);
> if (is_mbm_enabled())
> cancel_delayed_work(&d->mbm_over);
> @@ -545,6 +541,10 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
> cancel_delayed_work(&d->cqm_limbo);
> }
>
> + kfree(d->ctrl_val);
> + kfree(d->rmid_busy_llc);
> + kfree(d->mbm_total);
> + kfree(d->mbm_local);
> kfree(d);
> return;
> }

Hi Thomas,

Testing of your patch shows that your patch resolves the bug. Thanks
for the assistance! Is this something you could submit to mainline?

Thanks,

Joe

On 01/16/2018 01:59 PM, Thomas Gleixner wrote:
> On Tue, 16 Jan 2018, Yu, Fenghua wrote:
>>> From: Thomas Gleixner [mailto:tglx@linutronix.de]
>> Is this a Haswell specific issue?
>>
>> I run the following test forever without issue on Broadwell and 4.15.0-rc6 with rdt mounted:
>> for ((;;)) do
>>         for ((i=1;i<88;i++)) do
>>                 echo 0 >/sys/devices/system/cpu/cpu$i/online
>>         done
>>         echo "online cpus:"
>>         grep processor /proc/cpuinfo |wc
>>         for ((i=1;i<88;i++)) do
>>                 echo 1 >/sys/devices/system/cpu/cpu$i/online
>>         done
>>         echo "online cpus:"
>>         grep processor /proc/cpuinfo|wc
>> done
>>
>> I'm finding a Haswell to reproduce the issue.
> Come on. This is crystal clear from the KASAN trace. And the fix is simple enough.
>
> You simply do not run into it because on your machine
>
>     is_llc_occupancy_enabled() is false...
>
> Thanks,
>
> 	tglx
> 	
> 8<--------------------	
>
> diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
> index 88dcf8479013..99442370de40 100644
> --- a/arch/x86/kernel/cpu/intel_rdt.c
> +++ b/arch/x86/kernel/cpu/intel_rdt.c
> @@ -525,10 +525,6 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
>  		 */
>  		if (static_branch_unlikely(&rdt_mon_enable_key))
>  			rmdir_mondata_subdir_allrdtgrp(r, d->id);
> -		kfree(d->ctrl_val);
> -		kfree(d->rmid_busy_llc);
> -		kfree(d->mbm_total);
> -		kfree(d->mbm_local);
>  		list_del(&d->list);
>  		if (is_mbm_enabled())
>  			cancel_delayed_work(&d->mbm_over);
> @@ -545,6 +541,10 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
>  			cancel_delayed_work(&d->cqm_limbo);
>  		}
>  
> +		kfree(d->ctrl_val);
> +		kfree(d->rmid_busy_llc);
> +		kfree(d->mbm_total);
> +		kfree(d->mbm_local);
>  		kfree(d);
>  		return;
>  	}

Hi Thomas,

Testing of your patch shows that your patch resolves the bug.  Thanks
for the assistance!  Is this something you could submit to mainline?

Thanks,

Joe

Revision history for this message

tglx (tglx) wrote on 2018-01-17:

#60

On Wed, 17 Jan 2018, Joseph Salisbury wrote:
> On 01/16/2018 01:59 PM, Thomas Gleixner wrote:
>
> Testing of your patch shows that your patch resolves the bug. Thanks
> for the assistance! Is this something you could submit to mainline?

Already there :)

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d47924417319e3b6a728c0b690f183e75bc2a702

Tagged for stable.

Thanks,

tglx

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2018-01-17:

#61

On 01/17/2018 05:55 PM, Thomas Gleixner wrote:
> On Wed, 17 Jan 2018, Joseph Salisbury wrote:
>> On 01/16/2018 01:59 PM, Thomas Gleixner wrote:
>>
>> Testing of your patch shows that your patch resolves the bug. Thanks
>> for the assistance! Is this something you could submit to mainline?
> Already there :)
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d47924417319e3b6a728c0b690f183e75bc2a702
>
> Tagged for stable.
>
> Thanks,
>
> tglx

Thanks so much!

Joseph Salisbury (jsalisbury) on 2018-01-18

no longer affects:	linux-hwe (Ubuntu)
no longer affects:	linux-hwe (Ubuntu Artful)
no longer affects:	linux-hwe (Ubuntu Bionic)

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2018-01-18:

#62

I built one last Artful test kernel with the patch tglx submitted to mainline. The test kernel can be downloaded from:

http://kernel.ubuntu.com/~jsalisbury/lp1733662

Can you test this kernel and confirm it resolves the bug?

Revision history for this message

Rod Smith (rodsmith) wrote on 2018-01-18:

#63

I ran it half a dozen times with your latest kernel and it seemed fine, aside from the usual "error -19" messages. To be sure it's the right one, here's the kernel version information:

ubuntu@oil-boldore:~$ uname -a
Linux oil-boldore 4.13.0-25-generic #29~lp1733662PatchInMainline SMP Thu Jan 18 15:58:13 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2018-01-19:

#64

SRU request submitted for Artful and Bionic.

https://lists.ubuntu.com/archives/kernel-team/2018-January/089403.html

description:

updated

Seth Forshee (sforshee) on 2018-01-19

Changed in linux (Ubuntu Bionic):
status:	In Progress → Fix Committed

Khaled El Mously (kmously) on 2018-02-04

Changed in linux (Ubuntu Artful):
status:	In Progress → Fix Committed

Revision history for this message

Per Allansson (per-allansson) wrote on 2018-03-05:

#65

I have similar issues on 16.04.4 with latest HWE kernel - and when double-checking against the source code I can see that this fix is now AWOL from:

linux-image-4.13.0-36-generic 4.13.0-36.40~16.04.1

Joseph Salisbury (jsalisbury) on 2018-03-07

Changed in linux (Ubuntu Artful):
status:	Fix Committed → In Progress

Joseph Salisbury (jsalisbury) on 2018-03-08

Changed in linux (Ubuntu Artful):
status:	In Progress → Fix Committed

Revision history for this message

Stefan Bader (smb) wrote on 2018-03-19:

#66

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-artful' to 'verification-done-artful'. If the problem still exists, change the tag 'verification-needed-artful' to 'verification-failed-artful'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags:

added: verification-needed-artful

Revision history for this message

Rod Smith (rodsmith) wrote on 2018-03-20:

#67

I've tested kernel 4.13.0-38-generic #43-Ubuntu from artful-proposed and the problem does not occur with that kernel.

tags:

added: verification-done-artful
removed: verification-needed-artful

Revision history for this message

Launchpad Janitor (janitor) wrote on 2018-04-03:

#68

Download full text (18.9 KiB)

This bug was fixed in the package linux - 4.13.0-38.43

---------------
linux (4.13.0-38.43) artful; urgency=medium

* linux: 4.13.0-38.43 -proposed tracker (LP: #1755762)

  * Servers going OOM after updating kernel from 4.10 to 4.13 (LP: #1748408)
    - i40e: Fix memory leak related filter programming status
    - i40e: Add programming descriptors to cleaned_count

* [SRU] Lenovo E41 Mic mute hotkey is not responding (LP: #1753347)
- platform/x86: ideapad-laptop: Increase timeout to wait for EC answer

* fails to dump with latest kpti fixes (LP: #1750021)
- kdump: write correct address of mem_section into vmcoreinfo

  * headset mic can't be detected on two Dell machines (LP: #1748807)
    - ALSA: hda/realtek - Support headset mode for ALC215/ALC285/ALC289
    - ALSA: hda - Fix headset mic detection problem for two Dell machines
    - ALSA: hda - Fix a wrong FIXUP for alc289 on Dell machines

  * CIFS SMB2/SMB3 does not work for domain based DFS (LP: #1747572)
    - CIFS: make IPC a regular tcon
    - CIFS: use tcon_ipc instead of use_ipc parameter of SMB2_ioctl
    - CIFS: dump IPC tcon in debug proc file

* i2c-thunderx: erroneous error message "unhandled state: 0" (LP: #1754076)
- i2c: octeon: Prevent error message on bus error

* hisi_sas: Add disk LED support (LP: #1752695)
- scsi: hisi_sas: directly attached disk LED feature for v2 hw

  * EDAC, sb_edac: Backport 1 patch to Ubuntu 17.10 (Fix missing DIMM sysfs
    entries with KNL SNC2/SNC4 mode) (LP: #1743856)
    - EDAC, sb_edac: Fix missing DIMM sysfs entries with KNL SNC2/SNC4 mode

  * [regression] Colour banding and artefacts appear system-wide on an Asus
    Zenbook UX303LA with Intel HD 4400 graphics (LP: #1749420)
    - drm/edid: Add 6 bpc quirk for CPT panel in Asus UX303LA

* DVB Card with SAA7146 chipset not working (LP: #1742316)
- vmalloc: fix __GFP_HIGHMEM usage for vmalloc_32 on 32b systems

  * [Asus UX360UA] battery status in unity-panel is not changing when battery is
    being charged (LP: #1661876) // AC adapter status not detected on Asus
    ZenBook UX410UAK (LP: #1745032)
    - ACPI / battery: Add quirk for Asus UX360UA and UX410UAK

* ASUS UX305LA - Battery state not detected correctly (LP: #1482390)
- ACPI / battery: Add quirk for Asus GL502VSK and UX305LA

  * support thunderx2 vendor pmu events (LP: #1747523)
    - perf pmu: Extract function to get JSON alias map
    - perf pmu: Pass pmu as a parameter to get_cpuid_str()
    - perf tools arm64: Add support for get_cpuid_str function.
    - perf pmu: Add helper function is_pmu_core to detect PMU CORE devices
    - perf vendor events arm64: Add ThunderX2 implementation defined pmu core
      events
    - perf pmu: Add check for valid cpuid in perf_pmu__find_map()

* lpfc.ko module doesn't work (LP: #1746970)
- scsi: lpfc: Fix loop mode target discovery

  * Ubuntu 17.10 crashes on vmalloc.c (LP: #1739498)
    - powerpc/mm/book3s64: Make KERN_IO_START a variable
    - powerpc/mm/slb: Move comment next to the code it's referring to
    - powerpc/mm/hash64: Make vmalloc 56T on hash

* ethtool -p fails to light NIC LED on HiSilicon D05 systems (LP: #1748567)
- net...

This bug was fixed in the package linux - 4.13.0-38.43

---------------
linux (4.13.0-38.43) artful; urgency=medium

* linux: 4.13.0-38.43 -proposed tracker (LP: #1755762)

* Servers going OOM after updating kernel from 4.10 to 4.13 (LP: #1748408)
    - i40e: Fix memory leak related filter programming status
    - i40e: Add programming descriptors to cleaned_count

* [SRU] Lenovo E41 Mic mute hotkey is not responding (LP: #1753347)
    - platform/x86: ideapad-laptop: Increase timeout to wait for EC answer

* fails to dump with latest kpti fixes (LP: #1750021)
    - kdump: write correct address of mem_section into vmcoreinfo

* headset mic can't be detected on two Dell machines (LP: #1748807)
    - ALSA: hda/realtek - Support headset mode for ALC215/ALC285/ALC289
    - ALSA: hda - Fix headset mic detection problem for two Dell machines
    - ALSA: hda - Fix a wrong FIXUP for alc289 on Dell machines

* CIFS SMB2/SMB3 does not work for domain based DFS (LP: #1747572)
    - CIFS: make IPC a regular tcon
    - CIFS: use tcon_ipc instead of use_ipc parameter of SMB2_ioctl
    - CIFS: dump IPC tcon in debug proc file

* i2c-thunderx: erroneous error message "unhandled state: 0" (LP: #1754076)
    - i2c: octeon: Prevent error message on bus error

* hisi_sas: Add disk LED support (LP: #1752695)
    - scsi: hisi_sas: directly attached disk LED feature for v2 hw

* EDAC, sb_edac: Backport 1 patch to Ubuntu 17.10 (Fix missing DIMM sysfs
    entries with KNL SNC2/SNC4 mode) (LP: #1743856)
    - EDAC, sb_edac: Fix missing DIMM sysfs entries with KNL SNC2/SNC4 mode

* [regression] Colour banding and artefacts appear system-wide on an Asus
    Zenbook UX303LA with Intel HD 4400 graphics (LP: #1749420)
    - drm/edid: Add 6 bpc quirk for CPT panel in Asus UX303LA

* DVB Card with SAA7146 chipset not working (LP: #1742316)
    - vmalloc: fix __GFP_HIGHMEM usage for vmalloc_32 on 32b systems

* [Asus UX360UA] battery status in unity-panel is not changing when battery is
    being charged (LP: #1661876) // AC adapter status not detected on Asus
    ZenBook UX410UAK (LP: #1745032)
    - ACPI / battery: Add quirk for Asus UX360UA and UX410UAK

* ASUS UX305LA - Battery state not detected correctly (LP: #1482390)
    - ACPI / battery: Add quirk for Asus GL502VSK and UX305LA

* support thunderx2 vendor pmu events (LP: #1747523)
    - perf pmu: Extract function to get JSON alias map
    - perf pmu: Pass pmu as a parameter to get_cpuid_str()
    - perf tools arm64: Add support for get_cpuid_str function.
    - perf pmu: Add helper function is_pmu_core to detect PMU CORE devices
    - perf vendor events arm64: Add ThunderX2 implementation defined pmu core
      events
    - perf pmu: Add check for valid cpuid in perf_pmu__find_map()

* lpfc.ko module doesn't work (LP: #1746970)
    - scsi: lpfc: Fix loop mode target discovery

* Ubuntu 17.10 crashes on vmalloc.c (LP: #1739498)
    - powerpc/mm/book3s64: Make KERN_IO_START a variable
    - powerpc/mm/slb: Move comment next to the code it's referring to
    - powerpc/mm/hash64: Make vmalloc 56T on hash

* ethtool -p fails to light NIC LED on HiSilicon D05 systems (LP: #1748567)
    - net: hns: add ACPI mode support for ethtool -p

* CVE-2017-17807
    - KEYS: add missing permission check for request_key() destination

* [Artful SRU] Fix capsule update regression (LP: #1746019)
    - efi/capsule-loader: Reinstate virtual capsule mapping

* [Artful/Bionic] [Config] enable EDAC_GHES for ARM64 (LP: #1747746)
    - Ubuntu: [Config] enable EDAC_GHES for ARM64

* linux-tools: perf incorrectly linking libbfd (LP: #1748922)
    - SAUCE: tools -- add ability to disable libbfd
    - [Packaging] correct disablement of libbfd

* Cherry pick c96f5471ce7d for delayacct fix (LP: #1747769)
    - delayacct: Account blkio completion on the correct task

* Error in CPU frequency reporting when nominal and min pstates are same
    (cpufreq) (LP: #1746174)
    - cpufreq: powernv: Dont assume distinct pstate values for nominal and pmin

* retpoline abi files are empty on i386 (LP: #1751021)
    - [Packaging] retpoline-extract -- instantiate retpoline files for i386
    - [Packaging] final-checks -- sanity checking ABI contents
    - [Packaging] final-checks -- check for empty retpoline files

* [P9,Power NV][WSP][Ubuntu 1804] : "Kernel access of bad area " when grouping
    different pmu events using perf fuzzer . (perf:) (LP: #1746225)
    - powerpc/perf: Fix oops when grouping different pmu events

* bnx2x_attn_int_deasserted3:4323 MC assert! (LP: #1715519) //
    CVE-2018-1000026
    - net: create skb_gso_validate_mac_len()
    - bnx2x: disable GSO where gso_size is too big for hardware

* Ubuntu16.04.03: ISAv3 initialize MMU registers before setting partition
    table (LP: #1736145)
    - powerpc/64s: Initialize ISAv3 MMU registers before setting partition table

* powerpc/powernv: Flush console before platform error reboot (LP: #1735159)
    - powerpc/powernv: Flush console before platform error reboot

* Touchpad stops working after a few seconds in Lenovo ideapad 320
    (LP: #1732056)
    - pinctrl/amd: fix masking of GPIO interrupts

* [Artful][Wyse 3040] System hang when trying to enable an offlined CPU core
    (LP: #1736393)
    - SAUCE: drm/i915:Don't set chip specific data
    - SAUCE: drm/i915: make previous commit affects Wyse 3040 only

* ppc64el: Do not call ibm,os-term on panic (LP: #1736954)
    - powerpc: Do not call ppc_md.panic in fadump panic notifier

* Artful update to 4.13.16 stable release (LP: #1744213)
    - tcp_nv: fix division by zero in tcpnv_acked()
    - net: vrf: correct FRA_L3MDEV encode type
    - tcp: do not mangle skb->cb[] in tcp_make_synack()
    - net: systemport: Correct IPG length settings
    - netfilter/ipvs: clear ipvs_property flag when SKB net namespace changed
    - l2tp: don't use l2tp_tunnel_find() in l2tp_ip and l2tp_ip6
    - bonding: discard lowest hash bit for 802.3ad layer3+4
    - net: cdc_ether: fix divide by 0 on bad descriptors
    - net: qmi_wwan: fix divide by 0 on bad descriptors
    - qmi_wwan: Add missing skb_reset_mac_header-call
    - net: usb: asix: fill null-ptr-deref in asix_suspend
    - tcp: gso: avoid refcount_t warning from tcp_gso_segment()
    - tcp: fix tcp_fastretrans_alert warning
    - vlan: fix a use-after-free in vlan_device_event()
    - net/mlx5: Cancel health poll before sending panic teardown command
    - net/mlx5e: Set page to null in case dma mapping fails
    - af_netlink: ensure that NLMSG_DONE never fails in dumps
    - vxlan: fix the issue that neigh proxy blocks all icmpv6 packets
    - net: cdc_ncm: GetNtbFormat endian fix
    - fealnx: Fix building error on MIPS
    - net/sctp: Always set scope_id in sctp_inet6_skb_msgname
    - ima: do not update security.ima if appraisal status is not INTEGRITY_PASS
    - serial: omap: Fix EFR write on RTS deassertion
    - serial: 8250_fintek: Fix finding base_port with activated SuperIO
    - tpm-dev-common: Reject too short writes
    - rcu: Fix up pending cbs check in rcu_prepare_for_idle
    - ocfs2: fix cluster hang after a node dies
    - ocfs2: should wait dio before inode lock in ocfs2_setattr()
    - ipmi: fix unsigned long underflow
    - mm/page_alloc.c: broken deferred calculation
    - mm/page_ext.c: check if page_ext is not prepared
    - x86/cpu/amd: Derive L3 shared_cpu_map from cpu_llc_shared_mask
    - coda: fix 'kernel memory exposure attempt' in fsync
    - Linux 4.13.16

* Artful update to 4.13.15 stable release (LP: #1744212)
    - media: imon: Fix null-ptr-deref in imon_probe
    - media: dib0700: fix invalid dvb_detach argument
    - crypto: dh - Fix double free of ctx->p
    - crypto: dh - Don't permit 'p' to be 0
    - crypto: dh - Don't permit 'key' or 'g' size longer than 'p'
    - USB: early: Use new USB product ID and strings for DbC device
    - USB: usbfs: compute urb->actual_length for isochronous
    - USB: Add delay-init quirk for Corsair K70 LUX keyboards
    - usb: gadget: f_fs: Fix use-after-free in ffs_free_inst
    - USB: serial: metro-usb: stop I/O after failed open
    - USB: serial: Change DbC debug device binding ID
    - USB: serial: qcserial: add pid/vid for Sierra Wireless EM7355 fw update
    - USB: serial: garmin_gps: fix I/O after failed probe and remove
    - USB: serial: garmin_gps: fix memory leak on probe errors
    - x86/MCE/AMD: Always give panic severity for UC errors in kernel context
    - platform/x86: peaq-wmi: Add DMI check before binding to the WMI interface
    - platform/x86: peaq_wmi: Fix missing terminating entry for peaq_dmi_table
    - HID: cp2112: add HIDRAW dependency
    - HID: wacom: generic: Recognize WACOM_HID_WD_PEN as a type of pen collection
    - staging: wilc1000: Fix bssid buffer offset in Txq
    - staging: ccree: fix 64 bit scatter/gather DMA ops
    - staging: greybus: spilib: fix use-after-free after deregistration
    - staging: vboxvideo: Fix reporting invalid suggested-offset-properties
    - staging: rtl8188eu: Revert 4 commits breaking ARP
    - Linux 4.13.15

* time drifting on linux-hwe kernels (LP: #1744988)
    - x86/tsc: Future-proof native_calibrate_tsc()
    - x86/tsc: Fix erroneous TSC rate on Skylake Xeon
    - x86/tsc: Print tsc_khz, when it differs from cpu_khz

* Please backport vmd suspend/resume patches to 16.04 hwe (LP: #1745508)
    - PCI: vmd: Free up IRQs on suspend path

* CVE-2017-17448
    - netfilter: nfnetlink_cthelper: Add missing permission checks

* Dell XPS 13 9360 bluetooth (Atheros) won't connect after resume
    (LP: #1744712)
    - Bluetooth: btusb: Restore QCA Rome suspend/resume fix with a "rewritten"
      version

* [SRU] TrackPoint: middle button doesn't work on TrackPoint-compatible
    device. (LP: #1746002)
    - Input: trackpoint - force 3 buttons if 0 button is reported

* TB16 dock ethernet corrupts data with hw checksum silently failing
    (LP: #1729674)
    - r8152: disable RX aggregation on Dell TB16 dock

* [Artful] Realtek ALC225: 2 secs noise when a headset plugged in
    (LP: #1744058)
    - Revert "UBUNTU: SAUCE: ALSA: hda/realtek - Add support headset mode for DELL
      WYSE"
    - SAUCE: ALSA: hda/realtek - Add support headset mode for DELL WYSE
    - ALSA: hda/realtek - update ALC225 depop optimize

* [A] skb leak in vhost_net / tun / tap (LP: #1738975)
    - vhost: fix skb leak in handle_rx()
    - tap: free skb if flags error
    - tun: free skb in early errors

* Commit d9018976cdb6 missing in Kernels <4.14.x preventing lasting fix of
    Intel SPI bug on certain serial flash (LP: #1742696)
    - mfd: lpc_ich: Do not touch SPI-NOR write protection bit on Haswell/Broadwell
    - spi-nor: intel-spi: Fix broken software sequencing codes

* CVE-2018-5332
    - RDS: Heap OOB write in rds_message_alloc_sgs()

* [A] KVM Windows BSOD on 4.13.x (LP: #1738972)
    - KVM: x86: fix APIC page invalidation

* elantech touchpad of Lenovo L480/580 failed to detect hw_version
    (LP: #1733605)
    - Input: elantech - add new icbody type 15

* [SRU] External HDMI monitor failed to show screen on Lenovo X1 series
    (LP: #1738523)
    - SAUCE: drm/i915: Disable writing of TMDS_OE on Lenovo ThinkPad X1 series

* ubuntu/xr-usb-serial didn't get built in zesty and artful (LP: #1733281)
    - SAUCE: make sure ubuntu/xr-usb-serial builds for x86

* Disabling zfs does not always disable module checks for the zfs modules
    (LP: #1737176)
    - [Packaging] disable zfs module checks when zfs is disabled

* CVE-2017-17806
    - crypto: hmac - require that the underlying hash algorithm is unkeyed

* CVE-2017-17805
    - crypto: salsa20 - fix blkcipher_walk API usage

* CVE-2017-16994
    - mm/pagewalk.c: report holes in hugetlb ranges

* CVE-2017-17450
    - netfilter: xt_osf: Add missing permission checks

* apparmor profile load in stacked policy container fails (LP: #1746463)
    - SAUCE: apparmor: fix display of .ns_name for containers

* CVE-2017-15129
    - net: Fix double free and memory corruption in get_net_ns_by_id()

* CVE-2018-5344
    - loop: fix concurrent lo_open/lo_release

* CVE-2017-1000407
    - KVM: VMX: remove I/O port 0x80 bypass on Intel hosts

* CVE-2017-0861
    - ALSA: pcm: prevent UAF in snd_pcm_info

* perf stat segfaults on uncore events w/o -a (LP: #1745246)
    - perf xyarray: Save max_x, max_y
    - perf evsel: Fix buffer overflow while freeing events

* Support cppc-cpufreq driver on ThunderX2 systems (LP: #1745007)
    - mailbox: PCC: Move the MAX_PCC_SUBSPACES definition to header file
    - ACPI / CPPC: Make CPPC ACPI driver aware of PCC subspace IDs
    - ACPI / CPPC: Fix KASAN global out of bounds warning
    - ACPI: CPPC: remove initial assignment of pcc_ss_data

* P-state not working in kernel 4.13 (LP: #1743269)
    - x86 / CPU: Avoid unnecessary IPIs in arch_freq_get_on_cpu()
    - x86 / CPU: Always show current CPU frequency in /proc/cpuinfo

* Regression: KVM no longer supports Intel CPUs without Virtual NMI
    (LP: #1741655)
    - kvm: vmx: Reinstate support for CPUs without virtual NMI

* System hang with Linux kernel due to mainline commit 24247aeeabe
    (LP: #1733662)
    - x86/intel_rdt/cqm: Prevent use after free

* $(LOCAL_ENV_CC) and $(LOCAL_ENV_DISTCC_HOSTS) should be properly quoted
    (LP: #1744077)
    - [Debian] pass LOCAL_ENV_CC and LOCAL_ENV_DISTCC_HOSTS properly

* the wifi driver is always hard blocked on a lenovo laptop (LP: #1743672)
    - ACPI: EC: Fix possible issues related to EC initialization order

* text VTs are unavailable on desktop after upgrade to Ubuntu 17.10
    (LP: #1724911)
    - drm/i915/fbdev: Always forward hotplug events

* Samsung SSD 960 EVO 500GB refused to change power state (LP: #1705748)
    - nvme-pci: disable APST on Samsung SSD 960 EVO + ASUS PRIME B350M-A

* [0cf3:e010] QCA6174A XR failed to pair with bt 4.0 device  (LP: #1741166)
    - Bluetooth: btusb: Add support for 0cf3:e010

* CVE-2017-17741
    - KVM: Fix stack-out-of-bounds read in write_mmio

* CVE-2018-5333
    - RDS: null pointer dereference in rds_atomic_free_op

* [800 G3 SFF] [800 G3 DM]External microphone of headset(3-ring) is working,
    2-ring mic not working, both not shown in sound settings  (LP: #1740974)
    - ALSA: hda - Add MIC_NO_PRESENCE fixup for 2 HP machines

* Two front mics can't work on a lenovo machine (LP: #1740973)
    - ALSA: hda - change the location for one mic on a Lenovo machine

* No external microphone be detected via headset jack on a dell machine
    (LP: #1740972)
    - ALSA: hda - fix headset mic detection issue on a Dell machine

*  Can't detect external headset via line-out jack on some Dell machines
    (LP: #1740971)
    - ALSA: hda/realtek - Fix Dell AIO LineOut issue

* Support realtek new codec alc257 in the alsa hda driver  (LP: #1738911)
    - ALSA: hda/realtek - New codec support for ALC257

* Add support for 16g huge pages on Ubuntu 16.04.2 PowerNV (LP: #1706247)
    - powerpc/mm/hugetlb: Allow runtime allocation of 16G.
    - powerpc/mm/hugetlb: Add support for reserving gigantic huge pages via kernel
      command line
    - mm/hugetlb: Allow arch to override and call the weak function

* the kernel is blackholing IPv6 packets to linkdown nexthops (LP: #1738219)
    - ipv6: Do not consider linkdown nexthops during multipath

* e1000e in 4.4.0-97-generic breaks 82574L under heavy load. (LP: #1730550)
    - e1000e: Avoid receiver overrun interrupt bursts
    - e1000e: Separate signaling for link check/link up

* Ubuntu 17.10: Include patch "crypto: vmx - Use skcipher for ctr fallback"
    (LP: #1732978)
    - crypto: vmx - Use skcipher for ctr fallback

* QCA Rome bluetooth can not wakeup after USB runtime suspended.
    (LP: #1737890)
    - Bluetooth: btusb: driver to enable the usb-wakeup feature

* /dev/bcache/by-uuid links not created after reboot (LP: #1729145)
    - SAUCE: (no-up) bcache: decouple emitting a cached_dev CHANGE uevent

* Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for
    22s! [systemd:1]" (LP: #1730717)
    - SAUCE: exec: fix lockup because retry loop may never exit

* Request to backport cxlflash patches to 16.04 HWE Kernel (LP: #1730515)
    - scsi: cxlflash: Use derived maximum write same length
    - scsi: cxlflash: Allow cards without WWPN VPD to configure
    - scsi: cxlflash: Derive pid through accessors

* vagrant artful64 box filesystem too small (LP: #1726818)
    - block: factor out __blkdev_issue_zero_pages()
    - block: cope with WRITE ZEROES failing in blkdev_issue_zeroout()

* Artful update to 4.13.14 stable release (LP: #1744121)
    - ppp: fix race in ppp device destruction
    - gso: fix payload length when gso_size is zero
    - ipv4: Fix traffic triggered IPsec connections.
    - ipv6: Fix traffic triggered IPsec connections.
    - netlink: do not set cb_running if dump's start() errs
    - net: call cgroup_sk_alloc() earlier in sk_clone_lock()
    - macsec: fix memory leaks when skb_to_sgvec fails
    - l2tp: check ps->sock before running pppol2tp_session_ioctl()
    - netlink: fix netlink_ack() extack race
    - sctp: add the missing sock_owned_by_user check in sctp_icmp_redirect
    - tcp/dccp: fix ireq->opt races
    - packet: avoid panic in packet_getsockopt()
    - geneve: Fix function matching VNI and tunnel ID on big-endian
    - net: bridge: fix returning of vlan range op errors
    - soreuseport: fix initialization race
    - ipv6: flowlabel: do not leave opt->tot_len with garbage
    - sctp: full support for ipv6 ip_nonlocal_bind & IP_FREEBIND
    - tcp/dccp: fix lockdep splat in inet_csk_route_req()
    - tcp/dccp: fix other lockdep splats accessing ireq_opt
    - net: dsa: check master device before put
    - net/unix: don't show information about sockets from other namespaces
    - tap: double-free in error path in tap_open()
    - net/mlx5: Fix health work queue spin lock to IRQ safe
    - net/mlx5e: Properly deal with encap flows add/del under neigh update
    - ipip: only increase err_count for some certain type icmp in ipip_err
    - ip6_gre: only increase err_count for some certain type icmpv6 in ip6gre_err
    - ip6_gre: update dst pmtu if dev mtu has been updated by toobig in
      __gre6_xmit
    - tcp: refresh tp timestamp before tcp_mtu_probe()
    - tap: reference to KVA of an unloaded module causes kernel panic
    - sctp: reset owner sk for data chunks on out queues when migrating a sock
    - net_sched: avoid matching qdisc with zero handle
    - l2tp: hold tunnel in pppol2tp_connect()
    - ipv6: addrconf: increment ifp refcount before ipv6_del_addr()
    - tcp: fix tcp_mtu_probe() vs highest_sack
    - mac80211: accept key reinstall without changing anything
    - mac80211: use constant time comparison with keys
    - mac80211: don't compare TKIP TX MIC key in reinstall prevention
    - usb: usbtest: fix NULL pointer dereference
    - Input: ims-psu - check if CDC union descriptor is sane
    - EDAC, sb_edac: Don't create a second memory controller if HA1 is not present
    - dmaengine: dmatest: warn user when dma test times out
    - Linux 4.13.14

-- Stefan Bader <stefan.bader@canonical.com>  Wed, 14 Mar 2018 11:38:23 +0100

Changed in linux (Ubuntu Artful):
status:	Fix Committed → Fix Released

Po-Hsu Lin (cypressyew) on 2019-10-03

Changed in linux (Ubuntu):
status:	Fix Committed → Fix Released

Jeff Lane  (bladernr) on 2020-03-03

tags:

removed: hwcert-server

Ubuntu
linux package

System hang with Linux kernel due to mainline commit 24247aeeabe

Bug Description

CVE References

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches

	Status	Importance	Assigned to
linux (Ubuntu)	Fix Released	High	Unassigned
Artful	Fix Released	High	Unassigned
Bionic	Fix Committed	High	Unassigned

Ubuntulinux package

System hang with Linux kernel due to mainline commit 24247aeeabe

Bug Description

CVE References

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
linux package