Ubuntu
linux package

Execute NX-protected page - 4.4.0-78-generic - kernel panic

Bug #1691741 reported by Jordi de Wal on 2017-05-18

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	linux (Ubuntu)	Fix Released	High	Unassigned

Bug Description

After upgrading from 4.4.0-77 to 4.4.0-78 I started getting kernel panics.

The crashes do not happen immediately, but have happened generally after a couple of minutes, sometimes more.

After enabling linux-crashdump stuff, I managed to extract this dmesg.

[ 995.103846] kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
[ 995.104141] BUG: unable to handle kernel paging request at ffff88042a284000
[ 995.104407] IP: [<ffff88042a284000>] 0xffff88042a284000
[ 995.104594] PGD 43f20b067 PUD 43f20e067 PMD 42a3da063 PTE 800000042a284163
[ 995.104946] Oops: 0011 [#1] SMP
[ 995.105143] Modules linked in: zfs(PO) zunicode(PO) zcommon(PO) znvpair(PO) spl(O) zavl(PO) ppdev input_leds shpchp serio_raw i2c_piix4 mac_hid parport_pc parport 8250_fintek autofs4 ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm psmouse floppy pata_acpi
[ 995.107081] CPU: 1 PID: 0 Comm: swapper/1 Tainted: P O 4.4.0-78-generic #99-Ubuntu
[ 995.107299] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014
[ 995.107573] task: ffff88042a278000 ti: ffff88042a280000 task.ti: ffff88042a280000
[ 995.108070] RIP: 0010:[<ffff88042a284000>] [<ffff88042a284000>] 0xffff88042a284000
[ 995.108637] RSP: 0018:ffff88042a283ed0 EFLAGS: 00010082
[ 995.109116] RAX: 0000000000000001 RBX: 000000e797438af0 RCX: 0000000000000000
[ 995.109638] RDX: 0000000000000001 RSI: 0000000000000083 RDI: 0000000000000083
[ 995.110143] RBP: ffffffff81f38d40 R08: 000000000000000a R09: 0000000000000000
[ 995.110665] R10: 000000010002a665 R11: 0000000000004c00 R12: ffff88042a283ed0
[ 995.111182] R13: ffffffff810ff75e R14: 0000000000000000 R15: ffff88042a280000
[ 995.111733] FS: 0000000000000000(0000) GS:ffff88043fc80000(0000) knlGS:0000000000000000
[ 995.112486] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 995.112978] CR2: ffff88042a284000 CR3: 000000043d246000 CR4: 00000000000006e0
[ 995.113497] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 995.114085] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 995.114612] Stack:
[ 995.114965] ffff88042a283f28 ffffffff810c4736 ffff88042a280000 ffff88042a284000
[ 995.116204] ee041b0196f77cc4 a1abbcd2b8b123ce 0000000000000000 0000000000000000
[ 995.117389] 0000000000000000 0000000000000000 0000000000000000 ffff88042a283f48
[ 995.118425] Call Trace:
[ 995.118811] [<ffffffff810c4736>] ? cpu_startup_entry+0x176/0x350
[ 995.119293] [<ffffffff810517c4>] ? start_secondary+0x154/0x190
[ 995.119775] Code: ff ff ff 00 00 00 00 00 00 00 00 10 00 00 00 00 00 00 00 02 02 00 00 00 00 00 00 58 3f 28 2a 04 88 ff ff 18 00 00 00 00 00 00 00 <c0> 8c 27 2a 04 88 ff ff 00 00 00 00 00 00 00 00 02 00 00 00 00
[ 995.125554] RIP [<ffff88042a284000>] 0xffff88042a284000
[ 995.126088] RSP <ffff88042a283ed0>
[ 995.126453] CR2: ffff88042a284000

I've upgraded other machines as well, and only this particular VM shows this behaviour.

I have a crash dump, but I haven't looked into the contents yet. Getting the dmesg was already a pain in the behind.

The VM this happens on is:
- a KVM guest
- x86_64, 4 cores
- 16gb ram

lsb_release:
Distributor ID: Ubuntu
Description: Ubuntu 16.04.2 LTS
Release: 16.04
Codename: xenial

lspci says:
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
00:01.2 USB controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] (rev 01)
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
00:02.0 VGA compatible controller: VMware SVGA II Adapter
00:03.0 Unclassified device [00ff]: Red Hat, Inc Virtio memory balloon
00:0a.0 SCSI storage controller: Red Hat, Inc Virtio block device
00:0b.0 SCSI storage controller: Red Hat, Inc Virtio block device
00:12.0 Ethernet controller: Red Hat, Inc Virtio network device
00:1e.0 PCI bridge: Red Hat, Inc. QEMU PCI-PCI bridge
00:1f.0 PCI bridge: Red Hat, Inc. QEMU PCI-PCI bridge

Let me know if there are other helpful details I can provide. If I find out more, I'll update this ticket.

Tags:

Revision history for this message

Brad Figg (brad-figg) wrote on 2017-05-18: Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1691741

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status:	New → Incomplete

Revision history for this message

Jordi de Wal (jdwal) wrote on 2017-05-18:

Download full text (4.3 KiB)

Due to the nature of the data on the machine and the fact that I don't know what apport will send, I'm unable to execute the apport-collect.

# Some data from /usr/bin/crash:
      KERNEL: /usr/lib/debug/boot/vmlinux-4.4.0-78-generic
    DUMPFILE: dump.201705181342 [PARTIAL DUMP]
        CPUS: 4
        DATE: Thu May 18 13:42:12 2017
      UPTIME: 00:16:34
LOAD AVERAGE: 0.21, 0.05, 0.01
       TASKS: 547
    NODENAME: <my_server>
     RELEASE: 4.4.0-78-generic
     VERSION: #99-Ubuntu SMP Thu Apr 27 15:29:09 UTC 2017
     MACHINE: x86_64 (2199 Mhz)
      MEMORY: 16 GB
       PANIC: "BUG: unable to handle kernel paging request at ffff88042a284000"
         PID: 0
     COMMAND: "swapper/1"
        TASK: ffff88042a278000 (1 of 4) [THREAD_INFO: ffff88042a280000]
         CPU: 1
       STATE: TASK_RUNNING (PANIC)

# crash> bt
PID: 0 TASK: ffff88042a278000 CPU: 1 COMMAND: "swapper/1"
#0 [ffff88042a283b78] machine_kexec at ffffffff8105c0db
#1 [ffff88042a283bd8] crash_kexec at ffffffff8110e572
#2 [ffff88042a283ca8] oops_end at ffffffff81031c49
#3 [ffff88042a283cd0] no_context at ffffffff8106ad35
#4 [ffff88042a283d30] __bad_area_nosemaphore at ffffffff8106b000
#5 [ffff88042a283d78] bad_area_nosemaphore at ffffffff8106b183
#6 [ffff88042a283d88] __do_page_fault at ffffffff8106b447
#7 [ffff88042a283de0] trace_do_page_fault at ffffffff8106b7f7
#8 [ffff88042a283e10] do_async_page_fault at ffffffff81063ef9
#9 [ffff88042a283e20] async_page_fault at ffffffff81842be8
#10 [ffff88042a283e38] tick_nohz_idle_exit at ffffffff810ff75e
#11 [ffff88042a283ed8] cpu_startup_entry at ffffffff810c4736
#12 [ffff88042a283f30] start_secondary at ffffffff810517c4

# crash> bt -f
# #9 [ffff88042a283e20] async_page_fault at ffffffff81842be8
    ffff88042a283e28: ffff88042a280000 0000000000000000
    ffff88042a283e38: ffffffff810ff75e
#10 [ffff88042a283e38] tick_nohz_idle_exit at ffffffff810ff75e
    ffff88042a283e40: ffff88042a283ed0 ffffffff81f38d40
    ffff88042a283e50: 000000e797438af0 0000000000004c00
    ffff88042a283e60: 000000010002a665 0000000000000000
    ffff88042a283e70: 000000000000000a 0000000000000001
    ffff88042a283e80: 0000000000000000 0000000000000001
    ffff88042a283e90: 0000000000000083 0000000000000083
    ffff88042a283ea0: ffffffffffffffff ffff88042a284000
    ffff88042a283eb0: 0000000000000010 0000000000010082
    ffff88042a283ec0: ffff88042a283ed0 0000000000000018
    ffff88042a283ed0: ffff88042a283f28 ffffffff810c4736
#11 [ffff88042a283ed8] cpu_startup_entry at ffffffff810c4736
    ffff88042a283ee0: ffff88042a280000 ffff88042a284000
    ffff88042a283ef0: ee041b0196f77cc4 a1abbcd2b8b123ce
    ffff88042a283f00: 0000000000000000 0000000000000000
    ffff88042a283f10: 0000000000000000 0000000000000000
    ffff88042a283f20: 0000000000000000 ffff88042a283f48
    ffff88042a283f30: ffffffff810517c4
#12 [ffff88042a283f30] start_secondary at ffffffff810517c4

Due to the nature of the data on the machine and the fact that I don't know what apport will send, I'm unable to execute the apport-collect.

# Some data from /usr/bin/crash:
      KERNEL: /usr/lib/debug/boot/vmlinux-4.4.0-78-generic
    DUMPFILE: dump.201705181342  [PARTIAL DUMP]
        CPUS: 4
        DATE: Thu May 18 13:42:12 2017
      UPTIME: 00:16:34
LOAD AVERAGE: 0.21, 0.05, 0.01
       TASKS: 547
    NODENAME: <my_server>
     RELEASE: 4.4.0-78-generic
     VERSION: #99-Ubuntu SMP Thu Apr 27 15:29:09 UTC 2017
     MACHINE: x86_64  (2199 Mhz)
      MEMORY: 16 GB
       PANIC: "BUG: unable to handle kernel paging request at ffff88042a284000"
         PID: 0
     COMMAND: "swapper/1"
        TASK: ffff88042a278000  (1 of 4)  [THREAD_INFO: ffff88042a280000]
         CPU: 1
       STATE: TASK_RUNNING (PANIC)

# crash> bt
PID: 0      TASK: ffff88042a278000  CPU: 1   COMMAND: "swapper/1"
 #0 [ffff88042a283b78] machine_kexec at ffffffff8105c0db
 #1 [ffff88042a283bd8] crash_kexec at ffffffff8110e572
 #2 [ffff88042a283ca8] oops_end at ffffffff81031c49
 #3 [ffff88042a283cd0] no_context at ffffffff8106ad35
 #4 [ffff88042a283d30] __bad_area_nosemaphore at ffffffff8106b000
 #5 [ffff88042a283d78] bad_area_nosemaphore at ffffffff8106b183
 #6 [ffff88042a283d88] __do_page_fault at ffffffff8106b447
 #7 [ffff88042a283de0] trace_do_page_fault at ffffffff8106b7f7
 #8 [ffff88042a283e10] do_async_page_fault at ffffffff81063ef9
 #9 [ffff88042a283e20] async_page_fault at ffffffff81842be8
#10 [ffff88042a283e38] tick_nohz_idle_exit at ffffffff810ff75e
#11 [ffff88042a283ed8] cpu_startup_entry at ffffffff810c4736
#12 [ffff88042a283f30] start_secondary at ffffffff810517c4

# crash> bt -f
#  #9 [ffff88042a283e20] async_page_fault at ffffffff81842be8
    ffff88042a283e28: ffff88042a280000 0000000000000000 
    ffff88042a283e38: ffffffff810ff75e 
#10 [ffff88042a283e38] tick_nohz_idle_exit at ffffffff810ff75e
    ffff88042a283e40: ffff88042a283ed0 ffffffff81f38d40 
    ffff88042a283e50: 000000e797438af0 0000000000004c00 
    ffff88042a283e60: 000000010002a665 0000000000000000 
    ffff88042a283e70: 000000000000000a 0000000000000001 
    ffff88042a283e80: 0000000000000000 0000000000000001 
    ffff88042a283e90: 0000000000000083 0000000000000083 
    ffff88042a283ea0: ffffffffffffffff ffff88042a284000 
    ffff88042a283eb0: 0000000000000010 0000000000010082 
    ffff88042a283ec0: ffff88042a283ed0 0000000000000018 
    ffff88042a283ed0: ffff88042a283f28 ffffffff810c4736 
#11 [ffff88042a283ed8] cpu_startup_entry at ffffffff810c4736
    ffff88042a283ee0: ffff88042a280000 ffff88042a284000 
    ffff88042a283ef0: ee041b0196f77cc4 a1abbcd2b8b123ce 
    ffff88042a283f00: 0000000000000000 0000000000000000 
    ffff88042a283f10: 0000000000000000 0000000000000000 
    ffff88042a283f20: 0000000000000000 ffff88042a283f48 
    ffff88042a283f30: ffffffff810517c4 
#12 [ffff88042a283f30] start_secondary at ffffffff810517c4

# crash> dis tick_nohz_idle_exit
0xffffffff810ff74f <tick_nohz_idle_exit+127>:   mov    %r12,0xa8(%rbx)
0xffffffff810ff756 <tick_nohz_idle_exit+134>:   mov    %r12,%rsi
0xffffffff810ff759 <tick_nohz_idle_exit+137>:   callq  0xffffffff810ff170 <tick_nohz_restart>
0xffffffff810ff75e <tick_nohz_idle_exit+142>:   mov    0xd0989b(%rip),%rdi        # 0xffffffff81e09000 <jiffies>
0xffffffff810ff765 <tick_nohz_idle_exit+149>:   sub    0x78(%rbx),%rdi

# lsmod info's
Module                  Size  Used by
zfs                  2813952  5
zunicode              331776  1 zfs
zcommon                57344  1 zfs
znvpair                90112  2 zfs,zcommon
spl                   102400  3 zfs,zcommon,znvpair
zavl                   16384  1 zfs
ppdev                  20480  0
input_leds             16384  0
shpchp                 36864  0
serio_raw              16384  0
i2c_piix4              24576  0
8250_fintek            16384  0
parport_pc             32768  0
parport                49152  2 ppdev,parport_pc
mac_hid                16384  0
autofs4                40960  2
ttm                    94208  0
drm_kms_helper        155648  0
syscopyarea            16384  1 drm_kms_helper
sysfillrect            16384  1 drm_kms_helper
sysimgblt              16384  1 drm_kms_helper
fb_sys_fops            16384  1 drm_kms_helper
psmouse               131072  0
drm                   364544  2 ttm,drm_kms_helper
pata_acpi              16384  0
floppy                 73728  0

No idea if this is useful though.

Walter (wdoekes) on 2017-05-18

Changed in linux (Ubuntu):
status:	Incomplete → Confirmed

Joseph Salisbury (jsalisbury) on 2017-05-18

Changed in linux (Ubuntu):
importance:	Undecided → High

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2017-05-18:

Does the panic stop happening if you boot back into -77?

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2017-05-18:

Would it be possible for you to test the latest upstream stable kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.4 stable kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

If you are unable to test the mainline kernel, for example it will not boot, please add the tag: 'kernel-unable-to-test-upstream'.
Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.4.68

Can you also test the latest 3.19 upstream stable kernel? This will tell us if the fix in 4.1 was also sent for inclusion in the stable releases. It can be downloaded from:

http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.19.8-vivid/

tags:	added: kernel-da-key needs-bisect xenial
Changed in linux (Ubuntu):
status:	Confirmed → Incomplete

Revision history for this message

Walter (wdoekes) wrote on 2017-05-18:

> Does the panic stop happening if you boot back into -77?

Yes. It's now running fine for more thant 3 hours now on -77.

Observe that we have upgraded more than one machine to -78, and only one machine has trouble.

> Would it be possible for you to test the latest upstream stable kernel?

Ah, you have debianized builds for upstream kernels. I'm sure we can try that tomorrow.

Thanks for your questions/feedback.

Revision history for this message

Walter (wdoekes) wrote on 2017-05-19:

Well. That did not go so well: there were no ZFS modules in the builds.

We've tried all of:

* http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.19.8-vivid/linux-image-3.19.8-031908-generic_3.19.8-031908.201505110938_amd64.deb

* http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.4.68/linux-image-4.4.68-040468-generic_4.4.68-040468.201705140831_amd64.deb

* 4.4.0-78-generic with ZFS modules disabled.

But because there was no ZFS, our setup wasn't the same.

We tried:
- 15 minutes of file copying/removing with sync => no panic

Then we tried the 78-kernel with ZFS to reproduce the previous days panic:
- file copying on the ZFS disk => no panic
- 15 minutes of stress(1) testing with `stress -m 1 -c 1 -i 1 -d 1` => no panic
- repeat of yesterdays tasks => *instant* *panic*

The tasks were in this case:
- removing the mysql data dir
- aborting the rm, because mysqld (maria) was still running
- stopping mysqld, did not work
- resuming removal of data dir
- mkdir of data dir (took rather long)
- kill -9 on mysqld
- kernel panic within 2s

Previously the panic would occur when we were transferring a complete mysql dataset of 100G-500G over tcp to xbstream (receive).

----

We cannot test any further on this system without working ZFS support, since the environment seems to be different enough for the panic not to happen.

Is there any way you could provide the ZFS modules for the stock kernel?

Jordi de Wal (jdwal) on 2017-05-19

tags:

added: kernel-unable-to-test-upstream

Walter (wdoekes) on 2017-05-19

Changed in linux (Ubuntu):
status:	Incomplete → Confirmed

Revision history for this message

Walter (wdoekes) wrote on 2017-05-19:

Crash info appears the same:

      KERNEL: /usr/lib/debug/boot/vmlinux-4.4.0-78-generic
    DUMPFILE: dump.201705191443 [PARTIAL DUMP]
        CPUS: 4
        DATE: Fri May 19 14:43:31 2017
      UPTIME: 00:43:02
LOAD AVERAGE: 2.53, 1.81, 2.56
       TASKS: 544
    NODENAME: <hidden>
     RELEASE: 4.4.0-78-generic
     VERSION: #99-Ubuntu SMP Thu Apr 27 15:29:09 UTC 2017
     MACHINE: x86_64 (2199 Mhz)
      MEMORY: 16 GB
       PANIC: "BUG: unable to handle kernel paging request at ffff88042d684000"
         PID: 0
     COMMAND: "swapper/1"
        TASK: ffff88042d678000 (1 of 4) [THREAD_INFO: ffff88042d680000]
         CPU: 1
       STATE: TASK_RUNNING (PANIC)

bt:
#9 [ffff88042d683e20] async_page_fault at ffffffff81842be8
#10 [ffff88042d683e38] tick_nohz_idle_exit at ffffffff810ff75e
#11 [ffff88042d683ed8] cpu_startup_entry at ffffffff810c4736
#12 [ffff88042d683f30] start_secondary at ffffffff810517c4

bt -f:
#9 [ffff88042d683e20] async_page_fault at ffffffff81842be8
    ffff88042d683e28: ffff88042d680000 0000000000000000
    ffff88042d683e38: ffffffff810ff75e
#10 [ffff88042d683e38] tick_nohz_idle_exit at ffffffff810ff75e
    ffff88042d683e40: ffff88042d683ed0 ffffffff81f38d40
    ffff88042d683e50: 00000259403d7830 0000000000002800
    ffff88042d683e60: 000000010008b4dc 0000000000000000
    ffff88042d683e70: 0000000000000014 0000000000000001
    ffff88042d683e80: 0000000000000000 0000000000000001
    ffff88042d683e90: 0000000000000083 0000000000000083
    ffff88042d683ea0: ffffffffffffffff ffff88042d684000
    ffff88042d683eb0: 0000000000000010 0000000000010082
    ffff88042d683ec0: ffff88042d683ed0 0000000000000018
    ffff88042d683ed0: ffff88042d683f28 ffffffff810c4736
#11 [ffff88042d683ed8] cpu_startup_entry at ffffffff810c4736
    ffff88042d683ee0: ffff88042d680000 ffff88042d684000
    ffff88042d683ef0: 7072184c56e60054 ada92d752b7bcf68
    ffff88042d683f00: 0000000000000000 0000000000000000
    ffff88042d683f10: 0000000000000000 0000000000000000
    ffff88042d683f20: 0000000000000000 ffff88042d683f48
    ffff88042d683f30: ffffffff810517c4

Crash info appears the same:

KERNEL: /usr/lib/debug/boot/vmlinux-4.4.0-78-generic
    DUMPFILE: dump.201705191443  [PARTIAL DUMP]
        CPUS: 4
        DATE: Fri May 19 14:43:31 2017
      UPTIME: 00:43:02
LOAD AVERAGE: 2.53, 1.81, 2.56
       TASKS: 544
    NODENAME: <hidden>
     RELEASE: 4.4.0-78-generic
     VERSION: #99-Ubuntu SMP Thu Apr 27 15:29:09 UTC 2017
     MACHINE: x86_64  (2199 Mhz)
      MEMORY: 16 GB
       PANIC: "BUG: unable to handle kernel paging request at ffff88042d684000"
         PID: 0
     COMMAND: "swapper/1"
        TASK: ffff88042d678000  (1 of 4)  [THREAD_INFO: ffff88042d680000]
         CPU: 1
       STATE: TASK_RUNNING (PANIC)

bt:
 #9 [ffff88042d683e20] async_page_fault at ffffffff81842be8
#10 [ffff88042d683e38] tick_nohz_idle_exit at ffffffff810ff75e
#11 [ffff88042d683ed8] cpu_startup_entry at ffffffff810c4736
#12 [ffff88042d683f30] start_secondary at ffffffff810517c4

bt -f:
 #9 [ffff88042d683e20] async_page_fault at ffffffff81842be8
    ffff88042d683e28: ffff88042d680000 0000000000000000 
    ffff88042d683e38: ffffffff810ff75e 
#10 [ffff88042d683e38] tick_nohz_idle_exit at ffffffff810ff75e
    ffff88042d683e40: ffff88042d683ed0 ffffffff81f38d40 
    ffff88042d683e50: 00000259403d7830 0000000000002800 
    ffff88042d683e60: 000000010008b4dc 0000000000000000 
    ffff88042d683e70: 0000000000000014 0000000000000001 
    ffff88042d683e80: 0000000000000000 0000000000000001 
    ffff88042d683e90: 0000000000000083 0000000000000083 
    ffff88042d683ea0: ffffffffffffffff ffff88042d684000 
    ffff88042d683eb0: 0000000000000010 0000000000010082 
    ffff88042d683ec0: ffff88042d683ed0 0000000000000018 
    ffff88042d683ed0: ffff88042d683f28 ffffffff810c4736 
#11 [ffff88042d683ed8] cpu_startup_entry at ffffffff810c4736
    ffff88042d683ee0: ffff88042d680000 ffff88042d684000 
    ffff88042d683ef0: 7072184c56e60054 ada92d752b7bcf68 
    ffff88042d683f00: 0000000000000000 0000000000000000 
    ffff88042d683f10: 0000000000000000 0000000000000000 
    ffff88042d683f20: 0000000000000000 ffff88042d683f48 
    ffff88042d683f30: ffffffff810517c4

Revision history for this message

Walter (wdoekes) wrote on 2017-05-23:

I could be on a wild goose chase here.

But one of the prominent changes between 77 and 78 is the backing_dev_info element of the queue turning into a pointer.

Part of the "bdi" changes in this bit:
++ * UbuntuKVM guest crashed while running I/O stress test with Ubuntu kernel
++ 4.4.0-47-generic (LP: #1659111)

I believe from here: https://patchwork.kernel.org/patch/9547199/

For example this:

--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -2462,7 +2462,7 @@ static int drbd_congested(void *congested_data, int bdi_bits)

  if (get_ldev(device)) {
   q = bdev_get_queue(device->ldev->backing_bdev);
- r = bdi_congested(&q->backing_dev_info, bdi_bits);
+ r = bdi_congested(q->backing_dev_info, bdi_bits);
   put_ldev(device);
   if (r)
    reason = 'b';

But if I check the Ubuntu-specific code, that change does not seem to be done everywhere:

$ zcat linux_4.4.0-78.99.diff.gz | grep bdi_congested.*backing_dev -B2 | tail -n8

struct request_queue *q = bdev_get_queue(rdev->bdev);

- ret |= bdi_congested(&q->backing_dev_info, bits);
+ ret |= bdi_congested(q->backing_dev_info, bits);
--
+ struct request_queue *q = bdev_get_queue(rs->dev[p].dev->bdev);
+
+ r |= bdi_congested(&q->backing_dev_info, bdi_bits);

Extracting only the "new" files, I find it here indeed:

$ mkdir foo; cd foo
$ patch -tp1 < ../linux_4.4.0-78.99.diff.gz
$ find . -name '*.c' | xargs grep -B2 bdi_congested.*backing_dev_info
./ubuntu/dm-raid4-5/dm-raid4-5.c- struct request_queue *q = bdev_get_queue(rs->dev[p].dev->bdev);
./ubuntu/dm-raid4-5/dm-raid4-5.c-
./ubuntu/dm-raid4-5/dm-raid4-5.c: r |= bdi_congested(&q->backing_dev_info, bdi_bits);

This is just an example of course, I believe there could be more "bdi" changes like that one that haven't been made.

Correct me if I'm wrong though. I've never done any kernel dev, so I could be way off base here.

Revision history for this message

Walter (wdoekes) wrote on 2017-05-24:

revert-backing_dev_info-changes-of-77-78-DID-NOT-WORK.patch Edit (49.7 KiB, text/plain)

So, I tried to revert the backing_dev_info changes and dpkg-built an updated kernel. But that didn't work out. Either because it isn't the cause, or because I reverted too much or too little.

After several hours of normal-load uptime, a removal of the mysql data dir quickly caused a panic. I did not get a dmesg+dump unfortunately, but I'd guess the same problem still existed.

Attaching for reference:
revert-backing_dev_info-changes-of-77-78-DID-NOT-WORK.patch
^-- did not help

Revision history for this message

Walter (wdoekes) wrote on 2017-05-30:

#10

Okay. On a different machine with different load, we've now got the same problem:

      KERNEL: /usr/lib/debug/boot/vmlinux-4.4.0-78-generic
    DUMPFILE: dump.201705300948 [PARTIAL DUMP]
        CPUS: 4
        DATE: Tue May 30 09:48:05 2017
      UPTIME: 01:47:38
LOAD AVERAGE: 0.05, 0.06, 0.04
       TASKS: 292
    NODENAME: ossoio-docker1-tcn
     RELEASE: 4.4.0-78-generic
     VERSION: #99-Ubuntu SMP Thu Apr 27 15:29:09 UTC 2017
     MACHINE: x86_64 (2199 Mhz)
      MEMORY: 4 GB
       PANIC: "BUG: unable to handle kernel paging request at ffff88013a404000"
         PID: 0
     COMMAND: "swapper/3"
        TASK: ffff88013abf1980 (1 of 4) [THREAD_INFO: ffff88013a400000]
         CPU: 3
       STATE: TASK_RUNNING (PANIC)

#9 [ffff88013a403e20] async_page_fault at ffffffff81842be8
#10 [ffff88013a403e38] tick_nohz_idle_exit at ffffffff810ff75e
#11 [ffff88013a403ed8] cpu_startup_entry at ffffffff810c4736
#12 [ffff88013a403f30] start_secondary at ffffffff810517c4

Differences:

- this machine does not use zfs, the other one does
- this machine runs docker instances, the other one mainly mysqld
- this machines has x2apic enabled according to dmesg (no idea what that is)

Similarities:

- both are KVM guests (same KVM cluster, different nodes)
- the two KVM nodes have the same hardware, same kernel and same KVM host software
- 4 cpus, 2200MHz, flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx lm constant_tsc nopl xtopology pni cx16 x2apic hypervisor lahf_lm
- 512MB swap

Modules loaded:

- both: 8250_fintek autofs4 drm drm_kms_helper fb_sys_fops floppy i2c_piix4 input_leds mac_hid parport parport_pc pata_acpi ppdev psmouse serio_raw shpchp syscopyarea sysfillrect sysimgblt ttm

- this: aufs bridge br_netfilter ip6table_filter ip6_tables iptable_filter iptable_nat ip_tables ipt_MASQUERADE llc nf_conntrack nf_conntrack_ipv4 nf_conntrack_netlink nf_defrag_ipv4 nf_nat nf_nat_ipv4 nf_nat_masquerade_ipv4 nfnetlink stp veth xfrm_algo xfrm_user x_tables xt_addrtype xt_conntrack xt_nat xt_tcpudp

- other: spl(O) zavl(PO) zcommon(PO) zfs(PO) znvpair(PO) zunicode(PO)

We should be able to run this node on the vanilla kernel and see how that goes. Will report back in a bit.

Okay. On a different machine with different load, we've now got the same problem:

KERNEL: /usr/lib/debug/boot/vmlinux-4.4.0-78-generic
    DUMPFILE: dump.201705300948  [PARTIAL DUMP]
        CPUS: 4
        DATE: Tue May 30 09:48:05 2017
      UPTIME: 01:47:38
LOAD AVERAGE: 0.05, 0.06, 0.04
       TASKS: 292
    NODENAME: ossoio-docker1-tcn
     RELEASE: 4.4.0-78-generic
     VERSION: #99-Ubuntu SMP Thu Apr 27 15:29:09 UTC 2017
     MACHINE: x86_64  (2199 Mhz)
      MEMORY: 4 GB
       PANIC: "BUG: unable to handle kernel paging request at ffff88013a404000"
         PID: 0
     COMMAND: "swapper/3"
        TASK: ffff88013abf1980  (1 of 4)  [THREAD_INFO: ffff88013a400000]
         CPU: 3
       STATE: TASK_RUNNING (PANIC)

Differences:

- this machine does not use zfs, the other one does
- this machine runs docker instances, the other one mainly mysqld
- this machines has x2apic enabled according to dmesg (no idea what that is)

Similarities:

Modules loaded:

- both: 8250_fintek autofs4 drm drm_kms_helper fb_sys_fops floppy i2c_piix4 input_leds mac_hid parport parport_pc pata_acpi ppdev psmouse serio_raw shpchp syscopyarea sysfillrect sysimgblt ttm

- other: spl(O) zavl(PO) zcommon(PO) zfs(PO) znvpair(PO) zunicode(PO)

We should be able to run this node on the vanilla kernel and see how that goes. Will report back in a bit.

Revision history for this message

Walter (wdoekes) wrote on 2017-05-31:

#11

Okay. So we've been running 4.4.70 now for 22 hours and counting on docker1-tcn.

/var/crash# uptime
09:45:30 up 22:25, 1 user, load average: 0,01, 0,02, 0,00

/var/crash# uname -a
Linux ossoio-docker1-tcn 4.4.70-040470-generic #201705251131 SMP Thu May 25 15:34:16 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

/var/crash# ls -l
total 36
drwxr-xr-x 2 root root 4096 mei 29 18:19 201705291818
drwxr-xr-x 2 root root 4096 mei 29 18:37 201705291836
drwxr-xr-x 2 root root 4096 mei 29 23:53 201705292352
drwxr-xr-x 2 root root 4096 mei 30 08:00 201705300759
drwxr-xr-x 2 root root 4096 mei 30 09:49 201705300948
drwxr-xr-x 2 root root 4096 mei 30 10:58 201705301057
-rw-r--r-- 1 root root 0 mei 30 11:31 201705301119-START-vanilla-kernel-with-overlay2-instead-of-aufs
-rw-r--r-- 1 root root 0 mei 31 09:16 201705310915-22hrs-uptime-still-no-crash

Previously we had crashes at least every 8 hours. This looks promising.

But, for this machine to work, I had to switch the dockerd storage from aufs from overlay2, because the vanilla kernel doesn't include aufs. This change isn't a problem, but it makes this result less reliable: is the problem with the Ubuntu-specific modules? Or with something that has been fixed between kernel 4.4.61 and 4.4.70?

My best guess is still something with a (filesystem module?) change that went into 4.4.0-78 that did not get (fully) applied to the Ubuntu-specific modules.

Revision history for this message

Walter (wdoekes) wrote on 2017-06-01:

#12

The docker1-tcn machine with vanilla kernel 4.4.70 is still up.

And on another (abmfn-staging) machine with 4.4.0-78 I've seen the crash. According to dmesg output in the console, btrfs and other fs modules were loaded. Probably due to an update-grub call.

After restart, I ensured that those modules were not loaded. No crashes have happened there since.

In all cases, the machines were running on a proxmox cluster, with "Linux 4.4.24-1-pve" (pve-kernel-4.4.24-1-pve 4.4.24-72) on the host OS. With KVM version pve-qemu-kvm 2.7.0-8.

Revision history for this message

Walter (wdoekes) wrote on 2017-06-09:

#13

Update: running 4.4.0-79-generic on the original machine with issues for 24hrs (normal workload) without kernel panics thusfar. *Crosses fingers*

Revision history for this message

Walter (wdoekes) wrote on 2017-06-21:

#14

Uptime is now 13 days. I think we can close this one as "fixed by 4.4.0-79-generic".

Revision history for this message

Jordi de Wal (jdwal) wrote on 2017-06-29:

#15

Still no crashes seen with 4.4.0-79-generic (and newer versions). Would consider this fixed then.

Changed in linux (Ubuntu):
status:	Confirmed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

revert-backing_dev_info-changes-of-77-78-DID-NOT-WORK.patch Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

Ubuntulinux package

Execute NX-protected page - 4.4.0-78-generic - kernel panic

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
linux package