kernel BUG at /build/linux-lts-xenial-gUF4JR/linux-lts-xenial-4.4.0/mm/huge_memory.c:1931!

Bug #1644056 reported by Subrata Modak
54
This bug affects 8 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
Medium
Unassigned

Bug Description

Hi,

While running IO on the following kernel/Ubuntu version:

$ lsb_release -rd
Description: Ubuntu 14.04.5 LTS
Release: 14.04

$ uname -a
Linux 4.4.0-31-generic #50~14.04.1-Ubuntu SMP Wed Jul 13 01:07:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

$ cat /etc/issue
Ubuntu 14.04.5 LTS \n \l

[1133672.985186] /build/linux-lts-xenial-gUF4JR/linux-lts-xenial-4.4.0/mm/pgtable-generic.c:33: bad pmd ffff881fd6790240(800000004b8008e7)
[1135572.440941] huge_memory: mapcount 0 page_mapcount 1
[1135572.441607] ------------[ cut here ]------------
[1135572.442059] kernel BUG at /build/linux-lts-xenial-gUF4JR/linux-lts-xenial-4.4.0/mm/huge_memory.c:1931!
[1135572.442571] invalid opcode: 0000 [#1] SMP
[1135572.443028] Modules linked in: intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm joydev input_leds sb_edac irqbypass crct10dif_pclmul crc32_pclmul aesni_intel edac_core aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd dm_multipath lpc_ich ipmi_ssif ipmi_devintf shpchp 8250_fintek mac_hid acpi_power_meter ipmi_si ipmi_msghandler iTCO_wdt iTCO_vendor_support raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linear ses enclosure raid1 ast ttm ixgbe hid_generic igb vxlan drm_kms_helper syscopyarea ip6_udp_tunnel usbhid dca sysfillrect udp_tunnel sysimgblt hid mxm_wmi fb_sys_fops mpt3sas ptp drm ahci raid_class pps_core libahci i2c_algo_bit mdio scsi_transport_sas fjes wmi
[1135572.448909] CPU: 15 PID: 2018 Comm: sh Not tainted 4.4.0-31-generic #50~14.04.1-Ubuntu
[1135572.450082] Hardware name: Quanta Computer Inc. X-100.Column.01/S2PC-MB(Dual 1G LOM), BIOS S2P_3B04.HGT02 09/21/2016
[1135572.451346] task: ffff8814923ae040 ti: ffff882eba658000 task.ti: ffff882eba658000
[1135572.452494] RIP: 0010:[<ffffffff811e7c71>] [<ffffffff811e7c71>] __split_huge_page+0x691/0x6d0
[1135572.453580] RSP: 0018:ffff882eba65b7e0 EFLAGS: 00010292
[1135572.454589] RAX: 0000000000000027 RBX: ffffea00012e0000 RCX: 0000000000000000
[1135572.455742] RDX: 0000000000000001 RSI: ffff883fff3cdc78 RDI: ffff883fff3cdc78
[1135572.457040] RBP: ffff882eba65b860 R08: 0000000000000000 R09: ffff881fe93eaf00
[1135572.458271] R10: 00000000000003ff R11: 0000000000000ac1 R12: 0000000000000000
[1135572.459539] R13: ffff882eba65ba10 R14: ffffea00012e0000 R15: ffffea00012e0000
[1135572.460746] FS: 00007fcf6b305740(0000) GS:ffff883fff3c0000(0000) knlGS:0000000000000000
[1135572.461972] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1135572.463237] CR2: 0000558db9da0bb8 CR3: 00000021667eb000 CR4: 00000000003406e0
[1135572.464515] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[1135572.465831] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[1135572.467177] Stack:
[1135572.468502] 0000000000000000 ffff882eba65b840 ffffffff811bdfe9 ffff883fed2e51d0
[1135572.469769] 00000000811c67c8 ffff883fee9bb760 00000007f43c9000 ffff882eba65ba10
[1135572.471040] 0000000000007b6d ffffffff81c72fa0 ffff883ff17340f4 ffffea00012e0000
[1135572.472277] Call Trace:
[1135572.473574] [<ffffffff811bdfe9>] ? rmap_walk+0x239/0x2d0
[1135572.475079] [<ffffffff811e7d17>] split_huge_page_to_list+0x67/0xd0
[1135572.476473] [<ffffffff811c4357>] add_to_swap+0x57/0x70
[1135572.477852] [<ffffffff811967ec>] shrink_page_list+0x62c/0x770
[1135572.479246] [<ffffffff81196f49>] shrink_inactive_list+0x1e9/0x500
[1135572.480688] [<ffffffff81197b9e>] shrink_lruvec+0x58e/0x730
[1135572.482077] [<ffffffff81093ba0>] ? __queue_work+0x130/0x350
[1135572.483615] [<ffffffff81093ba0>] ? __queue_work+0x130/0x350
[1135572.485078] [<ffffffff81197e1c>] shrink_zone+0xdc/0x2c0
[1135572.486661] [<ffffffff81198374>] do_try_to_free_pages+0x164/0x440
[1135572.488354] [<ffffffff8119397d>] ? throttle_direct_reclaim+0x8d/0x230
[1135572.490068] [<ffffffff81198705>] try_to_free_pages+0xb5/0x170
[1135572.491535] [<ffffffff8118b7c7>] __alloc_pages_nodemask+0x597/0xac0
[1135572.493287] [<ffffffff8118bead>] alloc_kmem_pages_node+0x4d/0xd0
[1135572.495083] [<ffffffff8107af35>] copy_process+0x185/0x1c70
[1135572.496792] [<ffffffff81115902>] ? from_kgid_munged+0x12/0x20
[1135572.498404] [<ffffffff8120296d>] ? cp_new_stat+0x13d/0x160
[1135572.500116] [<ffffffff8107cbba>] _do_fork+0x8a/0x310
[1135572.501866] [<ffffffff8107cee9>] SyS_clone+0x19/0x20
[1135572.503672] [<ffffffff817f6f36>] entry_SYSCALL_64_fastpath+0x16/0x75
[1135572.505395] Code: 16 d0 fe ff e9 37 fa ff ff 0f 0b 0f 0b 0f 0b 0f 0b 0f 0b 41 8b 57 18 8b 75 a4 48 c7 c7 48 f0 ac 81 31 c0 83 c2 01 e8 01 8d f9 ff <0f> 0b 48 8b 55 c0 4c 89 ee 4c 89 f7 89 4d a8 e8 5b d0 fe ff 8b
[1135572.509080] RIP [<ffffffff811e7c71>] __split_huge_page+0x691/0x6d0
[1135572.510823] RSP <ffff882eba65b7e0>
[1135572.517043] ---[ end trace 6c09902f497f90d9 ]---

I would like to know if this is a know issue with this kernel, or, a new issue being reported.

Revision history for this message
Subrata Modak (tosubrata) wrote :
Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1644056

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: trusty
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Did this issue start happening after an update/upgrade? Was there a prior kernel version where you were not having this particular problem?

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.9 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.9-rc7

Changed in linux (Ubuntu):
importance: Undecided → Medium
status: Confirmed → Incomplete
penalvch (penalvch)
tags: added: needs-apport-collect needs-upstream-testing
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
Revision history for this message
Tomas Hamal (hamal-tom) wrote :
Download full text (5.6 KiB)

For us it happens running Ubuntu 16.04.2 LTS
$cat /etc/issue
Ubuntu 16.04.2 LTS \n \l

Apr 29 18:00:13 production1 kernel: [7352459.993408] huge_memory: mapcount 0 page_mapcount 1
Apr 29 18:00:13 production1 kernel: [7352459.993469] ------------[ cut here ]------------
Apr 29 18:00:13 production1 kernel: [7352459.993490] kernel BUG at /build/linux-W6HB68/linux-4.4.0/mm/huge_memory.c:1931!
Apr 29 18:00:13 production1 kernel: [7352459.993520] invalid opcode: 0000 [#1] SMP
Apr 29 18:00:13 production1 kernel: [7352459.993540] Modules linked in: nfnetlink_queue nfnetlink_log nfnetlink bluetooth binfmt_misc ufs qnx4 hfsplus hfs minix ntfs msdos jfs xfs ip6table_filter ip6_tables ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_tcpudp xt_conntrack iptable_filter iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip_tables x_tables intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass joydev input_leds sb_edac edac_core ipmi_ssif lpc_ich mei_me mei ioatdma shpchp ipmi_si 8250_fintek ipmi_msghandler acpi_power_meter acpi_pad mac_hid ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linear hid_generic ast ttm drm_kms_helper crct10dif_pclmul syscopyarea sysfillrect raid10 crc32_pclmul raid1 ghash_clmulni_intel aesni_intel aes_x86_64 igb lrw gf128mul sysimgblt glue_helper ablk_helper fb_sys_fops cryptd usbhid dca mpt3sas ahci ptp drm hid raid_class libahci pps_core i2c_algo_bit scsi_transport_sas wmi fjes
Apr 29 18:00:13 production1 kernel: [7352460.001736] CPU: 16 PID: 227 Comm: kswapd1 Not tainted 4.4.0-62-generic #83-Ubuntu
Apr 29 18:00:13 production1 kernel: [7352460.002906] Hardware name: Supermicro Super Server/X10DRi-LN4+, BIOS 2.0a 09/16/2016
Apr 29 18:00:13 production1 kernel: [7352460.004084] task: ffff883f65a5c600 ti: ffff883f6227c000 task.ti: ffff883f6227c000
Apr 29 18:00:13 production1 kernel: [7352460.005276] RIP: 0010:[<ffffffff811f96ff>] [<ffffffff811f96ff>] split_huge_page_to_list+0x74f/0x7a0
Apr 29 18:00:13 production1 kernel: [7352460.006454] RSP: 0018:ffff883f6227f9c0 EFLAGS: 00010282
Apr 29 18:00:13 production1 kernel: [7352460.007627] RAX: 0000000000000027 RBX: ffffea013f3e0000 RCX: 0000000000000000
Apr 29 18:00:13 production1 kernel: [7352460.008771] RDX: 0000000000000000 RSI: ffff887f7f18dd58 RDI: ffff887f7f18dd58
Apr 29 18:00:13 production1 kernel: [7352460.009966] RBP: ffff883f6227fa50 R08: 0000000000000000 R09: 0000000000012f00
Apr 29 18:00:13 production1 kernel: [7352460.011154] R10: 000000000000001c R11: 0000000000012f00 R12: 0000000000000000
Apr 29 18:00:13 production1 kernel: [7352460.012287] R13: ffff883f6227fc00 R14: ffffea013f3e0000 R15: ffff883f6227fb28
Apr 29 18:00:13 production1 kernel: [7352460.013435] FS: 0000000000000000(0000) GS:ffff887f7f180000(0000) knlGS:0000000000000000
Apr 29 18:00:13 production1 kernel: [7352460.014563] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 29 18:00:13 production1 kernel: [7352460.015676] CR2: 00000000b5ace0...

Read more...

Changed in linux (Ubuntu):
status: Expired → Confirmed
Revision history for this message
Tomas Hamal (hamal-tom) wrote :

$ uname -a
Linux production1 4.4.0-75-generic #96-Ubuntu SMP Thu Apr 20 09:56:33 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

$ lsb_release -rd
Description: Ubuntu 16.04.2 LTS
Release: 16.04

Revision history for this message
Ruslan Lutsenko (ruslan-lutcenko) wrote :

It seem we are hitting similar bug:
[Thu Jul 20 10:38:14 2017] /home/kernel/COD/linux/mm/pgtable-generic.c:33: bad pmd ffff97c4e4ab6800(00000006dca009e2)
[Thu Jul 20 10:50:17 2017] BUG: Bad rss-counter state mm:ffff97cb4d11a6c0 idx:1 val:512
[Thu Jul 20 10:50:17 2017] BUG: non-zero nr_ptes on freeing mm: 1
[Wed Jul 26 09:21:00 2017] /home/kernel/COD/linux/mm/pgtable-generic.c:33: bad pmd ffff97e4b2a7f008(000000237ba009e2)
[Wed Jul 26 09:25:51 2017] BUG: Bad rss-counter state mm:ffff97cb537e1f00 idx:1 val:512
[Wed Jul 26 09:25:51 2017] BUG: non-zero nr_ptes on freeing mm: 1

I've tried default LTS kernel and recent release - both versions experience same issue:
uname -a
4.4.0-78-generic #99-Ubuntu SMP Thu Apr 27 15:29:09 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

uname -a
4.10.0-041000-lowlatency #201702191831 SMP PREEMPT Sun Feb 19 23:36:31 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

lsb_release -rd
Description: Ubuntu 16.04.2 LTS
Release: 16.04

This servers are elasticsearch datanodes doing heavy I/O with 264G of RAM mostly used by filecache with lucene data.

I've tried to upgrade kernel tried different versions of OpenJDK and OracleJDK. Still issue remains - elasticsearch process keep crushing in a random order on a different servers with the similar hardware configuration.

I will attach apport crash report(without base64 encoded coredump)

Revision history for this message
Guillaume (xiu42) wrote :

I have the same issue:
[Thu Jul 27 05:37:43 2017] /build/linux-5EyXrQ/linux-4.4.0/mm/pgtable-generic.c:33: bad pmd ffff880ff7409df8(000000019b0009e2)
[Thu Jul 27 06:10:03 2017] BUG: Bad rss-counter state mm:ffff8804f6f26400 idx:1 val:512
[Thu Jul 27 06:10:03 2017] BUG: non-zero nr_ptes on freeing mm: 1

Would it be fixed by that patch: https://lkml.org/lkml/2017/4/10/152 which has been withdrawn in favor of https://lkml.org/lkml/2017/3/2/347 ?

Revision history for this message
Guillaume (xiu42) wrote :

$ uname -a
Linux batch2 4.4.0-87-generic #110-Ubuntu SMP Tue Jul 18 12:55:35 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

$ lsb_release -rd
Description: Ubuntu 16.04.1 LTS
Release: 16.04

Revision history for this message
asm64 (asm64) wrote :

Ubuntu Server

# cat /etc/os-release | grep "VERSION="
VERSION="16.04.3 LTS (Xenial Xerus)"
# uname -r
4.4.0-91-generic

Kernel log in attachment. Patch released in april https://lkml.org/lkml/2017/4/10/152

Revision history for this message
kokushibyou (kokushibyou) wrote :

I am seeing this problem on heavily utilized HP G9 servers, specifically when the raid battery fails. The battery will fail, the chassis will print a similar error in dmesg, and the heavily utilized process (in this case Java/Elastic search) will stop responding. A reboot recovers the host, the HP utilities then report the raid battery is bad. We swap the battery, reboot the host, and it recovers.

WE've had this occurs 6 different times on 6 different boxes. The HP G9's are known to have troublesome raid batteries, I just didn't expect the kernel to have such an issue with it.

[Tue Oct 17 01:06:15 2017] /build/linux-EO9xOi/linux-4.4.0/mm/pgtable-generic.c:33: bad pmd ffff88156c4320b8(000000212e6009e2)

uname -a
Linux xx 4.4.0-59-generic #80-Ubuntu SMP Fri Jan 6 17:47:47 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

cat /etc/issue
Ubuntu 16.04.1 LTS \n \l

Revision history for this message
Daniel Axtens (daxtens) wrote :

Hi all,

I think there are two issues at play here, one is the bad pmd one, and one is the original "huge_memory: mapcount 0 page_mapcount 1".

Perhaps we could break the bad pmd issue out into a different LP bug?

People with the original bug - was anyone able to verify if this happened on a more recent kernel? My understanding of mm/huge_memory.c is that it was significantly refactored after 4.4, so I would be interested to hear if that makes the issue go away.

I think disabling transparent huge pages on boot should also make the issue go away if anyone is able to try that?

Regards,
Daniel

Revision history for this message
linas (linasvepstas) wrote :

FYI, a 4.8.0 kernel (not ubuntu)

Nov 29 13:34:42 fanny kernel: [2752823.883284] BUG: Bad rss-counter state mm:ffff8d1c56233c00 idx:1 val:5
Nov 29 13:34:42 fanny kernel: [2752823.883292] BUG: non-zero nr_ptes on freeing mm: 2

$ cat /proc/version
Linux version 4.8.0-0.bpo.2-amd64 (<email address hidden>) (gcc version 4.9.2 (Debian 4.9.2-10) ) #1 SMP Debian 4.8.15-2~bpo8+2 (2017-01-17)

$ cat /proc/meminfo
MemTotal: 264141684 kB

$ cat /proc/cpuinfo
processor : 23
vendor_id : AuthenticAMD
cpu family : 21
model : 2
model name : AMD Opteron(tm) Processor 6344
stepping : 0
microcode : 0x600081c
TLB size : 1536 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc extd_apicid amd_dcm aperfmperf eagerfpu pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 popcnt aes xsave avx f16c lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs xop skinit wdt lwp fma4 tce nodeid_msr tbm topoext perfctr_core perfctr_nb cpb hw_pstate vmmcall bmi1 arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold
bugs : fxsave_leak sysret_ss_attrs null_seg
bogomips : 5200.02

Brad Figg (brad-figg)
tags: added: cscc
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.