[Dell PowerEdge R720] Machine Check Exception

Bug #1315736 reported by Sami Pietila
44
This bug affects 7 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Expired
Medium
Unassigned

Bug Description

Dell PowerEdge 720 on ubuntu 14.04 shows MCE errors on dmesg. Dell support instructed to run DSET and BIOS hardware diagnostics. Neither of the tools showed any errors. Dell support said that if there was a hardware error it would have been shown on Dell logs and the probable reason for the dmesg log is a bug in ubuntu kernel MCE reporting.

So, is it that following dmesg is because of a kernel bug in ubuntu 14.04 server?

[11562.171040] Please check user daemon is running.
[94953.306404] sbridge: HANDLING MCE MEMORY ERROR
[94953.306415] CPU 1: Machine Check Exception: 0 Bank 9: 8c00004b000800c0
[94953.306416] TSC 0 ADDR 2dfa0e1000 MISC 90000800080168c PROCESSOR 0:306e4 TIME 1399142359 SOCKET 1 APIC 20
[94953.306422] sbridge: HANDLING MCE MEMORY ERROR
[94953.306423] CPU 1: Machine Check Exception: 0 Bank 10: 8c000050000800c1
[94953.306424] TSC 0 ADDR 2dfa0e1000 MISC 90000000000208c PROCESSOR 0:306e4 TIME 1399142359 SOCKET 1 APIC 20
[94953.532217] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Channel#0_DIMM#0 (channel:0 slot:0 page:0x2dfa0e1 offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c0 socket:1 channel_mask:3 rank:0)
[94953.532226] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Channel#1_DIMM#0 (channel:1 slot:0 page:0x2dfa0e1 offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c1 socket:1 channel_mask:3 rank:0)

---
AlsaDevices:
 total 0
 crw-rw---- 1 root audio 116, 1 touko 2 19:15 seq
 crw-rw---- 1 root audio 116, 33 touko 2 19:15 timer
AplayDevices: Error: [Errno 2] No such file or directory
ApportVersion: 2.14.1-0ubuntu3
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
CRDA: Error: [Errno 2] No such file or directory
CurrentDmesg:
 Error: command ['sh', '-c', 'dmesg | comm -13 --nocheck-order /var/log/dmesg -'] failed with exit code 1: comm: /var/log/dmesg: Permission denied
 dmesg: write failed: Broken pipe
DistroRelease: Ubuntu 14.04
InstallationDate: Installed on 2014-02-26 (66 days ago)
InstallationMedia: Ubuntu-Server 14.04 LTS "Trusty Tahr" - Alpha amd64 (20140219)
MachineType: Dell Inc. PowerEdge R720
Package: linux (not installed)
PciMultimedia:

ProcFB: 0 VESA VGA
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-3.13.0-24-generic root=UUID=c03eb237-955a-4dee-bba1-deded53df372 ro
ProcVersionSignature: Ubuntu 3.13.0-24.46-generic 3.13.9
RfKill: Error: [Errno 2] No such file or directory
Tags: trusty
Uname: Linux 3.13.0-24-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:

WifiSyslog:

_MarkForUpload: True
dmi.bios.date: 01/16/2014
dmi.bios.vendor: Dell Inc.
dmi.bios.version: 2.2.2
dmi.board.name: 0DCWD1
dmi.board.vendor: Dell Inc.
dmi.board.version: A01
dmi.chassis.type: 23
dmi.chassis.vendor: Dell Inc.
dmi.modalias: dmi:bvnDellInc.:bvr2.2.2:bd01/16/2014:svnDellInc.:pnPowerEdgeR720:pvr:rvnDellInc.:rn0DCWD1:rvrA01:cvnDellInc.:ct23:cvr:
dmi.product.name: PowerEdge R720
dmi.sys.vendor: Dell Inc.

Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1315736

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Sami Pietila (sampie) wrote : HookError_source_linux.txt

apport information

tags: added: apport-collected
description: updated
Revision history for this message
Sami Pietila (sampie) wrote : IwConfig.txt

apport information

Revision history for this message
Sami Pietila (sampie) wrote : Lspci.txt

apport information

Revision history for this message
Sami Pietila (sampie) wrote : Lsusb.txt

apport information

Revision history for this message
Sami Pietila (sampie) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
Sami Pietila (sampie) wrote : ProcEnviron.txt

apport information

Revision history for this message
Sami Pietila (sampie) wrote : ProcInterrupts.txt

apport information

Revision history for this message
Sami Pietila (sampie) wrote : ProcModules.txt

apport information

Revision history for this message
Sami Pietila (sampie) wrote : UdevDb.txt

apport information

Revision history for this message
Sami Pietila (sampie) wrote : UdevLog.txt

apport information

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Sami Pietila (sampie) wrote : Re: Machine Check Exception
Download full text (3.1 KiB)

mesg also has following message:

[ 9441.626809] kernel BUG at /build/buildd/linux-3.13.0/mm/memory.c:3756!
[ 9441.628777] invalid opcode: 0000 [#1] SMP
[ 9441.630053] Modules linked in: ip6t_REJECT xt_hl ip6t_rt nf_conntrack_ipv6 nf
_defrag_ipv6 ipt_REJECT xt_limit xt_tcpudp xt_addrtype nf_conntrack_ipv4 nf_defr
ag_ipv4 xt_conntrack ip6table_filter ip6_tables rpcsec_gss_krb5 nfsv4 dcdbas nfsd auth_rpcgss nfs_acl nfs lockd sunrpc fscache nf_conntrack_netbios_ns nf_conntrack_broadcast nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack gpio_ich iptable_filter ip_tables x_tables x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm joydev crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd shpchp mei_me lpc_ich mei wmi ipmi_si lp mac_hid acpi_power_meter parport hid_generic tg3 ahci usbhid ptp hid libahci megaraid_sas pps_core
[ 9441.651872] CPU: 29 PID: 12407 Comm: java Not tainted 3.13.0-24-generic #46-Ubuntu
[ 9441.654159] Hardware name: Dell Inc. PowerEdge R720/0DCWD1, BIOS 2.2.2 01/16/2014
[ 9441.656421] task: ffff8817c7115fc0 ti: ffff880294c74000 task.ti: ffff880294c74000
[ 9441.658682] RIP: 0010:[<ffffffff81179051>] [<ffffffff81179051>] handle_mm_fault+0xe61/0xf10
[ 9441.661268] RSP: 0018:ffff880294c75d98 EFLAGS: 00010246
[ 9441.662863] RAX: 0000000000000100 RBX: 00000007ff258010 RCX: ffff880294c75b18
[ 9441.665019] RDX: ffff8817c7115fc0 RSI: 0000000000000000 RDI: 8000000182e009e6
[ 9441.667173] RBP: ffff880294c75e20 R08: 0000000000000000 R09: 00000000000000a9
[ 9441.669329] R10: 0000000000000001 R11: 0000000000000000 R12: ffff8801057acfc8
[ 9441.671483] R13: ffff881238353800 R14: ffff8817f8fdd080 R15: 0000000000000080
[ 9441.673640] FS: 00007fd475705700(0000) GS:ffff88301f3c0000(0000) knlGS:0000000000000000
[ 9441.676088] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 9441.677817] CR2: 00007f9cd00451b8 CR3: 00000002cd12c000 CR4: 00000000001407e0
[ 9441.679973] Stack:
[ 9441.680557] 0000000000000001 ffff880294c75db0 ffffffff8109a780 ffff880294c75dd0
[ 9441.682897] ffffffff810d7ad6 0000000000000001 ffffffff81f1e948 ffff880294c75e78
[ 9441.685238] ffffffff810d983d ffff880294c75e48 ffff8800000000a9 00000001ffffffff
[ 9441.687580] Call Trace:
[ 9441.688309] [<ffffffff8109a780>] ? wake_up_state+0x10/0x20
[ 9441.689986] [<ffffffff810d7ad6>] ? wake_futex+0x66/0x90
[ 9441.691583] [<ffffffff810d983d>] ? futex_wake_op+0x4ed/0x620
[ 9441.693314] [<ffffffff817219a4>] __do_page_fault+0x184/0x560
[ 9441.695046] [<ffffffff811112fc>] ? acct_account_cputime+0x1c/0x20
[ 9441.696912] [<ffffffff8109d76b>] ? account_user_time+0x8b/0xa0
[ 9441.698697] [<ffffffff8109dd84>] ? vtime_account_user+0x54/0x60
[ 9441.700505] [<ffffffff81721d9a>] do_page_fault+0x1a/0x70
[ 9441.702132] [<ffffffff8171e208>] page_fault+0x28/0x30
[ 9441.703673] Code: ff 48 89 d9 4c 89 e2 4c 89 ee 4c 89 f7 44 89 4d c8 e8 34 c1 ff ff 85 c0 0f 85 94 f5 ff ff 49 8b 3c 24 44 8b 4d c8 e9 68 f3 ff ff <0f> 0b be 8e 00 00 00 48 c7 c7 18 25 a6 81 44 89 4d c8 e8 18 e7
[ 9441.711148] RIP [<ffffffff81179051>] handle_mm_fault+0xe61/0xf10
[ 9441.713013] RSP <ffff880294c75d98>
[ 9441.877706] ---[ end trace e9d7418...

Read more...

Revision history for this message
Sami Pietila (sampie) wrote :

I am also seeing messages like: BUG: soft lockup - CPU#25 stuck for 23s!

Revision history for this message
Sami Pietila (sampie) wrote :

It seems that putting the server under load results an unresponsible server with console constantly flooding with error messages: A screenshot attached.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Can you also perform a memory test, which can be accessed from the GRUB menu? If you haven't gone to the GRUB menu before, it can be accessed by holding the SHIFT key after system power-on and seeing the BIOS messages.

Changed in linux (Ubuntu):
importance: Undecided → Medium
status: Confirmed → Incomplete
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Also, would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v3.15 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

If you are unable to test the mainline kernel, for example it will not boot, please add the tag: 'kernel-unable-to-test-upstream'.
Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.15-rc4-utopic/

tags: added: kernel-da-key
Revision history for this message
Sami Pietila (sampie) wrote :
Download full text (21.6 KiB)

Hi,

I did run the memory test and no errors were detected.

I also changed to the mainline kernel. With the mainline kernel (3.15.0-031500rc4-generic #201405042135 SMP) I have not seen yet MCE error or had an unresponsive system, however I can still see some errors on dmesg:

[ 840.160260] INFO: task fastq-join:2753 blocked for more than 120 seconds.
[ 840.162350] Not tainted 3.15.0-031500rc4-generic #201405042135
[ 840.164324] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 840.166723] fastq-join D 000000000000000d 0 2753 2752 0x00000002
[ 840.166726] ffff882e6c2e1bb8 0000000000000002 ffff882d7b74c438 ffff882e6c2e1fd8
[ 840.166728] 0000000000014500 0000000000014500 ffff882ff79fcb60 ffff882ff7d89920
[ 840.166730] ffff882e6c2e1bb8 ffff88301f2d4e00 ffff882ff7d89920 ffffffff8115b2a0
[ 840.166732] Call Trace:
[ 840.166742] [<ffffffff8115b2a0>] ? __lock_page+0x70/0x70
[ 840.166747] [<ffffffff8175fd99>] schedule+0x29/0x70
[ 840.166749] [<ffffffff8175fe6f>] io_schedule+0x8f/0xd0
[ 840.166751] [<ffffffff8115b2ae>] sleep_on_page+0xe/0x20
[ 840.166753] [<ffffffff81760532>] __wait_on_bit+0x62/0x90
[ 840.166755] [<ffffffff8115bd7b>] ? find_get_pages_tag+0xcb/0x170
[ 840.166756] [<ffffffff8115b410>] wait_on_page_bit+0x80/0x90
[ 840.166761] [<ffffffff810b3970>] ? wake_atomic_t_function+0x40/0x40
[ 840.166762] [<ffffffff8115b5e4>] filemap_fdatawait_range+0xf4/0x180
[ 840.166765] [<ffffffff8115d51d>] filemap_write_and_wait_range+0x4d/0x80
[ 840.166777] [<ffffffffa03e151f>] nfs4_file_fsync+0x5f/0xb0 [nfsv4]
[ 840.166781] [<ffffffff811fbf06>] vfs_fsync+0x26/0x40
[ 840.166793] [<ffffffffa03199ba>] nfs_file_flush+0x8a/0xd0 [nfs]
[ 840.166796] [<ffffffff811ca1ea>] filp_close+0x3a/0x90
[ 840.166801] [<ffffffff811e937a>] put_files_struct.part.12+0x7a/0xd0
[ 840.166803] [<ffffffff811e9725>] put_files_struct+0x15/0x20
[ 840.166804] [<ffffffff811e97f2>] exit_files+0x52/0x60
[ 840.166808] [<ffffffff8106e521>] do_exit+0x171/0x470
[ 840.166811] [<ffffffff81021e45>] ? syscall_trace_enter+0x165/0x280
[ 840.166813] [<ffffffff8106e8b4>] do_group_exit+0x44/0xa0
[ 840.166814] [<ffffffff8106e927>] SyS_exit_group+0x17/0x20
[ 840.166817] [<ffffffff8176d03f>] tracesys+0xe1/0xe6
[ 1461.722458] perf interrupt took too long (2501 > 2500), lowering kernel.perf_event_max_sample_rate to 50000
[30187.334010] ------------[ cut here ]------------
[30187.335401] kernel BUG at /home/apw/COD/linux/mm/memory.c:3924!
[30187.337183] invalid opcode: 0000 [#1] SMP
[30187.338459] Modules linked in: rpcsec_gss_krb5 ip6t_REJECT nfsv4 xt_hl ip6t_rt nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT xt_limit xt_tcpudp xt_addrtype nf_conntrack_ipv4 nf_defrag_ipv4 nfsd dcdbas xt_conntrack auth_rpcgss ip6table_filter x86_pkg_temp_thermal ip6_tables intel_powerclamp nf_conntrack_netbios_ns nfs_acl nf_conntrack_broadcast nfs nf_nat_ftp coretemp nf_nat kvm_intel lockd nf_conntrack_ftp nf_conntrack kvm sunrpc crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel fscache iptable_filter ip_tables aes_x86_64 x_tables lrw gf128mul glue_helper ablk_helper joydev cryptd mei_me sb_edac acpi_power_meter shpchp me...

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Can you also give the latest 3.13 upstream kernel a test? It can be downloaded from:

http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.13.11-trusty/

Revision history for this message
Sami Pietila (sampie) wrote :

It might be after all that this mainline v3.15 behaves like the default ubuntu kernel as the server just went to unresponsive. However, I noticed that I am able to login but only as a root and only from console. The server seems to be in a weird state as "ps aux" command shows about half of the processes and then hangs locking the whole console. Doing login to next console and running "ps aux" again hangs at the same location as previous ps.

Kernel v.3.13 seems to be older than what I am now running. Is there a change that there would be any fixes compared to v3.15?

Revision history for this message
Sami Pietila (sampie) wrote :

I am not sure if I have understood these tools correctly, but does the vmstat show that CPUs are at idle and uptime command show that system load is about 40?

Revision history for this message
Sami Pietila (sampie) wrote :

Now the logs show also the MCE error, so there seems to be no behavior differences between v3.15 and ubuntu standard kernel.

[78132.360975] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
[78132.403996] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 10: 8c000046000800c1
[78132.448800] EDAC sbridge MC1: TSC 0
[78132.449823] EDAC sbridge MC1: ADDR 2df41c3000 EDAC sbridge MC1: MISC 900080008000c8c
[78132.495588] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1399522552 SOCKET 1 APIC 20
[78133.113960] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Channel#1_DIMM#0 (channel:1 slot:0 page:0x2df41c3 offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c1 socket:1 channel_mask:1 rank:0)

Revision history for this message
Sami Pietila (sampie) wrote :

Here are error messages from dmesg:

There are lines like:

[30187.335401] kernel BUG at /home/apw/COD/linux/mm/memory.c:3924!
[30187.337183] invalid opcode: 0000 [#1] SMP
...

[30223.621247] WARNING: CPU: 12 PID: 29190 at /home/apw/COD/linux/kernel/watchdog.c:249 watchdog_overflow_callback+0x98/0xc0()
[30223.621248] Watchdog detected hard LOCKUP on cpu 12

Revision history for this message
Sami Pietila (sampie) wrote :

Top is also showing load averages about 60 - 70, but the process list does look like the system is pretty much idle.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Did this issue start happening after an update/upgrade? Was there a kernel version where you were not having this particular problem? This will help determine if the problem you are seeing is the result of the introduction of a regression, and when this regression was introduced. If this is a regression, we can perform a kernel bisect to identify the commit that introduced the problem.

Revision history for this message
Sami Pietila (sampie) wrote :

This is a new Dell Power Edge server that has been initially installed with Ubuntu 14.04 server. Reported problem(s) seems to occur with all kernels that I have tried (Ubuntu default server kernel and with v3.15 upstream kernel).

penalvch (penalvch)
description: updated
summary: - Machine Check Exception
+ [Dell PowerEdge R720] Machine Check Exception
tags: added: kernel-bug-exists-upstream-v3.15-rc4 regression-potential
tags: added: latest-bios-2.2.2
Revision history for this message
penalvch (penalvch) wrote :

Sami Pietila, could you please test the latest mainline kernel via http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.15-rc5-utopic/ and advise to the results?

If reproducible, for regression testing purposes, could you please test for this in the server release of http://old-releases.ubuntu.com/releases/12.04.0/ and advise to the results?

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Sami Pietila (sampie) wrote :

Hi Cristopher,

The server is a fully installed production server, so unfortunately downgrading to 12.04 is not possible.

Reproducing the problem takes about 2 days during which time server is offline. Having the server offline is somewhat problematic. However it is possible to test 3.15-rc5, but I would rather do it in case there is a fix which addresses similar symptoms than reported here. Does the rc5 have a fix which could remedy the problem?

Do the provided error messages give any information what might be wrong or is it that they are not helping in this case? I can see that there are some source code line numbers etc.

penalvch (penalvch)
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
tags: added: needs-upstream-testing
Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Tiago Antao (tiagoantao) wrote :

I seem to have this bug also. While this is on a production server, I have some flexibility in rebooting it.

I can note a few issues:

1. The kernel bug only happens with Java (tested both open-jdk7 and oracle8)

2. The java processes block and cannot be killed

3. Any process that tries to inspect the java process becomes blocked (e.g. top, ps, ...). an strace of a ps:
open("/proc/41126/status", O_RDONLY) = 6
read(6, "Name:\tjava\nState:\tD (disk sleep)"..., 1024) = 870
close(6) = 0
open("/proc/41126/cmdline", O_RDONLY) = 6
read(6,
[BLOCKS there]

4. As long as no queries are done on the blocked java processes, everything works (though the load of the machine is apparently high)

Tell me what you need done to test this, and I will do it

Revision history for this message
penalvch (penalvch) wrote :

Tiago Antao, thank you for your comment. So your hardware and problem may be tracked, could you please file a new report with Ubuntu by executing the following in a terminal while booted into a Ubuntu repository kernel (not a mainline one) via:
ubuntu-bug linux

For more on this, please read the official Ubuntu documentation:
Ubuntu Bug Control and Ubuntu Bug Squad: https://wiki.ubuntu.com/Bugs/BestPractices#X.2BAC8-Reporting.Focus_on_One_Issue
Ubuntu Kernel Team: https://wiki.ubuntu.com/KernelTeam/KernelTeamBugPolicies#Filing_Kernel_Bug_reports
Ubuntu Community: https://help.ubuntu.com/community/ReportingBugs#Bug_reporting_etiquette

When opening up the new report, please feel free to subscribe me to it.

Thank you for your understanding.

Helpful bug reporting tips:
https://wiki.ubuntu.com/ReportingBugs

Revision history for this message
Tiago Antao (tiagoantao) wrote :

I will do this, but one important comment: I am on a supermicro, not a dell. But the bug seems the same (same bug kernel line, and also java-related taints)

Revision history for this message
Sami Pietila (sampie) wrote :

Hi Tiago,

I agree, the bug seems the same (freezing process listing and relation to java in repreducing the bug). However, I am thinking if the Machine Check Exception is because of the same bug or if it is a totally different bug. Have you observed a Machine Check Exception?

Revision history for this message
Tiago Antao (tiagoantao) wrote :

We have now installed the new kernel, but as the bug is non-deterministic, we will have to wait until it manifests itself.

Revision history for this message
Tiago Antao (tiagoantao) wrote :
Download full text (4.3 KiB)

Sami,

Good observation: I do not have a machine check exception. The similarities are: a reported bug on the same line; similar behaviour; and java involved. For reference I copy my kernel bug below (I get several instances of this, only that the next ones are tainted). As soon as I have a problem with the new upstream kernel I will report it back

May 9 09:55:29 wintermute kernel: [604868.582044] ------------[ cut here ]------------
May 9 09:55:29 wintermute kernel: [604868.582059] kernel BUG at /build/buildd/linux-3.13.0/mm/memory.c:3756!
May 9 09:55:29 wintermute kernel: [604868.582064] invalid opcode: 0000 [#1] SMP
May 9 09:55:29 wintermute kernel: [604868.582069] Modules linked in: veth xt_addrtype xt_conntrack iptable_filter ipt_MASQUERADE iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip_tables x_tables bridge stp llc bnep rfcomm bluetooth aufs binfmt_misc kvm_amd kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd parport_pc ppdev psmouse amd64_edac_mod sp5100_tco serio_raw edac_core lp fam15h_power k10temp i2c_piix4 edac_mce_amd mac_hid parport hid_generic usbhid hid usb_storage ixgbe igb mdio i2c_algo_bit dca ahci ptp libahci pps_core
May 9 09:55:29 wintermute kernel: [604868.582148] CPU: 21 PID: 25260 Comm: java Not tainted 3.13.0-24-generic #46-Ubuntu
May 9 09:55:29 wintermute kernel: [604868.582152] Hardware name: Supermicro H8QG6/H8QG6, BIOS 3.5 12/16/2013
May 9 09:55:29 wintermute kernel: [604868.582156] task: ffff8876d3985fc0 ti: ffff8871f58c8000 task.ti: ffff8871f58c8000
May 9 09:55:29 wintermute kernel: [604868.582159] RIP: 0010:[<ffffffff81179051>] [<ffffffff81179051>] handle_mm_fault+0xe61/0xf10
May 9 09:55:29 wintermute kernel: [604868.582171] RSP: 0000:ffff8871f58c9d98 EFLAGS: 00010246
May 9 09:55:29 wintermute kernel: [604868.582174] RAX: 0000000000000100 RBX: 00007fa583801ea0 RCX: ffff8871f58c9b18
May 9 09:55:29 wintermute kernel: [604868.582177] RDX: ffff8876d3985fc0 RSI: 0000000000000000 RDI: 80000020286009e6
May 9 09:55:29 wintermute kernel: [604868.582180] RBP: ffff8871f58c9e20 R08: 0000000000000000 R09: 00000000000000a9
May 9 09:55:29 wintermute kernel: [604868.582182] R10: 0000000000000001 R11: 0000000000000000 R12: ffff883fb68b30e0
May 9 09:55:29 wintermute kernel: [604868.582185] R13: ffff882e351b2600 R14: ffff88702aceec80 R15: 0000000000000080
May 9 09:55:29 wintermute kernel: [604868.582188] FS: 00007fa5603f2700(0000) GS:ffff882fe7d40000(0000) knlGS:0000000000000000
May 9 09:55:29 wintermute kernel: [604868.582192] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
May 9 09:55:29 wintermute kernel: [604868.582194] CR2: 00007fa583a05620 CR3: 0000007861d59000 CR4: 00000000000407e0
May 9 09:55:29 wintermute kernel: [604868.582198] Stack:
May 9 09:55:29 wintermute kernel: [604868.582200] ffff8871f58c9e20 ffff88702aceec80 00007fad7d38fd70 00007fa583804020
May 9 09:55:29 wintermute kernel: [604868.582241] 0000000000002190 00007fad7401bb68 0000000000000000 0000000000000002
May 9 09:55:29 wintermute kernel: [604868.582266] ffff887101ef5e20 00007fad781a900f ffff8800000000a9 fffff...

Read more...

Revision history for this message
penalvch (penalvch) wrote :
Revision history for this message
Nikhil Chhaochharia (nikhilchh) wrote :
Download full text (4.5 KiB)

We are also seeing this bug. The machine becomes non-responsive, unable to ssh, high load average, trying to access the running java process does not work. I will file a bug as described in Comment #30

Our hardware is HP Proliant DL380p and we see the following in the syslog

May 26 06:19:38 server06 kernel: [75831.929529] ------------[ cut here ]------------
May 26 06:19:38 server06 kernel: [75831.930191] kernel BUG at /build/buildd/linux-3.13.0/mm/memory.c:3756!
May 26 06:19:38 server06 kernel: [75831.931129] invalid opcode: 0000 [#1] SMP
May 26 06:19:38 server06 kernel: [75831.931729] Modules linked in: xt_multiport ip6t_REJECT xt_hl ip6t_rt nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT xt_LOG xt_limit xt_tcpudp xt_addrtype nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack ip6table_filter ip6_tables gpio_ich nf_conntrack_netbios_ns nf_conntrack_broadcast nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack iptable_filter ip_tables x_tables intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd serio_raw sb_edac edac_core lpc_ich hpwdt hpilo ioatdma lp dca ipmi_si parport acpi_power_meter mac_hid tg3 ptp psmouse hpsa pps_core
May 26 06:19:38 server06 kernel: [75831.941585] CPU: 4 PID: 2930 Comm: java Not tainted 3.13.0-24-generic #47-Ubuntu
May 26 06:19:38 server06 kernel: [75831.942633] Hardware name: HP ProLiant DL380p Gen8, BIOS P70 02/10/2014
May 26 06:19:38 server06 kernel: [75831.943583] task: ffff881fe8372fe0 ti: ffff881fe632a000 task.ti: ffff881fe632a000
May 26 06:19:38 server06 kernel: [75831.944654] RIP: 0010:[<ffffffff81179051>] [<ffffffff81179051>] handle_mm_fault+0xe61/0xf10
May 26 06:19:38 server06 kernel: [75831.946137] RSP: 0000:ffff881fe632bd98 EFLAGS: 00010246
May 26 06:19:38 server06 kernel: [75831.946885] RAX: 0000000000000100 RBX: 00007fc37320a370 RCX: ffff881fe632bb18
May 26 06:19:38 server06 kernel: [75831.947902] RDX: ffff881fe8372fe0 RSI: 0000000000000000 RDI: 8000000100c009e6
May 26 06:19:38 server06 kernel: [75831.948932] RBP: ffff881fe632be20 R08: 0000000000000000 R09: 00000000000000a9
May 26 06:19:38 server06 kernel: [75831.949952] R10: 0000000000000001 R11: 0000000000000000 R12: ffff881fd83a7cc8
May 26 06:19:38 server06 kernel: [75831.950961] R13: ffff880fe6787d40 R14: ffff880fe5d95780 R15: 0000000000000080
May 26 06:19:38 server06 kernel: [75831.951985] FS: 00007fc938145700(0000) GS:ffff880fffa80000(0000) knlGS:0000000000000000
May 26 06:19:38 server06 kernel: [75831.976736] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 26 06:19:38 server06 kernel: [75832.005183] CR2: 00007fc373620930 CR3: 0000000fe63fe000 CR4: 00000000000407e0
May 26 06:19:38 server06 kernel: [75832.033473] Stack:
May 26 06:19:38 server06 kernel: [75832.060551] 0000000000000001 ffff881fe632bdb0 ffffffff8109a780 ffff881fe632bdd0
May 26 06:19:38 server06 kernel: [75832.117385] ffffffff810d7ad6 0000000000000001 ffffffff81f1ea20 ffff881fe632be78
May 26 06:19:38 server06 kernel: [75832.173599] ffffffff810d983d ffff881fe632be48 ffff8800000000a9 00000001ffffffff
May 26 06:19:38 server06 kernel: [75832.231813] Call...

Read more...

Revision history for this message
Sami Pietila (sampie) wrote :

I am wondering if this is a security issue also, as it seems that a normal user can render system unresponsive. Has other reporters an easy procedure to reproduce this bug? I have a somewhat complex producedure which involves running a DNA sequence analysis pipeline (Qiime).

Revision history for this message
penalvch (penalvch) wrote :

Nikhil Chhaochharia, thank you for your comment. So your hardware and problem may be tracked, could you please file a new report with Ubuntu by executing the following in a terminal while booted into a Ubuntu repository kernel (not a mainline one) via:
ubuntu-bug linux

For more on this, please read the official Ubuntu documentation:
Ubuntu Bug Control and Ubuntu Bug Squad: https://wiki.ubuntu.com/Bugs/BestPractices#X.2BAC8-Reporting.Focus_on_One_Issue
Ubuntu Kernel Team: https://wiki.ubuntu.com/KernelTeam/KernelTeamBugPolicies#Filing_Kernel_Bug_reports
Ubuntu Community: https://help.ubuntu.com/community/ReportingBugs#Bug_reporting_etiquette

When opening up the new report, please feel free to subscribe me to it.

Thank you for your understanding.

Helpful bug reporting tips:
https://wiki.ubuntu.com/ReportingBugs

Revision history for this message
Tamas Papp (tomposmiko) wrote :

This is duplicate of #1323165 .
There is also a possible workaround.

Revision history for this message
km (srikrishnamohan) wrote :
Download full text (1.6 MiB)

I am on a Dell R910 server with ubuntu 14.04 running on 3.13.30 kernel and I face the same issue. After few days, the system freezes throwing oops. So I have downgraded the kernel to 3.5.1 that seem not to show the error. So it seems like this issue has been solved downstream but did not get carried latest upstream (3.15.x). I have tried custom compiling 3.15.x hoping that it would havebeen fixed but not.
When can we expect a fix/patch for this issue in 3.15.x kernels ? or could someone please share a patch that could be applied onto latest upstream (3.15.x) kernels ? The issue is of immediate importance for us. Thank you!

The error log is:

Jun 23 08:25:09 ubuntu kernel: [330721.035825] BUG: soft lockup - CPU#4 stuck for 22s! [kworker/4:1:638]
Jun 23 08:25:09 ubuntu kernel: [330721.037440] Modules linked in: arc4 md4 nls_utf8 cifs ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp bridge stp llc ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables x_tables rpcsec_gss_krb5 nfsv4 nfsd auth_rpcgss binfmt_misc nfs_acl nfs lockd sunrpc fscache dcdbas gpio_ich intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd serio_raw joydev i7core_edac ipmi_si wmi acpi_power_meter mac_hid lpc_ich edac_core xfs lp parport btrfs xor raid6_pq libcrc32c ses hid_generic enclosure usbhid psmouse qlcnic hid megaraid_sas bnx2
Jun 23 08:25:09 ubuntu kernel: [330721.037518] CPU: 4 PID: 638 Comm: kworker/4:1 Not tainted 3.13.0-29-generic #53-Ubuntu
Jun 23 08:25:09 ubuntu kernel: [330721.037520] Hardware name: Dell Inc. PowerEdge R910/0KYD3D, BIOS 2.10.0 08/29/2013
Jun 23 08:25:09 ubuntu kernel: [330721.037532] Workqueue: cgroup_destroy css_killed_work_fn
Jun 23 08:25:09 ubuntu kernel: [330721.037535] task: ffff883f240f5fc0 ti: ffff883f23cba000 task.ti: ffff883f23cba000
Jun 23 08:25:09 ubuntu kernel: [330721.037537] RIP: 0010:[<ffffffff8115b956>] [<ffffffff8115b956>] put_page+0x16/0x40
Jun 23 08:25:09 ubuntu kernel: [330721.037545] RSP: 0018:ffff883f23cbbd20 EFLAGS: 00000202
Jun 23 08:25:09 ubuntu kernel: [330721.037547] RAX: 000000000000000a RBX: ffff883f7f8502e0 RCX: 0000000000000001
Jun 23 08:25:09 ubuntu kernel: [330721.037549] RDX: 000000000000000b RSI: 00000000de42de40 RDI: ffffea01a5e13840
Jun 23 08:25:09 ubuntu kernel: [330721.037551] RBP: ffff883f23cbbd20 R08: ffff883f7f418c00 R09: 00000000fffffff6
Jun 23 08:25:09 ubuntu kernel: [330721.037552] R10: 0000000000000001 R11: 0000000000001000 R12: ffffffff8115c110
Jun 23 08:25:09 ubuntu kernel: [330721.037554] R13: 0000000000000000 R14: ffffffff8115bec0 R15: ffff883f23cbbcc8
Jun 23 08:25:09 ubuntu kernel: [330721.037556] FS: 0000000000000000(0000) GS:ffff883f7f840000(0000) knlGS:0000000000000000
Jun 23 08:25:09 ubuntu kernel: [330721.037558] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Jun 23 08:25:09 ubuntu kernel: [330721.037560] CR2: 00007fff22051220 CR3: 0000000001c0e000 CR4: 00000000000007e0
Jun 23 08:25:09 ubuntu kernel: [330721.037561] Stack:
Jun 23 08:25:09 ubuntu kernel:...

Changed in linux (Ubuntu):
status: Incomplete → New
status: New → Confirmed
status: Confirmed → New
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
penalvch (penalvch) wrote :

km, thank you for your comment. So your hardware and problem may be tracked, could you please file a new report with Ubuntu by executing the following in a terminal while booted into the default Ubuntu kernel (not a mainline one) via:
ubuntu-bug linux

For more on this, please read the official Ubuntu documentation:
Ubuntu Bug Control and Ubuntu Bug Squad: https://wiki.ubuntu.com/Bugs/BestPractices#X.2BAC8-Reporting.Focus_on_One_Issue
Ubuntu Kernel Team: https://wiki.ubuntu.com/KernelTeam/KernelTeamBugPolicies#Filing_Kernel_Bug_reports
Ubuntu Community: https://help.ubuntu.com/community/ReportingBugs#Bug_reporting_etiquette

When opening up the new report, please feel free to subscribe me to it.

Thank you for your understanding.

Helpful bug reporting tips:
https://wiki.ubuntu.com/ReportingBugs

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
km (srikrishnamohan) wrote :

attached is the the apport report
Thanks!

Revision history for this message
penalvch (penalvch) wrote :

km, please file a new report against that crash report, as manually attaching it does not allow Apport to process it, delaying your issue. For more on this, please see https://wiki.ubuntu.com/ReportingBugs .

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.