NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [kswapd0:50]; oom-killer; and eventual kernel panic on 16.10 (upgrade from 16.04)

Bug #1655356 reported by Arul
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Expired
High
Unassigned

Bug Description

I have a Dell (PowerEdge T110/0V52N7, BIOS 1.6.4 03/02/2011) was running Ubuntu 16.04 for a while. Ever since I upgraded to 16.10, this problem started with errors, OOM and an eventual kernel panic. It can run fine for about 3-4 hours or so. I see the following errors on syslog (also attached w/ other logs and information I can gather).

Jan 9 07:36:32 gorilla kernel: [69304.099302] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [kswapd0:50]
Jan 9 07:37:00 gorilla kernel: [69332.119587] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [kswapd0:50]
Jan 9 07:37:33 gorilla kernel: [69364.114705] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [kswapd0:50]
Jan 9 07:38:01 gorilla kernel: [69392.127352] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [kswapd0:50]
Jan 9 07:38:37 gorilla kernel: [69428.134132] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [kswapd0:50]
Jan 9 07:39:45 gorilla kernel: [69496.112694] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [kswapd0:50]
Jan 9 07:40:13 gorilla kernel: [69524.112050] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [kswapd0:50]
Jan 9 07:40:49 gorilla kernel: [69560.104511] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [kswapd0:50]
Jan 9 07:41:17 gorilla kernel: [69588.107302] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [kswapd0:50]
Jan 9 07:41:45 gorilla kernel: [69616.104843] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [kswapd0:50]

Jan 8 11:52:27 gorilla kernel: [ 2852.818471] rsync invoked oom-killer: gfp_mask=0x26000d0(GFP_TEMPORARY|__GFP_NOTRACK), order=0, oom_score_adj=0
Jan 9 07:38:56 gorilla kernel: [69448.096571] kthreadd invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Jan 9 07:39:46 gorilla kernel: [69497.705922] apache2 invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Jan 9 07:40:50 gorilla kernel: [69561.956773] sh invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Jan 9 07:41:10 gorilla kernel: [69582.329364] rsync invoked oom-killer: gfp_mask=0x26000d0(GFP_TEMPORARY|__GFP_NOTRACK), order=0, oom_score_adj=0
Jan 9 07:42:40 gorilla kernel: [69672.181041] sessionclean invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Jan 9 07:42:41 gorilla kernel: [69673.298714] apache2 invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Jan 9 07:42:59 gorilla kernel: [69691.320169] apache2 invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Jan 9 07:43:03 gorilla kernel: [69694.769140] sessionclean invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Jan 9 07:43:20 gorilla kernel: [69712.255535] kthreadd invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0

Jan 8 11:46:11 gorilla kernel: [ 2476.342532] perf: interrupt took too long (2512 > 2500), lowering kernel.perf_event_max_sample_rate to 79500
Jan 8 11:49:04 gorilla kernel: [ 2650.045417] perf: interrupt took too long (3147 > 3140), lowering kernel.perf_event_max_sample_rate to 63500
Jan 8 11:49:56 gorilla kernel: [ 2701.973751] perf: interrupt took too long (3982 > 3933), lowering kernel.perf_event_max_sample_rate to 50000
Jan 8 11:51:47 gorilla kernel: [ 2812.208307] perf: interrupt took too long (4980 > 4977), lowering kernel.perf_event_max_sample_rate to 40000
Jan 8 13:56:06 gorilla kernel: [ 5678.539070] perf: interrupt took too long (2513 > 2500), lowering kernel.perf_event_max_sample_rate to 79500
Jan 8 15:59:49 gorilla kernel: [13101.158417] perf: interrupt took too long (3148 > 3141), lowering kernel.perf_event_max_sample_rate to 63500
Jan 9 02:15:54 gorilla kernel: [50065.939132] perf: interrupt took too long (3942 > 3935), lowering kernel.perf_event_max_sample_rate to 50500
Jan 9 07:35:30 gorilla kernel: [69241.742219] perf: interrupt took too long (4932 > 4927), lowering kernel.perf_event_max_sample_rate to 40500
Jan 9 07:35:54 gorilla kernel: [69265.928531] perf: interrupt took too long (6170 > 6165), lowering kernel.perf_event_max_sample_rate to 32250
Jan 9 07:36:53 gorilla kernel: [69325.386696] perf: interrupt took too long (7723 > 7712), lowering kernel.perf_event_max_sample_rate to 25750

Just to make sure if this is not memory related, I ran memtest for 12 passes over night and found no errors on memory. Removed the external backup drives to isolate the problem. Checked similar issues on lanchpad.net but most of them are related to video driver and power supply.

Appreciate help.

Thanks
-Arul

Attachments:
------------

syslog
uname.txt
swaps.txt
dmesg.txt
df.txt
lspci.txt
lsusb.txt
meminfo.txt
cpuinfo.txt
---
ApportVersion: 2.20.3-0ubuntu8.2
Architecture: i386
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
DistroRelease: Ubuntu 16.10
HibernationDevice: RESUME=UUID=f7ae4452-47ac-43eb-992d-8f5fc8e26f93
InstallationDate: Installed on 2011-08-13 (1977 days ago)
InstallationMedia: Ubuntu-Server 11.04 "Natty Narwhal" - Release i386 (20110426)
IwConfig: Error: [Errno 2] No such file or directory
MachineType: Dell Inc. PowerEdge T110
Package: linux (not installed)
ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB: 0 VESA VGA
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.8.0-32-generic root=UUID=3dcb6aa5-7006-4707-9985-1125faa68aca ro quiet
ProcVersionSignature: Ubuntu 4.8.0-32.34-generic 4.8.11
PulseList: Error: command ['pacmd', 'list'] failed with exit code 1: No PulseAudio daemon running, or not running as session daemon.
RelatedPackageVersions:
 linux-restricted-modules-4.8.0-32-generic N/A
 linux-backports-modules-4.8.0-32-generic N/A
 linux-firmware 1.161.1
RfKill: Error: [Errno 2] No such file or directory
Tags: yakkety
Uname: Linux 4.8.0-32-generic i686
UnreportableReason: The report belongs to a package that is not installed.
UpgradeStatus: Upgraded to yakkety on 2017-01-01 (9 days ago)
UserGroups: fuse
_MarkForUpload: False
dmi.bios.date: 03/02/2011
dmi.bios.vendor: Dell Inc.
dmi.bios.version: 1.6.4
dmi.board.name: 0V52N7
dmi.board.vendor: Dell Inc.
dmi.board.version: A02
dmi.chassis.type: 17
dmi.chassis.vendor: Dell Inc.
dmi.modalias: dmi:bvnDellInc.:bvr1.6.4:bd03/02/2011:svnDellInc.:pnPowerEdgeT110:pvr:rvnDellInc.:rn0V52N7:rvrA02:cvnDellInc.:ct17:cvr:
dmi.product.name: PowerEdge T110
dmi.sys.vendor: Dell Inc.

Revision history for this message
Arul (arulselvan) wrote :
Revision history for this message
Arul (arulselvan) wrote :

Attached files.tar containing

cpuinfo.txt
df.txt
dmesg.txt
initctl.txt
lspci.txt
lsusb.txt
meminfo.txt
swaps.txt
uname.txt

Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better. It seems that your bug report is not filed about a specific source package though, rather it is just filed against Ubuntu in general. It is important that bug reports be filed about source packages so that people interested in the package can find the bugs about it. You can find some hints about determining what package your bug might be about at https://wiki.ubuntu.com/Bugs/FindRightPackage. You might also ask for help in the #ubuntu-bugs irc channel on Freenode.

To change the source package that this bug is filed about visit https://bugs.launchpad.net/ubuntu/+bug/1655356/+editstatus and add the package name in the text box next to the word Package.

[This is an automated message. I apologize if it reached you inappropriately; please just reply to this message indicating so.]

tags: added: bot-comment
tags: added: yakkety
affects: ubuntu → linux (Ubuntu)
Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1655356

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Arul (arulselvan) wrote : AlsaInfo.txt

apport information

tags: added: apport-collected
description: updated
Revision history for this message
Arul (arulselvan) wrote : CRDA.txt

apport information

Revision history for this message
Arul (arulselvan) wrote : CurrentDmesg.txt

apport information

Revision history for this message
Arul (arulselvan) wrote : JournalErrors.txt

apport information

Revision history for this message
Arul (arulselvan) wrote : Lspci.txt

apport information

Revision history for this message
Arul (arulselvan) wrote : Lsusb.txt

apport information

Revision history for this message
Arul (arulselvan) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
Arul (arulselvan) wrote : ProcInterrupts.txt

apport information

Revision history for this message
Arul (arulselvan) wrote : ProcModules.txt

apport information

Revision history for this message
Arul (arulselvan) wrote : UdevDb.txt

apport information

Revision history for this message
Arul (arulselvan) wrote : WifiSyslog.txt

apport information

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Arul (arulselvan) wrote :

Yesterday, I disabled NMI watch (a workaround discussed in another similar bug report) but it still crashed this morning.

root@gorilla:~# sysctl -a|grep kernel.nmi_watchdog
kernel.nmi_watchdog = 0

Today, I upgraded to 4.8.0-34 from 4.8.0-32 kernel to see if that makes any difference.

Revision history for this message
Arul (arulselvan) wrote :

Kernel update to 4.8.0-34 did not make any difference.

Help please.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Does booting back with the Xenial kernel make the bug go away?

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.10 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.10-rc3

Changed in linux (Ubuntu):
importance: Undecided → High
status: Confirmed → Incomplete
tags: added: kernel-da-key
Revision history for this message
Arul (arulselvan) wrote :

@jsalisbury, Thank you for the suggestions. I don't have the xenial kernel handy as I have the habit of running apt-get autoremove after upgrades. However, I will try the latest upstream kernel over this weekend and report the results.

Revision history for this message
Arul (arulselvan) wrote :

Before trying the upstream kernel, I tried to replicate the issue. After noticing it was happening every time there are heavy file I/O. I was able to easily reproduce it at will by running apps that do lot of file I/O. I was also monitoring free memory every second to understand why kernel is invoking oom-killer to randomly killing applications. When oom-killer started to kill random applications, the memory looked like this.

Every 1.0s: free -h gorilla: Sat Jan 14 09:52:01 2017
              total used free shared buff/cache available
Mem: 5.9G 755M 127M 17M 5.1G 4.6G
Swap: 2.0G 0B 2.0G

As you can see, there are lot of available memory (mostly in cache and I am very sure most of it are clean cache) but for some reason, it was not reclaimed by kernel (kswapd0?). So I decided to run "echo 3 > /proc/sys/vm/drop_caches" frequently to force dropping cache, and sure enough everything worked fine. Right now, I haven't seen this problem in the last 2+ days.

root@gorilla:~# cat /var/log/syslog|egrep "NMI watchdog: BUG: soft lockup|oom-killer"
root@gorilla:~# uptime
 07:37:29 up 2 days, 19:34, 1 user, load average: 1.63, 0.77, 0.29

Now that I suspect this may be a possible bug in kswapd0, I did a search here for similar issues for kswapd0 and found one (see below) but I am not sure it is the same problem though the symptoms and workaround are same.

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1518457

At the end of this report (comment #142) says, they have no problem in 4.4.0-45 kernel but Yakkety based 4.8+ kernel has this problem. Assuming this is the same issue, I can confirm the same as I have never had this problem before upgrading to Yakkety. I am wondering if the bug made its way back since this fix. Since I have a workaround, I am going to continue with it; it is not ideal but seem to hold it. The last note on the above report says the bug is fixed and any new problem should be opened as a new bug. Can this report be treated as new bug to address this problem?

Thanks

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

This may be related to bug 1655842 . Can you test the test kernel for that bug which can be downloaded here:

http://kernel.ubuntu.com/~jsalisbury/lp1655842/

Revision history for this message
Arul (arulselvan) wrote :

Update: After upgrading to 17.04 (i.e. kernel 4.10*), I no longer have this problem.

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.