NMI watchdog: Watchdog detected hard LOCKUP on cpu 0 - Xenial - Python

Bug #1596866 reported by Benjamin Kaehne
42
This bug affects 7 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
High
Unassigned

Bug Description

I am receiving quite regular hardlockups on python (27) in xenial:

Linux rts-os-s-03 4.4.0-28-generic #47-Ubuntu SMP Fri Jun 24 10:09:13 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Ubuntu 16.04 LTS \n \l

Python 27:
ii python 2.7.11-1 amd64 interactive high-level object-oriented language (default version)
ii python2.7 2.7.11-7ubuntu1 amd64 Interactive high-level object-oriented language (version 2.7)

Python 3:
ii python3 3.5.1-3 amd64 interactive high-level object-oriented language (default python3 version)

Jun 28 06:52:42 XXXX kernel: [ 1634.052991] NMI watchdog: Watchdog detected hard LOCKUP on cpu 0
Jun 28 06:52:42 XXXX kernel: [ 1634.059516] Modules linked in: iptable_raw kvm_intel ebtable_filter ebtables ip6table_filter ip6_tables xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat xt_tcpudp iptable_filter ip_tables x_tables veth bridge stp llc bonding dcdbas intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm irqbypass shpchp lpc_ich ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi openvswitch nf_defrag_ipv6 nf_conntrack autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 raid0 multipath linear crct10dif_pclmul crc32_pclmul aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd bnx2x ahci libahci tg3 megaraid_sas vxlan ip6_udp_tunnel udp_tunnel ptp pps_core mdio libcrc32c [last unloaded: kvm_intel]
Jun 28 06:52:42 XXXX kernel: [ 1634.059790] CPU: 0 PID: 52914 Comm: python Not tainted 4.4.0-28-generic #47-Ubuntu
Jun 28 06:52:42 XXXX kernel: [ 1634.059791] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 10/002/2015
Jun 28 06:52:42 XXXX kernel: [ 1634.059792] 0000000000000086 000000002732bfd7 ffff887b254bbbd0 ffffffff813eb1a3
Jun 28 06:52:42 XXXX kernel: [ 1634.059794] 0000000000000000 0000000000000000 ffff887b254bbbe8 ffffffff8113b3bd
Jun 28 06:52:42 XXXX kernel: [ 1634.059796] ffff887e4da1a000 ffff887b254bbc20 ffffffff81183e4c 0000000000000001
Jun 28 06:52:42 XXXX kernel: [ 1634.059797] Call Trace:
Jun 28 06:52:42 XXXX kernel: [ 1634.059804] [<ffffffff813eb1a3>] dump_stack+0x63/0x90
Jun 28 06:52:42 XXXX kernel: [ 1634.059807] [<ffffffff8113b3bd>] watchdog_overflow_callback+0xbd/0xd0
Jun 28 06:52:42 XXXX kernel: [ 1634.059810] [<ffffffff81183e4c>] __perf_event_overflow+0x8c/0x1d0
Jun 28 06:52:42 XXXX kernel: [ 1634.059811] [<ffffffff81184a24>] perf_event_overflow+0x14/0x20
Jun 28 06:52:42 XXXX kernel: [ 1634.059814] [<ffffffff8100c4d1>] intel_pmu_handle_irq+0x1e1/0x4a0
Jun 28 06:52:42 XXXX kernel: [ 1634.059817] [<ffffffff81197001>] ? __alloc_pages_nodemask+0x1b1/0xb60
Jun 28 06:52:42 XXXX kernel: [ 1634.059821] [<ffffffff811fc3f4>] ? try_charge+0xd4/0x640
Jun 28 06:52:42 XXXX kernel: [ 1634.059823] [<ffffffff81200b4b>] ? mem_cgroup_try_charge+0x6b/0x1b0
Jun 28 06:52:42 XXXX kernel: [ 1634.059826] [<ffffffff8119e667>] ? lru_cache_add_active_or_unevictable+0x27/0xa0
Jun 28 06:52:42 XXXX kernel: [ 1634.059830] [<ffffffff811bfffa>] ? handle_mm_fault+0xcaa/0x1820
Jun 28 06:52:42 XXXX kernel: [ 1634.059831] [<ffffffff811c5fbe>] ? vma_merge+0x22e/0x330
Jun 28 06:52:42 XXXX kernel: [ 1634.059834] [<ffffffff810056dd>] perf_event_nmi_handler+0x2d/0x50
Jun 28 06:52:42 XXXX kernel: [ 1634.059837] [<ffffffff810323c9>] nmi_handle+0x69/0x120
Jun 28 06:52:42 XXXX kernel: [ 1634.059839] [<ffffffff81032900>] default_do_nmi+0x40/0x100
Jun 28 06:52:42 XXXX kernel: [ 1634.059841] [<ffffffff81032aa2>] do_nmi+0xe2/0x130
Jun 28 06:52:42 XXXX kernel: [ 1634.059844] [<ffffffff81829ac6>] nmi+0x56/0xa5

As suggested, this is causing hard lockups and/or pauses.
---
AlsaDevices:
 total 0
 crw-rw---- 1 root audio 116, 1 Jun 29 04:12 seq
 crw-rw---- 1 root audio 116, 33 Jun 29 04:12 timer
AplayDevices: Error: [Errno 2] No such file or directory
ApportVersion: 2.20.1-0ubuntu2.1
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
DistroRelease: Ubuntu 16.04
IwConfig: Error: [Errno 2] No such file or directory
Lsusb:
 Bus 002 Device 002: ID 8087:8002 Intel Corp.
 Bus 002 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
 Bus 001 Device 003: ID 413c:a001 Dell Computer Corp. Hub
 Bus 001 Device 002: ID 8087:800a Intel Corp.
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
MachineType: Dell Inc. PowerEdge R730
Package: linux (not installed)
PciMultimedia:

ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB:

ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.4.0-28-generic root=UUID=58ac0f2e-dff9-4433-9a89-a8cba7a8154b ro console=tty0 console=ttyS0,115200n8 console=ttyS1,115200n8 acpi=off modprobe.blacklist=mei_me
ProcVersionSignature: Ubuntu 4.4.0-28.47-generic 4.4.13
RelatedPackageVersions:
 linux-restricted-modules-4.4.0-28-generic N/A
 linux-backports-modules-4.4.0-28-generic N/A
 linux-firmware 1.157.1
RfKill: Error: [Errno 2] No such file or directory
Tags: xenial uec-images
Uname: Linux 4.4.0-28-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:

_MarkForUpload: True
dmi.bios.date: 10/002/2015
dmi.bios.vendor: Dell Inc.
dmi.bios.version: 1.5.4
dmi.board.name: 0H21J3
dmi.board.vendor: Dell Inc.
dmi.board.version: A12
dmi.chassis.type: 23
dmi.chassis.vendor: Dell Inc.
dmi.modalias: dmi:bvnDellInc.:bvr1.5.4:bd10/002/2015:svnDellInc.:pnPowerEdgeR730:pvr:rvnDellInc.:rn0H21J3:rvrA12:cvnDellInc.:ct23:cvr:
dmi.product.name: PowerEdge R730
dmi.sys.vendor: Dell Inc.

Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1596866

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Benjamin Kaehne (ben-kaehne) wrote : CRDA.txt

apport information

tags: added: apport-collected uec-images
description: updated
Revision history for this message
Benjamin Kaehne (ben-kaehne) wrote : CurrentDmesg.txt

apport information

Revision history for this message
Benjamin Kaehne (ben-kaehne) wrote : JournalErrors.txt

apport information

Revision history for this message
Benjamin Kaehne (ben-kaehne) wrote : Lspci.txt

apport information

Revision history for this message
Benjamin Kaehne (ben-kaehne) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
Benjamin Kaehne (ben-kaehne) wrote : ProcInterrupts.txt

apport information

Revision history for this message
Benjamin Kaehne (ben-kaehne) wrote : ProcModules.txt

apport information

Revision history for this message
Benjamin Kaehne (ben-kaehne) wrote : UdevDb.txt

apport information

Revision history for this message
Benjamin Kaehne (ben-kaehne) wrote : WifiSyslog.txt

apport information

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Benjamin Kaehne (ben-kaehne) wrote :

Uploading kern.log too. dmesg was cleared as server needed to be rebooted.

Revision history for this message
Benjamin Kaehne (ben-kaehne) wrote :
Revision history for this message
JuanJo Ciarlante (jjo) wrote :

Some ~recent alike finding, in case it helps:
  https://github.com/TobleMiner/wintron7.0/issues/2
- worked around with clocksource=tsc, guess that
ntpq should also show a large drift.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Did this issue start happening after an update/upgrade? Was there a prior kernel version where you were not having this particular problem?

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.7 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.7-rc5-yakkety/

Changed in linux (Ubuntu):
importance: Undecided → High
status: Confirmed → Incomplete
Revision history for this message
Benjamin Kaehne (ben-kaehne) wrote :

@jjo
So far I am having success with clocksource=tsc despite warnings from the kernel telling me it is ynstable.

@jsalisbury
This is off a clean xenial install/update. I will try post test from new kernel shortly.

James Troup (elmo)
Changed in linux (Ubuntu):
status: Incomplete → New
Revision history for this message
Brad Figg (brad-figg) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
Kevin O'Gorman (kogorman-pacbell) wrote :

It looks like I'm having the same problem with a clean install of Xubuntu 16.04.1, and I'll be trying the clocksource tweak. If you don't hear back, it got fixed.

Revision history for this message
Kevin O'Gorman (kogorman-pacbell) wrote :

I was having this problem on two systems, one a Core i-7, the other a Xeon. Both are X86-64, running the kernel that uname reports as 4.4.0-66-generic

The problem has not recurred since the changes, but that's only a few days ago at the moment.

Revision history for this message
Michael Sherman (msherman64) wrote :

I am experiencing this issue on a new xeon scalable machine.
I can confirm that it is present with xenial linux-image-generic, and linux-image-generic-hwe, kernels 4.4.0-116, and 4.13.0-38, and I can reproduce it at will.

It no longer seems to occur on kernel 4.15.0-13.

In all cases, clocksource was already set to TSC automatically on install.
The soft lockup and hard lockup were triggered each time by a waf build of the ns3 project.
In each case, a failure occurred while running the commands as a regular user, but the system was stable if run as root, or with sudo.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.