CPU Cores Stalls/Lock

Bug #1697132 reported by William Osborne
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Medium
Colin Ian King

Bug Description

CPU Stalls or Locks up after a random period of time. Can not trace issue to high resources or heat. System will crash while sitting essentially idle. Sometimes system will crash after a few minutes some times it will take days. A large file transfer will usually force the crash though.

I have been having similar issue since Ubuntu 14.04. I have upgrade to every version along the way to 17.04 with the same issue. I have replaced every piece of hardware except HDD. MB, Proc, power supply, memory, video card. With and without AMD micro code. This is a server so no GUI.

I've tried upstream kernels but they don't support ZFS and I haven't had any luck shimming it in. Happy to try newer kernels if someone can provide one with ZFS support.

Jun 9 18:54:57 TheSource kernel: [46465.931527] INFO: rcu_sched detected stalls on CPUs/tasks:
Jun 9 18:54:57 TheSource kernel: [46465.931583] 9-...: (12 GPs behind) idle=f03/1/0 softirq=261811/261811 fqs=6599
Jun 9 18:54:57 TheSource kernel: [46465.931624] (detected by 0, t=15002 jiffies, g=380049, c=380048, q=118270)
Jun 9 18:54:57 TheSource kernel: [46465.931665] Task dump for CPU 9:
Jun 9 18:54:57 TheSource kernel: [46465.931666] swapper/9 R running task 0 0 1 0x00000008
Jun 9 18:54:57 TheSource kernel: [46465.931670] Call Trace:
Jun 9 18:54:57 TheSource kernel: [46465.931677] ? cpuidle_enter_state+0x122/0x2c0
Jun 9 18:54:57 TheSource kernel: [46465.931680] ? cpuidle_enter_state+0x110/0x2c0
Jun 9 18:54:57 TheSource kernel: [46465.931683] ? cpuidle_enter+0x17/0x20
Jun 9 18:54:57 TheSource kernel: [46465.931685] ? call_cpuidle+0x23/0x40
Jun 9 18:54:57 TheSource kernel: [46465.931688] ? do_idle+0x189/0x200
Jun 9 18:54:57 TheSource kernel: [46465.931690] ? cpu_startup_entry+0x71/0x80
Jun 9 18:54:57 TheSource kernel: [46465.931693] ? start_secondary+0x154/0x190
Jun 9 18:54:57 TheSource kernel: [46465.931696] ? start_cpu+0x14/0x14

ProblemType: Bug
DistroRelease: Ubuntu 17.04
Package: linux-image-4.10.0-19-generic 4.10.0-19.21
ProcVersionSignature: Ubuntu 4.10.0-19.21-generic 4.10.8
Uname: Linux 4.10.0-19-generic x86_64
NonfreeKernelModules: zfs zunicode zavl zcommon znvpair
AlsaVersion: Advanced Linux Sound Architecture Driver Version k4.10.0-19-generic.
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
ApportVersion: 2.20.4-0ubuntu4.1
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/hwC0D0', '/dev/snd/pcmC0D7p', '/dev/snd/pcmC0D3p', '/dev/snd/controlC0', '/dev/snd/by-path', '/dev/snd/hwC1D0', '/dev/snd/pcmC1D2c', '/dev/snd/pcmC1D1p', '/dev/snd/pcmC1D0c', '/dev/snd/pcmC1D0p', '/dev/snd/controlC1', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
Card0.Amixer.info: Error: [Errno 2] No such file or directory: 'amixer'
Card0.Amixer.values: Error: [Errno 2] No such file or directory: 'amixer'
Card1.Amixer.info: Error: [Errno 2] No such file or directory: 'amixer'
Card1.Amixer.values: Error: [Errno 2] No such file or directory: 'amixer'
Date: Fri Jun 9 20:17:45 2017
InstallationDate: Installed on 2017-06-03 (6 days ago)
InstallationMedia: Ubuntu-Server 17.04 "Zesty Zapus" - Release amd64 (20170412)
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
MachineType: Micro-Star International Co., Ltd MS-7A32
ProcEnviron:
 TERM=xterm
 PATH=(custom, no user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB: 0 nouveaufb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.10.0-19-generic root=UUID=bba67c63-3cd9-4475-bf3a-a0890087330d ro
RelatedPackageVersions:
 linux-restricted-modules-4.10.0-19-generic N/A
 linux-backports-modules-4.10.0-19-generic N/A
 linux-firmware 1.164.1
RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 04/27/2017
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: 1.50
dmi.board.asset.tag: To be filled by O.E.M.
dmi.board.name: X370 GAMING PRO CARBON (MS-7A32)
dmi.board.vendor: Micro-Star International Co., Ltd
dmi.board.version: 1.0
dmi.chassis.asset.tag: To be filled by O.E.M.
dmi.chassis.type: 3
dmi.chassis.vendor: Micro-Star International Co., Ltd
dmi.chassis.version: 1.0
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr1.50:bd04/27/2017:svnMicro-StarInternationalCo.,Ltd:pnMS-7A32:pvr1.0:rvnMicro-StarInternationalCo.,Ltd:rnX370GAMINGPROCARBON(MS-7A32):rvr1.0:cvnMicro-StarInternationalCo.,Ltd:ct3:cvr1.0:
dmi.product.name: MS-7A32
dmi.product.version: 1.0
dmi.sys.vendor: Micro-Star International Co., Ltd

Revision history for this message
William Osborne (syntax.intact) wrote :
Revision history for this message
Joseph Salisbury (jsalisbury) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Changed in linux (Ubuntu):
importance: Undecided → Medium
Revision history for this message
William Osborne (syntax.intact) wrote :
Download full text (6.6 KiB)

Jun 11 09:03:08 TheSource kernel: [22835.941815] NMI watchdog: BUG: soft lockup - CPU#13 stuck for 22s! [z_rd_int_4:2067]
Jun 11 09:03:08 TheSource kernel: [22835.941874] Modules linked in: xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntr$
Jun 11 09:03:08 TheSource kernel: [22835.941920] x_tables autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear nouveau$
Jun 11 09:03:08 TheSource kernel: [22835.941952] CPU: 13 PID: 2067 Comm: z_rd_int_4 Tainted: P O L 4.10.0-19-generic #21-Ubuntu
Jun 11 09:03:08 TheSource kernel: [22835.941953] Hardware name: Micro-Star International Co., Ltd MS-7A32/X370 GAMING PRO CARBON (MS-7A32), BIOS 1.50 04/27/2017
Jun 11 09:03:08 TheSource kernel: [22835.941956] task: ffff9b0b828e0000 task.stack: ffffa7b84d1cc000
Jun 11 09:03:08 TheSource kernel: [22835.941962] RIP: 0010:smp_call_function_many+0x1f1/0x250
Jun 11 09:03:08 TheSource kernel: [22835.941963] RSP: 0018:ffffa7b84d1cfa50 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff10
Jun 11 09:03:08 TheSource kernel: [22835.941966] RAX: 0000000000000003 RBX: 0000000000000010 RCX: 0000000000000009
Jun 11 09:03:08 TheSource kernel: [22835.941967] RDX: ffff9b0b8e85d9e0 RSI: 0000000000000010 RDI: ffff9b0b8e021bc0
Jun 11 09:03:08 TheSource kernel: [22835.941968] RBP: ffffa7b84d1cfa88 R08: fffffffffffffe00 R09: 000000000000dfff
Jun 11 09:03:08 TheSource kernel: [22835.941969] R10: ffffd676902b1200 R11: ffffa7b880e04000 R12: ffffffffb3c75e00
Jun 11 09:03:08 TheSource kernel: [22835.941970] R13: 0000000000000000 R14: ffff9b0b8e95a280 R15: 000000000001a240
Jun 11 09:03:08 TheSource kernel: [22835.941972] FS: 0000000000000000(0000) GS:ffff9b0b8e940000(0000) knlGS:0000000000000000
Jun 11 09:03:08 TheSource kernel: [22835.941973] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 11 09:03:08 TheSource kernel: [22835.941975] CR2: 0000000000dc3520 CR3: 00000003d7c93000 CR4: 00000000003406e0
Jun 11 09:03:08 TheSource kernel: [22835.941976] Call Trace:
Jun 11 09:03:08 TheSource kernel: [22835.941982] ? leave_mm+0xc0/0xc0
Jun 11 09:03:08 TheSource kernel: [22835.941985] on_each_cpu+0x2d/0x60
Jun 11 09:03:08 TheSource kernel: [22835.941988] flush_tlb_kernel_range+0x4b/0x80
Jun 11 09:03:08 TheSource kernel: [22835.941991] __purge_vmap_area_lazy+0x50/0xc0
Jun 11 09:03:08 TheSource kernel: [22835.941994] free_vmap_area_noflush+0x7e/0x90
Jun 11 09:03:08 TheSource kernel: [22835.941996] remove_vm_area+0x6c/0x80
Jun 11 09:03:08 TheSource kernel: [22835.941999] __vunmap+0x26/0xd0
Jun 11 09:03:08 TheSource kernel: [22835.942001] vfree+0x2e/0x70
Jun 11 09:03:08 TheSource kernel: [22835.942009] kv_free.isra.6+0x3d/0x60 [spl]
Jun 11 09:03:08 TheSource kernel: [22835.942015] spl_slab_reclaim+0x1c9/0x210 [spl]
Jun 11 09:03:08 TheSource kernel: [22835.942021] spl_kmem_cache_free+0x121/0x1c0 [spl]
Jun 11 09:03:08 TheSource kernel: [22835.942071] zio_buf_free+0x58/0x60 [zfs]
Jun 11 09:03:08 TheSource kernel: [22835.942121] vdev_mirror_scrub_done+0x43/0x130 [zfs]
Jun 11 09:03:08 TheSource kernel: [22835.94216...

Read more...

Revision history for this message
William Osborne (syntax.intact) wrote :

This issue is not related to ZFS. It simply happened to stall on that process.
I have since removed zfs and rebuilt with mdadm and installed an upstream kernel (4.10.17-041017-generic) and the issue continues.

Revision history for this message
William Osborne (syntax.intact) wrote :

I have installed kernel 4.12.2 and have zero crashes in over 24 hours with several TB of network traffic through it. This is a first time in over a year. Seems stable but will let it burn in before calling it resolved by the kernel upgrade.

tags: added: kernel-fixed-upstream
Revision history for this message
Colin Ian King (colin-king) wrote :

It maybe also worth stress testing your hardware with tools such as stress-ng to try and coerce issues like this out. See: http://kernel.ubuntu.com/~cking/stress-ng/

Changed in zfs-linux (Ubuntu):
importance: Undecided → Medium
assignee: nobody → Colin Ian King (colin-king)
status: New → Triaged
tags: removed: kernel-fixed-upstream
Revision history for this message
William Osborne (syntax.intact) wrote :

I thought 4.12.2 fixed this issue but it just now crashed again.
I have replaced every single piece of hardware. CPU(fx and ryzen), mobo(asus and MSI), memory, hard drives(WD blue and re4 for boot disk, with red for storage array) , video card(nvidia and ATI).
The only commonality is Ubuntu with a large(7+TB) disk array on some kind a 64bit AMD CPU(i have not tried intel) with and without proprietary CPU code. I have used zfs and mdadm. I switched from samba to nfs for shares.

clean install with bind, and open ssh selected from tasksel and htop after the fact. Nothing else extra installed.

Issue started in Ubuntu 14.04 at some point later in it's life cycle.

tags: added: kernel-bug-exists-upstream
Revision history for this message
Colin Ian King (colin-king) wrote :

Just to factor out any specific BIOS/CPU issues, can you try with the kernel option:
processor.max_cstate=1

Edit /etc/default/grub and change:

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash"

to:

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash processor.max_cstate=1"

then run:

sudo update-grub

and reboot.

I just want to ensure that the hangs aren't caused by the CPUs going to a C state that causes this hang.

Revision history for this message
Colin Ian King (colin-king) wrote :

Removing ZFS from the bug as this is not specific to ZFS

Changed in zfs-linux (Ubuntu):
status: Triaged → Incomplete
Changed in linux (Ubuntu):
status: Confirmed → Incomplete
no longer affects: zfs-linux (Ubuntu)
Changed in linux (Ubuntu):
assignee: nobody → Colin Ian King (colin-king)
Revision history for this message
Colin Ian King (colin-king) wrote :

Hi William,

Do you mind trying out the suggested workaround in comment #8. If I don't hear back from you in 4 weeks I will close the bug.

Revision history for this message
Jon Kump (kump-jon) wrote :

I am experiencing the same problem on an Intel CPU. Just did a fresh install of 17.04 and the CPU Stalls/Locks.

Have never experienced this problem on any of the previous versions of ubuntu I have installed on it.

Workstation is currently locked and cannot get information until I get back into my office.

Revision history for this message
Jon Kump (kump-jon) wrote :
Download full text (4.9 KiB)

Sep 28 08:37:19 gandolf kernel: [ 6127.965285] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [sshd:3995]
Sep 28 08:37:19 gandolf kernel: [ 6127.965287] Modules linked in: vhost_net vhost macvtap macvlan ebtable_filter ebtables ip6table_filter ip6_tables xt_multiport iptable_filter bridge stp llc binfmt_misc snd_hda_codec_hdmi eeepc_wmi asus_wmi sparse_keymap intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel snd_hda_codec_realtek snd_hda_codec_generic kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc snd_hda_intel snd_hda_codec snd_usb_audio snd_seq_midi snd_hda_core snd_seq_midi_event snd_usbmidi_lib snd_seq snd_hwdep uvcvideo snd_rawmidi videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 snd_seq_device videobuf2_core snd_pcm videodev aesni_intel media aes_x86_64 crypto_simd glue_helper cryptd intel_cstate joydev snd_timer intel_rapl_perf input_leds snd soundcore mei_me mei lpc_ich shpchp mac_hid parport_pc
Sep 28 08:37:19 gandolf kernel: [ 6127.965305] ppdev lp parport ip_tables x_tables autofs4 hid_generic usbhid hid pata_acpi nouveau mxm_wmi video i2c_algo_bit ttm firewire_ohci drm_kms_helper syscopyarea sysfillrect sysimgblt pata_marvell fb_sys_fops ahci r8169 drm firewire_core libahci mii crc_itu_t fjes wmi
Sep 28 08:37:19 gandolf kernel: [ 6127.965321] CPU: 0 PID: 3995 Comm: sshd Tainted: G L 4.10.0-35-generic #39-Ubuntu
Sep 28 08:37:19 gandolf kernel: [ 6127.965321] Hardware name: System manufacturer System Product Name/P8P67 LE, BIOS 1011 04/08/2011
Sep 28 08:37:19 gandolf kernel: [ 6127.965322] task: ffff9a92cb272d00 task.stack: ffffa8d682f70000
Sep 28 08:37:19 gandolf kernel: [ 6127.965324] RIP: 0010:smp_call_function_many+0x1ec/0x250
Sep 28 08:37:19 gandolf kernel: [ 6127.965324] RSP: 0018:ffffa8d682f73bd8 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff10
Sep 28 08:37:19 gandolf kernel: [ 6127.965325] RAX: 0000000000000003 RBX: 0000000000000004 RCX: 0000000000000001
Sep 28 08:37:19 gandolf kernel: [ 6127.965325] RDX: ffff9a92df49d0f8 RSI: 0000000000000004 RDI: ffff9a92ce81a780
Sep 28 08:37:19 gandolf kernel: [ 6127.965326] RBP: ffffa8d682f73c10 R08: ffffffffffffffff R09: 000000000000000e
Sep 28 08:37:19 gandolf kernel: [ 6127.965326] R10: fffff99c100f5800 R11: 0000000000000010 R12: ffffffffb2275e20
Sep 28 08:37:19 gandolf kernel: [ 6127.965327] R13: 0000000000000000 R14: ffff9a92df41a340 R15: 000000000001a300
Sep 28 08:37:19 gandolf kernel: [ 6127.965327] FS: 00007ff133d988c0(0000) GS:ffff9a92df400000(0000) knlGS:0000000000000000
Sep 28 08:37:19 gandolf kernel: [ 6127.965328] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 28 08:37:19 gandolf kernel: [ 6127.965328] CR2: 00007ffcde3b0f0e CR3: 000000040ac9b000 CR4: 00000000000426f0
Sep 28 08:37:19 gandolf kernel: [ 6127.965329] Call Trace:
Sep 28 08:37:19 gandolf kernel: [ 6127.965331] ? leave_mm+0xc0/0xc0
Sep 28 08:37:19 gandolf kernel: [ 6127.965332] on_each_cpu+0x2d/0x60
Sep 28 08:37:19 gandolf kernel: [ 6127.965333] flush_tlb_kernel_range+0x4b/0x80
Sep 28 08:37:19 gandolf kernel: [ 6127.965334] __purge_vmap_area_lazy+0x50/0xc0
Sep 28 08:37:19 gandolf kernel: [ 6127.965335] vm_unmap_aliases+0x113/0x150
Sep 28 08:37:19...

Read more...

Revision history for this message
Jon Kump (kump-jon) wrote :
Download full text (4.0 KiB)

That stack trace was in the log just before It stalled and hard locked.

This is the trace that happened next

Sep 28 08:37:34 gandolf kernel: [ 6142.796923] INFO: rcu_bh self-detected stall on CPU
Sep 28 08:37:34 gandolf kernel: [ 6142.796926] 0-...: (1 GPs behind) idle=653/140000000000001/0 softirq=72562/73365 fqs=56230
Sep 28 08:37:34 gandolf kernel: [ 6142.796927] (t=150012 jiffies g=-255 c=-256 q=10)
Sep 28 08:37:34 gandolf kernel: [ 6142.796928] Task dump for CPU 0:
Sep 28 08:37:34 gandolf kernel: [ 6142.796929] sshd R running task 0 3995 1 0x0000000c
Sep 28 08:37:34 gandolf kernel: [ 6142.796930] Call Trace:
Sep 28 08:37:34 gandolf kernel: [ 6142.796930] <IRQ>
Sep 28 08:37:34 gandolf kernel: [ 6142.796933] sched_show_task+0xd3/0x140
Sep 28 08:37:34 gandolf kernel: [ 6142.796934] dump_cpu_task+0x37/0x40
Sep 28 08:37:34 gandolf kernel: [ 6142.796936] rcu_dump_cpu_stacks+0x99/0xbd
Sep 28 08:37:34 gandolf kernel: [ 6142.796938] rcu_check_callbacks+0x703/0x850
Sep 28 08:37:34 gandolf kernel: [ 6142.796939] ? acct_account_cputime+0x1c/0x20
Sep 28 08:37:34 gandolf kernel: [ 6142.796940] ? account_system_time+0x7a/0x110
Sep 28 08:37:34 gandolf kernel: [ 6142.796942] ? tick_sched_handle.isra.15+0x60/0x60
Sep 28 08:37:34 gandolf kernel: [ 6142.796943] update_process_times+0x2f/0x60
Sep 28 08:37:34 gandolf kernel: [ 6142.796944] tick_sched_handle.isra.15+0x25/0x60
Sep 28 08:37:34 gandolf kernel: [ 6142.796945] tick_sched_timer+0x3d/0x70
Sep 28 08:37:34 gandolf kernel: [ 6142.796946] __hrtimer_run_queues+0xf3/0x270
Sep 28 08:37:34 gandolf kernel: [ 6142.796946] hrtimer_interrupt+0xa3/0x1e0
Sep 28 08:37:34 gandolf kernel: [ 6142.796948] ? leave_mm+0xc0/0xc0
Sep 28 08:37:34 gandolf kernel: [ 6142.796950] local_apic_timer_interrupt+0x38/0x60
Sep 28 08:37:34 gandolf kernel: [ 6142.796951] smp_apic_timer_interrupt+0x38/0x50
Sep 28 08:37:34 gandolf kernel: [ 6142.796952] apic_timer_interrupt+0x89/0x90
Sep 28 08:37:34 gandolf kernel: [ 6142.796953] RIP: 0010:smp_call_function_many+0x1ec/0x250
Sep 28 08:37:34 gandolf kernel: [ 6142.796954] RSP: 0018:ffffa8d682f73bd8 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff10
Sep 28 08:37:34 gandolf kernel: [ 6142.796954] RAX: 0000000000000003 RBX: 0000000000000004 RCX: 0000000000000001
Sep 28 08:37:34 gandolf kernel: [ 6142.796955] RDX: ffff9a92df49d0f8 RSI: 0000000000000004 RDI: ffff9a92ce81a780
Sep 28 08:37:34 gandolf kernel: [ 6142.796955] RBP: ffffa8d682f73c10 R08: ffffffffffffffff R09: 000000000000000e
Sep 28 08:37:34 gandolf kernel: [ 6142.796956] R10: fffff99c100f5800 R11: 0000000000000010 R12: ffffffffb2275e20
Sep 28 08:37:34 gandolf kernel: [ 6142.796956] R13: 0000000000000000 R14: ffff9a92df41a340 R15: 000000000001a300
Sep 28 08:37:34 gandolf kernel: [ 6142.796957] </IRQ>
Sep 28 08:37:34 gandolf kernel: [ 6142.796957] ? leave_mm+0xc0/0xc0
Sep 28 08:37:34 gandolf kernel: [ 6142.796958] ? smp_call_function_many+0x1c8/0x250
Sep 28 08:37:34 gandolf kernel: [ 6142.796959] ? leave_mm+0xc0/0xc0
Sep 28 08:37:34 gandolf kernel: [ 6142.796960] on_each_cpu+0x2d/0x60
Sep 28 08:37:34 gandolf kernel: [ 6142.796960] flush_tlb_kernel_range+0x4b/0x80
Sep 28 08:37:34 gandolf ke...

Read more...

Revision history for this message
Jon Kump (kump-jon) wrote :

upgraded kernel to 4.12.13 and the problem is not happening.

Revision history for this message
Colin Ian King (colin-king) wrote :

OK, lets mark that as fixed released

Changed in linux (Ubuntu):
status: Incomplete → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.