cpu soft lockup causes system lockup

Bug #1053491 reported by Carl Benson
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Incomplete
High
Unassigned

Bug Description

This large computational server with many users ran for 141 days, then suffered a cpu soft lockup.
The soft lockup repeated continuously.

System load gradually built up. After several hours, it was in the 20s. Shortly thereafter, the system
became unresponsive and had to be reset. Upon reboot, the system was fine.

Here is syslog for the first cpu soft lockup:

Sep 16 14:20:26 flatfish kernel: [12283888.860058] BUG: soft lockup - CPU#7 stuck for 22s! [ps:31125]
Sep 16 14:20:26 flatfish kernel: [12283888.864001] Modules linked in: btrfs zlib_deflate libcrc32c ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs xfs reiserfs ext2 autofs4 microcode clip atm parport_pc ppdev binfmt_misc nfsd nfs lockd fscache auth_rpcgss nfs_acl sunrpc shpchp psmouse radeon i7300_edac ttm drm_kms_helper drm edac_core dm_multipath dcdbas joydev serio_raw i2c_algo_bit mac_hid lp parport ses enclosure uas usb_storage usbhid hid sata_sil24 bnx2 megaraid_sas dm_raid45 xor dm_mirror dm_region_hash dm_log
Sep 16 14:20:26 flatfish kernel: [12283888.864001] CPU 7
Sep 16 14:20:26 flatfish kernel: [12283888.864001] Modules linked in: btrfs zlib_deflate libcrc32c ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs xfs reiserfs ext2 autofs4 microcode clip atm parport_pc ppdev binfmt_misc nfsd nfs lockd fscache auth_rpcgss nfs_acl sunrpc shpchp psmouse radeon i7300_edac ttm drm_kms_helper drm edac_core dm_multipath dcdbas joydev serio_raw i2c_algo_bit mac_hid lp parport ses enclosure uas usb_storage usbhid hid sata_sil24 bnx2 megaraid_sas dm_raid45 xor dm_mirror dm_region_hash dm_log
Sep 16 14:20:26 flatfish kernel: [12283888.864001]
Sep 16 14:20:26 flatfish kernel: [12283888.864001] Pid: 31125, comm: ps Tainted: G D W 3.2.0-23-generic #36-Ubuntu Dell Inc. PowerEdge R900/0TT975
Sep 16 14:20:26 flatfish kernel: [12283888.864001] RIP: 0010:[<ffffffff8103db22>] [<ffffffff8103db22>] __ticket_spin_lock+0x22/0x30
Sep 16 14:20:26 flatfish kernel: [12283888.864001] RSP: 0018:ffff8807c4d55a48 EFLAGS: 00000287
Sep 16 14:20:26 flatfish kernel: [12283888.864001] RAX: 000000000000c49f RBX: 00ff881f00000000 RCX: ffff881f6c5f8e00
Sep 16 14:20:26 flatfish kernel: [12283888.864001] RDX: 000000000000c4a0 RSI: ffffffff81c28020 RDI: ffff881f59846680
Sep 16 14:20:26 flatfish kernel: [12283888.864001] RBP: ffff8807c4d55a48 R08: 0000000000000009 R09: 0000000000000000
Sep 16 14:20:26 flatfish kernel: [12283888.864001] R10: ffff881eb5cbafc0 R11: 0000000000000009 R12: ffff8807c4d5590a
Sep 16 14:20:26 flatfish kernel: [12283888.864001] R13: 000000000000ffff R14: ffff8807c4d559ce R15: ffff8807c4d559a8
Sep 16 14:20:26 flatfish kernel: [12283888.988018] FS: 00007fed758f8700(0000) GS:ffff881fbf2e0000(0000) knlGS:0000000000000000
Sep 16 14:20:26 flatfish kernel: [12283888.988018] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 16 14:20:26 flatfish kernel: [12283888.988018] CR2: 00007fff2da5dff8 CR3: 0000001f5f72f000 CR4: 00000000000006e0
Sep 16 14:20:26 flatfish kernel: [12283888.988018] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Sep 16 14:20:26 flatfish kernel: [12283888.988018] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Sep 16 14:20:26 flatfish kernel: [12283888.988018] Process ps (pid: 31125, threadinfo ffff8807c4d54000, task ffff881f6d99ade0)
Sep 16 14:20:26 flatfish kernel: [12283888.988018] Stack:
Sep 16 14:20:26 flatfish kernel: [12283888.988018] ffff8807c4d55a58 ffffffff8165c46e ffff8807c4d55ac8 ffffffffa023a96e
Sep 16 14:20:26 flatfish kernel: [12283888.988018] ffff8807c4d55aa8 0000000000000001 ffff8807c4d55a98 ffff881eb5cbaff8
Sep 16 14:20:26 flatfish kernel: [12283888.988018] 00000009bf59e000 ffff881f59846680 ffff8807c4d55b58 ffff881eb5cbafc0
Sep 16 14:20:26 flatfish kernel: [12283888.988018] Call Trace:
Sep 16 14:20:26 flatfish kernel: [12283888.988018] [<ffffffff8165c46e>] _raw_spin_lock+0xe/0x20
Sep 16 14:20:26 flatfish kernel: [12283888.988018] [<ffffffffa023a96e>] autofs4_lookup_expiring+0x4e/0x120 [autofs4]
Sep 16 14:20:26 flatfish kernel: [12283888.988018] [<ffffffffa023aa61>] do_expire_wait+0x21/0xb0 [autofs4]
Sep 16 14:20:26 flatfish kernel: [12283888.988018] [<ffffffffa023b153>] autofs4_d_manage+0x93/0xb0 [autofs4]
Sep 16 14:20:26 flatfish kernel: [12283888.988018] [<ffffffff811828db>] follow_managed+0xab/0x140
Sep 16 14:20:26 flatfish kernel: [12283888.988018] [<ffffffff811848bd>] do_lookup+0x14d/0x310
Sep 16 14:20:26 flatfish kernel: [12283888.988018] [<ffffffff81185278>] link_path_walk+0x138/0x870
Sep 16 14:20:26 flatfish kernel: [12283888.988018] [<ffffffff811864bd>] ? path_init+0x2ed/0x3c0
Sep 16 14:20:26 flatfish kernel: [12283888.988018] [<ffffffff811865e8>] path_lookupat+0x58/0x750
Sep 16 14:20:26 flatfish kernel: [12283888.988018] [<ffffffff81318c77>] ? __strncpy_from_user+0x27/0x60
Sep 16 14:20:26 flatfish kernel: [12283888.988018] [<ffffffff81186d11>] do_path_lookup+0x31/0xc0
Sep 16 14:20:26 flatfish kernel: [12283888.988018] [<ffffffff81187819>] user_path_at_empty+0x59/0xa0
Sep 16 14:20:26 flatfish kernel: [12283888.988018] [<ffffffff8118cc17>] ? prepend_path+0x97/0x1d0
Sep 16 14:20:26 flatfish kernel: [12283888.988018] [<ffffffff8119724f>] ? mntput+0x1f/0x30
Sep 16 14:20:26 flatfish kernel: [12283888.988018] [<ffffffff8165c46e>] ? _raw_spin_lock+0xe/0x20
Sep 16 14:20:26 flatfish kernel: [12283888.988018] [<ffffffff81187871>] user_path_at+0x11/0x20
Sep 16 14:20:26 flatfish kernel: [12283888.988018] [<ffffffff8117c8da>] vfs_fstatat+0x3a/0x70
Sep 16 14:20:26 flatfish kernel: [12283888.988018] [<ffffffff8119535e>] ? vfsmount_lock_local_unlock+0x1e/0x30
Sep 16 14:20:26 flatfish kernel: [12283888.988018] [<ffffffff81197170>] ? mntput_no_expire+0x30/0xf0
Sep 16 14:20:26 flatfish kernel: [12283888.988018] [<ffffffff8117c94b>] vfs_stat+0x1b/0x20
Sep 16 14:20:26 flatfish kernel: [12283888.988018] [<ffffffff8117ca8a>] sys_newstat+0x1a/0x40
Sep 16 14:20:26 flatfish kernel: [12283888.988018] [<ffffffff8117cbea>] ? sys_readlinkat+0x7a/0xb0
Sep 16 14:20:26 flatfish kernel: [12283888.988018] [<ffffffff81664a82>] system_call_fastpath+0x16/0x1b
Sep 16 14:20:26 flatfish kernel: [12283888.988018] Code: 90 90 90 90 90 90 90 90 90 55 b8 00 00 01 00 48 89 e5 f0 0f c1 07 89 c2 c1 ea 10 66 39 c2 74 13 66 0f 1f 84 00 00 00 00 00 f3 90 <0f> b7 07 66 39 d0 75 f6 5d c3 0f 1f 40 00 8b 17 55 31 c0 48 89
Sep 16 14:20:26 flatfish kernel: [12283888.988018] Call Trace:
Sep 16 14:20:26 flatfish kernel: [12283888.988018] [<ffffffff8165c46e>] _raw_spin_lock+0xe/0x20
Sep 16 14:20:26 flatfish kernel: [12283888.988018] [<ffffffffa023a96e>] autofs4_lookup_expiring+0x4e/0x120 [autofs4]
Sep 16 14:20:26 flatfish kernel: [12283888.988018] [<ffffffffa023aa61>] do_expire_wait+0x21/0xb0 [autofs4]
Sep 16 14:20:26 flatfish kernel: [12283888.988018] [<ffffffffa023b153>] autofs4_d_manage+0x93/0xb0 [autofs4]
Sep 16 14:20:26 flatfish kernel: [12283888.988018] [<ffffffff811828db>] follow_managed+0xab/0x140
Sep 16 14:20:26 flatfish kernel: [12283888.988018] [<ffffffff811848bd>] do_lookup+0x14d/0x310
Sep 16 14:20:26 flatfish kernel: [12283888.988018] [<ffffffff81185278>] link_path_walk+0x138/0x870
Sep 16 14:20:26 flatfish kernel: [12283888.988018] [<ffffffff811864bd>] ? path_init+0x2ed/0x3c0
Sep 16 14:20:26 flatfish kernel: [12283888.988018] [<ffffffff811865e8>] path_lookupat+0x58/0x750
Sep 16 14:20:26 flatfish kernel: [12283888.988018] [<ffffffff81318c77>] ? __strncpy_from_user+0x27/0x60
Sep 16 14:20:26 flatfish kernel: [12283888.988018] [<ffffffff81186d11>] do_path_lookup+0x31/0xc0
Sep 16 14:20:26 flatfish kernel: [12283888.988018] [<ffffffff81187819>] user_path_at_empty+0x59/0xa0
Sep 16 14:20:26 flatfish kernel: [12283888.988018] [<ffffffff8118cc17>] ? prepend_path+0x97/0x1d0
Sep 16 14:20:26 flatfish kernel: [12283888.988018] [<ffffffff8119724f>] ? mntput+0x1f/0x30
Sep 16 14:20:26 flatfish kernel: [12283888.988018] [<ffffffff8165c46e>] ? _raw_spin_lock+0xe/0x20
Sep 16 14:20:26 flatfish kernel: [12283888.988018] [<ffffffff81187871>] user_path_at+0x11/0x20
Sep 16 14:20:26 flatfish kernel: [12283888.988018] [<ffffffff8117c8da>] vfs_fstatat+0x3a/0x70
Sep 16 14:20:26 flatfish kernel: [12283888.988018] [<ffffffff8119535e>] ? vfsmount_lock_local_unlock+0x1e/0x30
Sep 16 14:20:26 flatfish kernel: [12283888.988018] [<ffffffff81197170>] ? mntput_no_expire+0x30/0xf0
Sep 16 14:20:26 flatfish kernel: [12283888.988018] [<ffffffff8117c94b>] vfs_stat+0x1b/0x20
Sep 16 14:20:26 flatfish kernel: [12283888.988018] [<ffffffff8117ca8a>] sys_newstat+0x1a/0x40
Sep 16 14:20:26 flatfish kernel: [12283888.988018] [<ffffffff8117cbea>] ? sys_readlinkat+0x7a/0xb0
Sep 16 14:20:26 flatfish kernel: [12283888.988018] [<ffffffff81664a82>] system_call_fastpath+0x16/0x1b

This system relies heavily on NFS v3 for user home directories, scratch directories, data.

This system has 16 cores, 128 GB RAM.

We have previously seen this behavior with openSUSE 11.3, where we opened a bug
report: https://bugzilla.novell.com/show_bug.cgi?id=707765

lsb_release -rd
Description: Ubuntu 12.04 LTS
Release: 12.04

I'm sending the results of "ubuntu-bug linux".
---
AlsaDevices:
 total 0
 crw-rw---T 1 root audio 116, 1 Sep 16 17:31 seq
 crw-rw---T 1 root audio 116, 33 Sep 16 17:31 timer
AplayDevices: Error: [Errno 2] No such file or directory
ApportVersion: 2.0.1-0ubuntu8
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
CRDA: Error: [Errno 2] No such file or directory
DistroRelease: Ubuntu 12.04
HibernationDevice: RESUME=UUID=10e20f45-babc-45a8-b35a-8a10a78db200
InstallationMedia: Ubuntu-Server 12.04 LTS "Precise Pangolin" - Release amd64 (20120424.1)
IwConfig: Error: [Errno 2] No such file or directory
MachineType: Dell Inc. PowerEdge R900
Package: linux (not installed)
PciMultimedia:

ProcEnviron:
 LANGUAGE=en_US:
 TERM=xterm
 PATH=(custom, no user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB: 0 radeondrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-3.2.0-23-generic root=UUID=5eefd515-a701-449a-962c-44f3536658df ro crashkernel=384M-2G:64M,2G-:128M quiet
ProcVersionSignature: Ubuntu 3.2.0-23.36-generic 3.2.14
PulseList: Error: command ['pacmd', 'list'] failed with exit code 1: No PulseAudio daemon running, or not running as session daemon.
RelatedPackageVersions:
 linux-restricted-modules-3.2.0-23-generic N/A
 linux-backports-modules-3.2.0-23-generic N/A
 linux-firmware 1.79
RfKill: Error: [Errno 2] No such file or directory
Tags: precise
Uname: Linux 3.2.0-23-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:

WifiSyslog:
 Sep 20 08:57:06 lamprey kernel: [314190.440030] megaraid_sas 0000:1a:00.0: vpd r/w failed. This is likely a firmware bug on this device. Contact the card vendor for a firmware update.
 Sep 20 09:10:23 lamprey kernel: [314782.500029] megaraid_sas 0000:1a:00.0: vpd r/w failed. This is likely a firmware bug on this device. Contact the card vendor for a firmware update.
dmi.bios.date: 10/09/2007
dmi.bios.vendor: Dell Inc.
dmi.bios.version: 1.1.0
dmi.board.name: 0TT975
dmi.board.vendor: Dell Inc.
dmi.board.version: A01
dmi.chassis.type: 23
dmi.chassis.vendor: Dell Inc.
dmi.modalias: dmi:bvnDellInc.:bvr1.1.0:bd10/09/2007:svnDellInc.:pnPowerEdgeR900:pvr:rvnDellInc.:rn0TT975:rvrA01:cvnDellInc.:ct23:cvr:
dmi.product.name: PowerEdge R900
dmi.sys.vendor: Dell Inc.

Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1053491

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: precise
Revision history for this message
Carl Benson (cbenson) wrote : AcpiTables.txt

apport information

tags: added: apport-collected
description: updated
Revision history for this message
Carl Benson (cbenson) wrote : BootDmesg.txt

apport information

Revision history for this message
Carl Benson (cbenson) wrote : CurrentDmesg.txt

apport information

Revision history for this message
Carl Benson (cbenson) wrote : Lspci.txt

apport information

Revision history for this message
Carl Benson (cbenson) wrote : Lsusb.txt

apport information

Revision history for this message
Carl Benson (cbenson) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
Carl Benson (cbenson) wrote : ProcInterrupts.txt

apport information

Revision history for this message
Carl Benson (cbenson) wrote : ProcModules.txt

apport information

Revision history for this message
Carl Benson (cbenson) wrote : UdevDb.txt

apport information

Revision history for this message
Carl Benson (cbenson) wrote : UdevLog.txt

apport information

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

This seems to be a duplicate of bug 922906 . This bug also exists upstream.

Changed in linux (Ubuntu):
importance: Undecided → Medium
importance: Medium → High
tags: added: kernel-bug-exists-upstream ticket-spin-lock
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.