BUG: soft lockup - CPU#15 stuck for 59737s! [genload:22734]

Bug #1370421 reported by bugproxy
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Invalid
Medium
Unassigned

Bug Description

== Comment: #0 - ABDUL HALEEM <email address hidden> - 2014-09-01 05:24:37 ==
---Problem Description---
CPU stalls and soft lockup on cpu while running ltpstresstest.sh test of LTP suite, detailed syslog and the test logs are attached

Contact Information = <email address hidden>

---uname output---
Linux ubuntu 3.16.0-10-generic #15-Ubuntu SMP Thu Aug 21 16:32:31 UTC 2014 ppc64le ppc64le ppc64le GNU/Linux

Machine Type = POWER8

---Debugger---
A debugger is not configured

---Steps to Reproduce---
- Ubuntu 14.10 LE guest running on Power 8 machine with Power KVM build 2_1_1.8
- Download and build LTP suite on the guest. run /opt/ltp/testscripts/ltpstress.sh -d /tmp/sardata -l /tmp/ltplog.12028 -m 128 -t 24 -S
- After 2hrs of test run, dmesg start throwing below trace messages.

syslog:
---------
Aug 31 09:31:59 ubuntu kernel: [83796.274731] Adding 576k swap on swapfile29. Priority:-29 extents:1 across:576k FS
Aug 31 09:32:00 ubuntu in.rshd[8457]: connect from 127.0.0.1 (127.0.0.1)
Aug 31 09:32:01 ubuntu in.rshd[8459]: connect from 127.0.0.1 (127.0.0.1)
Aug 31 09:32:02 ubuntu in.rshd[8461]: connect from 127.0.0.1 (127.0.0.1)
Sep 1 04:42:36 ubuntu kernel: [147953.248523] INFO: rcu_sched detected stalls on CPUs/tasks: { 15} (detected by 2, t=92214 jiffies, g=440674, c=440673, q=304)
Sep 1 04:42:36 ubuntu kernel: [147953.248720] Task dump for CPU 15:
Sep 1 04:42:36 ubuntu kernel: [147953.248725] genload R running task 0 22734 22733 0x00040000
Sep 1 04:42:36 ubuntu kernel: [147953.248730] Call Trace:
Sep 1 04:42:36 ubuntu kernel: [147953.248740] [c0000000033239b0] [c000000000056fe4] ht64_call_hpte_insert1+0x4/0x3c (unreliable)
Sep 1 04:42:36 ubuntu kernel: [147953.248745] [c000000003323ab0] [c0000000000532c8] hash_preload+0x2f8/0x300
Sep 1 04:42:36 ubuntu kernel: [147953.248748] [c000000003323b30] [c00000000004eaf0] update_mmu_cache+0xf0/0x110
Sep 1 04:42:36 ubuntu kernel: [147953.248753] [c000000003323b70] [c00000000023559c] handle_mm_fault+0xa0c/0x11b0
Sep 1 04:42:36 ubuntu kernel: [147953.248758] [c000000003323c10] [c0000000009e58dc] do_page_fault+0x71c/0x990
Sep 1 04:42:36 ubuntu kernel: [147953.248762] [c000000003323e30] [c000000000009568] handle_page_fault+0x10/0x30
Sep 1 04:42:36 ubuntu kernel: [147953.250365] INFO: rcu_sched detected stalls on CPUs/tasks: { 15} (detected by 2, t=16035133 jiffies, g=440674, c=440673, q=304)
Sep 1 04:42:36 ubuntu kernel: [147953.250519] Task dump for CPU 15:
Sep 1 04:42:36 ubuntu kernel: [147953.250522] genload R running task 0 22734 22733 0x00040000
Sep 1 04:42:36 ubuntu kernel: [147953.250525] Call Trace:
Sep 1 04:42:36 ubuntu kernel: [147953.250528] [c0000000033239b0] [c000000000056fe4] ht64_call_hpte_insert1+0x4/0x3c (unreliable)
Sep 1 04:42:36 ubuntu kernel: [147953.250532] [c000000003323ab0] [c0000000000532c8] hash_preload+0x2f8/0x300
Sep 1 04:42:36 ubuntu kernel: [147953.250535] [c000000003323b30] [c00000000004eaf0] update_mmu_cache+0xf0/0x110
Sep 1 04:42:36 ubuntu kernel: [147953.250538] [c000000003323b70] [c00000000023559c] handle_mm_fault+0xa0c/0x11b0
Sep 1 04:42:36 ubuntu kernel: [147953.250541] [c000000003323c10] [c0000000009e58dc] do_page_fault+0x71c/0x990
Sep 1 04:42:36 ubuntu kernel: [147953.250544] [c000000003323e30] [c000000000009568] handle_page_fault+0x10/0x30
Sep 1 04:42:36 ubuntu kernel: [147953.257562] BUG: soft lockup - CPU#15 stuck for 59737s! [genload:22734]
Sep 1 04:42:36 ubuntu kernel: [147953.257647] Modules linked in: nfsv2 nfsv3 nfsd auth_rpcgss nfs_acl nfs lockd sunrpc fscache pseries_rng rtc_generic e1000 ohci_pci

Other details :
------------------
@ubuntu:/tmp$ lscpu
Architecture: ppc64le
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 16
NUMA node(s): 1
Model: IBM pSeries (emulated by qemu)
L1d cache: 64K
L1i cache: 32K
NUMA node0 CPU(s): 0-15

@ubuntu:/tmp$ free
             total used free shared buffers cached
Mem: 2072704 892480 1180224 448 274240 132480
-/+ buffers/cache: 485760 1586944
Swap: 3460160 35392 3424768

@ubuntu:/tmp$ uptime
 05:22:02 up 1 day, 19:06, 2 users, load average: 10.67, 9.10, 9.32

Thanks

== Comment: #1 - ABDUL HALEEM <email address hidden> - 2014-09-01 05:31:58 ==

== Comment: #2 - ABDUL HALEEM <email address hidden> - 2014-09-01 05:36:48 ==

== Comment: #5 - MAMATHA INAMDAR <email address hidden> - 2014-09-05 05:03:56 ==
Hi Abdul,
Are you able to recreate this issue?
Please update the bug with your latest test results.

== Comment: #6 - ABDUL HALEEM <email address hidden> - 2014-09-10 05:55:47 ==
(In reply to comment #5)
> Hi Abdul,
> Are you able to recreate this issue?
> Please update the bug with your latest test results.

Hi Mamatha,

I have started the test again with xmon enabled.

will keep updating you on status.

Thanks

== Comment: #7 - ABDUL HALEEM <email address hidden> - 2014-09-10 05:59:17 ==
I have started the test on 3.16.0-14-generic and I still see these messages in syslog

[ 8075.169576] Unable to find swap-space signature
[ 7452.105450] Unable to find swap-space signature

should we worry about this.

the original problem has not reproduced yet..will update the soon

== Comment: #8 - Dan Streetman <email address hidden> - 2014-09-10 08:44:21 ==
(In reply to comment #7)
> I have started the test on 3.16.0-14-generic and I still see these messages
> in syslog
>
> [ 8075.169576] Unable to find swap-space signature
> [ 7452.105450] Unable to find swap-space signature
>
> should we worry about this.

It looks like you have some kind of tests creating/adding swap files, and I have no idea what those tests look like, so I don't know if this is an expected result of the tests or not. Generally that error means you are trying to swapon a swap file that isn't correctly initialized with mkswap, or it's header is corrupted.

Assuming your test isn't expecting a failure, you should just mkswap again on whatever swap file is failing. It looks like "./swapfile01", but since you're using relative paths, I can't tell you where it's located.

== Comment: #9 - ABDUL HALEEM <email address hidden> - 2014-09-11 04:09:13 ==
Hi,

I recreated the bug on latest kernel 3.16.0-14-generic

If I properly recall the scenario due to which kernel triggered soft lockup - CPU#15 traces is

During my first test run, the next day I saw the guest was in 'paused' state, as my host disk partition on which /var/lib/libvirt/images is mounted was out of space, I freed up the disk space and resumed the guest. Still i see my test were running, but dmesg showed the traces messages.

So in my last run I recreated similar scenario with xmon=on and found that the traces are triggered when I suspend and resume my guest when test were running and not because of my actual test.

--- Actual steps to reproduce --
- enable xmon in /etc/default/grub and run 'update-grub' and 'reboot'
- Run ltpstress test
- suspend the guest 'virsh suspend <guest>'
- after few seconds resume. my test running fine
- dmesg showed the original traces messages as below

perhaps when the traces were triggered, the console did not fall to xmon, I guess this might be a different problem.

I have kept the system in the same state.

Trace messages:
[84735.190787] Adding 576k swap on swapfile27. Priority:-27 extents:1 across:576k FS
[84735.740298] Adding 576k swap on swapfile28. Priority:-28 extents:1 across:576k FS
[84736.062528] Adding 576k swap on swapfile29. Priority:-29 extents:1 across:576k FS
[84924.032436] BUG: soft lockup - CPU#0 stuck for 104s! [float_bessel:10251]
[84924.032507] Modules linked in: nfsv2 nfsv3 nfsd auth_rpcgss nfs_acl nfs lockd sunrpc fscache pseries_rng rtc_generic shpchp ohci_pci e1000
[84924.032525] CPU: 0 PID: 10251 Comm: float_bessel Not tainted 3.16.0-14-generic #20-Ubuntu
[84924.032527] task: c000000003100000 ti: c00000003250c000 task.ti: c00000003250c000
[84924.032529] NIP: c0000000000110b4 LR: c0000000000110b4 CTR: 00003fffb4644120
[84924.032531] REGS: c00000003250fb90 TRAP: 0901 Not tainted (3.16.0-14-generic)
[84924.032532] MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 22002444 XER: 00000000
[84924.032538] CFAR: 00003fffb4645888 SOFTE: 1
GPR00: c00000000000a704 c00000003250fe10 c0000000013d49e0 0000000000000900
GPR04: 0000000000040004 0000000000000000 00000000009c0000 00000000ff001009
GPR08: 000182dee8f4d56f 000000007fefffff 0000000040cc8595 0000000000000000
GPR12: 0000000000002200 00003fffab8658f0
[84924.032552] NIP [c0000000000110b4] arch_local_irq_restore+0x74/0x90
[84924.032554] LR [c0000000000110b4] arch_local_irq_restore+0x74/0x90
[84924.032556] Call Trace:
[84924.032557] [c00000003250fe10] [0000000000002856] 0x2856 (unreliable)
[84924.032561] [c00000003250fe30] [c00000000000a704] ret_from_except_lite+0x30/0x60
[84924.032562] Instruction dump:
[84924.032563] 994d02ba 2fa30000 409e0024 e92d0020 61298000 7d210164 38210020 e8010010
[84924.032566] 7c0803a6 4e800020 60420000 4bff1315 <60000000> 4bffffe4 60420000 e92d0020
[84926.062119] Adding 576k swap on ./swapfile01. Priority:-2 extents:1 across:576k FS
[84936.733247] Adding 65472k swap on ./swapfile01. Priority:-2 extents:2 across:114624k

Thanks

Revision history for this message
bugproxy (bugproxy) wrote : var log messages

Default Comment by Bridge

tags: added: architecture-ppc64le bugnameltc-115436 severity-medium targetmilestone-inin---
Revision history for this message
bugproxy (bugproxy) wrote : test logs

Default Comment by Bridge

Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better. It seems that your bug report is not filed about a specific source package though, rather it is just filed against Ubuntu in general. It is important that bug reports be filed about source packages so that people interested in the package can find the bugs about it. You can find some hints about determining what package your bug might be about at https://wiki.ubuntu.com/Bugs/FindRightPackage. You might also ask for help in the #ubuntu-bugs irc channel on Freenode.

To change the source package that this bug is filed about visit https://bugs.launchpad.net/ubuntu/+bug/1370421/+editstatus and add the package name in the text box next to the word Package.

[This is an automated message. I apologize if it reached you inappropriately; please just reply to this message indicating so.]

tags: added: bot-comment
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2014-09-19 17:10 EDT-------
(In reply to comment #9)
> So in my last run I recreated similar scenario with xmon=on and found that
> the traces are triggered when I suspend and resume my guest when test were
> running and not because of my actual test.
>
> --- Actual steps to reproduce --
> - enable xmon in /etc/default/grub and run 'update-grub' and 'reboot'
> - Run ltpstress test
> - suspend the guest 'virsh suspend <guest>'
> - after few seconds resume. my test running fine
> - dmesg showed the original traces messages as below

This is normal for powerpc, currently.

In the watchdog:
/*
* If a virtual machine is stopped by the host it can look to
* the watchdog like a soft lockup, check to see if the host
* stopped the vm before we issue the warning
*/
if (kvm_check_and_clear_guest_paused())
return HRTIMER_RESTART;

The generic function requires arch support:
/*
* This function is used by architectures that support kvm to avoid issuing
* false soft lockup messages.
*/
static inline bool kvm_check_and_clear_guest_paused(void)
{
return false;
}

But powerpc doesn't currently implement that check:

static inline bool kvm_check_and_clear_guest_paused(void)
{
return false;
}

You can safely ignore all those soft lockup messages, if they occur when your guest is suspended. As that appears to be the case in this bug, I'm rejecting it.

Revision history for this message
Anton Blanchard (anton-samba) wrote :

We need to implement kvm_check_and_clear_guest_paused() on powerpc to make these warnings go away

affects: ubuntu → linux (Ubuntu)
Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1370421

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Changed in linux (Ubuntu):
importance: Undecided → Medium
status: Incomplete → Confirmed
Luciano Chavez (lnx1138)
Changed in linux (Ubuntu):
status: Confirmed → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.