[PowerVM] Kernel BUG @ kernel/irq_work.c:157! - 24x7 hw counters

Bug #1410519 reported by bugproxy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Low
Unassigned
Utopic
Fix Released
Medium
Chris J Arges

Bug Description

[Impact]
Using perf with hv_24x7 events can cause a kernel BUG.

[Fix]
The following upstream commits:
 d658972
 48bee8a
 f34b6c7
 ec2aef5

[Test Case]
Steps to recreate the problem:

1. Install Ubuntu 15.04 as a PowerVM guest.
2. Install perf tool
3. Run following scripts to test 24/7 Power8 hardware counter event with perf. tool

=== Script 1
#!/bin/bash

count=0;

offset=0x128
PERF_ARGS="-r 10 -C 0"
while [ $count -lt 100 ]; do

        EVENT="hv_24x7/domain=0x2,offset=$offset,starting_index=10/"

        perf stat $PERF_ARGS -x ' ' perf stat $PERF_ARGS -x ' ' -e $EVENT ls

        count=)
done

==== Script 2
#!/bin/bash

offset=0;

PERF_ARGS="-r 10 -C 0"
while [ $offset -lt 8192 ]; do

        EVENT="hv_24x7/domain=0x2,offset=$offset,starting_index=10/"

        perf stat $PERF_ARGS -x ' ' perf stat $PERF_ARGS -x ' ' -e $EVENT ls

        offset=)
done

After few iterations I hit the following BUG.

tt2.sh tt.sh
tt2.sh tt.sh
tt2.sh tt.sh
275679187521558 hv_24x7/domain=0x2,offset=6848,starting_index=10/ 0.00%
tt2.sh tt.sh
[ 4657.314709] softirq: huh, entered softirq 7 SCHED c00000000010abc0 with preem
pt_count 00000100, exited with bfff0000?
[ 4657.314727] kernel BUG at /build/buildd/linux-3.16.0/kernel/irq_work.c:157!
[ 4657.314732] Oops: Exception in kernel mode, sig: 5 [#1]
[ 4657.314740] Modules linked in: rtc_generic pseries_rng
[ 4657.314749] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.16.0-25-generic #33-U
[ 4657.314755] task: c000000001375e00 ti: c0000000013d0000 task.ti: c0000000013d0000
[ 4657.314759] NIP: c0000000001e8ffc LR: c00000000001fe70 CTR: c000000000002800ic)
[ 4657.314770] MSR: 8000000000029033 <SF,EE,ME,IR,DR,RI,LE> CR: 28042024 XER: 0000000a
[ 4657.314782] CFAR: c00000000001fe6c SOFTE: 0
GPR04: 0000000000000010 00000000009c0000 c000000001424a98 0000000000000002
GPR12: 8000000000009033 c00000000e9a0000 0000000006a3fcd0 0000000000000060
GPR16: 0000000000200000 0000000000000000 c000000000e57c00 0000000000000000
GPR20: c000000001595dca c000000001595478 0000000000000001 000000000000ffff
GPR28: c000000000e40380 c000000000e40300 c0000000013d3590 c000000000e56f08
[ 4657.314832] NIP [c0000000001e8ffc] irq_work_run+0x1c/0x30
[ 4657.314841] Call Trace:
4000 (unreliable)
[ 4657.314861] [c0000000013d34f0] [c00000000001ff90] timer_interrupt+0xa0/0xe0
[ 4657.314871] [c0000000013d3520] [c000000000002914] decrementer_common+0x114/0x180
[ 4657.314884] --- Exception: 901 at arch_local_irq_restore+0x14/0x90
[ 4657.314896] [c0000000013d3810] [c00000000012ed08] vprintk_emit+0x3b8/0x660 (u
[ 4657.314908] [c0000000013d38e0] [c000000000a02650] printk+0x84/0x98
[ 4657.314918] [c0000000013d3910] [c0000000000b51b4] __do_softirq+0x1e4/0x410
[ 4657.314927] [c0000000013d3a00] [c0000000000b57b8] irq_exit+0xf8/0x1400
[ 4657.314948] [c0000000013d3a60] [c000000000002c14] doorbell_super_common+0x114/0x180
[ 4657.314963] --- Exception: a01 at plpar_hcall_norets+0x8c/0xdc
[ 4657.314963] LR = check_and_cede_processor+0x34/0x5020/0x50 (unreliable)
[ 4657.314997] [c0000000013d3df0] [c00000000084077c] cpuidle_enter_state+0x6c/0x140c0
[ 4657.315030] [c0000000013d3f00] [c000000000d63ea8] start_kernel+0x500/0x51c
[ 4657.315047] Instruction dump:
[ 4657.315052] eba1ffe8 7c0803a6 ebc1fff0 ebe1fff8 4e800020 3c4c011f 3842c110 78290464
[ 4657.315068] 81290014 752a000f 7d380026 55291ffe <0b090000> 4bfffec8 60000000
60000000
[ 4657.315090] ---[ end trace ee202cccd2211e5d ]---
[ 4657.320224]
[ 4657.362675] Unable to handle kernel paging request for data at address 0xc000
000b35515048
[ 4657.362680] Faulting instruction address: 0xc00000000006a37c
[ 4657.362684] Oops: Kernel access of bad area, sig: 11 [#2]
[ 4657.362686] SMP NR_CPUS=2048 NUMA pSeries
[ 4657.362695] CPU: 12 PID: 7 Comm: rcu_sched Tainted: G D 3.16.0-25-
[ 4657.362699] task: c0000000eb581540 ti: c0000000eb604000 task.ti: c0000000eb60
[ 4657.362703] NIP: c00000000006a37c LR: c0000000000865a8 CTR: c00000000006a340
[ 4657.362706] REGS: c0000000eb607800 TRAP: 0300 Tainted: G D (3.16.0-25-generic)
00000000
[ 4657.362718] CFAR: c0000000000865a4 DAR: c000000b35515048 DSISR: 40000000 SOFTE: 0
GPR00: c0000000000865a8 c0000000eb607a80 c0000000013d50f0 00000000013d30d0
GPR08: 0000000000cc0000 c000000b35515000 c00000000e9a0000 0000000000000000
GPR12: c00000000006a340 c00000000e9a6c00 0000000000000000 0000000000000001
GPR20: 0000000000000000 c000000001389700 0000000000000000 0000000000000001
GPR28: c000000001420a68 0000000000000000 00000000013d30d0 0000000000000001
[ 4657.362758] NIP [c00000000006a37c] icp_hv_cause_ipi+0x3c/0xc0
[ 4657.362762] LR [c0000000000865a8] pSeries_cause_ipi_mux+0x88/0xc0
[ 4657.362765] Call Trace:
0 (unreliable)
[ 4657.362774] [c0000000eb607af0] [c0000000000865a8] pSeries_cause_ipi_mux+0x88/0xc0
[ 4657.362778] [c0000000eb607b20] [c0000000000426f0] smp_muxed_ipi_message_pass+
0x70/0x90
[ 4657.362783] [c0000000eb607b60] [c0000000000f3a58] resched_task+0x118/0x140
[ 4657.362786] [c0000000eb607b90] [c0000000000f3da0] resched_cpu+0xc0/0x110
[ 4657.362791] [c0000000eb607be0] [c00000000013f170] rcu_implicit_dynticks_qs+0x200/0x230
[ 4657.362795] [c0000000eb607c10] [c00000000013de1c] force_qs_rnp+0x14c/0x250
[ 4657.362799] [c0000000eb607c90] [c0000000001407f0] rcu_gp_kthread+0x430/0x8e0
[ 4657.362803] [c0000000eb607d80] [c0000000000e0820] kthread+0x110/0x130
[ 4657.362807] [c0000000eb607e30] [c00000000000a468] ret_from_kernel_thread+0x5c/0x74
[ 4657.362810] Instruction dump:
[ 4657.362812] fbc1fff0 fbe1fff8 f8010010 f821ff91 7c7e1b78 60000000 60000000 3d220008
[ 4657.362818] 39493f00 1d3e0900 e94a0000 7d2a4a14 <abe90048> 7c0004ac 3860006c
7fe4fb78
[ 4657.362825] ---[ end trace ee202cccd2211e5e ]---
[ 4657.365085]
[ 4659.320264] Kernel panic - not syncing: Attempted to kill the idle task!
[ 4659.325500] ---[ end Kernel panic - not syncing: Attempted to kill the idle task!

Backported following 4 commits/patches from upstream[1]:

        1. commit d658972
        Author: Himangi Saraogi <email address hidden>
        Date: Tue Jul 22 23:40:19 2014 +0530

            powerpc/perf/hv-24x7: Use kmem_cache_free

        2. commit 48bee8a
        Author: Cody P Schafer <email address hidden>
        Date: Tue Sep 30 23:03:17 2014 -0700

              powerpc/perf/hv-24x7: use kmem_cache instead of aligned stack allocations

        3. https://lkml.org/lkml/2014/12/10/613
        4. https://lkml.org/lkml/2014/12/10/36

to the vivid kernel[2]. The problem does not repro.

Will Canonical cherry-pick those commits or should we backport ?
(they apply without conflicts).

[1] The patches 3 and 4 above were posted recently, Powerpc
      maintainer plans to merge them.

[2] git://kernel.ubuntu.com/ubuntu/ubuntu-vivid.git

===
break-fix: - ec2aef5a8d3c14272f7a2d29b34f1f8e71f2be5b
break-fix: - f34b6c72c3ebaa286d3311a825ef79eccbcca82f
break-fix: - 48bee8a6c98e34367fa9d5e1be14109c92cbbb3b
break-fix: - d6589722846a57a4ddf7af595a7f854ff5180950

CVE References

bugproxy (bugproxy)
tags: added: architecture-ppc64le bugnameltc-119744 severity-critical targetmilestone-inin1504
Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better. It seems that your bug report is not filed about a specific source package though, rather it is just filed against Ubuntu in general. It is important that bug reports be filed about source packages so that people interested in the package can find the bugs about it. You can find some hints about determining what package your bug might be about at https://wiki.ubuntu.com/Bugs/FindRightPackage. You might also ask for help in the #ubuntu-bugs irc channel on Freenode.

To change the source package that this bug is filed about visit https://bugs.launchpad.net/ubuntu/+bug/1410519/+editstatus and add the package name in the text box next to the word Package.

[This is an automated message. I apologize if it reached you inappropriately; please just reply to this message indicating so.]

tags: added: bot-comment
Luciano Chavez (lnx1138)
affects: ubuntu → linux (Ubuntu)
Revision history for this message
Breno Leitão (breno-leitao) wrote :

Suka,

If patches [3] and [4] would make kernel 3.19, we will not need to backport them, since this bug is target for 15.04 (and 15.04 will ship probably with 3.19 kernel) On the other side, if we miss 3.19 window, I would ask you to backport them and attach the backport over 3.19 vivid git repository.

Canonical, correct me if I am wrong, please.

Thanks
Breno

Changed in linux (Ubuntu):
status: New → Confirmed
tags: added: kernel-da-key
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2015-01-14 20:03 EDT-------
(In reply to comment #9)
> Suka,
>
> If patches [3] and [4] would make kernel 3.19, we will not need to backport
> them, since this bug is target for 15.04 (and 15.04 will ship probably with
> 3.19 kernel) On the other side, if we miss 3.19 window, I would ask you to
> backport them and attach the backport over 3.19 vivid git repository.
>
> Canonical, correct me if I am wrong, please.
>
> Thanks
> Breno

Breno,

Patches [3] and [4] have been merged into 3.19-rc4:

commit f34b6c7
Author: <email address hidden> <email address hidden>
Date: Wed Dec 10 14:29:13 2014 -0800

powerpc/perf/hv-24x7: Use per-cpu page buffer

commit ec2aef5
Author: Sukadev Bhattiprolu <email address hidden>
Date: Wed Dec 10 01:43:34 2014 -0500

power/perf/hv-24x7: Use kmem_cache_free() instead of kfree

Chris J Arges (arges)
Changed in linux (Ubuntu):
assignee: nobody → Chris J Arges (arges)
importance: Undecided → Medium
status: Confirmed → In Progress
Chris J Arges (arges)
Changed in linux (Ubuntu Utopic):
assignee: nobody → Chris J Arges (arges)
Changed in linux (Ubuntu):
assignee: Chris J Arges (arges) → nobody
Changed in linux (Ubuntu Utopic):
importance: Undecided → Medium
Changed in linux (Ubuntu):
importance: Medium → Undecided
Changed in linux (Ubuntu Utopic):
status: New → In Progress
description: updated
Changed in linux (Ubuntu):
status: In Progress → Triaged
Revision history for this message
Chris J Arges (arges) wrote :

SRU sent for 3.16 to the k-team mailing list. The fix for vivid will be picked up when we rebase to 3.19.

Revision history for this message
Breno Leitão (breno-leitao) wrote :

Thanks Suka,

Chris,

Our major concern is regarding 15.04 for this bug, so, if you see any problem with this patch for 14.10, we can skip it and fix it in 15.04 only.

Thank you,
Breno

Andy Whitcroft (apw)
description: updated
tags: added: kernel-bug-break-fix
Changed in linux (Ubuntu):
importance: Undecided → Low
assignee: nobody → Andy Whitcroft (apw)
milestone: none → ubuntu-15.01
Andy Whitcroft (apw)
Changed in linux (Ubuntu Utopic):
status: In Progress → Fix Committed
Changed in linux (Ubuntu):
status: Triaged → Confirmed
Andy Whitcroft (apw)
Changed in linux (Ubuntu):
assignee: Andy Whitcroft (apw) → nobody
Chris J Arges (arges)
Changed in linux (Ubuntu):
status: Confirmed → Fix Committed
Andy Whitcroft (apw)
Changed in linux (Ubuntu):
status: Fix Committed → Confirmed
Revision history for this message
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-utopic' to 'verification-done-utopic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-utopic
Revision history for this message
Chris J Arges (arges) wrote :

I can't verify this with a utopic host running a vivid guest. Is there additional information on how to reproduce this issue in this environment? Perf doesn't seem to find the hv_24x7 event.

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2015-02-19 02:09 EDT-------
(In reply to comment #14)
> I can't verify this with a utopic host running a vivid guest. Is there
> additional information on how to reproduce this issue in this environment?
> Perf doesn't seem to find the hv_24x7 event.

Well, 24x7 counters are not supported on PowerKVM.

I was able to verify on a 15.04 PowerVM guest while running the
kernel from linux-image-3.18.0-13-generic. Both the test scripts
from the submitter's report passed and system stayed up.

tags: removed: verification-needed-utopic
Revision history for this message
Chris J Arges (arges) wrote :

Thanks for verifying this, I confused PowerKVM with PowerVM.

tags: added: verification-done-utopic
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2015-02-20 04:11 EDT-------
I have verified the defect with latest 15.04 daily ISO builds. The problem is fixed.

tags: removed: verification-done-utopic
Chris J Arges (arges)
tags: added: verification-done-utopic
Revision history for this message
Chris J Arges (arges) wrote :

Actually I see that you've verified with 3.18 kernels, can someone verify with the 3.16 kernel as mentioned in comment #6? Thanks

tags: added: verification-needed-utopic
removed: verification-done-utopic
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2015-02-20 15:33 EDT-------
(In reply to comment #19)
> Actually I see that you've verified with 3.18 kernels, can someone verify
> with the 3.16 kernel as mentioned in comment #6? Thanks

I have verified the problem with 3.16 kernel (3.16.0-31-generic) from utopic proposed and confirm that the problem is resolved.

tags: removed: verification-needed-utopic
Revision history for this message
Chris J Arges (arges) wrote :

Thanks!

tags: added: verification-done-utopic
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (28.3 KiB)

This bug was fixed in the package linux - 3.16.0-31.41

---------------
linux (3.16.0-31.41) utopic; urgency=low

  [ Seth Forshee ]

  * Release Tracking Bug
    - LP: #1419961

  [ Andy Whitcroft ]

  * [Debian] arm64 -- build ubuntu drivers
    - LP: #1411284
  * hyper-v -- fix comment handing in /etc/network/interfaces
    - LP: #1413020

  [ Ben Hutchings ]

  * SAUCE: rtsx_usb_ms: Use msleep_interruptible() in polling loop
    - LP: #1413149

  [ Brad Figg ]

  * SAUCE: Config IWLWIFI_UAPSD=N

  [ Kamal Mostafa ]

  * [Packaging] force "dpkg-source -I -i" behavior

  [ Kukjin Kim ]

  * SAUCE: (no-up) ARM: SAMSUNG: fix the CPU_ID for EXYNOS5440
    - LP: #1411062

  [ Leann Ogasawara ]

  * ubuntu: AUFS -- Resolve build failure union has no member named
    'd_child'

  [ Ming Lei ]

  * SAUCE: (no-up) ARM: EXYNOS: fix booting oops on exynos5440
    - LP: #1411062
  * SAUCE: (no-up) ARM: exynos5440-sd5v1: switch to fixed-link DT binding
    - LP: #1417339
  * SAUCE: (no-up) net: stmmac: add fixed_phy support via fixed-link DT
    binding
    - LP: #1417339

  [ Upstream Kernel Changes ]

  * Revert "[SCSI] mpt2sas: Remove phys on topology change."
    - LP: #1419125
  * Revert "[SCSI] mpt3sas: Remove phys on topology change"
    - LP: #1419125
  * Revert "ARM: 7830/1: delay: don't bother reporting bogomips in
    /proc/cpuinfo"
    - LP: #1419125
  * powerpc/powernv: Don't call generic code on offline cpus
    - LP: #1400411
  * powerpc/powernv: Return to cpu offline loop when finished in KVM guest
    - LP: #1400411
  * powerpc/powernv: Switch off MMU before entering nap/sleep/rvwinkle mode
    - LP: #1400411
  * powerpc/powernv: Enable Offline CPUs to enter deep idle states
    - LP: #1400411
  * powernv/cpuidle: Redesign idle states management
    - LP: #1400411
  * powernv/powerpc: Add winkle support for offline cpus
    - LP: #1400411
  * powerpc/kdump: Ignore failure in enabling big endian exception during
    crash
    - LP: #1410817
  * powerpc/perf/hv-24x7: Use kmem_cache_free
    - LP: #1410519
  * powerpc/perf/hv-24x7: use kmem_cache instead of aligned stack
    allocations
    - LP: #1410519
  * powerpc/perf/hv-24x7: Use per-cpu page buffer
    - LP: #1410519
  * power/perf/hv-24x7: Use kmem_cache_free() instead of kfree
    - LP: #1410519
  * KVM: x86: SYSENTER emulation is broken
    - LP: #1414651
    - CVE-2015-0239
  * powerpc/xmon: Fix another endiannes issue in RTAS call from xmon
    - LP: #1415919
  * HID: i2c-hid: call the hid driver's suspend and resume callbacks
    - LP: #1417363
  * HID: i2c-hid: Do not free buffers in i2c_hid_stop()
    - LP: #1417363
  * ALSA: hda - add mic mute led hook for dell machines
    - LP: #1418832
  * ALSA: hda - move DELL_WMI_MIC_MUTE_LED to the tail in the quirk chain
    - LP: #1381856, #1418832
  * ALSA: hda - fix the mic mute led problem for Latitude E5550
    - LP: #1381856, #1418832
  * drm/i915: don't warn if backlight unexpectedly enabled
    - LP: #1419125
  * drm/i915/dp: only use training pattern 3 on platforms that support it
    - LP: #1419125
  * udptunnel: Add SKB_GSO_UDP_TUNNEL during gro_complete.
    - LP: #1419125
  * s390/3215: fix hanging console issue
    - LP...

Changed in linux (Ubuntu Utopic):
status: Fix Committed → Fix Released
bugproxy (bugproxy)
tags: removed: verification-done-utopic
bugproxy (bugproxy)
tags: added: verification-done-utopic
Andy Whitcroft (apw)
Changed in linux (Ubuntu):
status: Confirmed → Fix Released
tags: removed: kernel-bug-break-fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.