[Zesty] QDF2400 ARM64 server - NMI watchdog: BUG: soft lockup - CPU#8 stuck for 22s!

Bug #1680549 reported by Manoj Iyer on 2017-04-06
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Critical
Manoj Iyer
Zesty
Critical
Unassigned

Bug Description

[IMPACT]
Booting Zesty 4.10 kernel on Qualcomm Centriq 2400 ARM64 servers causes soft lockups on multiple CPUs.

[ 104.205397] Modules linked in: nls_iso8859_1 cdc_acm bridge stp llc ipmi_ssif ipmi_devintf ipmi_msghandler shpchp hdma hdma_mgmt i2c_qup cppc_cpufreq ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear uas usb_storage at803x aes_ce_blk aes_ce_cipher crc32_ce crct10dif_ce ghash_ce sha2_ce sha1_ce mlx5_core devlink ptp pps_core ahci_platform libahci_platform libahci qcom_emac sdhci_acpi sdhci xhci_plat_hcd pinctrl_qdf2xxx fjes aes_neon_blk crypto_simd cryptd

[ 104.205442] CPU: 47 PID: 0 Comm: swapper/47 Tainted: G L 4.10.0-16-generic #18ubuntuRC03+<redacted>.1
[ 104.205443] Hardware name: Qualcomm QDF2400 DP/ABW|SYS|CVR,1DPC|V3 , BIOS XBL.DF.2.0.R3-00153 QDF2400_REL CRM 02/ 8/2017
[ 104.205444] task: ffff9fa30ed49c00 task.stack: ffff9fa30ed5c000
[ 104.205447] PC is at _raw_spin_unlock_irqrestore+0x2c/0x38
[ 104.205450] LR is at alloc_iova+0x1cc/0x2a0
[ 104.205451] pc : [<ffff3f0624a00974>] lr : [<ffff3f0624682e8c>] pstate: 20400145
[ 104.205452] sp : ffff9fa31fbecc00
[ 104.205453] x29: ffff9fa31fbecc00 x28: 0000000ffffefe46
[ 104.205455] x27: 0000000000000040 x26: 0000000fffffffff
[ 104.205458] x25: ffff3f06251f8000 x24: 0000000000000001
[ 104.205460] x23: ffff9fa30da06008 x22: 0000000000000000
[ 104.205462] x21: ffff9fa2e2af8740 x20: ffff9fa30da06008
[ 104.205464] x19: 0000000000000140 x18: 00000000a5e112c1
[ 104.205466] x17: 000000004d48a1ed x16: 00000000b0f9c455
[ 104.205468] x15: 00000000aa4269e9 x14: 0000000085094ac4
[ 104.205471] x13: 000000009b3b00da x12: 000000008aae8d9c
[ 104.205473] x11: ffff9fa31fbf90b0 x10: ffff3f0624eb70eb
[ 104.205475] x9 : 0000000000000000 x8 : 0000000000000004
[ 104.205477] x7 : ffff9fa2e2875400 x6 : 0000000000000000
[ 104.205479] x5 : ffff9fa2e2875401 x4 : 0000000000000000
[ 104.205481] x3 : ffff9fa2e2a27b00 x2 : ffff9fa2e2875400
[ 104.205483] x1 : 0000000000000140 x0 : 000000000000f7c2

[ 111.198062] INFO: rcu_sched self-detected stall on CPU
[ 111.198971] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 111.198977] 31-...: (1 GPs behind) idle=1b3/2/0 softirq=432/433 fqs=6805
[ 111.198979] 32-...: (1 GPs behind) idle=291/1/0 softirq=469/470 fqs=6805
[ 111.198980] (detected by 2, t=15002 jiffies, g=143, c=142, q=6968)
[ 111.199000] Task dump for CPU 31:
[ 111.199002] swapper/31 R running task 0 0 1 0x00000002
[ 111.199006] Call trace:
[ 111.199012] [<ffff3f0624086250>] __switch_to+0x98/0xb0
[ 111.199014] [<0000000b7160dcd2>] 0xb7160dcd2
[ 111.199015] Task dump for CPU 32:
[ 111.199016] swapper/32 R running task 0 0 1 0x00000002
[ 111.199018] Call trace:
[ 111.199019] [<ffff3f0624086250>] __switch_to+0x98/0xb0
[ 111.199020] [<0000000bcde2fa4e>] 0xbcde2fa4e
[ 111.227703] 31-...: (1 GPs behind) idle=1b3/2/0 softirq=432/433 fqs=6809
[ 111.234558] (t=15010 jiffies g=143 c=142 q=6968)
[ 111.239334] Task dump for CPU 31:
[ 111.239335] swapper/31 R running task 0 0 1 0x00000002
[ 111.239338] Call trace:
[ 111.239344] [<ffff3f062408b030>] dump_backtrace+0x0/0x2b0
[ 111.239346] [<ffff3f062408b304>] show_stack+0x24/0x30
[ 111.239350] [<ffff3f0624103f80>] sched_show_task+0x128/0x178
[ 111.239352] [<ffff3f0624106d68>] dump_cpu_task+0x48/0x58
[ 111.239356] [<ffff3f0624200d38>] rcu_dump_cpu_stacks+0xbc/0xf0
[ 111.239359] [<ffff3f06241409e8>] rcu_check_callbacks+0x7a8/0x968
[ 111.239362] [<ffff3f0624146e1c>] update_process_times+0x34/0x60
[ 111.239365] [<ffff3f0624159118>] tick_sched_handle.isra.7+0x38/0x70
[ 111.239367] [<ffff3f062415919c>] tick_sched_timer+0x4c/0x98
[ 111.239369] [<ffff3f06241477a0>] __hrtimer_run_queues+0xe8/0x2e8
[ 111.239371] [<ffff3f0624148340>] hrtimer_interrupt+0xa8/0x228
[ 111.239376] [<ffff3f062487c02c>] arch_timer_handler_phys+0x3c/0x50
[ 111.239379] [<ffff3f0624133964>] handle_percpu_devid_irq+0x8c/0x230
[ 111.239383] [<ffff3f062412d8b4>] generic_handle_irq+0x34/0x50
[ 111.239385] [<ffff3f062412dfe0>] __handle_domain_irq+0x68/0xc0
[ 111.239386] [<ffff3f06240818b4>] gic_handle_irq+0xc4/0x170
[ 111.239388] Exception stack(0xffff9fa31fa7caa0 to 0xffff9fa31fa7cbd0)
[ 111.239390] caa0: ffff9fa31fa7cad0 0001000000000000 ffff9fa31fa7cc00 ffff3f0624a00974
[ 111.239392] cac0: 0000000020400145 0000000000000001 00000000000000fe 0000000000000140
[ 111.239394] cae0: ffff9fa2e10b1c00 ffff9fa2e11c8800 0000000000000000 ffff9fa2e10b1c01
[ 111.239396] cb00: 0000000000000000 ffff9fa2e10b1c00 ffff9fa3035ee681 0000000000000000
[ 111.239397] cb20: ffff7e7e8b8533e0 ffff9fa31fa890b0 0000000000000000 000000009b3b00da
[ 111.239399] cb40: 0000000085094ac4 00000000aa4269e9 0000000046e68d43 000000004d48a1ed
[ 111.239401] cb60: 00000000a5e112c1 0000000000000140 ffff9fa30da06008 ffff9fa2e1073ac0
[ 111.239403] cb80: 0000000000000000 ffff9fa30da06008 0000000000000001 ffff3f06251f8000
[ 111.239404] cba0: 0000000fffffffff 0000000000000040 0000000ffffef50a ffff9fa31fa7cc00
[ 111.239406] cbc0: ffff3f0624682e8c ffff9fa31fa7cc00
[ 111.239407] [<ffff3f062408315c>] el1_irq+0xdc/0x180
[ 111.239411] [<ffff3f0624682e8c>] alloc_iova+0x1cc/0x2a0
[ 111.239413] [<ffff3f0624680488>] __alloc_iova+0x78/0x88
[ 111.239414] [<ffff3f0624680528>] __iommu_dma_map+0x90/0x128
[ 111.239416] [<ffff3f0624680e30>] iommu_dma_map_page+0x60/0x78
[ 111.239420] [<ffff3f062409c8fc>] __iommu_map_page+0x5c/0xd0
[ 111.239565] [<ffff3f06201046d0>] mlx5e_alloc_rx_wqe+0x118/0x318 [mlx5_core]
[ 111.239607] [<ffff3f06201050e8>] mlx5e_post_rx_wqes+0xa0/0x110 [mlx5_core]
[ 111.239647] [<ffff3f06201075dc>] mlx5e_napi_poll+0x22c/0x518 [mlx5_core]
[ 111.239650] [<ffff3f06248cdda0>] net_rx_action+0x2e8/0x3f0
[ 111.239652] [<ffff3f0624081aa8>] __do_softirq+0x148/0x31c
[ 111.239656] [<ffff3f06240d3d68>] irq_exit+0xd0/0x120
[ 111.239658] [<ffff3f062412dfe4>] __handle_domain_irq+0x6c/0xc0
[ 111.239660] [<ffff3f06240818b4>] gic_handle_irq+0xc4/0x170
[ 111.239661] Exception stack(0xffff9fa30ecffd80 to 0xffff9fa30ecffeb0)
[ 111.239663] fd80: ffff9fa31fa85200 0000609cfabd2000 0000000006400000 0000000000000004
[ 111.239665] fda0: 0000000000003296 0000000000000015 000000005c57e302 0000000000000000
[ 111.239667] fdc0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 111.239668] fde0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 111.239670] fe00: 0000000000000000 0000000000000000 00000000ffffffff 0000000b7179114e
[ 111.239672] fe20: ffff9fa3041c8000 0000000000000003 ffff3f0625292eb8 0000000000000000
[ 111.239673] fe40: 0000000b7160dcd2 0000000000000003 0000000000000000 0000000000000000
[ 111.239675] fe60: 0000000000000000 ffff9fa30ecffeb0 ffff3f06248549bc ffff9fa30ecffeb0
[ 111.239677] fe80: ffff3f06248549c4 0000000060400145 ffff9fa30ecffeb0 ffff3f06248549bc
[ 111.239678] fea0: ffffffffffffffff 0000000b7160dcd2
[ 111.239680] [<ffff3f062408315c>] el1_irq+0xdc/0x180
[ 111.239684] [<ffff3f06248549c4>] cpuidle_enter_state+0x124/0x318
[ 111.239686] [<ffff3f0624854c2c>] cpuidle_enter+0x34/0x48
[ 111.239689] [<ffff3f062411c030>] call_cpuidle+0x40/0x70
[ 111.239691] [<ffff3f062411c344>] do_idle+0x1ac/0x1f8
[ 111.239693] [<ffff3f062411c5c4>] cpu_startup_entry+0x2c/0x30
[ 111.239695] [<ffff3f0624091008>] secondary_start_kernel+0x158/0x198
[ 111.239696] [<00000000112091a4>] 0x112091a4
[ 111.239697] Task dump for CPU 32:
[ 111.239699] swapper/32 R running task 0 0 1 0x00000002
[ 111.239701] Call trace:
[ 111.239704] [<ffff3f0624086250>] __switch_to+0x98/0xb0
[ 111.239705] [<0000000bcde2fa4e>] 0xbcde2fa4e
[ 129.361765] ip_tables: (C) 2000-2006 Netfilter Core Team
[ 129.397270] ip6_tables: (C) 2000-2006 Netfilter Core Team
[ 129.438584] Ebtables v2.0 registered

[FIX]
The following patches cherry-picked from linux-next fixes this issue.
5016bdb796b3 iommu/iova: Fix underflow bug in __alloc_and_insert_iova_range
d9a5f8adaec9 iommu/dma: Plumb in the per-CPU IOVA caches
fc7f6142bacb iommu/dma: Clean up MSI IOVA allocation
568c61384ea1 iommu/dma: Convert to address-based allocation
dddd632b072f iommu/dma: Implement PCI allocation optimisation
de84f5f049d9 iommu/dma: Stop getting dma_32bit_pfn wrong

[Test case]
After applying the patches the kernel boot with no soft lockups. This was tested by me on Zesty 4.10.0-20.22 kernel on QDF2400 SDP.

[Regression Potential]
These patches applicable to iommu driver and does not impact any platform code. Please see the comments section for regression tests on ARM64, Power8 and intel platforms.

Manoj Iyer (manjo) on 2017-04-06
Changed in linux (Ubuntu):
importance: Undecided → Critical
Manoj Iyer (manjo) on 2017-04-10
description: updated
Manoj Iyer (manjo) on 2017-05-03
description: updated
Manoj Iyer (manjo) on 2017-05-03
description: updated
description: updated
Manoj Iyer (manjo) on 2017-05-03
description: updated
Manoj Iyer (manjo) wrote :

Tested on Thunder-X: No regressions.

Manoj Iyer (manjo) wrote :

Tested on HiSilicon: No regressions.

Manoj Iyer (manjo) on 2017-05-10
tags: added: qdf2400
Lowell Sochia (lsochia) wrote :

Hp Spectre X360 2017 model. Running Ubuntu 17.04.

If I am working on this computer and it never has a chance to enter into standby mode then a clean shutdown or reboot can be performed. If computer has entered into a standby mode then shutdown or reboot hangs with following messages:

CPU#1 stuck for 22s
CPU#2 stuck for 22s

I get quite a few of these error messages.

Manoj Iyer (manjo) wrote :

Tested on Power8
ubuntu@manjo-srutest:~$ uname -a
Linux manjo-srutest 4.10.0-22-generic #24~sru4+test.1-Ubuntu SMP Wed May 24 18:42:19 UTC 2017 ppc64le ppc64le ppc64le GNU/Linux
ubuntu@manjo-srutest:~$

Manoj Iyer (manjo) wrote :

Tested on AMD64:
ubuntu@adib:~$ uname -a
Linux adib 4.10.0-22-generic #24~sru4+test.4 SMP Thu May 25 16:18:40 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
ubuntu@adib:~$

Manoj Iyer (manjo) on 2017-06-02
Changed in linux (Ubuntu):
status: In Progress → Incomplete
Manoj Iyer (manjo) on 2017-06-05
description: updated
Seth Forshee (sforshee) on 2017-06-08
Changed in linux (Ubuntu):
status: Incomplete → Fix Committed
Stefan Bader (smb) on 2017-06-09
Changed in linux (Ubuntu Zesty):
importance: Undecided → Critical
status: New → Fix Committed

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-zesty' to 'verification-done-zesty'. If the problem still exists, change the tag 'verification-needed-zesty' to 'verification-failed-zesty'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-zesty
Manoj Iyer (manjo) wrote :

The proposed kernel in Zesty boots fine on the AW SDP at Canonical, and does not report any NMI softlockups.

ubuntu@ubuntu:~$ uname -a
Linux ubuntu 4.10.0-23-generic #25-Ubuntu SMP Fri Jun 9 09:36:27 UTC 2017 aarch64 aarch64 aarch64 GNU/Linux

ubuntu@ubuntu:~$ apt-cache policy linux-image-4.10.0-23-generic
linux-image-4.10.0-23-generic:
  Installed: 4.10.0-23.25
  Candidate: 4.10.0-23.25
  Version table:
 *** 4.10.0-23.25 500
        500 http://us.ports.ubuntu.com/ubuntu-ports zesty-proposed/main arm64 Packages
        100 /var/lib/dpkg/status

tags: added: verification-done-zesty
removed: verification-needed-zesty
Launchpad Janitor (janitor) wrote :
Download full text (19.7 KiB)

This bug was fixed in the package linux - 4.10.0-26.30

---------------
linux (4.10.0-26.30) zesty; urgency=low

  * linux: 4.10.0-26.30 -proposed tracker (LP: #1700528)

  * CVE-2017-1000364
    - Revert "UBUNTU: SAUCE: mm: Only expand stack if guard area is hit"
    - Revert "mm: do not collapse stack gap into THP"
    - Revert "mm: enlarge stack guard gap"
    - mm: larger stack guard gap, between vmas
    - mm: fix new crash in unmapped_area_topdown()
    - Allow stack to grow up to address space limit

linux (4.10.0-25.29) zesty; urgency=low

  * linux: 4.10.0-25.29 -proposed tracker (LP: #1699028)

  * CVE-2017-1000364
    - SAUCE: mm: Only expand stack if guard area is hit

  * CVE-2017-9074
    - ipv6: Prevent overrun when parsing v6 header options
    - ipv6: Check ip6_find_1stfragopt() return value properly.

  * [Zesty] QDF2400 ARM64 server - NMI watchdog: BUG: soft lockup - CPU#8 stuck
    for 22s! (LP: #1680549)
    - iommu/dma: Stop getting dma_32bit_pfn wrong
    - iommu/dma: Implement PCI allocation optimisation
    - iommu/dma: Convert to address-based allocation
    - iommu/dma: Clean up MSI IOVA allocation
    - iommu/dma: Plumb in the per-CPU IOVA caches
    - iommu/iova: Fix underflow bug in __alloc_and_insert_iova_range

  * Zesty update to 4.10.17 stable release (LP: #1692898)
    - xen: adjust early dom0 p2m handling to xen hypervisor behavior
    - target: Fix compare_and_write_callback handling for non GOOD status
    - target/fileio: Fix zero-length READ and WRITE handling
    - iscsi-target: Set session_fall_back_to_erl0 when forcing reinstatement
    - usb: xhci: bInterval quirk for TI TUSB73x0
    - usb: host: xhci: print correct command ring address
    - USB: serial: ftdi_sio: add device ID for Microsemi/Arrow SF2PLUS Dev Kit
    - USB: Proper handling of Race Condition when two USB class drivers try to
      call init_usb_class simultaneously
    - USB: Revert "cdc-wdm: fix "out-of-sync" due to missing notifications"
    - staging: vt6656: use off stack for in buffer USB transfers.
    - staging: vt6656: use off stack for out buffer USB transfers.
    - staging: gdm724x: gdm_mux: fix use-after-free on module unload
    - staging: wilc1000: Fix problem with wrong vif index
    - staging: comedi: jr3_pci: fix possible null pointer dereference
    - staging: comedi: jr3_pci: cope with jiffies wraparound
    - usb: misc: add missing continue in switch
    - usb: gadget: legacy gadgets are optional
    - usb: Make sure usb/phy/of gets built-in
    - usb: hub: Fix error loop seen after hub communication errors
    - usb: hub: Do not attempt to autosuspend disconnected devices
    - x86/boot: Fix BSS corruption/overwrite bug in early x86 kernel startup
    - selftests/x86/ldt_gdt_32: Work around a glibc sigaction() bug
    - x86, pmem: Fix cache flushing for iovec write < 8 bytes
    - um: Fix PTRACE_POKEUSER on x86_64
    - perf/x86: Fix Broadwell-EP DRAM RAPL events
    - KVM: x86: fix user triggerable warning in kvm_apic_accept_events()
    - KVM: arm/arm64: fix races in kvm_psci_vcpu_on
    - arm64: KVM: Fix decoding of Rt/Rt2 when trapping AArch32 CP accesses
    - block: fix blk_integrity_register to use templ...

Changed in linux (Ubuntu Zesty):
status: Fix Committed → Fix Released
Changed in linux (Ubuntu):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers