Comment 0 for bug 1680549

Manoj Iyer (manjo) wrote :

[IMPACT]
On QDF2400 ARM64 servers, booting Zesty 4.10 kernel causes soft lockups on multiple CPUs.

[ 104.205397] Modules linked in: nls_iso8859_1 cdc_acm bridge stp llc ipmi_ssif ipmi_devintf ipmi_msghandler shpchp hdma hdma_mgmt i2c_qup cppc_cpufreq ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear uas usb_storage at803x aes_ce_blk aes_ce_cipher crc32_ce crct10dif_ce ghash_ce sha2_ce sha1_ce mlx5_core devlink ptp pps_core ahci_platform libahci_platform libahci qcom_emac sdhci_acpi sdhci xhci_plat_hcd pinctrl_qdf2xxx fjes aes_neon_blk crypto_simd cryptd

[ 104.205442] CPU: 47 PID: 0 Comm: swapper/47 Tainted: G L 4.10.0-16-generic #18ubuntuRC03+bandera.1
[ 104.205443] Hardware name: Qualcomm QDF2400 DP/ABW|SYS|CVR,1DPC|V3 , BIOS XBL.DF.2.0.R3-00153 QDF2400_REL CRM 02/ 8/2017
[ 104.205444] task: ffff9fa30ed49c00 task.stack: ffff9fa30ed5c000
[ 104.205447] PC is at _raw_spin_unlock_irqrestore+0x2c/0x38
[ 104.205450] LR is at alloc_iova+0x1cc/0x2a0
[ 104.205451] pc : [<ffff3f0624a00974>] lr : [<ffff3f0624682e8c>] pstate: 20400145
[ 104.205452] sp : ffff9fa31fbecc00
[ 104.205453] x29: ffff9fa31fbecc00 x28: 0000000ffffefe46
[ 104.205455] x27: 0000000000000040 x26: 0000000fffffffff
[ 104.205458] x25: ffff3f06251f8000 x24: 0000000000000001
[ 104.205460] x23: ffff9fa30da06008 x22: 0000000000000000
[ 104.205462] x21: ffff9fa2e2af8740 x20: ffff9fa30da06008
[ 104.205464] x19: 0000000000000140 x18: 00000000a5e112c1
[ 104.205466] x17: 000000004d48a1ed x16: 00000000b0f9c455
[ 104.205468] x15: 00000000aa4269e9 x14: 0000000085094ac4
[ 104.205471] x13: 000000009b3b00da x12: 000000008aae8d9c
[ 104.205473] x11: ffff9fa31fbf90b0 x10: ffff3f0624eb70eb
[ 104.205475] x9 : 0000000000000000 x8 : 0000000000000004
[ 104.205477] x7 : ffff9fa2e2875400 x6 : 0000000000000000
[ 104.205479] x5 : ffff9fa2e2875401 x4 : 0000000000000000
[ 104.205481] x3 : ffff9fa2e2a27b00 x2 : ffff9fa2e2875400
[ 104.205483] x1 : 0000000000000140 x0 : 000000000000f7c2

[ 111.198062] INFO: rcu_sched self-detected stall on CPU
[ 111.198971] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 111.198977] 31-...: (1 GPs behind) idle=1b3/2/0 softirq=432/433 fqs=6805
[ 111.198979] 32-...: (1 GPs behind) idle=291/1/0 softirq=469/470 fqs=6805
[ 111.198980] (detected by 2, t=15002 jiffies, g=143, c=142, q=6968)
[ 111.199000] Task dump for CPU 31:
[ 111.199002] swapper/31 R running task 0 0 1 0x00000002
[ 111.199006] Call trace:
[ 111.199012] [<ffff3f0624086250>] __switch_to+0x98/0xb0
[ 111.199014] [<0000000b7160dcd2>] 0xb7160dcd2
[ 111.199015] Task dump for CPU 32:
[ 111.199016] swapper/32 R running task 0 0 1 0x00000002
[ 111.199018] Call trace:
[ 111.199019] [<ffff3f0624086250>] __switch_to+0x98/0xb0
[ 111.199020] [<0000000bcde2fa4e>] 0xbcde2fa4e
[ 111.227703] 31-...: (1 GPs behind) idle=1b3/2/0 softirq=432/433 fqs=6809
[ 111.234558] (t=15010 jiffies g=143 c=142 q=6968)
[ 111.239334] Task dump for CPU 31:
[ 111.239335] swapper/31 R running task 0 0 1 0x00000002
[ 111.239338] Call trace:
[ 111.239344] [<ffff3f062408b030>] dump_backtrace+0x0/0x2b0
[ 111.239346] [<ffff3f062408b304>] show_stack+0x24/0x30
[ 111.239350] [<ffff3f0624103f80>] sched_show_task+0x128/0x178
[ 111.239352] [<ffff3f0624106d68>] dump_cpu_task+0x48/0x58
[ 111.239356] [<ffff3f0624200d38>] rcu_dump_cpu_stacks+0xbc/0xf0
[ 111.239359] [<ffff3f06241409e8>] rcu_check_callbacks+0x7a8/0x968
[ 111.239362] [<ffff3f0624146e1c>] update_process_times+0x34/0x60
[ 111.239365] [<ffff3f0624159118>] tick_sched_handle.isra.7+0x38/0x70
[ 111.239367] [<ffff3f062415919c>] tick_sched_timer+0x4c/0x98
[ 111.239369] [<ffff3f06241477a0>] __hrtimer_run_queues+0xe8/0x2e8
[ 111.239371] [<ffff3f0624148340>] hrtimer_interrupt+0xa8/0x228
[ 111.239376] [<ffff3f062487c02c>] arch_timer_handler_phys+0x3c/0x50
[ 111.239379] [<ffff3f0624133964>] handle_percpu_devid_irq+0x8c/0x230
[ 111.239383] [<ffff3f062412d8b4>] generic_handle_irq+0x34/0x50
[ 111.239385] [<ffff3f062412dfe0>] __handle_domain_irq+0x68/0xc0
[ 111.239386] [<ffff3f06240818b4>] gic_handle_irq+0xc4/0x170
[ 111.239388] Exception stack(0xffff9fa31fa7caa0 to 0xffff9fa31fa7cbd0)
[ 111.239390] caa0: ffff9fa31fa7cad0 0001000000000000 ffff9fa31fa7cc00 ffff3f0624a00974
[ 111.239392] cac0: 0000000020400145 0000000000000001 00000000000000fe 0000000000000140
[ 111.239394] cae0: ffff9fa2e10b1c00 ffff9fa2e11c8800 0000000000000000 ffff9fa2e10b1c01
[ 111.239396] cb00: 0000000000000000 ffff9fa2e10b1c00 ffff9fa3035ee681 0000000000000000
[ 111.239397] cb20: ffff7e7e8b8533e0 ffff9fa31fa890b0 0000000000000000 000000009b3b00da
[ 111.239399] cb40: 0000000085094ac4 00000000aa4269e9 0000000046e68d43 000000004d48a1ed
[ 111.239401] cb60: 00000000a5e112c1 0000000000000140 ffff9fa30da06008 ffff9fa2e1073ac0
[ 111.239403] cb80: 0000000000000000 ffff9fa30da06008 0000000000000001 ffff3f06251f8000
[ 111.239404] cba0: 0000000fffffffff 0000000000000040 0000000ffffef50a ffff9fa31fa7cc00
[ 111.239406] cbc0: ffff3f0624682e8c ffff9fa31fa7cc00
[ 111.239407] [<ffff3f062408315c>] el1_irq+0xdc/0x180
[ 111.239411] [<ffff3f0624682e8c>] alloc_iova+0x1cc/0x2a0
[ 111.239413] [<ffff3f0624680488>] __alloc_iova+0x78/0x88
[ 111.239414] [<ffff3f0624680528>] __iommu_dma_map+0x90/0x128
[ 111.239416] [<ffff3f0624680e30>] iommu_dma_map_page+0x60/0x78
[ 111.239420] [<ffff3f062409c8fc>] __iommu_map_page+0x5c/0xd0
[ 111.239565] [<ffff3f06201046d0>] mlx5e_alloc_rx_wqe+0x118/0x318 [mlx5_core]
[ 111.239607] [<ffff3f06201050e8>] mlx5e_post_rx_wqes+0xa0/0x110 [mlx5_core]
[ 111.239647] [<ffff3f06201075dc>] mlx5e_napi_poll+0x22c/0x518 [mlx5_core]
[ 111.239650] [<ffff3f06248cdda0>] net_rx_action+0x2e8/0x3f0
[ 111.239652] [<ffff3f0624081aa8>] __do_softirq+0x148/0x31c
[ 111.239656] [<ffff3f06240d3d68>] irq_exit+0xd0/0x120
[ 111.239658] [<ffff3f062412dfe4>] __handle_domain_irq+0x6c/0xc0
[ 111.239660] [<ffff3f06240818b4>] gic_handle_irq+0xc4/0x170
[ 111.239661] Exception stack(0xffff9fa30ecffd80 to 0xffff9fa30ecffeb0)
[ 111.239663] fd80: ffff9fa31fa85200 0000609cfabd2000 0000000006400000 0000000000000004
[ 111.239665] fda0: 0000000000003296 0000000000000015 000000005c57e302 0000000000000000
[ 111.239667] fdc0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 111.239668] fde0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 111.239670] fe00: 0000000000000000 0000000000000000 00000000ffffffff 0000000b7179114e
[ 111.239672] fe20: ffff9fa3041c8000 0000000000000003 ffff3f0625292eb8 0000000000000000
[ 111.239673] fe40: 0000000b7160dcd2 0000000000000003 0000000000000000 0000000000000000
[ 111.239675] fe60: 0000000000000000 ffff9fa30ecffeb0 ffff3f06248549bc ffff9fa30ecffeb0
[ 111.239677] fe80: ffff3f06248549c4 0000000060400145 ffff9fa30ecffeb0 ffff3f06248549bc
[ 111.239678] fea0: ffffffffffffffff 0000000b7160dcd2
[ 111.239680] [<ffff3f062408315c>] el1_irq+0xdc/0x180
[ 111.239684] [<ffff3f06248549c4>] cpuidle_enter_state+0x124/0x318
[ 111.239686] [<ffff3f0624854c2c>] cpuidle_enter+0x34/0x48
[ 111.239689] [<ffff3f062411c030>] call_cpuidle+0x40/0x70
[ 111.239691] [<ffff3f062411c344>] do_idle+0x1ac/0x1f8
[ 111.239693] [<ffff3f062411c5c4>] cpu_startup_entry+0x2c/0x30
[ 111.239695] [<ffff3f0624091008>] secondary_start_kernel+0x158/0x198
[ 111.239696] [<00000000112091a4>] 0x112091a4
[ 111.239697] Task dump for CPU 32:
[ 111.239699] swapper/32 R running task 0 0 1 0x00000002
[ 111.239701] Call trace:
[ 111.239704] [<ffff3f0624086250>] __switch_to+0x98/0xb0
[ 111.239705] [<0000000bcde2fa4e>] 0xbcde2fa4e
[ 129.361765] ip_tables: (C) 2000-2006 Netfilter Core Team
[ 129.397270] ip6_tables: (C) 2000-2006 Netfilter Core Team
[ 129.438584] Ebtables v2.0 registered

[FIX]
The following patches applied in this order fixes this issue.
d9a5f8adaec9 iommu/dma: Plumb in the per-CPU IOVA caches
fc7f6142bacb iommu/dma: Clean up MSI IOVA allocation
568c61384ea1 iommu/dma: Convert to address-based allocation
dddd632b072f iommu/dma: Implement PCI allocation optimisation
de84f5f049d9 iommu/dma: Stop getting dma_32bit_pfn wrong

[Test case]
After applying the patches the kernel boot with no soft lockups. This was tested by me on Zesty Ubuntu-4.10.0-18.20 on QDF2400 SDP.