Activity log for bug #1680549

Date Who What changed Old value New value Message
2017-04-06 18:03:54 Manoj Iyer bug added bug
2017-04-06 23:02:57 Manoj Iyer linux (Ubuntu): importance Undecided Critical
2017-04-10 15:54:04 Manoj Iyer description [IMPACT] On QDF2400 ARM64 servers, booting Zesty 4.10 kernel causes soft lockups on multiple CPUs. [ 104.205397] Modules linked in: nls_iso8859_1 cdc_acm bridge stp llc ipmi_ssif ipmi_devintf ipmi_msghandler shpchp hdma hdma_mgmt i2c_qup cppc_cpufreq ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear uas usb_storage at803x aes_ce_blk aes_ce_cipher crc32_ce crct10dif_ce ghash_ce sha2_ce sha1_ce mlx5_core devlink ptp pps_core ahci_platform libahci_platform libahci qcom_emac sdhci_acpi sdhci xhci_plat_hcd pinctrl_qdf2xxx fjes aes_neon_blk crypto_simd cryptd [ 104.205442] CPU: 47 PID: 0 Comm: swapper/47 Tainted: G L 4.10.0-16-generic #18ubuntuRC03+bandera.1 [ 104.205443] Hardware name: Qualcomm QDF2400 DP/ABW|SYS|CVR,1DPC|V3 , BIOS XBL.DF.2.0.R3-00153 QDF2400_REL CRM 02/ 8/2017 [ 104.205444] task: ffff9fa30ed49c00 task.stack: ffff9fa30ed5c000 [ 104.205447] PC is at _raw_spin_unlock_irqrestore+0x2c/0x38 [ 104.205450] LR is at alloc_iova+0x1cc/0x2a0 [ 104.205451] pc : [<ffff3f0624a00974>] lr : [<ffff3f0624682e8c>] pstate: 20400145 [ 104.205452] sp : ffff9fa31fbecc00 [ 104.205453] x29: ffff9fa31fbecc00 x28: 0000000ffffefe46 [ 104.205455] x27: 0000000000000040 x26: 0000000fffffffff [ 104.205458] x25: ffff3f06251f8000 x24: 0000000000000001 [ 104.205460] x23: ffff9fa30da06008 x22: 0000000000000000 [ 104.205462] x21: ffff9fa2e2af8740 x20: ffff9fa30da06008 [ 104.205464] x19: 0000000000000140 x18: 00000000a5e112c1 [ 104.205466] x17: 000000004d48a1ed x16: 00000000b0f9c455 [ 104.205468] x15: 00000000aa4269e9 x14: 0000000085094ac4 [ 104.205471] x13: 000000009b3b00da x12: 000000008aae8d9c [ 104.205473] x11: ffff9fa31fbf90b0 x10: ffff3f0624eb70eb [ 104.205475] x9 : 0000000000000000 x8 : 0000000000000004 [ 104.205477] x7 : ffff9fa2e2875400 x6 : 0000000000000000 [ 104.205479] x5 : ffff9fa2e2875401 x4 : 0000000000000000 [ 104.205481] x3 : ffff9fa2e2a27b00 x2 : ffff9fa2e2875400 [ 104.205483] x1 : 0000000000000140 x0 : 000000000000f7c2 [ 111.198062] INFO: rcu_sched self-detected stall on CPU [ 111.198971] INFO: rcu_sched detected stalls on CPUs/tasks: [ 111.198977] 31-...: (1 GPs behind) idle=1b3/2/0 softirq=432/433 fqs=6805 [ 111.198979] 32-...: (1 GPs behind) idle=291/1/0 softirq=469/470 fqs=6805 [ 111.198980] (detected by 2, t=15002 jiffies, g=143, c=142, q=6968) [ 111.199000] Task dump for CPU 31: [ 111.199002] swapper/31 R running task 0 0 1 0x00000002 [ 111.199006] Call trace: [ 111.199012] [<ffff3f0624086250>] __switch_to+0x98/0xb0 [ 111.199014] [<0000000b7160dcd2>] 0xb7160dcd2 [ 111.199015] Task dump for CPU 32: [ 111.199016] swapper/32 R running task 0 0 1 0x00000002 [ 111.199018] Call trace: [ 111.199019] [<ffff3f0624086250>] __switch_to+0x98/0xb0 [ 111.199020] [<0000000bcde2fa4e>] 0xbcde2fa4e [ 111.227703] 31-...: (1 GPs behind) idle=1b3/2/0 softirq=432/433 fqs=6809 [ 111.234558] (t=15010 jiffies g=143 c=142 q=6968) [ 111.239334] Task dump for CPU 31: [ 111.239335] swapper/31 R running task 0 0 1 0x00000002 [ 111.239338] Call trace: [ 111.239344] [<ffff3f062408b030>] dump_backtrace+0x0/0x2b0 [ 111.239346] [<ffff3f062408b304>] show_stack+0x24/0x30 [ 111.239350] [<ffff3f0624103f80>] sched_show_task+0x128/0x178 [ 111.239352] [<ffff3f0624106d68>] dump_cpu_task+0x48/0x58 [ 111.239356] [<ffff3f0624200d38>] rcu_dump_cpu_stacks+0xbc/0xf0 [ 111.239359] [<ffff3f06241409e8>] rcu_check_callbacks+0x7a8/0x968 [ 111.239362] [<ffff3f0624146e1c>] update_process_times+0x34/0x60 [ 111.239365] [<ffff3f0624159118>] tick_sched_handle.isra.7+0x38/0x70 [ 111.239367] [<ffff3f062415919c>] tick_sched_timer+0x4c/0x98 [ 111.239369] [<ffff3f06241477a0>] __hrtimer_run_queues+0xe8/0x2e8 [ 111.239371] [<ffff3f0624148340>] hrtimer_interrupt+0xa8/0x228 [ 111.239376] [<ffff3f062487c02c>] arch_timer_handler_phys+0x3c/0x50 [ 111.239379] [<ffff3f0624133964>] handle_percpu_devid_irq+0x8c/0x230 [ 111.239383] [<ffff3f062412d8b4>] generic_handle_irq+0x34/0x50 [ 111.239385] [<ffff3f062412dfe0>] __handle_domain_irq+0x68/0xc0 [ 111.239386] [<ffff3f06240818b4>] gic_handle_irq+0xc4/0x170 [ 111.239388] Exception stack(0xffff9fa31fa7caa0 to 0xffff9fa31fa7cbd0) [ 111.239390] caa0: ffff9fa31fa7cad0 0001000000000000 ffff9fa31fa7cc00 ffff3f0624a00974 [ 111.239392] cac0: 0000000020400145 0000000000000001 00000000000000fe 0000000000000140 [ 111.239394] cae0: ffff9fa2e10b1c00 ffff9fa2e11c8800 0000000000000000 ffff9fa2e10b1c01 [ 111.239396] cb00: 0000000000000000 ffff9fa2e10b1c00 ffff9fa3035ee681 0000000000000000 [ 111.239397] cb20: ffff7e7e8b8533e0 ffff9fa31fa890b0 0000000000000000 000000009b3b00da [ 111.239399] cb40: 0000000085094ac4 00000000aa4269e9 0000000046e68d43 000000004d48a1ed [ 111.239401] cb60: 00000000a5e112c1 0000000000000140 ffff9fa30da06008 ffff9fa2e1073ac0 [ 111.239403] cb80: 0000000000000000 ffff9fa30da06008 0000000000000001 ffff3f06251f8000 [ 111.239404] cba0: 0000000fffffffff 0000000000000040 0000000ffffef50a ffff9fa31fa7cc00 [ 111.239406] cbc0: ffff3f0624682e8c ffff9fa31fa7cc00 [ 111.239407] [<ffff3f062408315c>] el1_irq+0xdc/0x180 [ 111.239411] [<ffff3f0624682e8c>] alloc_iova+0x1cc/0x2a0 [ 111.239413] [<ffff3f0624680488>] __alloc_iova+0x78/0x88 [ 111.239414] [<ffff3f0624680528>] __iommu_dma_map+0x90/0x128 [ 111.239416] [<ffff3f0624680e30>] iommu_dma_map_page+0x60/0x78 [ 111.239420] [<ffff3f062409c8fc>] __iommu_map_page+0x5c/0xd0 [ 111.239565] [<ffff3f06201046d0>] mlx5e_alloc_rx_wqe+0x118/0x318 [mlx5_core] [ 111.239607] [<ffff3f06201050e8>] mlx5e_post_rx_wqes+0xa0/0x110 [mlx5_core] [ 111.239647] [<ffff3f06201075dc>] mlx5e_napi_poll+0x22c/0x518 [mlx5_core] [ 111.239650] [<ffff3f06248cdda0>] net_rx_action+0x2e8/0x3f0 [ 111.239652] [<ffff3f0624081aa8>] __do_softirq+0x148/0x31c [ 111.239656] [<ffff3f06240d3d68>] irq_exit+0xd0/0x120 [ 111.239658] [<ffff3f062412dfe4>] __handle_domain_irq+0x6c/0xc0 [ 111.239660] [<ffff3f06240818b4>] gic_handle_irq+0xc4/0x170 [ 111.239661] Exception stack(0xffff9fa30ecffd80 to 0xffff9fa30ecffeb0) [ 111.239663] fd80: ffff9fa31fa85200 0000609cfabd2000 0000000006400000 0000000000000004 [ 111.239665] fda0: 0000000000003296 0000000000000015 000000005c57e302 0000000000000000 [ 111.239667] fdc0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 111.239668] fde0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 111.239670] fe00: 0000000000000000 0000000000000000 00000000ffffffff 0000000b7179114e [ 111.239672] fe20: ffff9fa3041c8000 0000000000000003 ffff3f0625292eb8 0000000000000000 [ 111.239673] fe40: 0000000b7160dcd2 0000000000000003 0000000000000000 0000000000000000 [ 111.239675] fe60: 0000000000000000 ffff9fa30ecffeb0 ffff3f06248549bc ffff9fa30ecffeb0 [ 111.239677] fe80: ffff3f06248549c4 0000000060400145 ffff9fa30ecffeb0 ffff3f06248549bc [ 111.239678] fea0: ffffffffffffffff 0000000b7160dcd2 [ 111.239680] [<ffff3f062408315c>] el1_irq+0xdc/0x180 [ 111.239684] [<ffff3f06248549c4>] cpuidle_enter_state+0x124/0x318 [ 111.239686] [<ffff3f0624854c2c>] cpuidle_enter+0x34/0x48 [ 111.239689] [<ffff3f062411c030>] call_cpuidle+0x40/0x70 [ 111.239691] [<ffff3f062411c344>] do_idle+0x1ac/0x1f8 [ 111.239693] [<ffff3f062411c5c4>] cpu_startup_entry+0x2c/0x30 [ 111.239695] [<ffff3f0624091008>] secondary_start_kernel+0x158/0x198 [ 111.239696] [<00000000112091a4>] 0x112091a4 [ 111.239697] Task dump for CPU 32: [ 111.239699] swapper/32 R running task 0 0 1 0x00000002 [ 111.239701] Call trace: [ 111.239704] [<ffff3f0624086250>] __switch_to+0x98/0xb0 [ 111.239705] [<0000000bcde2fa4e>] 0xbcde2fa4e [ 129.361765] ip_tables: (C) 2000-2006 Netfilter Core Team [ 129.397270] ip6_tables: (C) 2000-2006 Netfilter Core Team [ 129.438584] Ebtables v2.0 registered [FIX] The following patches applied in this order fixes this issue. d9a5f8adaec9 iommu/dma: Plumb in the per-CPU IOVA caches fc7f6142bacb iommu/dma: Clean up MSI IOVA allocation 568c61384ea1 iommu/dma: Convert to address-based allocation dddd632b072f iommu/dma: Implement PCI allocation optimisation de84f5f049d9 iommu/dma: Stop getting dma_32bit_pfn wrong [Test case] After applying the patches the kernel boot with no soft lockups. This was tested by me on Zesty Ubuntu-4.10.0-18.20 on QDF2400 SDP. [IMPACT] On QDF2400 ARM64 servers, booting Zesty 4.10 kernel causes soft lockups on multiple CPUs. [ 104.205397] Modules linked in: nls_iso8859_1 cdc_acm bridge stp llc ipmi_ssif ipmi_devintf ipmi_msghandler shpchp hdma hdma_mgmt i2c_qup cppc_cpufreq ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear uas usb_storage at803x aes_ce_blk aes_ce_cipher crc32_ce crct10dif_ce ghash_ce sha2_ce sha1_ce mlx5_core devlink ptp pps_core ahci_platform libahci_platform libahci qcom_emac sdhci_acpi sdhci xhci_plat_hcd pinctrl_qdf2xxx fjes aes_neon_blk crypto_simd cryptd [ 104.205442] CPU: 47 PID: 0 Comm: swapper/47 Tainted: G L 4.10.0-16-generic #18ubuntuRC03+bandera.1 [ 104.205443] Hardware name: Qualcomm QDF2400 DP/ABW|SYS|CVR,1DPC|V3 , BIOS XBL.DF.2.0.R3-00153 QDF2400_REL CRM 02/ 8/2017 [ 104.205444] task: ffff9fa30ed49c00 task.stack: ffff9fa30ed5c000 [ 104.205447] PC is at _raw_spin_unlock_irqrestore+0x2c/0x38 [ 104.205450] LR is at alloc_iova+0x1cc/0x2a0 [ 104.205451] pc : [<ffff3f0624a00974>] lr : [<ffff3f0624682e8c>] pstate: 20400145 [ 104.205452] sp : ffff9fa31fbecc00 [ 104.205453] x29: ffff9fa31fbecc00 x28: 0000000ffffefe46 [ 104.205455] x27: 0000000000000040 x26: 0000000fffffffff [ 104.205458] x25: ffff3f06251f8000 x24: 0000000000000001 [ 104.205460] x23: ffff9fa30da06008 x22: 0000000000000000 [ 104.205462] x21: ffff9fa2e2af8740 x20: ffff9fa30da06008 [ 104.205464] x19: 0000000000000140 x18: 00000000a5e112c1 [ 104.205466] x17: 000000004d48a1ed x16: 00000000b0f9c455 [ 104.205468] x15: 00000000aa4269e9 x14: 0000000085094ac4 [ 104.205471] x13: 000000009b3b00da x12: 000000008aae8d9c [ 104.205473] x11: ffff9fa31fbf90b0 x10: ffff3f0624eb70eb [ 104.205475] x9 : 0000000000000000 x8 : 0000000000000004 [ 104.205477] x7 : ffff9fa2e2875400 x6 : 0000000000000000 [ 104.205479] x5 : ffff9fa2e2875401 x4 : 0000000000000000 [ 104.205481] x3 : ffff9fa2e2a27b00 x2 : ffff9fa2e2875400 [ 104.205483] x1 : 0000000000000140 x0 : 000000000000f7c2 [ 111.198062] INFO: rcu_sched self-detected stall on CPU [ 111.198971] INFO: rcu_sched detected stalls on CPUs/tasks: [ 111.198977] 31-...: (1 GPs behind) idle=1b3/2/0 softirq=432/433 fqs=6805 [ 111.198979] 32-...: (1 GPs behind) idle=291/1/0 softirq=469/470 fqs=6805 [ 111.198980] (detected by 2, t=15002 jiffies, g=143, c=142, q=6968) [ 111.199000] Task dump for CPU 31: [ 111.199002] swapper/31 R running task 0 0 1 0x00000002 [ 111.199006] Call trace: [ 111.199012] [<ffff3f0624086250>] __switch_to+0x98/0xb0 [ 111.199014] [<0000000b7160dcd2>] 0xb7160dcd2 [ 111.199015] Task dump for CPU 32: [ 111.199016] swapper/32 R running task 0 0 1 0x00000002 [ 111.199018] Call trace: [ 111.199019] [<ffff3f0624086250>] __switch_to+0x98/0xb0 [ 111.199020] [<0000000bcde2fa4e>] 0xbcde2fa4e [ 111.227703] 31-...: (1 GPs behind) idle=1b3/2/0 softirq=432/433 fqs=6809 [ 111.234558] (t=15010 jiffies g=143 c=142 q=6968) [ 111.239334] Task dump for CPU 31: [ 111.239335] swapper/31 R running task 0 0 1 0x00000002 [ 111.239338] Call trace: [ 111.239344] [<ffff3f062408b030>] dump_backtrace+0x0/0x2b0 [ 111.239346] [<ffff3f062408b304>] show_stack+0x24/0x30 [ 111.239350] [<ffff3f0624103f80>] sched_show_task+0x128/0x178 [ 111.239352] [<ffff3f0624106d68>] dump_cpu_task+0x48/0x58 [ 111.239356] [<ffff3f0624200d38>] rcu_dump_cpu_stacks+0xbc/0xf0 [ 111.239359] [<ffff3f06241409e8>] rcu_check_callbacks+0x7a8/0x968 [ 111.239362] [<ffff3f0624146e1c>] update_process_times+0x34/0x60 [ 111.239365] [<ffff3f0624159118>] tick_sched_handle.isra.7+0x38/0x70 [ 111.239367] [<ffff3f062415919c>] tick_sched_timer+0x4c/0x98 [ 111.239369] [<ffff3f06241477a0>] __hrtimer_run_queues+0xe8/0x2e8 [ 111.239371] [<ffff3f0624148340>] hrtimer_interrupt+0xa8/0x228 [ 111.239376] [<ffff3f062487c02c>] arch_timer_handler_phys+0x3c/0x50 [ 111.239379] [<ffff3f0624133964>] handle_percpu_devid_irq+0x8c/0x230 [ 111.239383] [<ffff3f062412d8b4>] generic_handle_irq+0x34/0x50 [ 111.239385] [<ffff3f062412dfe0>] __handle_domain_irq+0x68/0xc0 [ 111.239386] [<ffff3f06240818b4>] gic_handle_irq+0xc4/0x170 [ 111.239388] Exception stack(0xffff9fa31fa7caa0 to 0xffff9fa31fa7cbd0) [ 111.239390] caa0: ffff9fa31fa7cad0 0001000000000000 ffff9fa31fa7cc00 ffff3f0624a00974 [ 111.239392] cac0: 0000000020400145 0000000000000001 00000000000000fe 0000000000000140 [ 111.239394] cae0: ffff9fa2e10b1c00 ffff9fa2e11c8800 0000000000000000 ffff9fa2e10b1c01 [ 111.239396] cb00: 0000000000000000 ffff9fa2e10b1c00 ffff9fa3035ee681 0000000000000000 [ 111.239397] cb20: ffff7e7e8b8533e0 ffff9fa31fa890b0 0000000000000000 000000009b3b00da [ 111.239399] cb40: 0000000085094ac4 00000000aa4269e9 0000000046e68d43 000000004d48a1ed [ 111.239401] cb60: 00000000a5e112c1 0000000000000140 ffff9fa30da06008 ffff9fa2e1073ac0 [ 111.239403] cb80: 0000000000000000 ffff9fa30da06008 0000000000000001 ffff3f06251f8000 [ 111.239404] cba0: 0000000fffffffff 0000000000000040 0000000ffffef50a ffff9fa31fa7cc00 [ 111.239406] cbc0: ffff3f0624682e8c ffff9fa31fa7cc00 [ 111.239407] [<ffff3f062408315c>] el1_irq+0xdc/0x180 [ 111.239411] [<ffff3f0624682e8c>] alloc_iova+0x1cc/0x2a0 [ 111.239413] [<ffff3f0624680488>] __alloc_iova+0x78/0x88 [ 111.239414] [<ffff3f0624680528>] __iommu_dma_map+0x90/0x128 [ 111.239416] [<ffff3f0624680e30>] iommu_dma_map_page+0x60/0x78 [ 111.239420] [<ffff3f062409c8fc>] __iommu_map_page+0x5c/0xd0 [ 111.239565] [<ffff3f06201046d0>] mlx5e_alloc_rx_wqe+0x118/0x318 [mlx5_core] [ 111.239607] [<ffff3f06201050e8>] mlx5e_post_rx_wqes+0xa0/0x110 [mlx5_core] [ 111.239647] [<ffff3f06201075dc>] mlx5e_napi_poll+0x22c/0x518 [mlx5_core] [ 111.239650] [<ffff3f06248cdda0>] net_rx_action+0x2e8/0x3f0 [ 111.239652] [<ffff3f0624081aa8>] __do_softirq+0x148/0x31c [ 111.239656] [<ffff3f06240d3d68>] irq_exit+0xd0/0x120 [ 111.239658] [<ffff3f062412dfe4>] __handle_domain_irq+0x6c/0xc0 [ 111.239660] [<ffff3f06240818b4>] gic_handle_irq+0xc4/0x170 [ 111.239661] Exception stack(0xffff9fa30ecffd80 to 0xffff9fa30ecffeb0) [ 111.239663] fd80: ffff9fa31fa85200 0000609cfabd2000 0000000006400000 0000000000000004 [ 111.239665] fda0: 0000000000003296 0000000000000015 000000005c57e302 0000000000000000 [ 111.239667] fdc0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 111.239668] fde0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 111.239670] fe00: 0000000000000000 0000000000000000 00000000ffffffff 0000000b7179114e [ 111.239672] fe20: ffff9fa3041c8000 0000000000000003 ffff3f0625292eb8 0000000000000000 [ 111.239673] fe40: 0000000b7160dcd2 0000000000000003 0000000000000000 0000000000000000 [ 111.239675] fe60: 0000000000000000 ffff9fa30ecffeb0 ffff3f06248549bc ffff9fa30ecffeb0 [ 111.239677] fe80: ffff3f06248549c4 0000000060400145 ffff9fa30ecffeb0 ffff3f06248549bc [ 111.239678] fea0: ffffffffffffffff 0000000b7160dcd2 [ 111.239680] [<ffff3f062408315c>] el1_irq+0xdc/0x180 [ 111.239684] [<ffff3f06248549c4>] cpuidle_enter_state+0x124/0x318 [ 111.239686] [<ffff3f0624854c2c>] cpuidle_enter+0x34/0x48 [ 111.239689] [<ffff3f062411c030>] call_cpuidle+0x40/0x70 [ 111.239691] [<ffff3f062411c344>] do_idle+0x1ac/0x1f8 [ 111.239693] [<ffff3f062411c5c4>] cpu_startup_entry+0x2c/0x30 [ 111.239695] [<ffff3f0624091008>] secondary_start_kernel+0x158/0x198 [ 111.239696] [<00000000112091a4>] 0x112091a4 [ 111.239697] Task dump for CPU 32: [ 111.239699] swapper/32 R running task 0 0 1 0x00000002 [ 111.239701] Call trace: [ 111.239704] [<ffff3f0624086250>] __switch_to+0x98/0xb0 [ 111.239705] [<0000000bcde2fa4e>] 0xbcde2fa4e [ 129.361765] ip_tables: (C) 2000-2006 Netfilter Core Team [ 129.397270] ip6_tables: (C) 2000-2006 Netfilter Core Team [ 129.438584] Ebtables v2.0 registered [FIX] The following patches applied in this order fixes this issue. d9a5f8adaec9 iommu/dma: Plumb in the per-CPU IOVA caches fc7f6142bacb iommu/dma: Clean up MSI IOVA allocation 568c61384ea1 iommu/dma: Convert to address-based allocation dddd632b072f iommu/dma: Implement PCI allocation optimisation de84f5f049d9 iommu/dma: Stop getting dma_32bit_pfn wrong and https://patchwork.kernel.org/patch/9668743/ [Test case] After applying the patches the kernel boot with no soft lockups. This was tested by me on Zesty Ubuntu-4.10.0-18.20 on QDF2400 SDP.
2017-04-26 19:13:49 Brian Kearns bug added subscriber Brian Kearns
2017-05-03 20:05:41 Manoj Iyer description [IMPACT] On QDF2400 ARM64 servers, booting Zesty 4.10 kernel causes soft lockups on multiple CPUs. [ 104.205397] Modules linked in: nls_iso8859_1 cdc_acm bridge stp llc ipmi_ssif ipmi_devintf ipmi_msghandler shpchp hdma hdma_mgmt i2c_qup cppc_cpufreq ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear uas usb_storage at803x aes_ce_blk aes_ce_cipher crc32_ce crct10dif_ce ghash_ce sha2_ce sha1_ce mlx5_core devlink ptp pps_core ahci_platform libahci_platform libahci qcom_emac sdhci_acpi sdhci xhci_plat_hcd pinctrl_qdf2xxx fjes aes_neon_blk crypto_simd cryptd [ 104.205442] CPU: 47 PID: 0 Comm: swapper/47 Tainted: G L 4.10.0-16-generic #18ubuntuRC03+bandera.1 [ 104.205443] Hardware name: Qualcomm QDF2400 DP/ABW|SYS|CVR,1DPC|V3 , BIOS XBL.DF.2.0.R3-00153 QDF2400_REL CRM 02/ 8/2017 [ 104.205444] task: ffff9fa30ed49c00 task.stack: ffff9fa30ed5c000 [ 104.205447] PC is at _raw_spin_unlock_irqrestore+0x2c/0x38 [ 104.205450] LR is at alloc_iova+0x1cc/0x2a0 [ 104.205451] pc : [<ffff3f0624a00974>] lr : [<ffff3f0624682e8c>] pstate: 20400145 [ 104.205452] sp : ffff9fa31fbecc00 [ 104.205453] x29: ffff9fa31fbecc00 x28: 0000000ffffefe46 [ 104.205455] x27: 0000000000000040 x26: 0000000fffffffff [ 104.205458] x25: ffff3f06251f8000 x24: 0000000000000001 [ 104.205460] x23: ffff9fa30da06008 x22: 0000000000000000 [ 104.205462] x21: ffff9fa2e2af8740 x20: ffff9fa30da06008 [ 104.205464] x19: 0000000000000140 x18: 00000000a5e112c1 [ 104.205466] x17: 000000004d48a1ed x16: 00000000b0f9c455 [ 104.205468] x15: 00000000aa4269e9 x14: 0000000085094ac4 [ 104.205471] x13: 000000009b3b00da x12: 000000008aae8d9c [ 104.205473] x11: ffff9fa31fbf90b0 x10: ffff3f0624eb70eb [ 104.205475] x9 : 0000000000000000 x8 : 0000000000000004 [ 104.205477] x7 : ffff9fa2e2875400 x6 : 0000000000000000 [ 104.205479] x5 : ffff9fa2e2875401 x4 : 0000000000000000 [ 104.205481] x3 : ffff9fa2e2a27b00 x2 : ffff9fa2e2875400 [ 104.205483] x1 : 0000000000000140 x0 : 000000000000f7c2 [ 111.198062] INFO: rcu_sched self-detected stall on CPU [ 111.198971] INFO: rcu_sched detected stalls on CPUs/tasks: [ 111.198977] 31-...: (1 GPs behind) idle=1b3/2/0 softirq=432/433 fqs=6805 [ 111.198979] 32-...: (1 GPs behind) idle=291/1/0 softirq=469/470 fqs=6805 [ 111.198980] (detected by 2, t=15002 jiffies, g=143, c=142, q=6968) [ 111.199000] Task dump for CPU 31: [ 111.199002] swapper/31 R running task 0 0 1 0x00000002 [ 111.199006] Call trace: [ 111.199012] [<ffff3f0624086250>] __switch_to+0x98/0xb0 [ 111.199014] [<0000000b7160dcd2>] 0xb7160dcd2 [ 111.199015] Task dump for CPU 32: [ 111.199016] swapper/32 R running task 0 0 1 0x00000002 [ 111.199018] Call trace: [ 111.199019] [<ffff3f0624086250>] __switch_to+0x98/0xb0 [ 111.199020] [<0000000bcde2fa4e>] 0xbcde2fa4e [ 111.227703] 31-...: (1 GPs behind) idle=1b3/2/0 softirq=432/433 fqs=6809 [ 111.234558] (t=15010 jiffies g=143 c=142 q=6968) [ 111.239334] Task dump for CPU 31: [ 111.239335] swapper/31 R running task 0 0 1 0x00000002 [ 111.239338] Call trace: [ 111.239344] [<ffff3f062408b030>] dump_backtrace+0x0/0x2b0 [ 111.239346] [<ffff3f062408b304>] show_stack+0x24/0x30 [ 111.239350] [<ffff3f0624103f80>] sched_show_task+0x128/0x178 [ 111.239352] [<ffff3f0624106d68>] dump_cpu_task+0x48/0x58 [ 111.239356] [<ffff3f0624200d38>] rcu_dump_cpu_stacks+0xbc/0xf0 [ 111.239359] [<ffff3f06241409e8>] rcu_check_callbacks+0x7a8/0x968 [ 111.239362] [<ffff3f0624146e1c>] update_process_times+0x34/0x60 [ 111.239365] [<ffff3f0624159118>] tick_sched_handle.isra.7+0x38/0x70 [ 111.239367] [<ffff3f062415919c>] tick_sched_timer+0x4c/0x98 [ 111.239369] [<ffff3f06241477a0>] __hrtimer_run_queues+0xe8/0x2e8 [ 111.239371] [<ffff3f0624148340>] hrtimer_interrupt+0xa8/0x228 [ 111.239376] [<ffff3f062487c02c>] arch_timer_handler_phys+0x3c/0x50 [ 111.239379] [<ffff3f0624133964>] handle_percpu_devid_irq+0x8c/0x230 [ 111.239383] [<ffff3f062412d8b4>] generic_handle_irq+0x34/0x50 [ 111.239385] [<ffff3f062412dfe0>] __handle_domain_irq+0x68/0xc0 [ 111.239386] [<ffff3f06240818b4>] gic_handle_irq+0xc4/0x170 [ 111.239388] Exception stack(0xffff9fa31fa7caa0 to 0xffff9fa31fa7cbd0) [ 111.239390] caa0: ffff9fa31fa7cad0 0001000000000000 ffff9fa31fa7cc00 ffff3f0624a00974 [ 111.239392] cac0: 0000000020400145 0000000000000001 00000000000000fe 0000000000000140 [ 111.239394] cae0: ffff9fa2e10b1c00 ffff9fa2e11c8800 0000000000000000 ffff9fa2e10b1c01 [ 111.239396] cb00: 0000000000000000 ffff9fa2e10b1c00 ffff9fa3035ee681 0000000000000000 [ 111.239397] cb20: ffff7e7e8b8533e0 ffff9fa31fa890b0 0000000000000000 000000009b3b00da [ 111.239399] cb40: 0000000085094ac4 00000000aa4269e9 0000000046e68d43 000000004d48a1ed [ 111.239401] cb60: 00000000a5e112c1 0000000000000140 ffff9fa30da06008 ffff9fa2e1073ac0 [ 111.239403] cb80: 0000000000000000 ffff9fa30da06008 0000000000000001 ffff3f06251f8000 [ 111.239404] cba0: 0000000fffffffff 0000000000000040 0000000ffffef50a ffff9fa31fa7cc00 [ 111.239406] cbc0: ffff3f0624682e8c ffff9fa31fa7cc00 [ 111.239407] [<ffff3f062408315c>] el1_irq+0xdc/0x180 [ 111.239411] [<ffff3f0624682e8c>] alloc_iova+0x1cc/0x2a0 [ 111.239413] [<ffff3f0624680488>] __alloc_iova+0x78/0x88 [ 111.239414] [<ffff3f0624680528>] __iommu_dma_map+0x90/0x128 [ 111.239416] [<ffff3f0624680e30>] iommu_dma_map_page+0x60/0x78 [ 111.239420] [<ffff3f062409c8fc>] __iommu_map_page+0x5c/0xd0 [ 111.239565] [<ffff3f06201046d0>] mlx5e_alloc_rx_wqe+0x118/0x318 [mlx5_core] [ 111.239607] [<ffff3f06201050e8>] mlx5e_post_rx_wqes+0xa0/0x110 [mlx5_core] [ 111.239647] [<ffff3f06201075dc>] mlx5e_napi_poll+0x22c/0x518 [mlx5_core] [ 111.239650] [<ffff3f06248cdda0>] net_rx_action+0x2e8/0x3f0 [ 111.239652] [<ffff3f0624081aa8>] __do_softirq+0x148/0x31c [ 111.239656] [<ffff3f06240d3d68>] irq_exit+0xd0/0x120 [ 111.239658] [<ffff3f062412dfe4>] __handle_domain_irq+0x6c/0xc0 [ 111.239660] [<ffff3f06240818b4>] gic_handle_irq+0xc4/0x170 [ 111.239661] Exception stack(0xffff9fa30ecffd80 to 0xffff9fa30ecffeb0) [ 111.239663] fd80: ffff9fa31fa85200 0000609cfabd2000 0000000006400000 0000000000000004 [ 111.239665] fda0: 0000000000003296 0000000000000015 000000005c57e302 0000000000000000 [ 111.239667] fdc0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 111.239668] fde0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 111.239670] fe00: 0000000000000000 0000000000000000 00000000ffffffff 0000000b7179114e [ 111.239672] fe20: ffff9fa3041c8000 0000000000000003 ffff3f0625292eb8 0000000000000000 [ 111.239673] fe40: 0000000b7160dcd2 0000000000000003 0000000000000000 0000000000000000 [ 111.239675] fe60: 0000000000000000 ffff9fa30ecffeb0 ffff3f06248549bc ffff9fa30ecffeb0 [ 111.239677] fe80: ffff3f06248549c4 0000000060400145 ffff9fa30ecffeb0 ffff3f06248549bc [ 111.239678] fea0: ffffffffffffffff 0000000b7160dcd2 [ 111.239680] [<ffff3f062408315c>] el1_irq+0xdc/0x180 [ 111.239684] [<ffff3f06248549c4>] cpuidle_enter_state+0x124/0x318 [ 111.239686] [<ffff3f0624854c2c>] cpuidle_enter+0x34/0x48 [ 111.239689] [<ffff3f062411c030>] call_cpuidle+0x40/0x70 [ 111.239691] [<ffff3f062411c344>] do_idle+0x1ac/0x1f8 [ 111.239693] [<ffff3f062411c5c4>] cpu_startup_entry+0x2c/0x30 [ 111.239695] [<ffff3f0624091008>] secondary_start_kernel+0x158/0x198 [ 111.239696] [<00000000112091a4>] 0x112091a4 [ 111.239697] Task dump for CPU 32: [ 111.239699] swapper/32 R running task 0 0 1 0x00000002 [ 111.239701] Call trace: [ 111.239704] [<ffff3f0624086250>] __switch_to+0x98/0xb0 [ 111.239705] [<0000000bcde2fa4e>] 0xbcde2fa4e [ 129.361765] ip_tables: (C) 2000-2006 Netfilter Core Team [ 129.397270] ip6_tables: (C) 2000-2006 Netfilter Core Team [ 129.438584] Ebtables v2.0 registered [FIX] The following patches applied in this order fixes this issue. d9a5f8adaec9 iommu/dma: Plumb in the per-CPU IOVA caches fc7f6142bacb iommu/dma: Clean up MSI IOVA allocation 568c61384ea1 iommu/dma: Convert to address-based allocation dddd632b072f iommu/dma: Implement PCI allocation optimisation de84f5f049d9 iommu/dma: Stop getting dma_32bit_pfn wrong and https://patchwork.kernel.org/patch/9668743/ [Test case] After applying the patches the kernel boot with no soft lockups. This was tested by me on Zesty Ubuntu-4.10.0-18.20 on QDF2400 SDP. [IMPACT] On QDF2400 ARM64 servers, booting Zesty 4.10 kernel causes soft lockups on multiple CPUs. [ 104.205397] Modules linked in: nls_iso8859_1 cdc_acm bridge stp llc ipmi_ssif ipmi_devintf ipmi_msghandler shpchp hdma hdma_mgmt i2c_qup cppc_cpufreq ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear uas usb_storage at803x aes_ce_blk aes_ce_cipher crc32_ce crct10dif_ce ghash_ce sha2_ce sha1_ce mlx5_core devlink ptp pps_core ahci_platform libahci_platform libahci qcom_emac sdhci_acpi sdhci xhci_plat_hcd pinctrl_qdf2xxx fjes aes_neon_blk crypto_simd cryptd [ 104.205442] CPU: 47 PID: 0 Comm: swapper/47 Tainted: G L 4.10.0-16-generic #18ubuntuRC03+<redacted>.1 [ 104.205443] Hardware name: Qualcomm QDF2400 DP/ABW|SYS|CVR,1DPC|V3 , BIOS XBL.DF.2.0.R3-00153 QDF2400_REL CRM 02/ 8/2017 [ 104.205444] task: ffff9fa30ed49c00 task.stack: ffff9fa30ed5c000 [ 104.205447] PC is at _raw_spin_unlock_irqrestore+0x2c/0x38 [ 104.205450] LR is at alloc_iova+0x1cc/0x2a0 [ 104.205451] pc : [<ffff3f0624a00974>] lr : [<ffff3f0624682e8c>] pstate: 20400145 [ 104.205452] sp : ffff9fa31fbecc00 [ 104.205453] x29: ffff9fa31fbecc00 x28: 0000000ffffefe46 [ 104.205455] x27: 0000000000000040 x26: 0000000fffffffff [ 104.205458] x25: ffff3f06251f8000 x24: 0000000000000001 [ 104.205460] x23: ffff9fa30da06008 x22: 0000000000000000 [ 104.205462] x21: ffff9fa2e2af8740 x20: ffff9fa30da06008 [ 104.205464] x19: 0000000000000140 x18: 00000000a5e112c1 [ 104.205466] x17: 000000004d48a1ed x16: 00000000b0f9c455 [ 104.205468] x15: 00000000aa4269e9 x14: 0000000085094ac4 [ 104.205471] x13: 000000009b3b00da x12: 000000008aae8d9c [ 104.205473] x11: ffff9fa31fbf90b0 x10: ffff3f0624eb70eb [ 104.205475] x9 : 0000000000000000 x8 : 0000000000000004 [ 104.205477] x7 : ffff9fa2e2875400 x6 : 0000000000000000 [ 104.205479] x5 : ffff9fa2e2875401 x4 : 0000000000000000 [ 104.205481] x3 : ffff9fa2e2a27b00 x2 : ffff9fa2e2875400 [ 104.205483] x1 : 0000000000000140 x0 : 000000000000f7c2 [ 111.198062] INFO: rcu_sched self-detected stall on CPU [ 111.198971] INFO: rcu_sched detected stalls on CPUs/tasks: [ 111.198977] 31-...: (1 GPs behind) idle=1b3/2/0 softirq=432/433 fqs=6805 [ 111.198979] 32-...: (1 GPs behind) idle=291/1/0 softirq=469/470 fqs=6805 [ 111.198980] (detected by 2, t=15002 jiffies, g=143, c=142, q=6968) [ 111.199000] Task dump for CPU 31: [ 111.199002] swapper/31 R running task 0 0 1 0x00000002 [ 111.199006] Call trace: [ 111.199012] [<ffff3f0624086250>] __switch_to+0x98/0xb0 [ 111.199014] [<0000000b7160dcd2>] 0xb7160dcd2 [ 111.199015] Task dump for CPU 32: [ 111.199016] swapper/32 R running task 0 0 1 0x00000002 [ 111.199018] Call trace: [ 111.199019] [<ffff3f0624086250>] __switch_to+0x98/0xb0 [ 111.199020] [<0000000bcde2fa4e>] 0xbcde2fa4e [ 111.227703] 31-...: (1 GPs behind) idle=1b3/2/0 softirq=432/433 fqs=6809 [ 111.234558] (t=15010 jiffies g=143 c=142 q=6968) [ 111.239334] Task dump for CPU 31: [ 111.239335] swapper/31 R running task 0 0 1 0x00000002 [ 111.239338] Call trace: [ 111.239344] [<ffff3f062408b030>] dump_backtrace+0x0/0x2b0 [ 111.239346] [<ffff3f062408b304>] show_stack+0x24/0x30 [ 111.239350] [<ffff3f0624103f80>] sched_show_task+0x128/0x178 [ 111.239352] [<ffff3f0624106d68>] dump_cpu_task+0x48/0x58 [ 111.239356] [<ffff3f0624200d38>] rcu_dump_cpu_stacks+0xbc/0xf0 [ 111.239359] [<ffff3f06241409e8>] rcu_check_callbacks+0x7a8/0x968 [ 111.239362] [<ffff3f0624146e1c>] update_process_times+0x34/0x60 [ 111.239365] [<ffff3f0624159118>] tick_sched_handle.isra.7+0x38/0x70 [ 111.239367] [<ffff3f062415919c>] tick_sched_timer+0x4c/0x98 [ 111.239369] [<ffff3f06241477a0>] __hrtimer_run_queues+0xe8/0x2e8 [ 111.239371] [<ffff3f0624148340>] hrtimer_interrupt+0xa8/0x228 [ 111.239376] [<ffff3f062487c02c>] arch_timer_handler_phys+0x3c/0x50 [ 111.239379] [<ffff3f0624133964>] handle_percpu_devid_irq+0x8c/0x230 [ 111.239383] [<ffff3f062412d8b4>] generic_handle_irq+0x34/0x50 [ 111.239385] [<ffff3f062412dfe0>] __handle_domain_irq+0x68/0xc0 [ 111.239386] [<ffff3f06240818b4>] gic_handle_irq+0xc4/0x170 [ 111.239388] Exception stack(0xffff9fa31fa7caa0 to 0xffff9fa31fa7cbd0) [ 111.239390] caa0: ffff9fa31fa7cad0 0001000000000000 ffff9fa31fa7cc00 ffff3f0624a00974 [ 111.239392] cac0: 0000000020400145 0000000000000001 00000000000000fe 0000000000000140 [ 111.239394] cae0: ffff9fa2e10b1c00 ffff9fa2e11c8800 0000000000000000 ffff9fa2e10b1c01 [ 111.239396] cb00: 0000000000000000 ffff9fa2e10b1c00 ffff9fa3035ee681 0000000000000000 [ 111.239397] cb20: ffff7e7e8b8533e0 ffff9fa31fa890b0 0000000000000000 000000009b3b00da [ 111.239399] cb40: 0000000085094ac4 00000000aa4269e9 0000000046e68d43 000000004d48a1ed [ 111.239401] cb60: 00000000a5e112c1 0000000000000140 ffff9fa30da06008 ffff9fa2e1073ac0 [ 111.239403] cb80: 0000000000000000 ffff9fa30da06008 0000000000000001 ffff3f06251f8000 [ 111.239404] cba0: 0000000fffffffff 0000000000000040 0000000ffffef50a ffff9fa31fa7cc00 [ 111.239406] cbc0: ffff3f0624682e8c ffff9fa31fa7cc00 [ 111.239407] [<ffff3f062408315c>] el1_irq+0xdc/0x180 [ 111.239411] [<ffff3f0624682e8c>] alloc_iova+0x1cc/0x2a0 [ 111.239413] [<ffff3f0624680488>] __alloc_iova+0x78/0x88 [ 111.239414] [<ffff3f0624680528>] __iommu_dma_map+0x90/0x128 [ 111.239416] [<ffff3f0624680e30>] iommu_dma_map_page+0x60/0x78 [ 111.239420] [<ffff3f062409c8fc>] __iommu_map_page+0x5c/0xd0 [ 111.239565] [<ffff3f06201046d0>] mlx5e_alloc_rx_wqe+0x118/0x318 [mlx5_core] [ 111.239607] [<ffff3f06201050e8>] mlx5e_post_rx_wqes+0xa0/0x110 [mlx5_core] [ 111.239647] [<ffff3f06201075dc>] mlx5e_napi_poll+0x22c/0x518 [mlx5_core] [ 111.239650] [<ffff3f06248cdda0>] net_rx_action+0x2e8/0x3f0 [ 111.239652] [<ffff3f0624081aa8>] __do_softirq+0x148/0x31c [ 111.239656] [<ffff3f06240d3d68>] irq_exit+0xd0/0x120 [ 111.239658] [<ffff3f062412dfe4>] __handle_domain_irq+0x6c/0xc0 [ 111.239660] [<ffff3f06240818b4>] gic_handle_irq+0xc4/0x170 [ 111.239661] Exception stack(0xffff9fa30ecffd80 to 0xffff9fa30ecffeb0) [ 111.239663] fd80: ffff9fa31fa85200 0000609cfabd2000 0000000006400000 0000000000000004 [ 111.239665] fda0: 0000000000003296 0000000000000015 000000005c57e302 0000000000000000 [ 111.239667] fdc0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 111.239668] fde0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 111.239670] fe00: 0000000000000000 0000000000000000 00000000ffffffff 0000000b7179114e [ 111.239672] fe20: ffff9fa3041c8000 0000000000000003 ffff3f0625292eb8 0000000000000000 [ 111.239673] fe40: 0000000b7160dcd2 0000000000000003 0000000000000000 0000000000000000 [ 111.239675] fe60: 0000000000000000 ffff9fa30ecffeb0 ffff3f06248549bc ffff9fa30ecffeb0 [ 111.239677] fe80: ffff3f06248549c4 0000000060400145 ffff9fa30ecffeb0 ffff3f06248549bc [ 111.239678] fea0: ffffffffffffffff 0000000b7160dcd2 [ 111.239680] [<ffff3f062408315c>] el1_irq+0xdc/0x180 [ 111.239684] [<ffff3f06248549c4>] cpuidle_enter_state+0x124/0x318 [ 111.239686] [<ffff3f0624854c2c>] cpuidle_enter+0x34/0x48 [ 111.239689] [<ffff3f062411c030>] call_cpuidle+0x40/0x70 [ 111.239691] [<ffff3f062411c344>] do_idle+0x1ac/0x1f8 [ 111.239693] [<ffff3f062411c5c4>] cpu_startup_entry+0x2c/0x30 [ 111.239695] [<ffff3f0624091008>] secondary_start_kernel+0x158/0x198 [ 111.239696] [<00000000112091a4>] 0x112091a4 [ 111.239697] Task dump for CPU 32: [ 111.239699] swapper/32 R running task 0 0 1 0x00000002 [ 111.239701] Call trace: [ 111.239704] [<ffff3f0624086250>] __switch_to+0x98/0xb0 [ 111.239705] [<0000000bcde2fa4e>] 0xbcde2fa4e [ 129.361765] ip_tables: (C) 2000-2006 Netfilter Core Team [ 129.397270] ip6_tables: (C) 2000-2006 Netfilter Core Team [ 129.438584] Ebtables v2.0 registered [FIX] The following patches applied in this order fixes this issue. d9a5f8adaec9 iommu/dma: Plumb in the per-CPU IOVA caches fc7f6142bacb iommu/dma: Clean up MSI IOVA allocation 568c61384ea1 iommu/dma: Convert to address-based allocation dddd632b072f iommu/dma: Implement PCI allocation optimisation de84f5f049d9 iommu/dma: Stop getting dma_32bit_pfn wrong and https://patchwork.kernel.org/patch/9668743/ [Test case] After applying the patches the kernel boot with no soft lockups. This was tested by me on Zesty Ubuntu-4.10.0-18.20 on QDF2400 SDP.
2017-05-03 20:19:03 Manoj Iyer description [IMPACT] On QDF2400 ARM64 servers, booting Zesty 4.10 kernel causes soft lockups on multiple CPUs. [ 104.205397] Modules linked in: nls_iso8859_1 cdc_acm bridge stp llc ipmi_ssif ipmi_devintf ipmi_msghandler shpchp hdma hdma_mgmt i2c_qup cppc_cpufreq ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear uas usb_storage at803x aes_ce_blk aes_ce_cipher crc32_ce crct10dif_ce ghash_ce sha2_ce sha1_ce mlx5_core devlink ptp pps_core ahci_platform libahci_platform libahci qcom_emac sdhci_acpi sdhci xhci_plat_hcd pinctrl_qdf2xxx fjes aes_neon_blk crypto_simd cryptd [ 104.205442] CPU: 47 PID: 0 Comm: swapper/47 Tainted: G L 4.10.0-16-generic #18ubuntuRC03+<redacted>.1 [ 104.205443] Hardware name: Qualcomm QDF2400 DP/ABW|SYS|CVR,1DPC|V3 , BIOS XBL.DF.2.0.R3-00153 QDF2400_REL CRM 02/ 8/2017 [ 104.205444] task: ffff9fa30ed49c00 task.stack: ffff9fa30ed5c000 [ 104.205447] PC is at _raw_spin_unlock_irqrestore+0x2c/0x38 [ 104.205450] LR is at alloc_iova+0x1cc/0x2a0 [ 104.205451] pc : [<ffff3f0624a00974>] lr : [<ffff3f0624682e8c>] pstate: 20400145 [ 104.205452] sp : ffff9fa31fbecc00 [ 104.205453] x29: ffff9fa31fbecc00 x28: 0000000ffffefe46 [ 104.205455] x27: 0000000000000040 x26: 0000000fffffffff [ 104.205458] x25: ffff3f06251f8000 x24: 0000000000000001 [ 104.205460] x23: ffff9fa30da06008 x22: 0000000000000000 [ 104.205462] x21: ffff9fa2e2af8740 x20: ffff9fa30da06008 [ 104.205464] x19: 0000000000000140 x18: 00000000a5e112c1 [ 104.205466] x17: 000000004d48a1ed x16: 00000000b0f9c455 [ 104.205468] x15: 00000000aa4269e9 x14: 0000000085094ac4 [ 104.205471] x13: 000000009b3b00da x12: 000000008aae8d9c [ 104.205473] x11: ffff9fa31fbf90b0 x10: ffff3f0624eb70eb [ 104.205475] x9 : 0000000000000000 x8 : 0000000000000004 [ 104.205477] x7 : ffff9fa2e2875400 x6 : 0000000000000000 [ 104.205479] x5 : ffff9fa2e2875401 x4 : 0000000000000000 [ 104.205481] x3 : ffff9fa2e2a27b00 x2 : ffff9fa2e2875400 [ 104.205483] x1 : 0000000000000140 x0 : 000000000000f7c2 [ 111.198062] INFO: rcu_sched self-detected stall on CPU [ 111.198971] INFO: rcu_sched detected stalls on CPUs/tasks: [ 111.198977] 31-...: (1 GPs behind) idle=1b3/2/0 softirq=432/433 fqs=6805 [ 111.198979] 32-...: (1 GPs behind) idle=291/1/0 softirq=469/470 fqs=6805 [ 111.198980] (detected by 2, t=15002 jiffies, g=143, c=142, q=6968) [ 111.199000] Task dump for CPU 31: [ 111.199002] swapper/31 R running task 0 0 1 0x00000002 [ 111.199006] Call trace: [ 111.199012] [<ffff3f0624086250>] __switch_to+0x98/0xb0 [ 111.199014] [<0000000b7160dcd2>] 0xb7160dcd2 [ 111.199015] Task dump for CPU 32: [ 111.199016] swapper/32 R running task 0 0 1 0x00000002 [ 111.199018] Call trace: [ 111.199019] [<ffff3f0624086250>] __switch_to+0x98/0xb0 [ 111.199020] [<0000000bcde2fa4e>] 0xbcde2fa4e [ 111.227703] 31-...: (1 GPs behind) idle=1b3/2/0 softirq=432/433 fqs=6809 [ 111.234558] (t=15010 jiffies g=143 c=142 q=6968) [ 111.239334] Task dump for CPU 31: [ 111.239335] swapper/31 R running task 0 0 1 0x00000002 [ 111.239338] Call trace: [ 111.239344] [<ffff3f062408b030>] dump_backtrace+0x0/0x2b0 [ 111.239346] [<ffff3f062408b304>] show_stack+0x24/0x30 [ 111.239350] [<ffff3f0624103f80>] sched_show_task+0x128/0x178 [ 111.239352] [<ffff3f0624106d68>] dump_cpu_task+0x48/0x58 [ 111.239356] [<ffff3f0624200d38>] rcu_dump_cpu_stacks+0xbc/0xf0 [ 111.239359] [<ffff3f06241409e8>] rcu_check_callbacks+0x7a8/0x968 [ 111.239362] [<ffff3f0624146e1c>] update_process_times+0x34/0x60 [ 111.239365] [<ffff3f0624159118>] tick_sched_handle.isra.7+0x38/0x70 [ 111.239367] [<ffff3f062415919c>] tick_sched_timer+0x4c/0x98 [ 111.239369] [<ffff3f06241477a0>] __hrtimer_run_queues+0xe8/0x2e8 [ 111.239371] [<ffff3f0624148340>] hrtimer_interrupt+0xa8/0x228 [ 111.239376] [<ffff3f062487c02c>] arch_timer_handler_phys+0x3c/0x50 [ 111.239379] [<ffff3f0624133964>] handle_percpu_devid_irq+0x8c/0x230 [ 111.239383] [<ffff3f062412d8b4>] generic_handle_irq+0x34/0x50 [ 111.239385] [<ffff3f062412dfe0>] __handle_domain_irq+0x68/0xc0 [ 111.239386] [<ffff3f06240818b4>] gic_handle_irq+0xc4/0x170 [ 111.239388] Exception stack(0xffff9fa31fa7caa0 to 0xffff9fa31fa7cbd0) [ 111.239390] caa0: ffff9fa31fa7cad0 0001000000000000 ffff9fa31fa7cc00 ffff3f0624a00974 [ 111.239392] cac0: 0000000020400145 0000000000000001 00000000000000fe 0000000000000140 [ 111.239394] cae0: ffff9fa2e10b1c00 ffff9fa2e11c8800 0000000000000000 ffff9fa2e10b1c01 [ 111.239396] cb00: 0000000000000000 ffff9fa2e10b1c00 ffff9fa3035ee681 0000000000000000 [ 111.239397] cb20: ffff7e7e8b8533e0 ffff9fa31fa890b0 0000000000000000 000000009b3b00da [ 111.239399] cb40: 0000000085094ac4 00000000aa4269e9 0000000046e68d43 000000004d48a1ed [ 111.239401] cb60: 00000000a5e112c1 0000000000000140 ffff9fa30da06008 ffff9fa2e1073ac0 [ 111.239403] cb80: 0000000000000000 ffff9fa30da06008 0000000000000001 ffff3f06251f8000 [ 111.239404] cba0: 0000000fffffffff 0000000000000040 0000000ffffef50a ffff9fa31fa7cc00 [ 111.239406] cbc0: ffff3f0624682e8c ffff9fa31fa7cc00 [ 111.239407] [<ffff3f062408315c>] el1_irq+0xdc/0x180 [ 111.239411] [<ffff3f0624682e8c>] alloc_iova+0x1cc/0x2a0 [ 111.239413] [<ffff3f0624680488>] __alloc_iova+0x78/0x88 [ 111.239414] [<ffff3f0624680528>] __iommu_dma_map+0x90/0x128 [ 111.239416] [<ffff3f0624680e30>] iommu_dma_map_page+0x60/0x78 [ 111.239420] [<ffff3f062409c8fc>] __iommu_map_page+0x5c/0xd0 [ 111.239565] [<ffff3f06201046d0>] mlx5e_alloc_rx_wqe+0x118/0x318 [mlx5_core] [ 111.239607] [<ffff3f06201050e8>] mlx5e_post_rx_wqes+0xa0/0x110 [mlx5_core] [ 111.239647] [<ffff3f06201075dc>] mlx5e_napi_poll+0x22c/0x518 [mlx5_core] [ 111.239650] [<ffff3f06248cdda0>] net_rx_action+0x2e8/0x3f0 [ 111.239652] [<ffff3f0624081aa8>] __do_softirq+0x148/0x31c [ 111.239656] [<ffff3f06240d3d68>] irq_exit+0xd0/0x120 [ 111.239658] [<ffff3f062412dfe4>] __handle_domain_irq+0x6c/0xc0 [ 111.239660] [<ffff3f06240818b4>] gic_handle_irq+0xc4/0x170 [ 111.239661] Exception stack(0xffff9fa30ecffd80 to 0xffff9fa30ecffeb0) [ 111.239663] fd80: ffff9fa31fa85200 0000609cfabd2000 0000000006400000 0000000000000004 [ 111.239665] fda0: 0000000000003296 0000000000000015 000000005c57e302 0000000000000000 [ 111.239667] fdc0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 111.239668] fde0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 111.239670] fe00: 0000000000000000 0000000000000000 00000000ffffffff 0000000b7179114e [ 111.239672] fe20: ffff9fa3041c8000 0000000000000003 ffff3f0625292eb8 0000000000000000 [ 111.239673] fe40: 0000000b7160dcd2 0000000000000003 0000000000000000 0000000000000000 [ 111.239675] fe60: 0000000000000000 ffff9fa30ecffeb0 ffff3f06248549bc ffff9fa30ecffeb0 [ 111.239677] fe80: ffff3f06248549c4 0000000060400145 ffff9fa30ecffeb0 ffff3f06248549bc [ 111.239678] fea0: ffffffffffffffff 0000000b7160dcd2 [ 111.239680] [<ffff3f062408315c>] el1_irq+0xdc/0x180 [ 111.239684] [<ffff3f06248549c4>] cpuidle_enter_state+0x124/0x318 [ 111.239686] [<ffff3f0624854c2c>] cpuidle_enter+0x34/0x48 [ 111.239689] [<ffff3f062411c030>] call_cpuidle+0x40/0x70 [ 111.239691] [<ffff3f062411c344>] do_idle+0x1ac/0x1f8 [ 111.239693] [<ffff3f062411c5c4>] cpu_startup_entry+0x2c/0x30 [ 111.239695] [<ffff3f0624091008>] secondary_start_kernel+0x158/0x198 [ 111.239696] [<00000000112091a4>] 0x112091a4 [ 111.239697] Task dump for CPU 32: [ 111.239699] swapper/32 R running task 0 0 1 0x00000002 [ 111.239701] Call trace: [ 111.239704] [<ffff3f0624086250>] __switch_to+0x98/0xb0 [ 111.239705] [<0000000bcde2fa4e>] 0xbcde2fa4e [ 129.361765] ip_tables: (C) 2000-2006 Netfilter Core Team [ 129.397270] ip6_tables: (C) 2000-2006 Netfilter Core Team [ 129.438584] Ebtables v2.0 registered [FIX] The following patches applied in this order fixes this issue. d9a5f8adaec9 iommu/dma: Plumb in the per-CPU IOVA caches fc7f6142bacb iommu/dma: Clean up MSI IOVA allocation 568c61384ea1 iommu/dma: Convert to address-based allocation dddd632b072f iommu/dma: Implement PCI allocation optimisation de84f5f049d9 iommu/dma: Stop getting dma_32bit_pfn wrong and https://patchwork.kernel.org/patch/9668743/ [Test case] After applying the patches the kernel boot with no soft lockups. This was tested by me on Zesty Ubuntu-4.10.0-18.20 on QDF2400 SDP. [IMPACT] Booting Zesty 4.10 kernel on Qualcomm Centriq 2400 ARM64 servers causes soft lockups on multiple CPUs. [ 104.205397] Modules linked in: nls_iso8859_1 cdc_acm bridge stp llc ipmi_ssif ipmi_devintf ipmi_msghandler shpchp hdma hdma_mgmt i2c_qup cppc_cpufreq ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear uas usb_storage at803x aes_ce_blk aes_ce_cipher crc32_ce crct10dif_ce ghash_ce sha2_ce sha1_ce mlx5_core devlink ptp pps_core ahci_platform libahci_platform libahci qcom_emac sdhci_acpi sdhci xhci_plat_hcd pinctrl_qdf2xxx fjes aes_neon_blk crypto_simd cryptd [ 104.205442] CPU: 47 PID: 0 Comm: swapper/47 Tainted: G L 4.10.0-16-generic #18ubuntuRC03+<redacted>.1 [ 104.205443] Hardware name: Qualcomm QDF2400 DP/ABW|SYS|CVR,1DPC|V3 , BIOS XBL.DF.2.0.R3-00153 QDF2400_REL CRM 02/ 8/2017 [ 104.205444] task: ffff9fa30ed49c00 task.stack: ffff9fa30ed5c000 [ 104.205447] PC is at _raw_spin_unlock_irqrestore+0x2c/0x38 [ 104.205450] LR is at alloc_iova+0x1cc/0x2a0 [ 104.205451] pc : [<ffff3f0624a00974>] lr : [<ffff3f0624682e8c>] pstate: 20400145 [ 104.205452] sp : ffff9fa31fbecc00 [ 104.205453] x29: ffff9fa31fbecc00 x28: 0000000ffffefe46 [ 104.205455] x27: 0000000000000040 x26: 0000000fffffffff [ 104.205458] x25: ffff3f06251f8000 x24: 0000000000000001 [ 104.205460] x23: ffff9fa30da06008 x22: 0000000000000000 [ 104.205462] x21: ffff9fa2e2af8740 x20: ffff9fa30da06008 [ 104.205464] x19: 0000000000000140 x18: 00000000a5e112c1 [ 104.205466] x17: 000000004d48a1ed x16: 00000000b0f9c455 [ 104.205468] x15: 00000000aa4269e9 x14: 0000000085094ac4 [ 104.205471] x13: 000000009b3b00da x12: 000000008aae8d9c [ 104.205473] x11: ffff9fa31fbf90b0 x10: ffff3f0624eb70eb [ 104.205475] x9 : 0000000000000000 x8 : 0000000000000004 [ 104.205477] x7 : ffff9fa2e2875400 x6 : 0000000000000000 [ 104.205479] x5 : ffff9fa2e2875401 x4 : 0000000000000000 [ 104.205481] x3 : ffff9fa2e2a27b00 x2 : ffff9fa2e2875400 [ 104.205483] x1 : 0000000000000140 x0 : 000000000000f7c2 [ 111.198062] INFO: rcu_sched self-detected stall on CPU [ 111.198971] INFO: rcu_sched detected stalls on CPUs/tasks: [ 111.198977] 31-...: (1 GPs behind) idle=1b3/2/0 softirq=432/433 fqs=6805 [ 111.198979] 32-...: (1 GPs behind) idle=291/1/0 softirq=469/470 fqs=6805 [ 111.198980] (detected by 2, t=15002 jiffies, g=143, c=142, q=6968) [ 111.199000] Task dump for CPU 31: [ 111.199002] swapper/31 R running task 0 0 1 0x00000002 [ 111.199006] Call trace: [ 111.199012] [<ffff3f0624086250>] __switch_to+0x98/0xb0 [ 111.199014] [<0000000b7160dcd2>] 0xb7160dcd2 [ 111.199015] Task dump for CPU 32: [ 111.199016] swapper/32 R running task 0 0 1 0x00000002 [ 111.199018] Call trace: [ 111.199019] [<ffff3f0624086250>] __switch_to+0x98/0xb0 [ 111.199020] [<0000000bcde2fa4e>] 0xbcde2fa4e [ 111.227703] 31-...: (1 GPs behind) idle=1b3/2/0 softirq=432/433 fqs=6809 [ 111.234558] (t=15010 jiffies g=143 c=142 q=6968) [ 111.239334] Task dump for CPU 31: [ 111.239335] swapper/31 R running task 0 0 1 0x00000002 [ 111.239338] Call trace: [ 111.239344] [<ffff3f062408b030>] dump_backtrace+0x0/0x2b0 [ 111.239346] [<ffff3f062408b304>] show_stack+0x24/0x30 [ 111.239350] [<ffff3f0624103f80>] sched_show_task+0x128/0x178 [ 111.239352] [<ffff3f0624106d68>] dump_cpu_task+0x48/0x58 [ 111.239356] [<ffff3f0624200d38>] rcu_dump_cpu_stacks+0xbc/0xf0 [ 111.239359] [<ffff3f06241409e8>] rcu_check_callbacks+0x7a8/0x968 [ 111.239362] [<ffff3f0624146e1c>] update_process_times+0x34/0x60 [ 111.239365] [<ffff3f0624159118>] tick_sched_handle.isra.7+0x38/0x70 [ 111.239367] [<ffff3f062415919c>] tick_sched_timer+0x4c/0x98 [ 111.239369] [<ffff3f06241477a0>] __hrtimer_run_queues+0xe8/0x2e8 [ 111.239371] [<ffff3f0624148340>] hrtimer_interrupt+0xa8/0x228 [ 111.239376] [<ffff3f062487c02c>] arch_timer_handler_phys+0x3c/0x50 [ 111.239379] [<ffff3f0624133964>] handle_percpu_devid_irq+0x8c/0x230 [ 111.239383] [<ffff3f062412d8b4>] generic_handle_irq+0x34/0x50 [ 111.239385] [<ffff3f062412dfe0>] __handle_domain_irq+0x68/0xc0 [ 111.239386] [<ffff3f06240818b4>] gic_handle_irq+0xc4/0x170 [ 111.239388] Exception stack(0xffff9fa31fa7caa0 to 0xffff9fa31fa7cbd0) [ 111.239390] caa0: ffff9fa31fa7cad0 0001000000000000 ffff9fa31fa7cc00 ffff3f0624a00974 [ 111.239392] cac0: 0000000020400145 0000000000000001 00000000000000fe 0000000000000140 [ 111.239394] cae0: ffff9fa2e10b1c00 ffff9fa2e11c8800 0000000000000000 ffff9fa2e10b1c01 [ 111.239396] cb00: 0000000000000000 ffff9fa2e10b1c00 ffff9fa3035ee681 0000000000000000 [ 111.239397] cb20: ffff7e7e8b8533e0 ffff9fa31fa890b0 0000000000000000 000000009b3b00da [ 111.239399] cb40: 0000000085094ac4 00000000aa4269e9 0000000046e68d43 000000004d48a1ed [ 111.239401] cb60: 00000000a5e112c1 0000000000000140 ffff9fa30da06008 ffff9fa2e1073ac0 [ 111.239403] cb80: 0000000000000000 ffff9fa30da06008 0000000000000001 ffff3f06251f8000 [ 111.239404] cba0: 0000000fffffffff 0000000000000040 0000000ffffef50a ffff9fa31fa7cc00 [ 111.239406] cbc0: ffff3f0624682e8c ffff9fa31fa7cc00 [ 111.239407] [<ffff3f062408315c>] el1_irq+0xdc/0x180 [ 111.239411] [<ffff3f0624682e8c>] alloc_iova+0x1cc/0x2a0 [ 111.239413] [<ffff3f0624680488>] __alloc_iova+0x78/0x88 [ 111.239414] [<ffff3f0624680528>] __iommu_dma_map+0x90/0x128 [ 111.239416] [<ffff3f0624680e30>] iommu_dma_map_page+0x60/0x78 [ 111.239420] [<ffff3f062409c8fc>] __iommu_map_page+0x5c/0xd0 [ 111.239565] [<ffff3f06201046d0>] mlx5e_alloc_rx_wqe+0x118/0x318 [mlx5_core] [ 111.239607] [<ffff3f06201050e8>] mlx5e_post_rx_wqes+0xa0/0x110 [mlx5_core] [ 111.239647] [<ffff3f06201075dc>] mlx5e_napi_poll+0x22c/0x518 [mlx5_core] [ 111.239650] [<ffff3f06248cdda0>] net_rx_action+0x2e8/0x3f0 [ 111.239652] [<ffff3f0624081aa8>] __do_softirq+0x148/0x31c [ 111.239656] [<ffff3f06240d3d68>] irq_exit+0xd0/0x120 [ 111.239658] [<ffff3f062412dfe4>] __handle_domain_irq+0x6c/0xc0 [ 111.239660] [<ffff3f06240818b4>] gic_handle_irq+0xc4/0x170 [ 111.239661] Exception stack(0xffff9fa30ecffd80 to 0xffff9fa30ecffeb0) [ 111.239663] fd80: ffff9fa31fa85200 0000609cfabd2000 0000000006400000 0000000000000004 [ 111.239665] fda0: 0000000000003296 0000000000000015 000000005c57e302 0000000000000000 [ 111.239667] fdc0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 111.239668] fde0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 111.239670] fe00: 0000000000000000 0000000000000000 00000000ffffffff 0000000b7179114e [ 111.239672] fe20: ffff9fa3041c8000 0000000000000003 ffff3f0625292eb8 0000000000000000 [ 111.239673] fe40: 0000000b7160dcd2 0000000000000003 0000000000000000 0000000000000000 [ 111.239675] fe60: 0000000000000000 ffff9fa30ecffeb0 ffff3f06248549bc ffff9fa30ecffeb0 [ 111.239677] fe80: ffff3f06248549c4 0000000060400145 ffff9fa30ecffeb0 ffff3f06248549bc [ 111.239678] fea0: ffffffffffffffff 0000000b7160dcd2 [ 111.239680] [<ffff3f062408315c>] el1_irq+0xdc/0x180 [ 111.239684] [<ffff3f06248549c4>] cpuidle_enter_state+0x124/0x318 [ 111.239686] [<ffff3f0624854c2c>] cpuidle_enter+0x34/0x48 [ 111.239689] [<ffff3f062411c030>] call_cpuidle+0x40/0x70 [ 111.239691] [<ffff3f062411c344>] do_idle+0x1ac/0x1f8 [ 111.239693] [<ffff3f062411c5c4>] cpu_startup_entry+0x2c/0x30 [ 111.239695] [<ffff3f0624091008>] secondary_start_kernel+0x158/0x198 [ 111.239696] [<00000000112091a4>] 0x112091a4 [ 111.239697] Task dump for CPU 32: [ 111.239699] swapper/32 R running task 0 0 1 0x00000002 [ 111.239701] Call trace: [ 111.239704] [<ffff3f0624086250>] __switch_to+0x98/0xb0 [ 111.239705] [<0000000bcde2fa4e>] 0xbcde2fa4e [ 129.361765] ip_tables: (C) 2000-2006 Netfilter Core Team [ 129.397270] ip6_tables: (C) 2000-2006 Netfilter Core Team [ 129.438584] Ebtables v2.0 registered [FIX] The following patches applied in this order fixes this issue. 5016bdb796b3 iommu/iova: Fix underflow bug in __alloc_and_insert_iova_range d9a5f8adaec9 iommu/dma: Plumb in the per-CPU IOVA caches fc7f6142bacb iommu/dma: Clean up MSI IOVA allocation 568c61384ea1 iommu/dma: Convert to address-based allocation dddd632b072f iommu/dma: Implement PCI allocation optimisation de84f5f049d9 iommu/dma: Stop getting dma_32bit_pfn wrong [Test case] After applying the patches the kernel boot with no soft lockups. This was tested by me on Zesty 4.10.0-20.22 kernel on QDF2400 SDP.
2017-05-03 20:19:52 Manoj Iyer description [IMPACT] Booting Zesty 4.10 kernel on Qualcomm Centriq 2400 ARM64 servers causes soft lockups on multiple CPUs. [ 104.205397] Modules linked in: nls_iso8859_1 cdc_acm bridge stp llc ipmi_ssif ipmi_devintf ipmi_msghandler shpchp hdma hdma_mgmt i2c_qup cppc_cpufreq ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear uas usb_storage at803x aes_ce_blk aes_ce_cipher crc32_ce crct10dif_ce ghash_ce sha2_ce sha1_ce mlx5_core devlink ptp pps_core ahci_platform libahci_platform libahci qcom_emac sdhci_acpi sdhci xhci_plat_hcd pinctrl_qdf2xxx fjes aes_neon_blk crypto_simd cryptd [ 104.205442] CPU: 47 PID: 0 Comm: swapper/47 Tainted: G L 4.10.0-16-generic #18ubuntuRC03+<redacted>.1 [ 104.205443] Hardware name: Qualcomm QDF2400 DP/ABW|SYS|CVR,1DPC|V3 , BIOS XBL.DF.2.0.R3-00153 QDF2400_REL CRM 02/ 8/2017 [ 104.205444] task: ffff9fa30ed49c00 task.stack: ffff9fa30ed5c000 [ 104.205447] PC is at _raw_spin_unlock_irqrestore+0x2c/0x38 [ 104.205450] LR is at alloc_iova+0x1cc/0x2a0 [ 104.205451] pc : [<ffff3f0624a00974>] lr : [<ffff3f0624682e8c>] pstate: 20400145 [ 104.205452] sp : ffff9fa31fbecc00 [ 104.205453] x29: ffff9fa31fbecc00 x28: 0000000ffffefe46 [ 104.205455] x27: 0000000000000040 x26: 0000000fffffffff [ 104.205458] x25: ffff3f06251f8000 x24: 0000000000000001 [ 104.205460] x23: ffff9fa30da06008 x22: 0000000000000000 [ 104.205462] x21: ffff9fa2e2af8740 x20: ffff9fa30da06008 [ 104.205464] x19: 0000000000000140 x18: 00000000a5e112c1 [ 104.205466] x17: 000000004d48a1ed x16: 00000000b0f9c455 [ 104.205468] x15: 00000000aa4269e9 x14: 0000000085094ac4 [ 104.205471] x13: 000000009b3b00da x12: 000000008aae8d9c [ 104.205473] x11: ffff9fa31fbf90b0 x10: ffff3f0624eb70eb [ 104.205475] x9 : 0000000000000000 x8 : 0000000000000004 [ 104.205477] x7 : ffff9fa2e2875400 x6 : 0000000000000000 [ 104.205479] x5 : ffff9fa2e2875401 x4 : 0000000000000000 [ 104.205481] x3 : ffff9fa2e2a27b00 x2 : ffff9fa2e2875400 [ 104.205483] x1 : 0000000000000140 x0 : 000000000000f7c2 [ 111.198062] INFO: rcu_sched self-detected stall on CPU [ 111.198971] INFO: rcu_sched detected stalls on CPUs/tasks: [ 111.198977] 31-...: (1 GPs behind) idle=1b3/2/0 softirq=432/433 fqs=6805 [ 111.198979] 32-...: (1 GPs behind) idle=291/1/0 softirq=469/470 fqs=6805 [ 111.198980] (detected by 2, t=15002 jiffies, g=143, c=142, q=6968) [ 111.199000] Task dump for CPU 31: [ 111.199002] swapper/31 R running task 0 0 1 0x00000002 [ 111.199006] Call trace: [ 111.199012] [<ffff3f0624086250>] __switch_to+0x98/0xb0 [ 111.199014] [<0000000b7160dcd2>] 0xb7160dcd2 [ 111.199015] Task dump for CPU 32: [ 111.199016] swapper/32 R running task 0 0 1 0x00000002 [ 111.199018] Call trace: [ 111.199019] [<ffff3f0624086250>] __switch_to+0x98/0xb0 [ 111.199020] [<0000000bcde2fa4e>] 0xbcde2fa4e [ 111.227703] 31-...: (1 GPs behind) idle=1b3/2/0 softirq=432/433 fqs=6809 [ 111.234558] (t=15010 jiffies g=143 c=142 q=6968) [ 111.239334] Task dump for CPU 31: [ 111.239335] swapper/31 R running task 0 0 1 0x00000002 [ 111.239338] Call trace: [ 111.239344] [<ffff3f062408b030>] dump_backtrace+0x0/0x2b0 [ 111.239346] [<ffff3f062408b304>] show_stack+0x24/0x30 [ 111.239350] [<ffff3f0624103f80>] sched_show_task+0x128/0x178 [ 111.239352] [<ffff3f0624106d68>] dump_cpu_task+0x48/0x58 [ 111.239356] [<ffff3f0624200d38>] rcu_dump_cpu_stacks+0xbc/0xf0 [ 111.239359] [<ffff3f06241409e8>] rcu_check_callbacks+0x7a8/0x968 [ 111.239362] [<ffff3f0624146e1c>] update_process_times+0x34/0x60 [ 111.239365] [<ffff3f0624159118>] tick_sched_handle.isra.7+0x38/0x70 [ 111.239367] [<ffff3f062415919c>] tick_sched_timer+0x4c/0x98 [ 111.239369] [<ffff3f06241477a0>] __hrtimer_run_queues+0xe8/0x2e8 [ 111.239371] [<ffff3f0624148340>] hrtimer_interrupt+0xa8/0x228 [ 111.239376] [<ffff3f062487c02c>] arch_timer_handler_phys+0x3c/0x50 [ 111.239379] [<ffff3f0624133964>] handle_percpu_devid_irq+0x8c/0x230 [ 111.239383] [<ffff3f062412d8b4>] generic_handle_irq+0x34/0x50 [ 111.239385] [<ffff3f062412dfe0>] __handle_domain_irq+0x68/0xc0 [ 111.239386] [<ffff3f06240818b4>] gic_handle_irq+0xc4/0x170 [ 111.239388] Exception stack(0xffff9fa31fa7caa0 to 0xffff9fa31fa7cbd0) [ 111.239390] caa0: ffff9fa31fa7cad0 0001000000000000 ffff9fa31fa7cc00 ffff3f0624a00974 [ 111.239392] cac0: 0000000020400145 0000000000000001 00000000000000fe 0000000000000140 [ 111.239394] cae0: ffff9fa2e10b1c00 ffff9fa2e11c8800 0000000000000000 ffff9fa2e10b1c01 [ 111.239396] cb00: 0000000000000000 ffff9fa2e10b1c00 ffff9fa3035ee681 0000000000000000 [ 111.239397] cb20: ffff7e7e8b8533e0 ffff9fa31fa890b0 0000000000000000 000000009b3b00da [ 111.239399] cb40: 0000000085094ac4 00000000aa4269e9 0000000046e68d43 000000004d48a1ed [ 111.239401] cb60: 00000000a5e112c1 0000000000000140 ffff9fa30da06008 ffff9fa2e1073ac0 [ 111.239403] cb80: 0000000000000000 ffff9fa30da06008 0000000000000001 ffff3f06251f8000 [ 111.239404] cba0: 0000000fffffffff 0000000000000040 0000000ffffef50a ffff9fa31fa7cc00 [ 111.239406] cbc0: ffff3f0624682e8c ffff9fa31fa7cc00 [ 111.239407] [<ffff3f062408315c>] el1_irq+0xdc/0x180 [ 111.239411] [<ffff3f0624682e8c>] alloc_iova+0x1cc/0x2a0 [ 111.239413] [<ffff3f0624680488>] __alloc_iova+0x78/0x88 [ 111.239414] [<ffff3f0624680528>] __iommu_dma_map+0x90/0x128 [ 111.239416] [<ffff3f0624680e30>] iommu_dma_map_page+0x60/0x78 [ 111.239420] [<ffff3f062409c8fc>] __iommu_map_page+0x5c/0xd0 [ 111.239565] [<ffff3f06201046d0>] mlx5e_alloc_rx_wqe+0x118/0x318 [mlx5_core] [ 111.239607] [<ffff3f06201050e8>] mlx5e_post_rx_wqes+0xa0/0x110 [mlx5_core] [ 111.239647] [<ffff3f06201075dc>] mlx5e_napi_poll+0x22c/0x518 [mlx5_core] [ 111.239650] [<ffff3f06248cdda0>] net_rx_action+0x2e8/0x3f0 [ 111.239652] [<ffff3f0624081aa8>] __do_softirq+0x148/0x31c [ 111.239656] [<ffff3f06240d3d68>] irq_exit+0xd0/0x120 [ 111.239658] [<ffff3f062412dfe4>] __handle_domain_irq+0x6c/0xc0 [ 111.239660] [<ffff3f06240818b4>] gic_handle_irq+0xc4/0x170 [ 111.239661] Exception stack(0xffff9fa30ecffd80 to 0xffff9fa30ecffeb0) [ 111.239663] fd80: ffff9fa31fa85200 0000609cfabd2000 0000000006400000 0000000000000004 [ 111.239665] fda0: 0000000000003296 0000000000000015 000000005c57e302 0000000000000000 [ 111.239667] fdc0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 111.239668] fde0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 111.239670] fe00: 0000000000000000 0000000000000000 00000000ffffffff 0000000b7179114e [ 111.239672] fe20: ffff9fa3041c8000 0000000000000003 ffff3f0625292eb8 0000000000000000 [ 111.239673] fe40: 0000000b7160dcd2 0000000000000003 0000000000000000 0000000000000000 [ 111.239675] fe60: 0000000000000000 ffff9fa30ecffeb0 ffff3f06248549bc ffff9fa30ecffeb0 [ 111.239677] fe80: ffff3f06248549c4 0000000060400145 ffff9fa30ecffeb0 ffff3f06248549bc [ 111.239678] fea0: ffffffffffffffff 0000000b7160dcd2 [ 111.239680] [<ffff3f062408315c>] el1_irq+0xdc/0x180 [ 111.239684] [<ffff3f06248549c4>] cpuidle_enter_state+0x124/0x318 [ 111.239686] [<ffff3f0624854c2c>] cpuidle_enter+0x34/0x48 [ 111.239689] [<ffff3f062411c030>] call_cpuidle+0x40/0x70 [ 111.239691] [<ffff3f062411c344>] do_idle+0x1ac/0x1f8 [ 111.239693] [<ffff3f062411c5c4>] cpu_startup_entry+0x2c/0x30 [ 111.239695] [<ffff3f0624091008>] secondary_start_kernel+0x158/0x198 [ 111.239696] [<00000000112091a4>] 0x112091a4 [ 111.239697] Task dump for CPU 32: [ 111.239699] swapper/32 R running task 0 0 1 0x00000002 [ 111.239701] Call trace: [ 111.239704] [<ffff3f0624086250>] __switch_to+0x98/0xb0 [ 111.239705] [<0000000bcde2fa4e>] 0xbcde2fa4e [ 129.361765] ip_tables: (C) 2000-2006 Netfilter Core Team [ 129.397270] ip6_tables: (C) 2000-2006 Netfilter Core Team [ 129.438584] Ebtables v2.0 registered [FIX] The following patches applied in this order fixes this issue. 5016bdb796b3 iommu/iova: Fix underflow bug in __alloc_and_insert_iova_range d9a5f8adaec9 iommu/dma: Plumb in the per-CPU IOVA caches fc7f6142bacb iommu/dma: Clean up MSI IOVA allocation 568c61384ea1 iommu/dma: Convert to address-based allocation dddd632b072f iommu/dma: Implement PCI allocation optimisation de84f5f049d9 iommu/dma: Stop getting dma_32bit_pfn wrong [Test case] After applying the patches the kernel boot with no soft lockups. This was tested by me on Zesty 4.10.0-20.22 kernel on QDF2400 SDP. [IMPACT] Booting Zesty 4.10 kernel on Qualcomm Centriq 2400 ARM64 servers causes soft lockups on multiple CPUs. [ 104.205397] Modules linked in: nls_iso8859_1 cdc_acm bridge stp llc ipmi_ssif ipmi_devintf ipmi_msghandler shpchp hdma hdma_mgmt i2c_qup cppc_cpufreq ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear uas usb_storage at803x aes_ce_blk aes_ce_cipher crc32_ce crct10dif_ce ghash_ce sha2_ce sha1_ce mlx5_core devlink ptp pps_core ahci_platform libahci_platform libahci qcom_emac sdhci_acpi sdhci xhci_plat_hcd pinctrl_qdf2xxx fjes aes_neon_blk crypto_simd cryptd [ 104.205442] CPU: 47 PID: 0 Comm: swapper/47 Tainted: G L 4.10.0-16-generic #18ubuntuRC03+<redacted>.1 [ 104.205443] Hardware name: Qualcomm QDF2400 DP/ABW|SYS|CVR,1DPC|V3 , BIOS XBL.DF.2.0.R3-00153 QDF2400_REL CRM 02/ 8/2017 [ 104.205444] task: ffff9fa30ed49c00 task.stack: ffff9fa30ed5c000 [ 104.205447] PC is at _raw_spin_unlock_irqrestore+0x2c/0x38 [ 104.205450] LR is at alloc_iova+0x1cc/0x2a0 [ 104.205451] pc : [<ffff3f0624a00974>] lr : [<ffff3f0624682e8c>] pstate: 20400145 [ 104.205452] sp : ffff9fa31fbecc00 [ 104.205453] x29: ffff9fa31fbecc00 x28: 0000000ffffefe46 [ 104.205455] x27: 0000000000000040 x26: 0000000fffffffff [ 104.205458] x25: ffff3f06251f8000 x24: 0000000000000001 [ 104.205460] x23: ffff9fa30da06008 x22: 0000000000000000 [ 104.205462] x21: ffff9fa2e2af8740 x20: ffff9fa30da06008 [ 104.205464] x19: 0000000000000140 x18: 00000000a5e112c1 [ 104.205466] x17: 000000004d48a1ed x16: 00000000b0f9c455 [ 104.205468] x15: 00000000aa4269e9 x14: 0000000085094ac4 [ 104.205471] x13: 000000009b3b00da x12: 000000008aae8d9c [ 104.205473] x11: ffff9fa31fbf90b0 x10: ffff3f0624eb70eb [ 104.205475] x9 : 0000000000000000 x8 : 0000000000000004 [ 104.205477] x7 : ffff9fa2e2875400 x6 : 0000000000000000 [ 104.205479] x5 : ffff9fa2e2875401 x4 : 0000000000000000 [ 104.205481] x3 : ffff9fa2e2a27b00 x2 : ffff9fa2e2875400 [ 104.205483] x1 : 0000000000000140 x0 : 000000000000f7c2 [ 111.198062] INFO: rcu_sched self-detected stall on CPU [ 111.198971] INFO: rcu_sched detected stalls on CPUs/tasks: [ 111.198977] 31-...: (1 GPs behind) idle=1b3/2/0 softirq=432/433 fqs=6805 [ 111.198979] 32-...: (1 GPs behind) idle=291/1/0 softirq=469/470 fqs=6805 [ 111.198980] (detected by 2, t=15002 jiffies, g=143, c=142, q=6968) [ 111.199000] Task dump for CPU 31: [ 111.199002] swapper/31 R running task 0 0 1 0x00000002 [ 111.199006] Call trace: [ 111.199012] [<ffff3f0624086250>] __switch_to+0x98/0xb0 [ 111.199014] [<0000000b7160dcd2>] 0xb7160dcd2 [ 111.199015] Task dump for CPU 32: [ 111.199016] swapper/32 R running task 0 0 1 0x00000002 [ 111.199018] Call trace: [ 111.199019] [<ffff3f0624086250>] __switch_to+0x98/0xb0 [ 111.199020] [<0000000bcde2fa4e>] 0xbcde2fa4e [ 111.227703] 31-...: (1 GPs behind) idle=1b3/2/0 softirq=432/433 fqs=6809 [ 111.234558] (t=15010 jiffies g=143 c=142 q=6968) [ 111.239334] Task dump for CPU 31: [ 111.239335] swapper/31 R running task 0 0 1 0x00000002 [ 111.239338] Call trace: [ 111.239344] [<ffff3f062408b030>] dump_backtrace+0x0/0x2b0 [ 111.239346] [<ffff3f062408b304>] show_stack+0x24/0x30 [ 111.239350] [<ffff3f0624103f80>] sched_show_task+0x128/0x178 [ 111.239352] [<ffff3f0624106d68>] dump_cpu_task+0x48/0x58 [ 111.239356] [<ffff3f0624200d38>] rcu_dump_cpu_stacks+0xbc/0xf0 [ 111.239359] [<ffff3f06241409e8>] rcu_check_callbacks+0x7a8/0x968 [ 111.239362] [<ffff3f0624146e1c>] update_process_times+0x34/0x60 [ 111.239365] [<ffff3f0624159118>] tick_sched_handle.isra.7+0x38/0x70 [ 111.239367] [<ffff3f062415919c>] tick_sched_timer+0x4c/0x98 [ 111.239369] [<ffff3f06241477a0>] __hrtimer_run_queues+0xe8/0x2e8 [ 111.239371] [<ffff3f0624148340>] hrtimer_interrupt+0xa8/0x228 [ 111.239376] [<ffff3f062487c02c>] arch_timer_handler_phys+0x3c/0x50 [ 111.239379] [<ffff3f0624133964>] handle_percpu_devid_irq+0x8c/0x230 [ 111.239383] [<ffff3f062412d8b4>] generic_handle_irq+0x34/0x50 [ 111.239385] [<ffff3f062412dfe0>] __handle_domain_irq+0x68/0xc0 [ 111.239386] [<ffff3f06240818b4>] gic_handle_irq+0xc4/0x170 [ 111.239388] Exception stack(0xffff9fa31fa7caa0 to 0xffff9fa31fa7cbd0) [ 111.239390] caa0: ffff9fa31fa7cad0 0001000000000000 ffff9fa31fa7cc00 ffff3f0624a00974 [ 111.239392] cac0: 0000000020400145 0000000000000001 00000000000000fe 0000000000000140 [ 111.239394] cae0: ffff9fa2e10b1c00 ffff9fa2e11c8800 0000000000000000 ffff9fa2e10b1c01 [ 111.239396] cb00: 0000000000000000 ffff9fa2e10b1c00 ffff9fa3035ee681 0000000000000000 [ 111.239397] cb20: ffff7e7e8b8533e0 ffff9fa31fa890b0 0000000000000000 000000009b3b00da [ 111.239399] cb40: 0000000085094ac4 00000000aa4269e9 0000000046e68d43 000000004d48a1ed [ 111.239401] cb60: 00000000a5e112c1 0000000000000140 ffff9fa30da06008 ffff9fa2e1073ac0 [ 111.239403] cb80: 0000000000000000 ffff9fa30da06008 0000000000000001 ffff3f06251f8000 [ 111.239404] cba0: 0000000fffffffff 0000000000000040 0000000ffffef50a ffff9fa31fa7cc00 [ 111.239406] cbc0: ffff3f0624682e8c ffff9fa31fa7cc00 [ 111.239407] [<ffff3f062408315c>] el1_irq+0xdc/0x180 [ 111.239411] [<ffff3f0624682e8c>] alloc_iova+0x1cc/0x2a0 [ 111.239413] [<ffff3f0624680488>] __alloc_iova+0x78/0x88 [ 111.239414] [<ffff3f0624680528>] __iommu_dma_map+0x90/0x128 [ 111.239416] [<ffff3f0624680e30>] iommu_dma_map_page+0x60/0x78 [ 111.239420] [<ffff3f062409c8fc>] __iommu_map_page+0x5c/0xd0 [ 111.239565] [<ffff3f06201046d0>] mlx5e_alloc_rx_wqe+0x118/0x318 [mlx5_core] [ 111.239607] [<ffff3f06201050e8>] mlx5e_post_rx_wqes+0xa0/0x110 [mlx5_core] [ 111.239647] [<ffff3f06201075dc>] mlx5e_napi_poll+0x22c/0x518 [mlx5_core] [ 111.239650] [<ffff3f06248cdda0>] net_rx_action+0x2e8/0x3f0 [ 111.239652] [<ffff3f0624081aa8>] __do_softirq+0x148/0x31c [ 111.239656] [<ffff3f06240d3d68>] irq_exit+0xd0/0x120 [ 111.239658] [<ffff3f062412dfe4>] __handle_domain_irq+0x6c/0xc0 [ 111.239660] [<ffff3f06240818b4>] gic_handle_irq+0xc4/0x170 [ 111.239661] Exception stack(0xffff9fa30ecffd80 to 0xffff9fa30ecffeb0) [ 111.239663] fd80: ffff9fa31fa85200 0000609cfabd2000 0000000006400000 0000000000000004 [ 111.239665] fda0: 0000000000003296 0000000000000015 000000005c57e302 0000000000000000 [ 111.239667] fdc0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 111.239668] fde0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 111.239670] fe00: 0000000000000000 0000000000000000 00000000ffffffff 0000000b7179114e [ 111.239672] fe20: ffff9fa3041c8000 0000000000000003 ffff3f0625292eb8 0000000000000000 [ 111.239673] fe40: 0000000b7160dcd2 0000000000000003 0000000000000000 0000000000000000 [ 111.239675] fe60: 0000000000000000 ffff9fa30ecffeb0 ffff3f06248549bc ffff9fa30ecffeb0 [ 111.239677] fe80: ffff3f06248549c4 0000000060400145 ffff9fa30ecffeb0 ffff3f06248549bc [ 111.239678] fea0: ffffffffffffffff 0000000b7160dcd2 [ 111.239680] [<ffff3f062408315c>] el1_irq+0xdc/0x180 [ 111.239684] [<ffff3f06248549c4>] cpuidle_enter_state+0x124/0x318 [ 111.239686] [<ffff3f0624854c2c>] cpuidle_enter+0x34/0x48 [ 111.239689] [<ffff3f062411c030>] call_cpuidle+0x40/0x70 [ 111.239691] [<ffff3f062411c344>] do_idle+0x1ac/0x1f8 [ 111.239693] [<ffff3f062411c5c4>] cpu_startup_entry+0x2c/0x30 [ 111.239695] [<ffff3f0624091008>] secondary_start_kernel+0x158/0x198 [ 111.239696] [<00000000112091a4>] 0x112091a4 [ 111.239697] Task dump for CPU 32: [ 111.239699] swapper/32 R running task 0 0 1 0x00000002 [ 111.239701] Call trace: [ 111.239704] [<ffff3f0624086250>] __switch_to+0x98/0xb0 [ 111.239705] [<0000000bcde2fa4e>] 0xbcde2fa4e [ 129.361765] ip_tables: (C) 2000-2006 Netfilter Core Team [ 129.397270] ip6_tables: (C) 2000-2006 Netfilter Core Team [ 129.438584] Ebtables v2.0 registered [FIX] The following patches cherry-picked from linux-next fixes this issue. 5016bdb796b3 iommu/iova: Fix underflow bug in __alloc_and_insert_iova_range d9a5f8adaec9 iommu/dma: Plumb in the per-CPU IOVA caches fc7f6142bacb iommu/dma: Clean up MSI IOVA allocation 568c61384ea1 iommu/dma: Convert to address-based allocation dddd632b072f iommu/dma: Implement PCI allocation optimisation de84f5f049d9 iommu/dma: Stop getting dma_32bit_pfn wrong [Test case] After applying the patches the kernel boot with no soft lockups. This was tested by me on Zesty 4.10.0-20.22 kernel on QDF2400 SDP.
2017-05-03 20:28:51 Manoj Iyer description [IMPACT] Booting Zesty 4.10 kernel on Qualcomm Centriq 2400 ARM64 servers causes soft lockups on multiple CPUs. [ 104.205397] Modules linked in: nls_iso8859_1 cdc_acm bridge stp llc ipmi_ssif ipmi_devintf ipmi_msghandler shpchp hdma hdma_mgmt i2c_qup cppc_cpufreq ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear uas usb_storage at803x aes_ce_blk aes_ce_cipher crc32_ce crct10dif_ce ghash_ce sha2_ce sha1_ce mlx5_core devlink ptp pps_core ahci_platform libahci_platform libahci qcom_emac sdhci_acpi sdhci xhci_plat_hcd pinctrl_qdf2xxx fjes aes_neon_blk crypto_simd cryptd [ 104.205442] CPU: 47 PID: 0 Comm: swapper/47 Tainted: G L 4.10.0-16-generic #18ubuntuRC03+<redacted>.1 [ 104.205443] Hardware name: Qualcomm QDF2400 DP/ABW|SYS|CVR,1DPC|V3 , BIOS XBL.DF.2.0.R3-00153 QDF2400_REL CRM 02/ 8/2017 [ 104.205444] task: ffff9fa30ed49c00 task.stack: ffff9fa30ed5c000 [ 104.205447] PC is at _raw_spin_unlock_irqrestore+0x2c/0x38 [ 104.205450] LR is at alloc_iova+0x1cc/0x2a0 [ 104.205451] pc : [<ffff3f0624a00974>] lr : [<ffff3f0624682e8c>] pstate: 20400145 [ 104.205452] sp : ffff9fa31fbecc00 [ 104.205453] x29: ffff9fa31fbecc00 x28: 0000000ffffefe46 [ 104.205455] x27: 0000000000000040 x26: 0000000fffffffff [ 104.205458] x25: ffff3f06251f8000 x24: 0000000000000001 [ 104.205460] x23: ffff9fa30da06008 x22: 0000000000000000 [ 104.205462] x21: ffff9fa2e2af8740 x20: ffff9fa30da06008 [ 104.205464] x19: 0000000000000140 x18: 00000000a5e112c1 [ 104.205466] x17: 000000004d48a1ed x16: 00000000b0f9c455 [ 104.205468] x15: 00000000aa4269e9 x14: 0000000085094ac4 [ 104.205471] x13: 000000009b3b00da x12: 000000008aae8d9c [ 104.205473] x11: ffff9fa31fbf90b0 x10: ffff3f0624eb70eb [ 104.205475] x9 : 0000000000000000 x8 : 0000000000000004 [ 104.205477] x7 : ffff9fa2e2875400 x6 : 0000000000000000 [ 104.205479] x5 : ffff9fa2e2875401 x4 : 0000000000000000 [ 104.205481] x3 : ffff9fa2e2a27b00 x2 : ffff9fa2e2875400 [ 104.205483] x1 : 0000000000000140 x0 : 000000000000f7c2 [ 111.198062] INFO: rcu_sched self-detected stall on CPU [ 111.198971] INFO: rcu_sched detected stalls on CPUs/tasks: [ 111.198977] 31-...: (1 GPs behind) idle=1b3/2/0 softirq=432/433 fqs=6805 [ 111.198979] 32-...: (1 GPs behind) idle=291/1/0 softirq=469/470 fqs=6805 [ 111.198980] (detected by 2, t=15002 jiffies, g=143, c=142, q=6968) [ 111.199000] Task dump for CPU 31: [ 111.199002] swapper/31 R running task 0 0 1 0x00000002 [ 111.199006] Call trace: [ 111.199012] [<ffff3f0624086250>] __switch_to+0x98/0xb0 [ 111.199014] [<0000000b7160dcd2>] 0xb7160dcd2 [ 111.199015] Task dump for CPU 32: [ 111.199016] swapper/32 R running task 0 0 1 0x00000002 [ 111.199018] Call trace: [ 111.199019] [<ffff3f0624086250>] __switch_to+0x98/0xb0 [ 111.199020] [<0000000bcde2fa4e>] 0xbcde2fa4e [ 111.227703] 31-...: (1 GPs behind) idle=1b3/2/0 softirq=432/433 fqs=6809 [ 111.234558] (t=15010 jiffies g=143 c=142 q=6968) [ 111.239334] Task dump for CPU 31: [ 111.239335] swapper/31 R running task 0 0 1 0x00000002 [ 111.239338] Call trace: [ 111.239344] [<ffff3f062408b030>] dump_backtrace+0x0/0x2b0 [ 111.239346] [<ffff3f062408b304>] show_stack+0x24/0x30 [ 111.239350] [<ffff3f0624103f80>] sched_show_task+0x128/0x178 [ 111.239352] [<ffff3f0624106d68>] dump_cpu_task+0x48/0x58 [ 111.239356] [<ffff3f0624200d38>] rcu_dump_cpu_stacks+0xbc/0xf0 [ 111.239359] [<ffff3f06241409e8>] rcu_check_callbacks+0x7a8/0x968 [ 111.239362] [<ffff3f0624146e1c>] update_process_times+0x34/0x60 [ 111.239365] [<ffff3f0624159118>] tick_sched_handle.isra.7+0x38/0x70 [ 111.239367] [<ffff3f062415919c>] tick_sched_timer+0x4c/0x98 [ 111.239369] [<ffff3f06241477a0>] __hrtimer_run_queues+0xe8/0x2e8 [ 111.239371] [<ffff3f0624148340>] hrtimer_interrupt+0xa8/0x228 [ 111.239376] [<ffff3f062487c02c>] arch_timer_handler_phys+0x3c/0x50 [ 111.239379] [<ffff3f0624133964>] handle_percpu_devid_irq+0x8c/0x230 [ 111.239383] [<ffff3f062412d8b4>] generic_handle_irq+0x34/0x50 [ 111.239385] [<ffff3f062412dfe0>] __handle_domain_irq+0x68/0xc0 [ 111.239386] [<ffff3f06240818b4>] gic_handle_irq+0xc4/0x170 [ 111.239388] Exception stack(0xffff9fa31fa7caa0 to 0xffff9fa31fa7cbd0) [ 111.239390] caa0: ffff9fa31fa7cad0 0001000000000000 ffff9fa31fa7cc00 ffff3f0624a00974 [ 111.239392] cac0: 0000000020400145 0000000000000001 00000000000000fe 0000000000000140 [ 111.239394] cae0: ffff9fa2e10b1c00 ffff9fa2e11c8800 0000000000000000 ffff9fa2e10b1c01 [ 111.239396] cb00: 0000000000000000 ffff9fa2e10b1c00 ffff9fa3035ee681 0000000000000000 [ 111.239397] cb20: ffff7e7e8b8533e0 ffff9fa31fa890b0 0000000000000000 000000009b3b00da [ 111.239399] cb40: 0000000085094ac4 00000000aa4269e9 0000000046e68d43 000000004d48a1ed [ 111.239401] cb60: 00000000a5e112c1 0000000000000140 ffff9fa30da06008 ffff9fa2e1073ac0 [ 111.239403] cb80: 0000000000000000 ffff9fa30da06008 0000000000000001 ffff3f06251f8000 [ 111.239404] cba0: 0000000fffffffff 0000000000000040 0000000ffffef50a ffff9fa31fa7cc00 [ 111.239406] cbc0: ffff3f0624682e8c ffff9fa31fa7cc00 [ 111.239407] [<ffff3f062408315c>] el1_irq+0xdc/0x180 [ 111.239411] [<ffff3f0624682e8c>] alloc_iova+0x1cc/0x2a0 [ 111.239413] [<ffff3f0624680488>] __alloc_iova+0x78/0x88 [ 111.239414] [<ffff3f0624680528>] __iommu_dma_map+0x90/0x128 [ 111.239416] [<ffff3f0624680e30>] iommu_dma_map_page+0x60/0x78 [ 111.239420] [<ffff3f062409c8fc>] __iommu_map_page+0x5c/0xd0 [ 111.239565] [<ffff3f06201046d0>] mlx5e_alloc_rx_wqe+0x118/0x318 [mlx5_core] [ 111.239607] [<ffff3f06201050e8>] mlx5e_post_rx_wqes+0xa0/0x110 [mlx5_core] [ 111.239647] [<ffff3f06201075dc>] mlx5e_napi_poll+0x22c/0x518 [mlx5_core] [ 111.239650] [<ffff3f06248cdda0>] net_rx_action+0x2e8/0x3f0 [ 111.239652] [<ffff3f0624081aa8>] __do_softirq+0x148/0x31c [ 111.239656] [<ffff3f06240d3d68>] irq_exit+0xd0/0x120 [ 111.239658] [<ffff3f062412dfe4>] __handle_domain_irq+0x6c/0xc0 [ 111.239660] [<ffff3f06240818b4>] gic_handle_irq+0xc4/0x170 [ 111.239661] Exception stack(0xffff9fa30ecffd80 to 0xffff9fa30ecffeb0) [ 111.239663] fd80: ffff9fa31fa85200 0000609cfabd2000 0000000006400000 0000000000000004 [ 111.239665] fda0: 0000000000003296 0000000000000015 000000005c57e302 0000000000000000 [ 111.239667] fdc0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 111.239668] fde0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 111.239670] fe00: 0000000000000000 0000000000000000 00000000ffffffff 0000000b7179114e [ 111.239672] fe20: ffff9fa3041c8000 0000000000000003 ffff3f0625292eb8 0000000000000000 [ 111.239673] fe40: 0000000b7160dcd2 0000000000000003 0000000000000000 0000000000000000 [ 111.239675] fe60: 0000000000000000 ffff9fa30ecffeb0 ffff3f06248549bc ffff9fa30ecffeb0 [ 111.239677] fe80: ffff3f06248549c4 0000000060400145 ffff9fa30ecffeb0 ffff3f06248549bc [ 111.239678] fea0: ffffffffffffffff 0000000b7160dcd2 [ 111.239680] [<ffff3f062408315c>] el1_irq+0xdc/0x180 [ 111.239684] [<ffff3f06248549c4>] cpuidle_enter_state+0x124/0x318 [ 111.239686] [<ffff3f0624854c2c>] cpuidle_enter+0x34/0x48 [ 111.239689] [<ffff3f062411c030>] call_cpuidle+0x40/0x70 [ 111.239691] [<ffff3f062411c344>] do_idle+0x1ac/0x1f8 [ 111.239693] [<ffff3f062411c5c4>] cpu_startup_entry+0x2c/0x30 [ 111.239695] [<ffff3f0624091008>] secondary_start_kernel+0x158/0x198 [ 111.239696] [<00000000112091a4>] 0x112091a4 [ 111.239697] Task dump for CPU 32: [ 111.239699] swapper/32 R running task 0 0 1 0x00000002 [ 111.239701] Call trace: [ 111.239704] [<ffff3f0624086250>] __switch_to+0x98/0xb0 [ 111.239705] [<0000000bcde2fa4e>] 0xbcde2fa4e [ 129.361765] ip_tables: (C) 2000-2006 Netfilter Core Team [ 129.397270] ip6_tables: (C) 2000-2006 Netfilter Core Team [ 129.438584] Ebtables v2.0 registered [FIX] The following patches cherry-picked from linux-next fixes this issue. 5016bdb796b3 iommu/iova: Fix underflow bug in __alloc_and_insert_iova_range d9a5f8adaec9 iommu/dma: Plumb in the per-CPU IOVA caches fc7f6142bacb iommu/dma: Clean up MSI IOVA allocation 568c61384ea1 iommu/dma: Convert to address-based allocation dddd632b072f iommu/dma: Implement PCI allocation optimisation de84f5f049d9 iommu/dma: Stop getting dma_32bit_pfn wrong [Test case] After applying the patches the kernel boot with no soft lockups. This was tested by me on Zesty 4.10.0-20.22 kernel on QDF2400 SDP. [IMPACT] Booting Zesty 4.10 kernel on Qualcomm Centriq 2400 ARM64 servers causes soft lockups on multiple CPUs. [ 104.205397] Modules linked in: nls_iso8859_1 cdc_acm bridge stp llc ipmi_ssif ipmi_devintf ipmi_msghandler shpchp hdma hdma_mgmt i2c_qup cppc_cpufreq ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear uas usb_storage at803x aes_ce_blk aes_ce_cipher crc32_ce crct10dif_ce ghash_ce sha2_ce sha1_ce mlx5_core devlink ptp pps_core ahci_platform libahci_platform libahci qcom_emac sdhci_acpi sdhci xhci_plat_hcd pinctrl_qdf2xxx fjes aes_neon_blk crypto_simd cryptd [ 104.205442] CPU: 47 PID: 0 Comm: swapper/47 Tainted: G L 4.10.0-16-generic #18ubuntuRC03+<redacted>.1 [ 104.205443] Hardware name: Qualcomm QDF2400 DP/ABW|SYS|CVR,1DPC|V3 , BIOS XBL.DF.2.0.R3-00153 QDF2400_REL CRM 02/ 8/2017 [ 104.205444] task: ffff9fa30ed49c00 task.stack: ffff9fa30ed5c000 [ 104.205447] PC is at _raw_spin_unlock_irqrestore+0x2c/0x38 [ 104.205450] LR is at alloc_iova+0x1cc/0x2a0 [ 104.205451] pc : [<ffff3f0624a00974>] lr : [<ffff3f0624682e8c>] pstate: 20400145 [ 104.205452] sp : ffff9fa31fbecc00 [ 104.205453] x29: ffff9fa31fbecc00 x28: 0000000ffffefe46 [ 104.205455] x27: 0000000000000040 x26: 0000000fffffffff [ 104.205458] x25: ffff3f06251f8000 x24: 0000000000000001 [ 104.205460] x23: ffff9fa30da06008 x22: 0000000000000000 [ 104.205462] x21: ffff9fa2e2af8740 x20: ffff9fa30da06008 [ 104.205464] x19: 0000000000000140 x18: 00000000a5e112c1 [ 104.205466] x17: 000000004d48a1ed x16: 00000000b0f9c455 [ 104.205468] x15: 00000000aa4269e9 x14: 0000000085094ac4 [ 104.205471] x13: 000000009b3b00da x12: 000000008aae8d9c [ 104.205473] x11: ffff9fa31fbf90b0 x10: ffff3f0624eb70eb [ 104.205475] x9 : 0000000000000000 x8 : 0000000000000004 [ 104.205477] x7 : ffff9fa2e2875400 x6 : 0000000000000000 [ 104.205479] x5 : ffff9fa2e2875401 x4 : 0000000000000000 [ 104.205481] x3 : ffff9fa2e2a27b00 x2 : ffff9fa2e2875400 [ 104.205483] x1 : 0000000000000140 x0 : 000000000000f7c2 [ 111.198062] INFO: rcu_sched self-detected stall on CPU [ 111.198971] INFO: rcu_sched detected stalls on CPUs/tasks: [ 111.198977] 31-...: (1 GPs behind) idle=1b3/2/0 softirq=432/433 fqs=6805 [ 111.198979] 32-...: (1 GPs behind) idle=291/1/0 softirq=469/470 fqs=6805 [ 111.198980] (detected by 2, t=15002 jiffies, g=143, c=142, q=6968) [ 111.199000] Task dump for CPU 31: [ 111.199002] swapper/31 R running task 0 0 1 0x00000002 [ 111.199006] Call trace: [ 111.199012] [<ffff3f0624086250>] __switch_to+0x98/0xb0 [ 111.199014] [<0000000b7160dcd2>] 0xb7160dcd2 [ 111.199015] Task dump for CPU 32: [ 111.199016] swapper/32 R running task 0 0 1 0x00000002 [ 111.199018] Call trace: [ 111.199019] [<ffff3f0624086250>] __switch_to+0x98/0xb0 [ 111.199020] [<0000000bcde2fa4e>] 0xbcde2fa4e [ 111.227703] 31-...: (1 GPs behind) idle=1b3/2/0 softirq=432/433 fqs=6809 [ 111.234558] (t=15010 jiffies g=143 c=142 q=6968) [ 111.239334] Task dump for CPU 31: [ 111.239335] swapper/31 R running task 0 0 1 0x00000002 [ 111.239338] Call trace: [ 111.239344] [<ffff3f062408b030>] dump_backtrace+0x0/0x2b0 [ 111.239346] [<ffff3f062408b304>] show_stack+0x24/0x30 [ 111.239350] [<ffff3f0624103f80>] sched_show_task+0x128/0x178 [ 111.239352] [<ffff3f0624106d68>] dump_cpu_task+0x48/0x58 [ 111.239356] [<ffff3f0624200d38>] rcu_dump_cpu_stacks+0xbc/0xf0 [ 111.239359] [<ffff3f06241409e8>] rcu_check_callbacks+0x7a8/0x968 [ 111.239362] [<ffff3f0624146e1c>] update_process_times+0x34/0x60 [ 111.239365] [<ffff3f0624159118>] tick_sched_handle.isra.7+0x38/0x70 [ 111.239367] [<ffff3f062415919c>] tick_sched_timer+0x4c/0x98 [ 111.239369] [<ffff3f06241477a0>] __hrtimer_run_queues+0xe8/0x2e8 [ 111.239371] [<ffff3f0624148340>] hrtimer_interrupt+0xa8/0x228 [ 111.239376] [<ffff3f062487c02c>] arch_timer_handler_phys+0x3c/0x50 [ 111.239379] [<ffff3f0624133964>] handle_percpu_devid_irq+0x8c/0x230 [ 111.239383] [<ffff3f062412d8b4>] generic_handle_irq+0x34/0x50 [ 111.239385] [<ffff3f062412dfe0>] __handle_domain_irq+0x68/0xc0 [ 111.239386] [<ffff3f06240818b4>] gic_handle_irq+0xc4/0x170 [ 111.239388] Exception stack(0xffff9fa31fa7caa0 to 0xffff9fa31fa7cbd0) [ 111.239390] caa0: ffff9fa31fa7cad0 0001000000000000 ffff9fa31fa7cc00 ffff3f0624a00974 [ 111.239392] cac0: 0000000020400145 0000000000000001 00000000000000fe 0000000000000140 [ 111.239394] cae0: ffff9fa2e10b1c00 ffff9fa2e11c8800 0000000000000000 ffff9fa2e10b1c01 [ 111.239396] cb00: 0000000000000000 ffff9fa2e10b1c00 ffff9fa3035ee681 0000000000000000 [ 111.239397] cb20: ffff7e7e8b8533e0 ffff9fa31fa890b0 0000000000000000 000000009b3b00da [ 111.239399] cb40: 0000000085094ac4 00000000aa4269e9 0000000046e68d43 000000004d48a1ed [ 111.239401] cb60: 00000000a5e112c1 0000000000000140 ffff9fa30da06008 ffff9fa2e1073ac0 [ 111.239403] cb80: 0000000000000000 ffff9fa30da06008 0000000000000001 ffff3f06251f8000 [ 111.239404] cba0: 0000000fffffffff 0000000000000040 0000000ffffef50a ffff9fa31fa7cc00 [ 111.239406] cbc0: ffff3f0624682e8c ffff9fa31fa7cc00 [ 111.239407] [<ffff3f062408315c>] el1_irq+0xdc/0x180 [ 111.239411] [<ffff3f0624682e8c>] alloc_iova+0x1cc/0x2a0 [ 111.239413] [<ffff3f0624680488>] __alloc_iova+0x78/0x88 [ 111.239414] [<ffff3f0624680528>] __iommu_dma_map+0x90/0x128 [ 111.239416] [<ffff3f0624680e30>] iommu_dma_map_page+0x60/0x78 [ 111.239420] [<ffff3f062409c8fc>] __iommu_map_page+0x5c/0xd0 [ 111.239565] [<ffff3f06201046d0>] mlx5e_alloc_rx_wqe+0x118/0x318 [mlx5_core] [ 111.239607] [<ffff3f06201050e8>] mlx5e_post_rx_wqes+0xa0/0x110 [mlx5_core] [ 111.239647] [<ffff3f06201075dc>] mlx5e_napi_poll+0x22c/0x518 [mlx5_core] [ 111.239650] [<ffff3f06248cdda0>] net_rx_action+0x2e8/0x3f0 [ 111.239652] [<ffff3f0624081aa8>] __do_softirq+0x148/0x31c [ 111.239656] [<ffff3f06240d3d68>] irq_exit+0xd0/0x120 [ 111.239658] [<ffff3f062412dfe4>] __handle_domain_irq+0x6c/0xc0 [ 111.239660] [<ffff3f06240818b4>] gic_handle_irq+0xc4/0x170 [ 111.239661] Exception stack(0xffff9fa30ecffd80 to 0xffff9fa30ecffeb0) [ 111.239663] fd80: ffff9fa31fa85200 0000609cfabd2000 0000000006400000 0000000000000004 [ 111.239665] fda0: 0000000000003296 0000000000000015 000000005c57e302 0000000000000000 [ 111.239667] fdc0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 111.239668] fde0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 111.239670] fe00: 0000000000000000 0000000000000000 00000000ffffffff 0000000b7179114e [ 111.239672] fe20: ffff9fa3041c8000 0000000000000003 ffff3f0625292eb8 0000000000000000 [ 111.239673] fe40: 0000000b7160dcd2 0000000000000003 0000000000000000 0000000000000000 [ 111.239675] fe60: 0000000000000000 ffff9fa30ecffeb0 ffff3f06248549bc ffff9fa30ecffeb0 [ 111.239677] fe80: ffff3f06248549c4 0000000060400145 ffff9fa30ecffeb0 ffff3f06248549bc [ 111.239678] fea0: ffffffffffffffff 0000000b7160dcd2 [ 111.239680] [<ffff3f062408315c>] el1_irq+0xdc/0x180 [ 111.239684] [<ffff3f06248549c4>] cpuidle_enter_state+0x124/0x318 [ 111.239686] [<ffff3f0624854c2c>] cpuidle_enter+0x34/0x48 [ 111.239689] [<ffff3f062411c030>] call_cpuidle+0x40/0x70 [ 111.239691] [<ffff3f062411c344>] do_idle+0x1ac/0x1f8 [ 111.239693] [<ffff3f062411c5c4>] cpu_startup_entry+0x2c/0x30 [ 111.239695] [<ffff3f0624091008>] secondary_start_kernel+0x158/0x198 [ 111.239696] [<00000000112091a4>] 0x112091a4 [ 111.239697] Task dump for CPU 32: [ 111.239699] swapper/32 R running task 0 0 1 0x00000002 [ 111.239701] Call trace: [ 111.239704] [<ffff3f0624086250>] __switch_to+0x98/0xb0 [ 111.239705] [<0000000bcde2fa4e>] 0xbcde2fa4e [ 129.361765] ip_tables: (C) 2000-2006 Netfilter Core Team [ 129.397270] ip6_tables: (C) 2000-2006 Netfilter Core Team [ 129.438584] Ebtables v2.0 registered [FIX] The following patches cherry-picked from linux-next fixes this issue. 5016bdb796b3 iommu/iova: Fix underflow bug in __alloc_and_insert_iova_range d9a5f8adaec9 iommu/dma: Plumb in the per-CPU IOVA caches fc7f6142bacb iommu/dma: Clean up MSI IOVA allocation 568c61384ea1 iommu/dma: Convert to address-based allocation dddd632b072f iommu/dma: Implement PCI allocation optimisation de84f5f049d9 iommu/dma: Stop getting dma_32bit_pfn wrong [Test case] After applying the patches the kernel boot with no soft lockups. This was tested by me on Zesty 4.10.0-20.22 kernel on QDF2400 SDP. [Regression Potential] These patches applicable to iommu driver and does not impact any platform code.
2017-05-05 13:14:23 Stefan Bader nominated for series Ubuntu Zesty
2017-05-05 13:14:23 Stefan Bader bug task added linux (Ubuntu Zesty)
2017-05-10 14:42:06 Manoj Iyer tags qdf2400
2017-06-02 16:30:50 Manoj Iyer linux (Ubuntu): status In Progress Incomplete
2017-06-05 22:51:17 Manoj Iyer description [IMPACT] Booting Zesty 4.10 kernel on Qualcomm Centriq 2400 ARM64 servers causes soft lockups on multiple CPUs. [ 104.205397] Modules linked in: nls_iso8859_1 cdc_acm bridge stp llc ipmi_ssif ipmi_devintf ipmi_msghandler shpchp hdma hdma_mgmt i2c_qup cppc_cpufreq ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear uas usb_storage at803x aes_ce_blk aes_ce_cipher crc32_ce crct10dif_ce ghash_ce sha2_ce sha1_ce mlx5_core devlink ptp pps_core ahci_platform libahci_platform libahci qcom_emac sdhci_acpi sdhci xhci_plat_hcd pinctrl_qdf2xxx fjes aes_neon_blk crypto_simd cryptd [ 104.205442] CPU: 47 PID: 0 Comm: swapper/47 Tainted: G L 4.10.0-16-generic #18ubuntuRC03+<redacted>.1 [ 104.205443] Hardware name: Qualcomm QDF2400 DP/ABW|SYS|CVR,1DPC|V3 , BIOS XBL.DF.2.0.R3-00153 QDF2400_REL CRM 02/ 8/2017 [ 104.205444] task: ffff9fa30ed49c00 task.stack: ffff9fa30ed5c000 [ 104.205447] PC is at _raw_spin_unlock_irqrestore+0x2c/0x38 [ 104.205450] LR is at alloc_iova+0x1cc/0x2a0 [ 104.205451] pc : [<ffff3f0624a00974>] lr : [<ffff3f0624682e8c>] pstate: 20400145 [ 104.205452] sp : ffff9fa31fbecc00 [ 104.205453] x29: ffff9fa31fbecc00 x28: 0000000ffffefe46 [ 104.205455] x27: 0000000000000040 x26: 0000000fffffffff [ 104.205458] x25: ffff3f06251f8000 x24: 0000000000000001 [ 104.205460] x23: ffff9fa30da06008 x22: 0000000000000000 [ 104.205462] x21: ffff9fa2e2af8740 x20: ffff9fa30da06008 [ 104.205464] x19: 0000000000000140 x18: 00000000a5e112c1 [ 104.205466] x17: 000000004d48a1ed x16: 00000000b0f9c455 [ 104.205468] x15: 00000000aa4269e9 x14: 0000000085094ac4 [ 104.205471] x13: 000000009b3b00da x12: 000000008aae8d9c [ 104.205473] x11: ffff9fa31fbf90b0 x10: ffff3f0624eb70eb [ 104.205475] x9 : 0000000000000000 x8 : 0000000000000004 [ 104.205477] x7 : ffff9fa2e2875400 x6 : 0000000000000000 [ 104.205479] x5 : ffff9fa2e2875401 x4 : 0000000000000000 [ 104.205481] x3 : ffff9fa2e2a27b00 x2 : ffff9fa2e2875400 [ 104.205483] x1 : 0000000000000140 x0 : 000000000000f7c2 [ 111.198062] INFO: rcu_sched self-detected stall on CPU [ 111.198971] INFO: rcu_sched detected stalls on CPUs/tasks: [ 111.198977] 31-...: (1 GPs behind) idle=1b3/2/0 softirq=432/433 fqs=6805 [ 111.198979] 32-...: (1 GPs behind) idle=291/1/0 softirq=469/470 fqs=6805 [ 111.198980] (detected by 2, t=15002 jiffies, g=143, c=142, q=6968) [ 111.199000] Task dump for CPU 31: [ 111.199002] swapper/31 R running task 0 0 1 0x00000002 [ 111.199006] Call trace: [ 111.199012] [<ffff3f0624086250>] __switch_to+0x98/0xb0 [ 111.199014] [<0000000b7160dcd2>] 0xb7160dcd2 [ 111.199015] Task dump for CPU 32: [ 111.199016] swapper/32 R running task 0 0 1 0x00000002 [ 111.199018] Call trace: [ 111.199019] [<ffff3f0624086250>] __switch_to+0x98/0xb0 [ 111.199020] [<0000000bcde2fa4e>] 0xbcde2fa4e [ 111.227703] 31-...: (1 GPs behind) idle=1b3/2/0 softirq=432/433 fqs=6809 [ 111.234558] (t=15010 jiffies g=143 c=142 q=6968) [ 111.239334] Task dump for CPU 31: [ 111.239335] swapper/31 R running task 0 0 1 0x00000002 [ 111.239338] Call trace: [ 111.239344] [<ffff3f062408b030>] dump_backtrace+0x0/0x2b0 [ 111.239346] [<ffff3f062408b304>] show_stack+0x24/0x30 [ 111.239350] [<ffff3f0624103f80>] sched_show_task+0x128/0x178 [ 111.239352] [<ffff3f0624106d68>] dump_cpu_task+0x48/0x58 [ 111.239356] [<ffff3f0624200d38>] rcu_dump_cpu_stacks+0xbc/0xf0 [ 111.239359] [<ffff3f06241409e8>] rcu_check_callbacks+0x7a8/0x968 [ 111.239362] [<ffff3f0624146e1c>] update_process_times+0x34/0x60 [ 111.239365] [<ffff3f0624159118>] tick_sched_handle.isra.7+0x38/0x70 [ 111.239367] [<ffff3f062415919c>] tick_sched_timer+0x4c/0x98 [ 111.239369] [<ffff3f06241477a0>] __hrtimer_run_queues+0xe8/0x2e8 [ 111.239371] [<ffff3f0624148340>] hrtimer_interrupt+0xa8/0x228 [ 111.239376] [<ffff3f062487c02c>] arch_timer_handler_phys+0x3c/0x50 [ 111.239379] [<ffff3f0624133964>] handle_percpu_devid_irq+0x8c/0x230 [ 111.239383] [<ffff3f062412d8b4>] generic_handle_irq+0x34/0x50 [ 111.239385] [<ffff3f062412dfe0>] __handle_domain_irq+0x68/0xc0 [ 111.239386] [<ffff3f06240818b4>] gic_handle_irq+0xc4/0x170 [ 111.239388] Exception stack(0xffff9fa31fa7caa0 to 0xffff9fa31fa7cbd0) [ 111.239390] caa0: ffff9fa31fa7cad0 0001000000000000 ffff9fa31fa7cc00 ffff3f0624a00974 [ 111.239392] cac0: 0000000020400145 0000000000000001 00000000000000fe 0000000000000140 [ 111.239394] cae0: ffff9fa2e10b1c00 ffff9fa2e11c8800 0000000000000000 ffff9fa2e10b1c01 [ 111.239396] cb00: 0000000000000000 ffff9fa2e10b1c00 ffff9fa3035ee681 0000000000000000 [ 111.239397] cb20: ffff7e7e8b8533e0 ffff9fa31fa890b0 0000000000000000 000000009b3b00da [ 111.239399] cb40: 0000000085094ac4 00000000aa4269e9 0000000046e68d43 000000004d48a1ed [ 111.239401] cb60: 00000000a5e112c1 0000000000000140 ffff9fa30da06008 ffff9fa2e1073ac0 [ 111.239403] cb80: 0000000000000000 ffff9fa30da06008 0000000000000001 ffff3f06251f8000 [ 111.239404] cba0: 0000000fffffffff 0000000000000040 0000000ffffef50a ffff9fa31fa7cc00 [ 111.239406] cbc0: ffff3f0624682e8c ffff9fa31fa7cc00 [ 111.239407] [<ffff3f062408315c>] el1_irq+0xdc/0x180 [ 111.239411] [<ffff3f0624682e8c>] alloc_iova+0x1cc/0x2a0 [ 111.239413] [<ffff3f0624680488>] __alloc_iova+0x78/0x88 [ 111.239414] [<ffff3f0624680528>] __iommu_dma_map+0x90/0x128 [ 111.239416] [<ffff3f0624680e30>] iommu_dma_map_page+0x60/0x78 [ 111.239420] [<ffff3f062409c8fc>] __iommu_map_page+0x5c/0xd0 [ 111.239565] [<ffff3f06201046d0>] mlx5e_alloc_rx_wqe+0x118/0x318 [mlx5_core] [ 111.239607] [<ffff3f06201050e8>] mlx5e_post_rx_wqes+0xa0/0x110 [mlx5_core] [ 111.239647] [<ffff3f06201075dc>] mlx5e_napi_poll+0x22c/0x518 [mlx5_core] [ 111.239650] [<ffff3f06248cdda0>] net_rx_action+0x2e8/0x3f0 [ 111.239652] [<ffff3f0624081aa8>] __do_softirq+0x148/0x31c [ 111.239656] [<ffff3f06240d3d68>] irq_exit+0xd0/0x120 [ 111.239658] [<ffff3f062412dfe4>] __handle_domain_irq+0x6c/0xc0 [ 111.239660] [<ffff3f06240818b4>] gic_handle_irq+0xc4/0x170 [ 111.239661] Exception stack(0xffff9fa30ecffd80 to 0xffff9fa30ecffeb0) [ 111.239663] fd80: ffff9fa31fa85200 0000609cfabd2000 0000000006400000 0000000000000004 [ 111.239665] fda0: 0000000000003296 0000000000000015 000000005c57e302 0000000000000000 [ 111.239667] fdc0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 111.239668] fde0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 111.239670] fe00: 0000000000000000 0000000000000000 00000000ffffffff 0000000b7179114e [ 111.239672] fe20: ffff9fa3041c8000 0000000000000003 ffff3f0625292eb8 0000000000000000 [ 111.239673] fe40: 0000000b7160dcd2 0000000000000003 0000000000000000 0000000000000000 [ 111.239675] fe60: 0000000000000000 ffff9fa30ecffeb0 ffff3f06248549bc ffff9fa30ecffeb0 [ 111.239677] fe80: ffff3f06248549c4 0000000060400145 ffff9fa30ecffeb0 ffff3f06248549bc [ 111.239678] fea0: ffffffffffffffff 0000000b7160dcd2 [ 111.239680] [<ffff3f062408315c>] el1_irq+0xdc/0x180 [ 111.239684] [<ffff3f06248549c4>] cpuidle_enter_state+0x124/0x318 [ 111.239686] [<ffff3f0624854c2c>] cpuidle_enter+0x34/0x48 [ 111.239689] [<ffff3f062411c030>] call_cpuidle+0x40/0x70 [ 111.239691] [<ffff3f062411c344>] do_idle+0x1ac/0x1f8 [ 111.239693] [<ffff3f062411c5c4>] cpu_startup_entry+0x2c/0x30 [ 111.239695] [<ffff3f0624091008>] secondary_start_kernel+0x158/0x198 [ 111.239696] [<00000000112091a4>] 0x112091a4 [ 111.239697] Task dump for CPU 32: [ 111.239699] swapper/32 R running task 0 0 1 0x00000002 [ 111.239701] Call trace: [ 111.239704] [<ffff3f0624086250>] __switch_to+0x98/0xb0 [ 111.239705] [<0000000bcde2fa4e>] 0xbcde2fa4e [ 129.361765] ip_tables: (C) 2000-2006 Netfilter Core Team [ 129.397270] ip6_tables: (C) 2000-2006 Netfilter Core Team [ 129.438584] Ebtables v2.0 registered [FIX] The following patches cherry-picked from linux-next fixes this issue. 5016bdb796b3 iommu/iova: Fix underflow bug in __alloc_and_insert_iova_range d9a5f8adaec9 iommu/dma: Plumb in the per-CPU IOVA caches fc7f6142bacb iommu/dma: Clean up MSI IOVA allocation 568c61384ea1 iommu/dma: Convert to address-based allocation dddd632b072f iommu/dma: Implement PCI allocation optimisation de84f5f049d9 iommu/dma: Stop getting dma_32bit_pfn wrong [Test case] After applying the patches the kernel boot with no soft lockups. This was tested by me on Zesty 4.10.0-20.22 kernel on QDF2400 SDP. [Regression Potential] These patches applicable to iommu driver and does not impact any platform code. [IMPACT] Booting Zesty 4.10 kernel on Qualcomm Centriq 2400 ARM64 servers causes soft lockups on multiple CPUs. [ 104.205397] Modules linked in: nls_iso8859_1 cdc_acm bridge stp llc ipmi_ssif ipmi_devintf ipmi_msghandler shpchp hdma hdma_mgmt i2c_qup cppc_cpufreq ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear uas usb_storage at803x aes_ce_blk aes_ce_cipher crc32_ce crct10dif_ce ghash_ce sha2_ce sha1_ce mlx5_core devlink ptp pps_core ahci_platform libahci_platform libahci qcom_emac sdhci_acpi sdhci xhci_plat_hcd pinctrl_qdf2xxx fjes aes_neon_blk crypto_simd cryptd [ 104.205442] CPU: 47 PID: 0 Comm: swapper/47 Tainted: G L 4.10.0-16-generic #18ubuntuRC03+<redacted>.1 [ 104.205443] Hardware name: Qualcomm QDF2400 DP/ABW|SYS|CVR,1DPC|V3 , BIOS XBL.DF.2.0.R3-00153 QDF2400_REL CRM 02/ 8/2017 [ 104.205444] task: ffff9fa30ed49c00 task.stack: ffff9fa30ed5c000 [ 104.205447] PC is at _raw_spin_unlock_irqrestore+0x2c/0x38 [ 104.205450] LR is at alloc_iova+0x1cc/0x2a0 [ 104.205451] pc : [<ffff3f0624a00974>] lr : [<ffff3f0624682e8c>] pstate: 20400145 [ 104.205452] sp : ffff9fa31fbecc00 [ 104.205453] x29: ffff9fa31fbecc00 x28: 0000000ffffefe46 [ 104.205455] x27: 0000000000000040 x26: 0000000fffffffff [ 104.205458] x25: ffff3f06251f8000 x24: 0000000000000001 [ 104.205460] x23: ffff9fa30da06008 x22: 0000000000000000 [ 104.205462] x21: ffff9fa2e2af8740 x20: ffff9fa30da06008 [ 104.205464] x19: 0000000000000140 x18: 00000000a5e112c1 [ 104.205466] x17: 000000004d48a1ed x16: 00000000b0f9c455 [ 104.205468] x15: 00000000aa4269e9 x14: 0000000085094ac4 [ 104.205471] x13: 000000009b3b00da x12: 000000008aae8d9c [ 104.205473] x11: ffff9fa31fbf90b0 x10: ffff3f0624eb70eb [ 104.205475] x9 : 0000000000000000 x8 : 0000000000000004 [ 104.205477] x7 : ffff9fa2e2875400 x6 : 0000000000000000 [ 104.205479] x5 : ffff9fa2e2875401 x4 : 0000000000000000 [ 104.205481] x3 : ffff9fa2e2a27b00 x2 : ffff9fa2e2875400 [ 104.205483] x1 : 0000000000000140 x0 : 000000000000f7c2 [ 111.198062] INFO: rcu_sched self-detected stall on CPU [ 111.198971] INFO: rcu_sched detected stalls on CPUs/tasks: [ 111.198977] 31-...: (1 GPs behind) idle=1b3/2/0 softirq=432/433 fqs=6805 [ 111.198979] 32-...: (1 GPs behind) idle=291/1/0 softirq=469/470 fqs=6805 [ 111.198980] (detected by 2, t=15002 jiffies, g=143, c=142, q=6968) [ 111.199000] Task dump for CPU 31: [ 111.199002] swapper/31 R running task 0 0 1 0x00000002 [ 111.199006] Call trace: [ 111.199012] [<ffff3f0624086250>] __switch_to+0x98/0xb0 [ 111.199014] [<0000000b7160dcd2>] 0xb7160dcd2 [ 111.199015] Task dump for CPU 32: [ 111.199016] swapper/32 R running task 0 0 1 0x00000002 [ 111.199018] Call trace: [ 111.199019] [<ffff3f0624086250>] __switch_to+0x98/0xb0 [ 111.199020] [<0000000bcde2fa4e>] 0xbcde2fa4e [ 111.227703] 31-...: (1 GPs behind) idle=1b3/2/0 softirq=432/433 fqs=6809 [ 111.234558] (t=15010 jiffies g=143 c=142 q=6968) [ 111.239334] Task dump for CPU 31: [ 111.239335] swapper/31 R running task 0 0 1 0x00000002 [ 111.239338] Call trace: [ 111.239344] [<ffff3f062408b030>] dump_backtrace+0x0/0x2b0 [ 111.239346] [<ffff3f062408b304>] show_stack+0x24/0x30 [ 111.239350] [<ffff3f0624103f80>] sched_show_task+0x128/0x178 [ 111.239352] [<ffff3f0624106d68>] dump_cpu_task+0x48/0x58 [ 111.239356] [<ffff3f0624200d38>] rcu_dump_cpu_stacks+0xbc/0xf0 [ 111.239359] [<ffff3f06241409e8>] rcu_check_callbacks+0x7a8/0x968 [ 111.239362] [<ffff3f0624146e1c>] update_process_times+0x34/0x60 [ 111.239365] [<ffff3f0624159118>] tick_sched_handle.isra.7+0x38/0x70 [ 111.239367] [<ffff3f062415919c>] tick_sched_timer+0x4c/0x98 [ 111.239369] [<ffff3f06241477a0>] __hrtimer_run_queues+0xe8/0x2e8 [ 111.239371] [<ffff3f0624148340>] hrtimer_interrupt+0xa8/0x228 [ 111.239376] [<ffff3f062487c02c>] arch_timer_handler_phys+0x3c/0x50 [ 111.239379] [<ffff3f0624133964>] handle_percpu_devid_irq+0x8c/0x230 [ 111.239383] [<ffff3f062412d8b4>] generic_handle_irq+0x34/0x50 [ 111.239385] [<ffff3f062412dfe0>] __handle_domain_irq+0x68/0xc0 [ 111.239386] [<ffff3f06240818b4>] gic_handle_irq+0xc4/0x170 [ 111.239388] Exception stack(0xffff9fa31fa7caa0 to 0xffff9fa31fa7cbd0) [ 111.239390] caa0: ffff9fa31fa7cad0 0001000000000000 ffff9fa31fa7cc00 ffff3f0624a00974 [ 111.239392] cac0: 0000000020400145 0000000000000001 00000000000000fe 0000000000000140 [ 111.239394] cae0: ffff9fa2e10b1c00 ffff9fa2e11c8800 0000000000000000 ffff9fa2e10b1c01 [ 111.239396] cb00: 0000000000000000 ffff9fa2e10b1c00 ffff9fa3035ee681 0000000000000000 [ 111.239397] cb20: ffff7e7e8b8533e0 ffff9fa31fa890b0 0000000000000000 000000009b3b00da [ 111.239399] cb40: 0000000085094ac4 00000000aa4269e9 0000000046e68d43 000000004d48a1ed [ 111.239401] cb60: 00000000a5e112c1 0000000000000140 ffff9fa30da06008 ffff9fa2e1073ac0 [ 111.239403] cb80: 0000000000000000 ffff9fa30da06008 0000000000000001 ffff3f06251f8000 [ 111.239404] cba0: 0000000fffffffff 0000000000000040 0000000ffffef50a ffff9fa31fa7cc00 [ 111.239406] cbc0: ffff3f0624682e8c ffff9fa31fa7cc00 [ 111.239407] [<ffff3f062408315c>] el1_irq+0xdc/0x180 [ 111.239411] [<ffff3f0624682e8c>] alloc_iova+0x1cc/0x2a0 [ 111.239413] [<ffff3f0624680488>] __alloc_iova+0x78/0x88 [ 111.239414] [<ffff3f0624680528>] __iommu_dma_map+0x90/0x128 [ 111.239416] [<ffff3f0624680e30>] iommu_dma_map_page+0x60/0x78 [ 111.239420] [<ffff3f062409c8fc>] __iommu_map_page+0x5c/0xd0 [ 111.239565] [<ffff3f06201046d0>] mlx5e_alloc_rx_wqe+0x118/0x318 [mlx5_core] [ 111.239607] [<ffff3f06201050e8>] mlx5e_post_rx_wqes+0xa0/0x110 [mlx5_core] [ 111.239647] [<ffff3f06201075dc>] mlx5e_napi_poll+0x22c/0x518 [mlx5_core] [ 111.239650] [<ffff3f06248cdda0>] net_rx_action+0x2e8/0x3f0 [ 111.239652] [<ffff3f0624081aa8>] __do_softirq+0x148/0x31c [ 111.239656] [<ffff3f06240d3d68>] irq_exit+0xd0/0x120 [ 111.239658] [<ffff3f062412dfe4>] __handle_domain_irq+0x6c/0xc0 [ 111.239660] [<ffff3f06240818b4>] gic_handle_irq+0xc4/0x170 [ 111.239661] Exception stack(0xffff9fa30ecffd80 to 0xffff9fa30ecffeb0) [ 111.239663] fd80: ffff9fa31fa85200 0000609cfabd2000 0000000006400000 0000000000000004 [ 111.239665] fda0: 0000000000003296 0000000000000015 000000005c57e302 0000000000000000 [ 111.239667] fdc0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 111.239668] fde0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 111.239670] fe00: 0000000000000000 0000000000000000 00000000ffffffff 0000000b7179114e [ 111.239672] fe20: ffff9fa3041c8000 0000000000000003 ffff3f0625292eb8 0000000000000000 [ 111.239673] fe40: 0000000b7160dcd2 0000000000000003 0000000000000000 0000000000000000 [ 111.239675] fe60: 0000000000000000 ffff9fa30ecffeb0 ffff3f06248549bc ffff9fa30ecffeb0 [ 111.239677] fe80: ffff3f06248549c4 0000000060400145 ffff9fa30ecffeb0 ffff3f06248549bc [ 111.239678] fea0: ffffffffffffffff 0000000b7160dcd2 [ 111.239680] [<ffff3f062408315c>] el1_irq+0xdc/0x180 [ 111.239684] [<ffff3f06248549c4>] cpuidle_enter_state+0x124/0x318 [ 111.239686] [<ffff3f0624854c2c>] cpuidle_enter+0x34/0x48 [ 111.239689] [<ffff3f062411c030>] call_cpuidle+0x40/0x70 [ 111.239691] [<ffff3f062411c344>] do_idle+0x1ac/0x1f8 [ 111.239693] [<ffff3f062411c5c4>] cpu_startup_entry+0x2c/0x30 [ 111.239695] [<ffff3f0624091008>] secondary_start_kernel+0x158/0x198 [ 111.239696] [<00000000112091a4>] 0x112091a4 [ 111.239697] Task dump for CPU 32: [ 111.239699] swapper/32 R running task 0 0 1 0x00000002 [ 111.239701] Call trace: [ 111.239704] [<ffff3f0624086250>] __switch_to+0x98/0xb0 [ 111.239705] [<0000000bcde2fa4e>] 0xbcde2fa4e [ 129.361765] ip_tables: (C) 2000-2006 Netfilter Core Team [ 129.397270] ip6_tables: (C) 2000-2006 Netfilter Core Team [ 129.438584] Ebtables v2.0 registered [FIX] The following patches cherry-picked from linux-next fixes this issue. 5016bdb796b3 iommu/iova: Fix underflow bug in __alloc_and_insert_iova_range d9a5f8adaec9 iommu/dma: Plumb in the per-CPU IOVA caches fc7f6142bacb iommu/dma: Clean up MSI IOVA allocation 568c61384ea1 iommu/dma: Convert to address-based allocation dddd632b072f iommu/dma: Implement PCI allocation optimisation de84f5f049d9 iommu/dma: Stop getting dma_32bit_pfn wrong [Test case] After applying the patches the kernel boot with no soft lockups. This was tested by me on Zesty 4.10.0-20.22 kernel on QDF2400 SDP. [Regression Potential] These patches applicable to iommu driver and does not impact any platform code. Please see the comments section for regression tests on ARM64, Power8 and intel platforms.
2017-06-08 12:57:12 Seth Forshee linux (Ubuntu): status Incomplete Fix Committed
2017-06-09 07:52:45 Stefan Bader linux (Ubuntu Zesty): importance Undecided Critical
2017-06-09 07:52:45 Stefan Bader linux (Ubuntu Zesty): status New Fix Committed
2017-06-14 09:16:26 Kleber Sacilotto de Souza tags qdf2400 qdf2400 verification-needed-zesty
2017-06-16 15:52:36 Manoj Iyer tags qdf2400 verification-needed-zesty qdf2400 verification-done-zesty
2017-06-29 07:17:58 Launchpad Janitor linux (Ubuntu Zesty): status Fix Committed Fix Released
2017-06-29 07:17:58 Launchpad Janitor cve linked 2017-1000364
2017-06-29 07:17:58 Launchpad Janitor cve linked 2017-100363
2017-06-29 07:17:58 Launchpad Janitor cve linked 2017-8890
2017-06-29 07:17:58 Launchpad Janitor cve linked 2017-9074
2017-06-29 07:17:58 Launchpad Janitor cve linked 2017-9075
2017-06-29 07:17:58 Launchpad Janitor cve linked 2017-9076
2017-06-29 07:17:58 Launchpad Janitor cve linked 2017-9077
2017-06-29 07:17:58 Launchpad Janitor cve linked 2017-9242
2017-06-30 17:19:06 Launchpad Janitor linux (Ubuntu): status Fix Committed Fix Released