Ubuntu 17.04 KVM: stack trace generated when enabling SRIOV in power

Bug #1702768 reported by bugproxy
This bug report is a duplicate of:  Bug #1701272: New NVLINK2 patches. Edit Remove
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
The Ubuntu-power-systems project
In Progress
Undecided
Canonical Kernel Team
linux (Ubuntu)
In Progress
Medium
Joseph Salisbury
Zesty
In Progress
Medium
Joseph Salisbury

Bug Description

---Problem Description---
When enabling SRIOV with kernel 4.10.0-26-generic in power will see this stack trace:
[ 2084.079575] ------------[ cut here ]------------
[ 2084.079583] WARNING: CPU: 120 PID: 734 at /build/linux-TAhFXm/linux-4.10.0/arch/powerpc/platforms/powernv/npu-dma.c:78 pnv_pci_get_npu_dev+0x40/0xb0
[ 2084.079584] Modules linked in: mst_pciconf(OE) mst_pci(OE) xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp kvm_hv kvm_pr kvm ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter rdma_ucm(OE) ib_ucm(OE) ib_ipoib(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx4_ib(OE) binfmt_misc bridge stp llc ipmi_powernv ipmi_devintf ipmi_msghandler powernv_rng powernv_op_panel uio_pdrv_genirq leds_powernv uio ibmpowernv vmx_crypto sunrpc ib_iser(OE) rdma_cm(OE) iw_cm(OE) ib_cm(OE) ib_core(OE) configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi knem(OE) ip_tables x_tables autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx
[ 2084.079640] xor raid6_pq libcrc32c raid1 raid0 multipath linear mlx4_en(OE) ses enclosure scsi_transport_sas crc32c_vpmsum tg3 mlx5_core(OE) mlx4_core(OE) ipr devlink mlx_compat(OE)
[ 2084.079658] CPU: 120 PID: 734 Comm: kworker/120:0 Tainted: G W OE 4.10.0-26-generic #30-Ubuntu
[ 2084.079663] Workqueue: events work_for_cpu_fn
[ 2084.079665] task: c000000fee60dc00 task.stack: c000000fee534000
[ 2084.079666] NIP: c00000000009c210 LR: c00000000009d404 CTR: 0000000000000000
[ 2084.079668] REGS: c000000fee537700 TRAP: 0700 Tainted: G W OE (4.10.0-26-generic)
[ 2084.079669] MSR: 900000000282b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>
[ 2084.079677] CR: 42004428 XER: 20000000
[ 2084.079678] CFAR: c00000000009d400 SOFTE: 1
               GPR00: c00000000009d404 c000000fee537980 c00000000145d100 0000000000000000
               GPR04: 0000000000000000 0000000000000aa6 c000001fff700000 0000000000049188
               GPR08: 0000000000000007 0000000000000001 0000000000000001 0000000000000000
               GPR12: 0000000000002200 c00000000fbc3800 c00000000010ef48 c000000ff70ec540
               GPR16: c000000ffa622c58 c000000ffa622a10 c000000ffa6229a0 0000000000000001
               GPR20: 0000000000000000 c000000001318de8 c000000000d700e8 0000000000000001
               GPR24: c000000000d6f070 c000000000d6f050 c000000003d02000 c000000003d02098
               GPR28: c000000e92680060 0800001fffffffff ffffffffffffffff 0000000000000000
[ 2084.079702] NIP [c00000000009c210] pnv_pci_get_npu_dev+0x40/0xb0
[ 2084.079704] LR [c00000000009d404] pnv_npu_try_dma_set_bypass+0x144/0x250
[ 2084.079705] Call Trace:
[ 2084.079708] [c000000fee5379b0] [c00000000009d404] pnv_npu_try_dma_set_bypass+0x144/0x250
[ 2084.079710] [c000000fee537a80] [c000000000096c74] pnv_pci_ioda_dma_set_mask+0xa4/0x150
[ 2084.079714] [c000000fee537b00] [c0000000000291a0] dma_set_mask+0x40/0xc0
[ 2084.079728] [c000000fee537b20] [d0000000143531e4] init_one+0x33c/0x6a0 [mlx5_core]
[ 2084.079732] [c000000fee537bd0] [c00000000066ba9c] local_pci_probe+0x6c/0x140
[ 2084.079734] [c000000fee537c60] [c0000000001016b8] work_for_cpu_fn+0x38/0x60
[ 2084.079737] [c000000fee537c90] [c0000000001061a0] process_one_work+0x2b0/0x5a0
[ 2084.079740] [c000000fee537d20] [c000000000106780] worker_thread+0x2f0/0x650
[ 2084.079742] [c000000fee537dc0] [c00000000010f0a4] kthread+0x164/0x1b0
[ 2084.079746] [c000000fee537e30] [c00000000000b4e8] ret_from_kernel_thread+0x5c/0x74
[ 2084.079747] Instruction dump:
[ 2084.079748] 7c0802a6 fbe1fff8 f8010010 f821ffd1 7c690074 7929d182 0b090000 2fa30000
[ 2084.079753] 419e0060 e8630330 7c690074 7929d182 <0b090000> 2fa30000 419e0048 7c852378
[ 2084.079759] ---[ end trace 7bf01a937efd69d8 ]---

This issue was introduced by this commit:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4c3b89effc281704d5395282c800c45e453235f6 (Subject: powerpc/powernv: Add sanity checks to pnv_pci_get_{gpu|npu}_dev )

and the solution will be to add this commit:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=377aa6b0efbaa29cfeecd8b9244641217f9544ca

which reads: "powerpc/npu-dma: Remove spurious WARN_ON when a PCI device has no of_node"

Requesting fix inclusion in 17.04 and probably 16.04.3.

---uname output---
4.10.0-26-generic #30-Ubuntu SMP Tue Jun 27 09:29:34 UTC 2017 ppc64le ppc64le ppc64le GNU/Linux

---Additional Hardware Info---
Need a Mellanox card that supports SRIOV.

Machine Type = P8

---Steps to Reproduce---
 Just enable SRIOV in a power system with Mellanox CX4 or CX5 will be like this:
echo 1 > /sys/class/infiniband/mlx5_0/device/sriov_numvfs

Stack trace output:
 [ 2084.079567] mlx5_core 0004:01:04.0: Using 64-bit DMA iommu bypass
[ 2084.079575] ------------[ cut here ]------------
[ 2084.079583] WARNING: CPU: 120 PID: 734 at /build/linux-TAhFXm/linux-4.10.0/arch/powerpc/platforms/powernv/npu-dma.c:78 pnv_pci_get_npu_dev+0x40/0xb0
[ 2084.079584] Modules linked in: mst_pciconf(OE) mst_pci(OE) xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp kvm_hv kvm_pr kvm ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter rdma_ucm(OE) ib_ucm(OE) ib_ipoib(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx4_ib(OE) binfmt_misc bridge stp llc ipmi_powernv ipmi_devintf ipmi_msghandler powernv_rng powernv_op_panel uio_pdrv_genirq leds_powernv uio ibmpowernv vmx_crypto sunrpc ib_iser(OE) rdma_cm(OE) iw_cm(OE) ib_cm(OE) ib_core(OE) configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi knem(OE) ip_tables x_tables autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx
[ 2084.079640] xor raid6_pq libcrc32c raid1 raid0 multipath linear mlx4_en(OE) ses enclosure scsi_transport_sas crc32c_vpmsum tg3 mlx5_core(OE) mlx4_core(OE) ipr devlink mlx_compat(OE)
[ 2084.079658] CPU: 120 PID: 734 Comm: kworker/120:0 Tainted: G W OE 4.10.0-26-generic #30-Ubuntu
[ 2084.079663] Workqueue: events work_for_cpu_fn
[ 2084.079665] task: c000000fee60dc00 task.stack: c000000fee534000
[ 2084.079666] NIP: c00000000009c210 LR: c00000000009d404 CTR: 0000000000000000
[ 2084.079668] REGS: c000000fee537700 TRAP: 0700 Tainted: G W OE (4.10.0-26-generic)
[ 2084.079669] MSR: 900000000282b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>
[ 2084.079677] CR: 42004428 XER: 20000000
[ 2084.079678] CFAR: c00000000009d400 SOFTE: 1
               GPR00: c00000000009d404 c000000fee537980 c00000000145d100 0000000000000000
               GPR04: 0000000000000000 0000000000000aa6 c000001fff700000 0000000000049188
               GPR08: 0000000000000007 0000000000000001 0000000000000001 0000000000000000
               GPR12: 0000000000002200 c00000000fbc3800 c00000000010ef48 c000000ff70ec540
               GPR16: c000000ffa622c58 c000000ffa622a10 c000000ffa6229a0 0000000000000001
               GPR20: 0000000000000000 c000000001318de8 c000000000d700e8 0000000000000001
               GPR24: c000000000d6f070 c000000000d6f050 c000000003d02000 c000000003d02098
               GPR28: c000000e92680060 0800001fffffffff ffffffffffffffff 0000000000000000
[ 2084.079702] NIP [c00000000009c210] pnv_pci_get_npu_dev+0x40/0xb0
[ 2084.079704] LR [c00000000009d404] pnv_npu_try_dma_set_bypass+0x144/0x250
[ 2084.079705] Call Trace:
[ 2084.079708] [c000000fee5379b0] [c00000000009d404] pnv_npu_try_dma_set_bypass+0x144/0x250
[ 2084.079710] [c000000fee537a80] [c000000000096c74] pnv_pci_ioda_dma_set_mask+0xa4/0x150
[ 2084.079714] [c000000fee537b00] [c0000000000291a0] dma_set_mask+0x40/0xc0
[ 2084.079728] [c000000fee537b20] [d0000000143531e4] init_one+0x33c/0x6a0 [mlx5_core]
[ 2084.079732] [c000000fee537bd0] [c00000000066ba9c] local_pci_probe+0x6c/0x140
[ 2084.079734] [c000000fee537c60] [c0000000001016b8] work_for_cpu_fn+0x38/0x60
[ 2084.079737] [c000000fee537c90] [c0000000001061a0] process_one_work+0x2b0/0x5a0
[ 2084.079740] [c000000fee537d20] [c000000000106780] worker_thread+0x2f0/0x650
[ 2084.079742] [c000000fee537dc0] [c00000000010f0a4] kthread+0x164/0x1b0
[ 2084.079746] [c000000fee537e30] [c00000000000b4e8] ret_from_kernel_thread+0x5c/0x74
[ 2084.079747] Instruction dump:
[ 2084.079748] 7c0802a6 fbe1fff8 f8010010 f821ffd1 7c690074 7929d182 0b090000 2fa30000
[ 2084.079753] 419e0060 e8630330 7c690074 7929d182 <0b090000> 2fa30000 419e0048 7c852378
[ 2084.079759] ---[ end trace 7bf01a937efd69d8 ]---
[ 2084.080096] mlx5_core 0004:01:04.0: firmware version: 12.20.1010

bugproxy (bugproxy)
tags: added: architecture-ppc64le bugnameltc-156405 severity-high targetmilestone-inin1704
Changed in ubuntu:
assignee: nobody → Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage)
affects: ubuntu → linux (Ubuntu)
Frank Heimes (fheimes)
Changed in ubuntu-power-systems:
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)
tags: added: kernel-da-key
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built a test kernel with a pick of commit 377aa6b0efba. The test kernel can be downloaded from:

http://kernel.ubuntu.com/~jsalisbury/lp1702768/

Can you test this kernel and see if it resolves this bug?

Changed in linux (Ubuntu):
status: New → In Progress
importance: Undecided → Medium
assignee: Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) → Joseph Salisbury (jsalisbury)
Frank Heimes (fheimes)
Changed in ubuntu-power-systems:
status: New → In Progress
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2017-07-10 15:54 EDT-------
(In reply to comment #5)
> I built a test kernel with a pick of commit 377aa6b0efba. The test kernel
> can be downloaded from:
>
> http://kernel.ubuntu.com/~jsalisbury/lp1702768/
>
> Can you test this kernel and see if it resolves this bug?

I tried this kernel and it is ok.
[ 128.224664] (0004:01:00.0): E-Switch: E-Switch enable SRIOV: nvfs(1) mode (1)
[ 128.234634] (0004:01:00.0): E-Switch: SRIOV enabled: active vports(2)
[ 128.234818] mlx5_core 0004:01:00.0: VF BAR0: [mem 0x240000000000-0x2401ffffffff 64bit pref] shifted to [mem 0x240000000000-0x2401ffffffff 64bit pref] (Disabling 1 VFs shifted by 0)
[ 128.234836] pci 0004:01: 0.2: [PE# 00] VF 0004:01:00.2 associated with PE#0
[ 128.235086] pci 0004:01: 0.2: [PE# 00] Setting up 32-bit TCE table at 0..80000000
[ 128.238861] pci 0004:01: 0.2: [PE# 00] Setting up window#0 0..7fffffff pg=1000
[ 128.238972] pci 0004:01: 0.2: [PE# 00] Enabling 64-bit DMA bypass
[ 128.344614] pci 0004:01:00.2: [15b3:1014] type 00 class 0x020000
[ 128.344942] pci 0004:01:00.2: Max Payload Size set to 512 (was 128, max 512)
[ 128.345403] iommu: Adding device 0004:01:00.2 to group 6
[ 128.345871] mlx5_core 0004:01:00.2: enabling device (0000 -> 0002)
[ 128.345907] mlx5_core 0004:01:00.2: Using 64-bit DMA iommu bypass
[ 128.346076] mlx5_core 0004:01:00.2: firmware version: 12.20.1010
[ 128.902589] mlx5_core 0004:01:00.2: MLX5E: StrdRq(0) RqSz(1024) StrdSz(1) RxCqeCmprss(0)
[ 128.903017] mlx5_core 0004:01:00.2: Assigned random MAC address 2a:fc:b7:49:03:1b
[ 129.007113] mlx5_core 0004:01:00.2 enP4p1s0f2: renamed from eth0
[ 129.015731] mlx5_ib: Mellanox Connect-IB Infiniband driver v2.2-1 (Feb 2014)

uname -a
4.10.0-26-generic #30~lp1702768 SMP Mon Jul 10 18:37:50 UTC 2017 ppc64le ppc64le ppc64le GNU/Linux

Changed in linux (Ubuntu Zesty):
status: New → In Progress
importance: Undecided → Medium
assignee: nobody → Joseph Salisbury (jsalisbury)
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2017-08-04 21:34 EDT-------
jsalisbury,

This one might also apply to Ubuntu 16.04.3 as well.

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2017-09-14 08:47 EDT-------
Joseph,

Any updates on this one? At this point in time I am not even sure it is worthwhile to provide a fix for Ubuntu 17.04. We should target Ubuntu 17.10 probably.

Revision history for this message
Andrew Cloke (andrew-cloke) wrote :

This bug is marked as a duplicate of https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1701272 which is marked as "Fix Released" for 17.04.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

@lagarcia, This bug is a duplicate of bug 1701272 and that bug is Fix Released. Do you still see 17.04 or 17.10 exhibiting this bug?

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2017-11-21 12:48 EDT-------
Joseph,

Has this been released already?

Revision history for this message
Leonardo Garcia (laggarcia) wrote :

Joseph,

Disregard my last comment.

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2017-11-21 12:51 EDT-------
Hi Joseph,

Sorry, I just saw that the latest comments never reached IBM Bugzilla. Closing this one from IBM side as well. Thanks!

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.