Kernel panics on mlx4_core (Mellanox Core driver) with SR-IOV mode

Bug #1473883 reported by Kamal Heib
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Medium
Unassigned
Vivid
Fix Released
Medium
Chris J Arges

Bug Description

SRU Justification:

[Impact]

While load/unload mlx4_core twice with SR-IOV mode enabled in host with multiple Mellanox devices (some of them support SR-IOV and other don't) this will lead to kernel panic.

[Fix]

commit 5114a04e6c73a0c6e74737e801b8a3b3d40c7e36
commit ed3d2276ef72be23c6367358d80004130d8c797d

$ git describe 5114a04e6c73a0c6e74737e801b8a3b3d40c7e36 ed3d2276ef72be23c6367358d80004130d8c797d
v4.1-rc6-1067-g5114a04
v4.1-rc6-1068-ged3d227

[Test Case]

1- add the "options mlx4_core num_vfs=60 port_type_array=2,2" to /etc/modprobe.d/mlx4_core.conf file.
2- unload mlx4_* kernel modules: modprobe -rv mlx4_en; modprobe -rv mlx4_ib; modprobe -rv mlx4_core;
3- load mlx4_en kernel module: modprobe -v mlx4_en
4- edit /etc/modprobe.d/mlx4_core.conf file and put "options mlx4_core num_vfs=60 port_type_array=2,2" in comment.
5 -repeat 2 and 3
6- will get the following call trace.

--

While load/unload mlx4_core twice with SR-IOV mode enabled in host with multiple Mellanox devices (some of them support SR-IOV and other don't) this will lead to kernel panic.

The following two upstream commits fix this issue:

commit 32b4ca5af1cf1c558dfca0e3417e9b35402401a6
Author: Carol L Soto <email address hidden>
Date: Tue Jun 2 16:07:23 2015 -0500

    net/mlx4_core: double free of dev_vfs

    If user loads mlx4_core with num_vfs greater than
    supported then variable dev->dev_vfs is freed 2 times after unloading the
    driver.

    Acked-by: Or Gerlitz <email address hidden>
    Signed-off-by: Carol L Soto <email address hidden>
    Signed-off-by: David S. Miller <email address hidden>

commit 7095b39f3189d2107045d769fdc32dfc0b704028
Author: Carol Soto <email address hidden>
Date: Tue Jun 2 16:07:24 2015 -0500

    net/mlx4_core: need to call close fw if alloc icm is called twice

    If mlx4_enable_sriov is called by adapter without this
    feature MLX4_DEV_CAP_FLAG2_SYS_EQS then during this path the function alloc
    icm is called twice without freeing the structures from the first time.

    Acked-by: Or Gerlitz <email address hidden>
    Signed-off-by: Carol L Soto <email address hidden>
    Signed-off-by: David S. Miller <email address hidden>

Steps to reproduce:
1- add the "options mlx4_core num_vfs=60 port_type_array=2,2" to /etc/modprobe.d/mlx4_core.conf file.
2- unload mlx4_* kernel modules: modprobe -rv mlx4_en; modprobe -rv mlx4_ib; modprobe -rv mlx4_core;
3- load mlx4_en kernel module: modprobe -v mlx4_en
4- edit /etc/modprobe.d/mlx4_core.conf file and put "options mlx4_core num_vfs=60 port_type_array=2,2" in comment.
5 -repeat 2 and 3
6- will get the following call trace.

Call Trace:
 1175.699487] mlx4_core 0000:24:00.0: Received reset from slave:7
[ 1175.767388] mlx4_core 0000:24:00.0: Received reset from slave:6
[ 1175.830898] mlx4_core 0000:24:00.0: Received reset from slave:5
[ 1175.898229] mlx4_core 0000:24:00.0: Received reset from slave:4
[ 1175.963514] mlx4_core 0000:24:00.0: Received reset from slave:3
[ 1176.035312] mlx4_core 0000:24:00.0: Received reset from slave:2
[ 1176.105085] mlx4_core 0000:24:00.0: Received reset from slave:1
[ 1177.253200] mlx4_core 0000:24:00.0: Disabling SR-IOV
[ 1179.724864] mlx4_core: Mellanox ConnectX core driver v2.2-1 (Feb, 2014)
[ 1179.724885] mlx4_core: Initializing 0000:21:00.0
[ 1185.760555] mlx4_core 0000:21:00.0: Enabling SR-IOV with 60 VFs
[ 1185.760575] mlx4_core 0000:21:00.0: Failed to enable SR-IOV, continuing without SR-IOV (err = -22)
[ 1185.770550] mlx4_core 0000:21:00.0: PCIe link speed is 8.0GT/s, device supports 8.0GT/s
[ 1185.770552] mlx4_core 0000:21:00.0: PCIe link width is x8, device supports x8
[ 1185.771870] ------------[ cut here ]------------
[ 1185.771878] WARNING: CPU: 6 PID: 5947 at /build/buildd/linux-3.19.0/fs/sysfs/dir.c:31 sysfs_warn_dup+0x68/0x80()
[ 1185.771880] sysfs: cannot create duplicate filename '/devices/pci0000:20/0000:20:03.0/0000:21:00.0/msi_irqs/57'
[ 1185.771881] Modules linked in: mlx4_core(+) vxlan ip6_udp_tunnel udp_tunnel mst_pciconf(OE) mst_pci(OE) nfsv3 rpcsec_gss_krb5 nfsv4 nfs fscache xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables x_tables bridge stp llc ipmi_ssif intel_rapl iosf_mbi x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul dm_multipath glue_helper scsi_dh ablk_helper cryptd joydev lpc_ich serio_raw ipmi_si 8250_fintek ipmi_msghandler acpi_power_meter ioatdma dca hpilo mac_hid wmi sb_edac edac_core shpchp nfsd auth_rpcgss
[ 1185.771920] nfs_acl lockd grace sunrpc autofs4 hid_generic usbhid tg3 pata_acpi ptp hid psmouse hpsa pps_core [last unloaded: ib_addr]
[ 1185.771931] CPU: 6 PID: 5947 Comm: modprobe Tainted: G OE 3.19.0-16-generic #16-Ubuntu
[ 1185.771932] Hardware name: HP ProLiant DL380p Gen8, BIOS P70 03/01/2013
[ 1185.771934] ffffffff81abb6d8 ffff88086cdb37c8 ffffffff817c2235 0000000000000007
[ 1185.771936] ffff88086cdb3818 ffff88086cdb3808 ffffffff8107595a 0000000000000292
[ 1185.771938] ffff88084d1ea000 ffff88086d1c1648 ffff8807b3df62d0 ffff880867ab85a0
[ 1185.771941] Call Trace:
[ 1185.771949] [<ffffffff817c2235>] dump_stack+0x45/0x57
[ 1185.771953] [<ffffffff8107595a>] warn_slowpath_common+0x8a/0xc0
[ 1185.771955] [<ffffffff810759d6>] warn_slowpath_fmt+0x46/0x50
[ 1185.771958] [<ffffffff8126ab58>] ? kernfs_path+0x48/0x60
[ 1185.771961] [<ffffffff8126e508>] sysfs_warn_dup+0x68/0x80
[ 1185.771963] [<ffffffff8126e1ff>] sysfs_add_file_mode_ns+0x14f/0x1c0
[ 1185.771966] [<ffffffff8126c050>] ? kernfs_create_dir_ns+0x50/0x80
[ 1185.771969] [<ffffffff8126edf9>] internal_create_group+0xd9/0x280
[ 1185.771971] [<ffffffff8126f0d9>] sysfs_create_groups+0x49/0xa0
[ 1185.771976] [<ffffffff8141bfad>] populate_msi_sysfs+0x1bd/0x200
[ 1185.771978] [<ffffffff8141c4c8>] pci_enable_msix+0x158/0x3c0
[ 1185.771980] [<ffffffff8141c75d>] pci_enable_msix_range+0x2d/0x70
[ 1185.771991] [<ffffffffc0900245>] mlx4_load_one+0xea5/0x1410 [mlx4_core]
[ 1185.771999] [<ffffffffc0900c9b>] mlx4_init_one+0x4eb/0x600 [mlx4_core]
[ 1185.772003] [<ffffffff81401155>] local_pci_probe+0x45/0xa0
[ 1185.772005] [<ffffffff81402345>] ? pci_match_device+0xe5/0x110
[ 1185.772007] [<ffffffff81402489>] pci_device_probe+0xd9/0x130
[ 1185.772012] [<ffffffff81506523>] driver_probe_device+0xa3/0x410
[ 1185.772014] [<ffffffff8150696b>] __driver_attach+0x9b/0xa0
[ 1185.772016] [<ffffffff815068d0>] ? __device_attach+0x40/0x40
[ 1185.772020] [<ffffffff815042eb>] bus_for_each_dev+0x6b/0xb0
[ 1185.772022] [<ffffffff81505f8e>] driver_attach+0x1e/0x20
[ 1185.772024] [<ffffffff81505b60>] bus_add_driver+0x180/0x250
[ 1185.772027] [<ffffffffc0344000>] ? 0xffffffffc0344000
[ 1185.772030] [<ffffffff81507164>] driver_register+0x64/0xf0
[ 1185.772034] [<ffffffff8140098c>] __pci_register_driver+0x4c/0x50
[ 1185.772042] [<ffffffffc0344126>] mlx4_init+0x126/0x1000 [mlx4_core]
[ 1185.772047] [<ffffffff81002148>] do_one_initcall+0xd8/0x210
[ 1185.772053] [<ffffffff811d5b49>] ? kmem_cache_alloc_trace+0x189/0x200
[ 1185.772058] [<ffffffff810f99c4>] ? load_module+0x15a4/0x1ce0
[ 1185.772061] [<ffffffff810f99fe>] load_module+0x15de/0x1ce0
[ 1185.772063] [<ffffffff810f51d0>] ? store_uevent+0x40/0x40
[ 1185.772067] [<ffffffff810fa276>] SyS_finit_module+0x86/0xb0
[ 1185.772072] [<ffffffff817c934d>] system_call_fastpath+0x16/0x1b
[ 1185.772074] ---[ end trace 9d9c0896e72e5312 ]---
[ 1185.873139] mlx4_core 0000:21:00.0: command 0x31 timed out (go bit not cleared)
[ 1185.873147] mlx4_core 0000:21:00.0: device is going to be reset
[ 1186.881239] mlx4_core 0000:21:00.0: device was reset successfully
[ 1186.888006] mlx4_core 0000:21:00.0: NOP command failed to generate interrupt (IRQ 53), aborting
[ 1186.897831] mlx4_core 0000:21:00.0: BIOS or ACPI interrupt routing problem?
[ 1186.907762] BUG: unable to handle kernel NULL pointer dereference at 000000000000001c
[ 1186.916462] IP: [<ffffffff81181185>] __free_pages+0x5/0x30
[ 1186.922560] PGD 0
[ 1186.924814] Oops: 0002 [#1] SMP
[ 1186.928423] Modules linked in: mlx4_core(+) vxlan ip6_udp_tunnel udp_tunnel mst_pciconf(OE) mst_pci(OE) nfsv3 rpcsec_gss_krb5 nfsv4 nfs fscache xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables x_tables bridge stp llc ipmi_ssif intel_rapl iosf_mbi x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul dm_multipath glue_helper scsi_dh ablk_helper cryptd joydev lpc_ich serio_raw ipmi_si 8250_fintek ipmi_msghandler acpi_power_meter ioatdma dca hpilo mac_hid wmi sb_edac edac_core shpchp nfsd auth_rpcgss
[ 1187.008078] nfs_acl lockd grace sunrpc autofs4 hid_generic usbhid tg3 pata_acpi ptp hid psmouse hpsa pps_core [last unloaded: ib_addr]
[ 1187.020643] CPU: 8 PID: 5947 Comm: modprobe Tainted: G W OE 3.19.0-16-generic #16-Ubuntu
[ 1187.030455] Hardware name: HP ProLiant DL380p Gen8, BIOS P70 03/01/2013
[ 1187.037778] task: ffff88079d6cb110 ti: ffff88086cdb0000 task.ti: ffff88086cdb0000
[ 1187.046064] RIP: 0010:[<ffffffff81181185>] [<ffffffff81181185>] __free_pages+0x5/0x30
[ 1187.054859] RSP: 0018:ffff88086cdb39a0 EFLAGS: 00010206
[ 1187.060730] RAX: 0000000000000000 RBX: 00000000ffffffff RCX: 0000000000000000
[ 1187.068610] RDX: 00000000000ffff8 RSI: 0000000000000014 RDI: 0000000000000000
[ 1187.076492] RBP: ffff88086cdb39e8 R08: 0000000000000040 R09: 0000000000000000
[ 1187.084374] R10: 0000000000000040 R11: ffff88079bbf6000 R12: ffff8807b3e20000
[ 1187.092253] R13: ffff88086921a420 R14: ffff88086921a400 R15: 0000000000000001
[ 1187.100139] FS: 00007fadaa1b9700(0000) GS:ffff88087f840000(0000) knlGS:0000000000000000
[ 1187.109092] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1187.115445] CR2: 000000000000001c CR3: 0000000823f6f000 CR4: 00000000000407e0
[ 1187.123336] Stack:
[ 1187.125570] ffffffffc08f9d9f 0000000000000099 ffff88086921a3e0 ffff88086cdb39e8
[ 1187.133802] 0000000000000099 ffff8807b3e20000 ffff8807b3e23268 0000000000000099
[ 1187.142030] ffff8807b3e20000 ffff88086cdb3a18 ffffffffc08fab7c ffff8807b3e20000
[ 1187.150264] Call Trace:
[ 1187.153003] [<ffffffffc08f9d9f>] ? mlx4_free_icm+0x17f/0x1d0 [mlx4_core]
[ 1187.160526] [<ffffffffc08fab7c>] mlx4_cleanup_icm_table+0x5c/0x80 [mlx4_core]
[ 1187.168537] [<ffffffffc08fb5bd>] mlx4_free_icms+0x1d/0x100 [mlx4_core]
[ 1187.175849] [<ffffffffc08fba8b>] mlx4_close_hca+0x4b/0x70 [mlx4_core]
[ 1187.183072] [<ffffffffc08ff943>] mlx4_load_one+0x5a3/0x1410 [mlx4_core]
[ 1187.190480] [<ffffffffc0900c9b>] mlx4_init_one+0x4eb/0x600 [mlx4_core]
[ 1187.197786] [<ffffffff81401155>] local_pci_probe+0x45/0xa0
[ 1187.203944] [<ffffffff81402345>] ? pci_match_device+0xe5/0x110
[ 1187.210485] [<ffffffff81402489>] pci_device_probe+0xd9/0x130
[ 1187.216842] [<ffffffff81506523>] driver_probe_device+0xa3/0x410
[ 1187.223478] [<ffffffff8150696b>] __driver_attach+0x9b/0xa0
[ 1187.229643] [<ffffffff815068d0>] ? __device_attach+0x40/0x40
[ 1187.236002] [<ffffffff815042eb>] bus_for_each_dev+0x6b/0xb0
[ 1187.242256] [<ffffffff81505f8e>] driver_attach+0x1e/0x20
[ 1187.248222] [<ffffffff81505b60>] bus_add_driver+0x180/0x250
[ 1187.254479] [<ffffffffc0344000>] ? 0xffffffffc0344000
[ 1187.260158] [<ffffffff81507164>] driver_register+0x64/0xf0
[ 1187.266334] [<ffffffff8140098c>] __pci_register_driver+0x4c/0x50
[ 1187.273077] [<ffffffffc0344126>] mlx4_init+0x126/0x1000 [mlx4_core]
[ 1187.280112] [<ffffffff81002148>] do_one_initcall+0xd8/0x210
[ 1187.286383] [<ffffffff811d5b49>] ? kmem_cache_alloc_trace+0x189/0x200
[ 1187.293753] [<ffffffff810f99c4>] ? load_module+0x15a4/0x1ce0
[ 1187.300109] [<ffffffff810f99fe>] load_module+0x15de/0x1ce0
[ 1187.306271] [<ffffffff810f51d0>] ? store_uevent+0x40/0x40
[ 1187.312333] [<ffffffff810fa276>] SyS_finit_module+0x86/0xb0
[ 1187.318595] [<ffffffff817c934d>] system_call_fastpath+0x16/0x1b
[ 1187.325233] Code: 74 1c 48 8b 03 90 48 8b 7b 08 48 83 c3 10 44 89 ea 4c 89 e6 ff d0 48 8b 03 48 85 c0 75 e8 eb a6 66 0f 1f 44 00 00 66 66 66 66 90 <f0> ff 4f 1c 74 05 c3 0f 1f 40 00 55 85 f6 48 89 e5 74 08 e8 d3
[ 1187.346856] RIP [<ffffffff81181185>] __free_pages+0x5/0x30
[ 1187.353034] RSP <ffff88086cdb39a0>
[ 1187.356900] CR2: 000000000000001c
[ 1187.361080] ---[ end trace 9d9c0896e72e5313 ]---

Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1473883

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Chris J Arges (arges)
Changed in linux (Ubuntu Vivid):
assignee: nobody → Chris J Arges (arges)
Changed in linux (Ubuntu):
status: Incomplete → Triaged
Changed in linux (Ubuntu Vivid):
status: New → In Progress
importance: Undecided → Medium
description: updated
Revision history for this message
Chris J Arges (arges) wrote :

Sent patches to k-team ML for Vivid.
Since these are in 4.1 should be picked up in Wily when we rebase.

Brad Figg (brad-figg)
Changed in linux (Ubuntu Vivid):
status: In Progress → Fix Committed
Revision history for this message
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-vivid' to 'verification-done-vivid'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-vivid
Changed in linux (Ubuntu):
importance: Undecided → Medium
status: Triaged → Fix Committed
Revision history for this message
Chris J Arges (arges) wrote :

4.1 is in Wily thus, this is Fix Released in wily. Still awaiting verification for Vivid.

Changed in linux (Ubuntu):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (30.6 KiB)

This bug was fixed in the package linux - 3.19.0-26.28

---------------
linux (3.19.0-26.28) vivid; urgency=low

  [ Luis Henriques ]

  * Release Tracking Bug
    - LP: #1483630

  [ Upstream Kernel Changes ]

  * Revert "Bluetooth: ath3k: Add support of 04ca:300d AR3012 device"

linux (3.19.0-26.27) vivid; urgency=low

  [ Luis Henriques ]

  * Release Tracking Bug
    - LP: #1479055
  * [Config] updateconfigs for 3.19.8-ckt4 stable update

  [ Chris J Arges ]

  * [Config] Add MTD_POWERNV_FLASH and OPAL_PRD
    - LP: #1464560

  [ Mika Kuoppala ]

  * SAUCE: i915_bpo: drm/i915: Fix divide by zero on watermark update
    - LP: #1473175

  [ Tim Gardner ]

  * [Config] ACORN_PARTITION=n
    - LP: #1453117
  * [Config] Add i40e[vf] to d-i
    - LP: #1476393

  [ Timo Aaltonen ]

  * SAUCE: i915_bpo: Rebase to v4.2-rc3
    - LP: #1473175
  * SAUCE: i915_bpo: Revert "mm/fault, drm/i915: Use pagefault_disabled()
    to check for disabled pagefaults"
    - LP: #1473175
  * SAUCE: i915_bpo: Revert "drm: i915: Port to new backlight interface
    selection API"
    - LP: #1473175

  [ Upstream Kernel Changes ]

  * Revert "tools/vm: fix page-flags build"
    - LP: #1473547
  * Revert "ALSA: hda - Add mute-LED mode control to Thinkpad"
    - LP: #1473547
  * Revert "drm/radeon: adjust pll when audio is not enabled"
    - LP: #1473547
  * Revert "crypto: talitos - convert to use be16_add_cpu()"
    - LP: #1479048
  * module: Call module notifier on failure after complete_formation()
    - LP: #1473547
  * gpio: gpio-kempld: Fix get_direction return value
    - LP: #1473547
  * ARM: dts: imx27: only map 4 Kbyte for fec registers
    - LP: #1473547
  * ARM: 8356/1: mm: handle non-pmd-aligned end of RAM
    - LP: #1473547
  * x86/mce: Fix MCE severity messages
    - LP: #1473547
  * mac80211: don't use napi_gro_receive() outside NAPI context
    - LP: #1473547
  * iwlwifi: mvm: Free fw_status after use to avoid memory leak
    - LP: #1473547
  * iwlwifi: mvm: clean net-detect info if device was reset during suspend
    - LP: #1473547
  * drm/plane-helper: Adapt cursor hack to transitional helpers
    - LP: #1473547
  * ARM: dts: set display clock correctly for exynos4412-trats2
    - LP: #1473547
  * hwmon: (ntc_thermistor) Ensure iio channel is of type IIO_VOLTAGE
    - LP: #1473547
  * mfd: da9052: Fix broken regulator probe
    - LP: #1473547
  * ALSA: hda - Fix noise on AMD radeon 290x controller
    - LP: #1473547
  * lguest: fix out-by-one error in address checking.
    - LP: #1473547
  * xfs: xfs_attr_inactive leaves inconsistent attr fork state behind
    - LP: #1473547
  * xfs: xfs_iozero can return positive errno
    - LP: #1473547
  * fs, omfs: add NULL terminator in the end up the token list
    - LP: #1473547
  * omfs: fix sign confusion for bitmap loop counter
    - LP: #1473547
  * d_walk() might skip too much
    - LP: #1473547
  * dm: fix casting bug in dm_merge_bvec()
    - LP: #1473547
  * hwmon: (nct6775) Add missing sysfs attribute initialization
    - LP: #1473547
  * hwmon: (nct6683) Add missing sysfs attribute initialization
    - LP: #1473547
  * target/pscsi: Don't leak scsi_host if hba is VIRTUAL_HOST
    - LP: #1473547
  * net...

Changed in linux (Ubuntu Vivid):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.