Kernel panic when using KVM and mlx4_en driver (when bonding and sriov enabled)

Bug #1755268 reported by kvaps on 2018-03-12
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
High
Unassigned
Xenial
High
Unassigned
Artful
High
Unassigned
Bionic
High
Unassigned

Bug Description

##### System information #####

    # uname -a
    Linux m5c37 4.13.0-36-generic #40~16.04.1-Ubuntu SMP Fri Feb 16 23:25:58 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

    # cat /etc/os-release
    NAME="Ubuntu"
    VERSION="16.04.4 LTS (Xenial Xerus)"
    ID=ubuntu
    ID_LIKE=debian
    PRETTY_NAME="Ubuntu 16.04.4 LTS"
    VERSION_ID="16.04"
    HOME_URL="http://www.ubuntu.com/"
    SUPPORT_URL="http://help.ubuntu.com/"
    BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
    VERSION_CODENAME=xenial
    UBUNTU_CODENAME=xenial

    # ethtool -i eno1
    driver: mlx4_en
    version: 4.3-1.0.1
    firmware-version: 2.42.5004
    expansion-rom-version:
    bus-info: 0000:11:00.0
    supports-statistics: yes
    supports-test: yes
    supports-eeprom-access: no
    supports-register-dump: no
    supports-priv-flags: yes

    # ethtool -i bond0
    driver: bonding
    version: 3.7.1
    firmware-version: 2
    expansion-rom-version:
    bus-info:
    supports-statistics: no
    supports-test: no
    supports-eeprom-access: no
    supports-register-dump: no
    supports-priv-flags: no

    # ethtool -i vmbr0
    driver: bridge
    version: 2.3
    firmware-version: N/A
    expansion-rom-version:
    bus-info: N/A
    supports-statistics: no
    supports-test: no
    supports-eeprom-access: no
    supports-register-dump: no
    supports-priv-flags: no

Mellanox driver was installed from
http://content.mellanox.com/ofed/MLNX_OFED-4.3-1.0.1.0/MLNX_OFED_LINUX-4.3-1.0.1.0-ubuntu16.04-x86_64.tgz

    ./mlnxofedinstall --kernel 4.13.0-36-generic --without-dkms --add-kernel-support

##### Steps for reproduce #####

This is my /etc/network/interfaces file:

    auto lo
    iface lo inet loopback

    auto openibd
    iface openibd inet manual
            pre-up /etc/init.d/openibd start
            pre-down /etc/init.d/openibd force-stop

    auto bond0
    iface bond0 inet manual
            pre-up ip link add bond0 type bond || true
            pre-up ip link set bond0 down
            pre-up ip link set bond0 type bond mode active-backup arp_interval 2000 arp_ip_target 10.36.0.1 arp_validate 3 primary eno1
            pre-up ip link set eno1 down
            pre-up ip link set eno1d1 down
            pre-up ip link set eno1 master bond0
            pre-up ip link set eno1d1 master bond0
            pre-up ip link set bond0 up
            pre-down ip link del bond0

    auto vmbr0
    iface vmbr0 inet static
            address 10.36.128.217
            netmask 255.255.0.0
            gateway 10.36.0.1
            bridge_ports bond0
            bridge_stp off
            bridge_fd 0

I execute these commands:

    wget http://dl-cdn.alpinelinux.org/alpine/v3.7/releases/x86_64/alpine-virt-3.7.0-x86_64.iso -O alpine.iso
    qemu-system-x86_64 -machine pc-i440fx-xenial,accel=kvm,usb=off -boot d -cdrom alpine.iso -m 512 -nographic -device e1000,netdev=net0 -netdev tap,id=net0

And after few moments I have hang kernel, and theese messages in console:

    [74390.187908] mlx4_core 0000:11:00.0: bond for multifunction failed
    [74390.486476] mlx4_en: eno1d1: Fail to bond device
    [74390.750758] cache_from_obj: Wrong slab cache. kmalloc-256 but object is from kmalloc-192
    [74391.152326] general protection fault: 0000 [#1] SMP PTI
    [74391.410424] cache_from_obj: Wrong slab cache. kmalloc-256 but object is from kmalloc-192

kernel trace log in attachment

ProblemType: Bug
DistroRelease: Ubuntu 16.04
Package: linux-image-4.13.0-36-generic 4.13.0-36.40~16.04.1
ProcVersionSignature: Ubuntu 4.13.0-36.40~16.04.1-generic 4.13.13
Uname: Linux 4.13.0-36-generic x86_64
ApportVersion: 2.20.1-0ubuntu2.15
Architecture: amd64
Date: Mon Mar 12 19:59:16 2018
ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 LANG=C
 SHELL=/bin/bash
SourcePackage: linux-hwe
UpgradeStatus: No upgrade log present (probably fresh install)

kvaps (kvapss) wrote :
kvaps (kvapss) wrote :

I'm tried to use simple modules loading instead /etc/init.d/openibd script:

    rmmod mlx4_en mlx4_core
    modprobe mlx4_core num_vfs=1 port_type_array=2,2 probe_vf=1

The result the same:

    [ 193.469331] mlx4_core 0000:11:00.0: bond for multifunction failed
    [ 193.770126] cache_from_obj: Wrong slab cache. kmalloc-256 but object is from kmalloc-192
    [ 194.170171] mlx4_en: eno1d1: Fail to bond device
    [ 194.170178] nf_reject_ipv4 ebtable_filter ebtables ip6table_filter ip6_tables xt_set ip_set_list_set ip_set_hash_net veth beegfs(OE) dummy nf_conntrack_netlink xt_nat xt_tcpudp xt_recent ip_set nfnetlink ip_vs rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace sunrpc fscache xt_comment xt_mark netconsole mlx4_ib(OE) mlx4_en(OE) mlx4_core(OE) ipt_MASQUERADE nf_nat_masquerade_ipv4 xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter ip_tables xt_conntrack x_tables nf_nat nf_conntrack libcrc32c br_netfilter 8021q garp mrp ib_core(OE) mlx_compat(OE) bridge stp llc bonding ipmi_ssif intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 hpilo crypto_simd
    [ 194.170482] glue_helper cryptd intel_cstate mei_me ipmi_si ipmi_devintf ipmi_msghandler intel_rapl_perf shpchp mei acpi_power_meter mac_hid ie31200_edac knem(OE) autofs4 overlay nbd ptp pps_core i915 mgag200 video ttm i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm nvme ahci nvme_core libahci devlink [last unloaded: mlx4_core]
    [ 194.170550] CPU: 0 PID: 7 Comm: ksoftirqd/0 Tainted: G W OE 4.13.0-36-generic #40~16.04.1-Ubuntu

kvaps (kvapss) wrote :
Download full text (6.1 KiB)

I was teted right now, the problem occurs also on stock kernel.

    # uname -a
    Linux m5c43 4.4.0-116-generic #140-Ubuntu SMP Mon Feb 12 21:23:04 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Seems bug connected only with specific hardware, I'll provide more information about this.
I uses HP Moonshot x710x xartridge:

    # lspci
    00:00.0 Host bridge: Intel Corporation Sky Lake Host Bridge/DRAM Registers (rev 0a)
    00:01.0 PCI bridge: Intel Corporation Sky Lake PCIe Controller (x16) (rev 0a)
    00:02.0 VGA compatible controller: Intel Corporation Device 193a (rev 09)
    00:14.0 USB controller: Intel Corporation Sunrise Point-H USB 3.0 xHCI Controller (rev 31)
    00:16.0 Communication controller: Intel Corporation Sunrise Point-H CSME HECI #1 (rev 31)
    00:17.0 SATA controller: Intel Corporation Sunrise Point-H SATA controller [AHCI mode] (rev 31)
    00:1b.0 PCI bridge: Intel Corporation Sunrise Point-H PCI Root Port #17 (rev f1)
    00:1c.0 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root Port #1 (rev f1)
    00:1c.3 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root Port #4 (rev f1)
    00:1c.4 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root Port #5 (rev f1)
    00:1d.0 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root Port #9 (rev f1)
    00:1d.5 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root Port #14 (rev f1)
    00:1f.0 ISA bridge: Intel Corporation Sunrise Point-H LPC Controller (rev 31)
    00:1f.2 Memory controller: Intel Corporation Sunrise Point-H PMC (rev 31)
    00:1f.4 SMBus: Intel Corporation Sunrise Point-H SMBus (rev 31)
    01:00.0 System peripheral: Hewlett-Packard Company Integrated Lights-Out Standard Slave Instrumentation & System Support (rev 06)
    01:00.1 VGA compatible controller: Matrox Electronics Systems Ltd. MGA G200EH (rev 01)
    01:00.2 System peripheral: Hewlett-Packard Company Integrated Lights-Out Standard Management Processor Support and Messaging (rev 06)
    01:00.4 USB controller: Hewlett-Packard Company Integrated Lights-Out Standard Virtual USB Controller (rev 03)
    0e:00.0 Non-Volatile memory controller: Intel Corporation Device f1a5 (rev 03)
    11:00.0 Ethernet controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro]
    11:00.1 Ethernet controller: Mellanox Technologies MT27500/MT27520 Family [ConnectX-3/ConnectX-3 Pro Virtual Function]

    # ethtool -i eno1
    driver: mlx4_en
    version: 4.3-1.0.1
    firmware-version: 2.40.5540
    expansion-rom-version:
    bus-info: 0000:11:00.0
    supports-statistics: yes
    supports-test: yes
    supports-eeprom-access: no
    supports-register-dump: no
    supports-priv-flags: yes

    # mst status
    MST modules:
    ------------
        MST PCI module loaded
        MST PCI configuration module loaded

    MST devices:
    ------------
    /dev/mst/mt4103_pci_cr0 - PCI direct access.
                                       domain:bus:dev.fn=0000:11:00.0 bar=0x7f100000 size=0x100000
                                       Chip revision is: 00
    /dev/mst/mt4103_pciconf0 - PCI configuration cycles access.
                              ...

Read more...

kvaps (kvapss) wrote :
kvaps (kvapss) wrote :
affects: linux-hwe (Ubuntu) → linux (Ubuntu)

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
tags: added: artful

Did this issue start happening after an update/upgrade? Was there a prior kernel version where you were not having this particular problem?

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.16 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.16-rc6

Changed in linux (Ubuntu):
importance: Undecided → High
Changed in linux (Ubuntu Artful):
status: New → Confirmed
importance: Undecided → High
status: Confirmed → Incomplete
Changed in linux (Ubuntu):
status: Confirmed → Incomplete
tags: added: kernel-da-key
Changed in linux (Ubuntu Xenial):
status: New → Incomplete
importance: Undecided → High
kvaps (kvapss) wrote :

Hi Joseph, thanks for your answer.

Currnetly I have two various hardware configuration:

- HPE ProLiant m710p Server Cartridge (have no this problem)
- HPE ProLiant m710x Server Cartridge (have this problem)

> Did this issue start happening after an update/upgrade?
> Was there a prior kernel version where you were not having this particular problem?

Well, I uses debootstrap script for install all needed software automatically and build image with base system.
After that I uses this image for boot my nodes via PXE. So each boot I have system that installed from scratch.
I've tested the next kernels:
- Ubuntu 16.04 with stock kernel: 4.4.0-116-generic
- Ubuntu 16.04 with hwe kernel: 4.13.0-36-generic
- Ubuntu 16.04 with pve kernel: 4.13.13-6-pve
- Debian 9 with pve kernel: 4.13.13-6-pve
- Debian 9 with stock kernel: 4.9.0-6-amd64

All of them have this problem, but stock kernels can drops after some time.
(I had no this error only on debian with 4.9.0-6-amd64 but presume it exists there because I'm not tested it properly)

Another thing, that if I do this steps AFTER the system is boot up:

    rmmod mlx4_en mlx4_ib mlx4_core
    modprobe mlx4_core num_vfs=1 port_type_array=2,2 probe_vf=1
    systemctl restart networking

Everything starts working fine.

> Please test the latest v4.16 kernel[0].

Ok, I'll do this

kvaps (kvapss) wrote :

Hi, I can't build new mellanox drivers for new kernel, so I used default one:

    # ethtool -i eno1
    driver: mlx4_en
    version: 4.0-0
    firmware-version: 2.42.5004
    expansion-rom-version:
    bus-info: 0000:11:00.0
    supports-statistics: yes
    supports-test: yes
    supports-eeprom-access: no
    supports-register-dump: no
    supports-priv-flags: yes

After creating virtual machine, my kernel was stuck with these messages:
    [ 2170.511433] kvm [28464]: vcpu0, guest rIP: 0xffffffff810644d8 disabled perfctr wrmsr: 0xc2 data 0xffff
    [ 2170.963166] ------------[ cut here ]------------

tags: added: kernel-bug-exists-upstream
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Changed in linux (Ubuntu Xenial):
status: Incomplete → Confirmed
Changed in linux (Ubuntu Artful):
status: Incomplete → Confirmed
kvaps (kvapss) wrote :

Additional I've also tested latest hwe kernel (4.13.0-37-generic) and build-in driver, the same problem here:

    [ 1011.070739] kvm [16361]: vcpu0, guest rIP: 0xffffffff810644d8 disabled perfctr wrmsr: 0xc2 data 0xffff
    [ 1011.528347] cache_from_obj: Wrong slab cache. kmalloc-256 but object is from kmalloc-192
    [ 1011.927642] general protection fault: 0000 [#1] SMP PTI
    [ 1012.185439] cache_from_obj: Wrong slab cache. kmalloc-256 but object is from kmalloc-192

driver version:

    # ethtool -i eno1
    driver: mlx4_en
    version: 4.0-0
    firmware-version: 2.42.5004
    expansion-rom-version:
    bus-info: 0000:11:00.0
    supports-statistics: yes
    supports-test: yes
    supports-eeprom-access: no
    supports-register-dump: no
    supports-priv-flags: yes

Joseph Salisbury (jsalisbury) wrote :

This issue appears to be an upstream bug, since you tested the latest upstream kernel. Would it be possible for you to open an upstream bug report[0]? That will allow the upstream Developers to examine the issue, and may provide a quicker resolution to the bug.

Please follow the instructions on the wiki page[0]. The first step is to email the appropriate mailing list. If no response is received, then a bug may be opened on bugzilla.kernel.org.

Once this bug is reported upstream, please add the tag: 'kernel-bug-reported-upstream'.

[0] https://wiki.ubuntu.com/Bugs/Upstream/kernel

Changed in linux (Ubuntu Bionic):
status: Confirmed → Triaged
Changed in linux (Ubuntu Artful):
status: Confirmed → Triaged
Changed in linux (Ubuntu Xenial):
status: Confirmed → Triaged
kvaps (kvapss) on 2018-03-20
summary: - Kernel panic when using KVM and Mellanox OFED driver (bonding and sriov
+ Kernel panic when using KVM and mlx4_en driver (when bonding and sriov
enabled)
description: updated
kvaps (kvapss) wrote :

OK, this is my meesage that I wrote to kernel's netdev list:
https://<email address hidden>/msg223827.html

This bug was nominated against a series that is no longer supported, ie artful. The bug task representing the artful nomination is being closed as Won't Fix.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu Artful):
status: Triaged → Won't Fix
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers