Kernel panic when using KVM and mlx4_en driver (when bonding and sriov enabled)

Bug #1755268 reported by kvaps on 2018-03-12
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
High
Unassigned
Xenial
High
Unassigned
Artful
High
Unassigned
Bionic
High
Unassigned

Bug Description

##### System information #####

    # uname -a
    Linux m5c37 4.13.0-36-generic #40~16.04.1-Ubuntu SMP Fri Feb 16 23:25:58 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

    # cat /etc/os-release
    NAME="Ubuntu"
    VERSION="16.04.4 LTS (Xenial Xerus)"
    ID=ubuntu
    ID_LIKE=debian
    PRETTY_NAME="Ubuntu 16.04.4 LTS"
    VERSION_ID="16.04"
    HOME_URL="http://www.ubuntu.com/"
    SUPPORT_URL="http://help.ubuntu.com/"
    BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
    VERSION_CODENAME=xenial
    UBUNTU_CODENAME=xenial

    # ethtool -i eno1
    driver: mlx4_en
    version: 4.3-1.0.1
    firmware-version: 2.42.5004
    expansion-rom-version:
    bus-info: 0000:11:00.0
    supports-statistics: yes
    supports-test: yes
    supports-eeprom-access: no
    supports-register-dump: no
    supports-priv-flags: yes

    # ethtool -i bond0
    driver: bonding
    version: 3.7.1
    firmware-version: 2
    expansion-rom-version:
    bus-info:
    supports-statistics: no
    supports-test: no
    supports-eeprom-access: no
    supports-register-dump: no
    supports-priv-flags: no

    # ethtool -i vmbr0
    driver: bridge
    version: 2.3
    firmware-version: N/A
    expansion-rom-version:
    bus-info: N/A
    supports-statistics: no
    supports-test: no
    supports-eeprom-access: no
    supports-register-dump: no
    supports-priv-flags: no

Mellanox driver was installed from
http://content.mellanox.com/ofed/MLNX_OFED-4.3-1.0.1.0/MLNX_OFED_LINUX-4.3-1.0.1.0-ubuntu16.04-x86_64.tgz

    ./mlnxofedinstall --kernel 4.13.0-36-generic --without-dkms --add-kernel-support

##### Steps for reproduce #####

This is my /etc/network/interfaces file:

    auto lo
    iface lo inet loopback

    auto openibd
    iface openibd inet manual
            pre-up /etc/init.d/openibd start
            pre-down /etc/init.d/openibd force-stop

    auto bond0
    iface bond0 inet manual
            pre-up ip link add bond0 type bond || true
            pre-up ip link set bond0 down
            pre-up ip link set bond0 type bond mode active-backup arp_interval 2000 arp_ip_target 10.36.0.1 arp_validate 3 primary eno1
            pre-up ip link set eno1 down
            pre-up ip link set eno1d1 down
            pre-up ip link set eno1 master bond0
            pre-up ip link set eno1d1 master bond0
            pre-up ip link set bond0 up
            pre-down ip link del bond0

    auto vmbr0
    iface vmbr0 inet static
            address 10.36.128.217
            netmask 255.255.0.0
            gateway 10.36.0.1
            bridge_ports bond0
            bridge_stp off
            bridge_fd 0

I execute these commands:

    wget http://dl-cdn.alpinelinux.org/alpine/v3.7/releases/x86_64/alpine-virt-3.7.0-x86_64.iso -O alpine.iso
    qemu-system-x86_64 -machine pc-i440fx-xenial,accel=kvm,usb=off -boot d -cdrom alpine.iso -m 512 -nographic -device e1000,netdev=net0 -netdev tap,id=net0

And after few moments I have hang kernel, and theese messages in console:

    [74390.187908] mlx4_core 0000:11:00.0: bond for multifunction failed
    [74390.486476] mlx4_en: eno1d1: Fail to bond device
    [74390.750758] cache_from_obj: Wrong slab cache. kmalloc-256 but object is from kmalloc-192
    [74391.152326] general protection fault: 0000 [#1] SMP PTI
    [74391.410424] cache_from_obj: Wrong slab cache. kmalloc-256 but object is from kmalloc-192

kernel trace log in attachment

ProblemType: Bug
DistroRelease: Ubuntu 16.04
Package: linux-image-4.13.0-36-generic 4.13.0-36.40~16.04.1
ProcVersionSignature: Ubuntu 4.13.0-36.40~16.04.1-generic 4.13.13
Uname: Linux 4.13.0-36-generic x86_64
ApportVersion: 2.20.1-0ubuntu2.15
Architecture: amd64
Date: Mon Mar 12 19:59:16 2018
ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 LANG=C
 SHELL=/bin/bash
SourcePackage: linux-hwe
UpgradeStatus: No upgrade log present (probably fresh install)

kvaps (kvapss) wrote :
kvaps (kvapss) wrote :

I'm tried to use simple modules loading instead /etc/init.d/openibd script:

    rmmod mlx4_en mlx4_core
    modprobe mlx4_core num_vfs=1 port_type_array=2,2 probe_vf=1

The result the same:

    [ 193.469331] mlx4_core 0000:11:00.0: bond for multifunction failed
    [ 193.770126] cache_from_obj: Wrong slab cache. kmalloc-256 but object is from kmalloc-192
    [ 194.170171] mlx4_en: eno1d1: Fail to bond device
    [ 194.170178] nf_reject_ipv4 ebtable_filter ebtables ip6table_filter ip6_tables xt_set ip_set_list_set ip_set_hash_net veth beegfs(OE) dummy nf_conntrack_netlink xt_nat xt_tcpudp xt_recent ip_set nfnetlink ip_vs rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace sunrpc fscache xt_comment xt_mark netconsole mlx4_ib(OE) mlx4_en(OE) mlx4_core(OE) ipt_MASQUERADE nf_nat_masquerade_ipv4 xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter ip_tables xt_conntrack x_tables nf_nat nf_conntrack libcrc32c br_netfilter 8021q garp mrp ib_core(OE) mlx_compat(OE) bridge stp llc bonding ipmi_ssif intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 hpilo crypto_simd
    [ 194.170482] glue_helper cryptd intel_cstate mei_me ipmi_si ipmi_devintf ipmi_msghandler intel_rapl_perf shpchp mei acpi_power_meter mac_hid ie31200_edac knem(OE) autofs4 overlay nbd ptp pps_core i915 mgag200 video ttm i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm nvme ahci nvme_core libahci devlink [last unloaded: mlx4_core]
    [ 194.170550] CPU: 0 PID: 7 Comm: ksoftirqd/0 Tainted: G W OE 4.13.0-36-generic #40~16.04.1-Ubuntu

kvaps (kvapss) wrote :
Download full text (6.1 KiB)

I was teted right now, the problem occurs also on stock kernel.

    # uname -a
    Linux m5c43 4.4.0-116-generic #140-Ubuntu SMP Mon Feb 12 21:23:04 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Seems bug connected only with specific hardware, I'll provide more information about this.
I uses HP Moonshot x710x xartridge:

    # lspci
    00:00.0 Host bridge: Intel Corporation Sky Lake Host Bridge/DRAM Registers (rev 0a)
    00:01.0 PCI bridge: Intel Corporation Sky Lake PCIe Controller (x16) (rev 0a)
    00:02.0 VGA compatible controller: Intel Corporation Device 193a (rev 09)
    00:14.0 USB controller: Intel Corporation Sunrise Point-H USB 3.0 xHCI Controller (rev 31)
    00:16.0 Communication controller: Intel Corporation Sunrise Point-H CSME HECI #1 (rev 31)
    00:17.0 SATA controller: Intel Corporation Sunrise Point-H SATA controller [AHCI mode] (rev 31)
    00:1b.0 PCI bridge: Intel Corporation Sunrise Point-H PCI Root Port #17 (rev f1)
    00:1c.0 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root Port #1 (rev f1)
    00:1c.3 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root Port #4 (rev f1)
    00:1c.4 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root Port #5 (rev f1)
    00:1d.0 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root Port #9 (rev f1)
    00:1d.5 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root Port #14 (rev f1)
    00:1f.0 ISA bridge: Intel Corporation Sunrise Point-H LPC Controller (rev 31)
    00:1f.2 Memory controller: Intel Corporation Sunrise Point-H PMC (rev 31)
    00:1f.4 SMBus: Intel Corporation Sunrise Point-H SMBus (rev 31)
    01:00.0 System peripheral: Hewlett-Packard Company Integrated Lights-Out Standard Slave Instrumentation & System Support (rev 06)
    01:00.1 VGA compatible controller: Matrox Electronics Systems Ltd. MGA G200EH (rev 01)
    01:00.2 System peripheral: Hewlett-Packard Company Integrated Lights-Out Standard Management Processor Support and Messaging (rev 06)
    01:00.4 USB controller: Hewlett-Packard Company Integrated Lights-Out Standard Virtual USB Controller (rev 03)
    0e:00.0 Non-Volatile memory controller: Intel Corporation Device f1a5 (rev 03)
    11:00.0 Ethernet controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro]
    11:00.1 Ethernet controller: Mellanox Technologies MT27500/MT27520 Family [ConnectX-3/ConnectX-3 Pro Virtual Function]

    # ethtool -i eno1
    driver: mlx4_en
    version: 4.3-1.0.1
    firmware-version: 2.40.5540
    expansion-rom-version:
    bus-info: 0000:11:00.0
    supports-statistics: yes
    supports-test: yes
    supports-eeprom-access: no
    supports-register-dump: no
    supports-priv-flags: yes

    # mst status
    MST modules:
    ------------
        MST PCI module loaded
        MST PCI configuration module loaded

    MST devices:
    ------------
    /dev/mst/mt4103_pci_cr0 - PCI direct access.
                                       domain:bus:dev.fn=0000:11:00.0 bar=0x7f100000 size=0x100000
                                       Chip revision is: 00
    /dev/mst/mt4103_pciconf0 - PCI configuration cycles access.
                              ...

Read more...

kvaps (kvapss) wrote :
kvaps (kvapss) wrote :
affects: linux-hwe (Ubuntu) → linux (Ubuntu)

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
tags: added: artful

Did this issue start happening after an update/upgrade? Was there a prior kernel version where you were not having this particular problem?

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.16 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.16-rc6

Changed in linux (Ubuntu):
importance: Undecided → High
Changed in linux (Ubuntu Artful):
status: New → Confirmed
importance: Undecided → High
status: Confirmed → Incomplete
Changed in linux (Ubuntu):
status: Confirmed → Incomplete
tags: added: kernel-da-key
Changed in linux (Ubuntu Xenial):
status: New → Incomplete
importance: Undecided → High
kvaps (kvapss) wrote :

Hi Joseph, thanks for your answer.

Currnetly I have two various hardware configuration:

- HPE ProLiant m710p Server Cartridge (have no this problem)
- HPE ProLiant m710x Server Cartridge (have this problem)

> Did this issue start happening after an update/upgrade?
> Was there a prior kernel version where you were not having this particular problem?

Well, I uses debootstrap script for install all needed software automatically and build image with base system.
After that I uses this image for boot my nodes via PXE. So each boot I have system that installed from scratch.
I've tested the next kernels:
- Ubuntu 16.04 with stock kernel: 4.4.0-116-generic
- Ubuntu 16.04 with hwe kernel: 4.13.0-36-generic
- Ubuntu 16.04 with pve kernel: 4.13.13-6-pve
- Debian 9 with pve kernel: 4.13.13-6-pve
- Debian 9 with stock kernel: 4.9.0-6-amd64

All of them have this problem, but stock kernels can drops after some time.
(I had no this error only on debian with 4.9.0-6-amd64 but presume it exists there because I'm not tested it properly)

Another thing, that if I do this steps AFTER the system is boot up:

    rmmod mlx4_en mlx4_ib mlx4_core
    modprobe mlx4_core num_vfs=1 port_type_array=2,2 probe_vf=1
    systemctl restart networking

Everything starts working fine.

> Please test the latest v4.16 kernel[0].

Ok, I'll do this

kvaps (kvapss) wrote :

Hi, I can't build new mellanox drivers for new kernel, so I used default one:

    # ethtool -i eno1
    driver: mlx4_en
    version: 4.0-0
    firmware-version: 2.42.5004
    expansion-rom-version:
    bus-info: 0000:11:00.0
    supports-statistics: yes
    supports-test: yes
    supports-eeprom-access: no
    supports-register-dump: no
    supports-priv-flags: yes

After creating virtual machine, my kernel was stuck with these messages:
    [ 2170.511433] kvm [28464]: vcpu0, guest rIP: 0xffffffff810644d8 disabled perfctr wrmsr: 0xc2 data 0xffff
    [ 2170.963166] ------------[ cut here ]------------

tags: added: kernel-bug-exists-upstream
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Changed in linux (Ubuntu Xenial):
status: Incomplete → Confirmed
Changed in linux (Ubuntu Artful):
status: Incomplete → Confirmed
kvaps (kvapss) wrote :

Additional I've also tested latest hwe kernel (4.13.0-37-generic) and build-in driver, the same problem here:

    [ 1011.070739] kvm [16361]: vcpu0, guest rIP: 0xffffffff810644d8 disabled perfctr wrmsr: 0xc2 data 0xffff
    [ 1011.528347] cache_from_obj: Wrong slab cache. kmalloc-256 but object is from kmalloc-192
    [ 1011.927642] general protection fault: 0000 [#1] SMP PTI
    [ 1012.185439] cache_from_obj: Wrong slab cache. kmalloc-256 but object is from kmalloc-192

driver version:

    # ethtool -i eno1
    driver: mlx4_en
    version: 4.0-0
    firmware-version: 2.42.5004
    expansion-rom-version:
    bus-info: 0000:11:00.0
    supports-statistics: yes
    supports-test: yes
    supports-eeprom-access: no
    supports-register-dump: no
    supports-priv-flags: yes

Joseph Salisbury (jsalisbury) wrote :

This issue appears to be an upstream bug, since you tested the latest upstream kernel. Would it be possible for you to open an upstream bug report[0]? That will allow the upstream Developers to examine the issue, and may provide a quicker resolution to the bug.

Please follow the instructions on the wiki page[0]. The first step is to email the appropriate mailing list. If no response is received, then a bug may be opened on bugzilla.kernel.org.

Once this bug is reported upstream, please add the tag: 'kernel-bug-reported-upstream'.

[0] https://wiki.ubuntu.com/Bugs/Upstream/kernel

Changed in linux (Ubuntu Bionic):
status: Confirmed → Triaged
Changed in linux (Ubuntu Artful):
status: Confirmed → Triaged
Changed in linux (Ubuntu Xenial):
status: Confirmed → Triaged
kvaps (kvapss) on 2018-03-20
summary: - Kernel panic when using KVM and Mellanox OFED driver (bonding and sriov
+ Kernel panic when using KVM and mlx4_en driver (when bonding and sriov
enabled)
description: updated
kvaps (kvapss) wrote :

OK, this is my meesage that I wrote to kernel's netdev list:
https://<email address hidden>/msg223827.html

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers