[hns-1106] vm crash for hns vf function

Bug #1903267 reported by Fred Kimmy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
kunpeng920
Fix Released
Critical
Unassigned
Ubuntu-18.04
Won't Fix
Critical
Unassigned
Ubuntu-18.04-hwe
Invalid
Critical
Unassigned
Ubuntu-20.04
Invalid
Critical
Unassigned
Ubuntu-20.04-hwe
Fix Released
Critical
Unassigned
Upstream-kernel
Fix Released
Critical
Unassigned

Bug Description

[Bug Description]

[Steps to Reproduce]
1) install VM
2)apt install -y bridge-utils
3)echo 1 > /sys/class/net/eth0/device/sriov_numvfs
ethtool -i eth0v0
brctl addbr br00
ifconfig br00 5.5.5.1/24
brctl addif br00 eth0
brctl addif br00 eth01
virsh start vm1
virsh start vm2

vim sl1.xml
<hostdev mode='subsystem' type='pci' managed='yes'>
     <source>
             <address domain='0x0000' bus='0x7d' slot='0x01' function='0x7'/>
     </source>
</hostdev>
vim sl2.xml
virsh attach-device vm1 sl1.xml
virsh attach-device vm2 sl2.xml

TCP:netperf -H <Server IP> -t TCP_STREAM -l 60 -- -m 1472

UCP:netperf -H <Server IP> -t UDP_STREAM -l 60 -- -m 1472

[Actual Results]
vm crash

[Expected Results]
vm is ok

[Reproducibility]
100%

[Additional information]
(Firmware version, kernel version, affected hardware, etc. if required):
[ 1273.099819] pci 0000:7c:00.0: AER: Device recovery successful
[ 1273.099829] hns3 0000:7d:00.2: AER: aer_status: 0x00000000, aer_mask: 0x00000000
[ 1273.107205] hns3 0000:7d:00.2: AER: aer_layer=Transaction Layer, aer_agent=Receiver ID
[ 1273.112484] hns3 0000:7d:00.0: cleaned 0, need to clean 1
[ 1273.115095] hns3 0000:7d:00.0: get link status cmd failed -52
[ 1273.115100] hns3 0000:7d:00.2: AER: aer_uncor_severity: 0x00000000
[ 1273.127000] hns3 0000:7d:00.0: PCI error detected, state(=1)!!
[ 1273.127027] hns3 0000:7d:00.1: PCI error detected, state(=1)!!
[ 1273.127036] hns3 0000:7d:00.2: PCI error detected, state(=1)!!
[ 1273.127045] hns3 0000:7d:00.3: PCI error detected, state(=1)!!
[ 1273.127114] pci 0000:7c:00.0: AER: Device recovery successful
[ 1273.127123] hns3 0000:7d:00.3: AER: aer_status: 0x00000000, aer_mask: 0x00000000
[ 1273.127411] br00: port 1(eno3) entered disabled state
[ 1273.134507] hns3 0000:7d:00.3: AER: aer_layer=Transaction Layer, aer_agent=Receiver ID
[ 1273.142403] hns3 0000:7d:00.3: AER: aer_uncor_severity: 0x00000000
[ 1273.148576] hns3 0000:7d:00.0: PCI error detected, state(=1)!!
[ 1273.148610] hns3 0000:7d:00.1: PCI error detected, state(=1)!!
[ 1273.148626] hns3 0000:7d:00.2: PCI error detected, state(=1)!!
[ 1273.148637] hns3 0000:7d:00.3: PCI error detected, state(=1)!!
[ 1273.148712] pci 0000:7c:00.0: AER: Device recovery successful
[ 1273.149370] br00: port 2(eno4) entered disabled state
[ 1273.149584] hns3 0000:7d:00.0 eno1: link down
[ 1273.198236] hns3 0000:7d:00.2: prepare wait ok
[ 1273.206287] hns3 0000:7d:00.3: prepare wait ok
[ 1273.206303] hns3 0000:7d:00.1: prepare wait ok
[ 1273.254285] hns3 0000:7d:00.0: prepare wait ok
[ 1310.998220] hns3 0000:7d:00.2: Wait for reset timeout: 6

[Resolution]
74ef402 net: hns3: fix for fraglist SKB headlen not handling correctly
0ec3b6a net: hns3: fix for not unmapping TX buffer correctly
48ae74c net: hns3: fix for not calculating TX BD send size correctly
8ceca59 net: hns3: fix error handling for desc filling
cfdaeba net: hns3: fix desc filling bug when skb is expanded or lineared

Tags: tairadar
Revision history for this message
Fred Kimmy (kongzizaixian) wrote :
Changed in kunpeng920:
importance: Undecided → Critical
Revision history for this message
Taihsiang Ho (tai271828) wrote :

All commits will be in focal hwe 5.8 kernel tree:

Ubuntu-hwe-5.8-5.8.0-25.26_20.04.1 74ef402e134b net: hns3: fix for fraglist SKB headlen not handling correctly
Ubuntu-hwe-5.8-5.8.0-25.26_20.04.1 0ec3b6a7c026 net: hns3: fix for not unmapping TX buffer correctly
Ubuntu-hwe-5.8-5.8.0-25.26_20.04.1 48ae74c9d89f net: hns3: fix for not calculating TX BD send size correctly
Ubuntu-hwe-5.8-5.8.0-25.26_20.04.1 8ceca59fb3ed net: hns3: fix error handling for desc filling
Ubuntu-hwe-5.8-5.8.0-25.26_20.04.1 cfdaeba5ddc9 net: hns3: fix desc filling bug when skb is expanded or lineared

Taihsiang Ho (tai271828)
description: updated
Taihsiang Ho (tai271828)
description: updated
Revision history for this message
Taihsiang Ho (tai271828) wrote :

HI @Fred

May you help to provide the following information:

- image and kernel version of your ubuntu version (e.g focal?) and kernel(5.4?)?
- image version of the vm/guest system (e.g. ubuntu cloud image for arm64 or amd64? or server image?)
- vm configuration information if possible[1]
- (nice to have) the tool you launch the VMs (qemu, virsh, virt-manager, virt-inst, uvtool, or ...?)

[1] You may dump vm information by

    virsh dumpxml vm1 > vm1.xml

It will be very helpful if the corresponding vm1.xml is attached.

Revision history for this message
Fred Kimmy (kongzizaixian) wrote :

May you help to provide the following information:

- image and kernel version of your ubuntu version (e.g focal?) and kernel(5.4?)?
5.40-42-generic
- image version of the vm/guest system (e.g. ubuntu cloud image for arm64 or amd64? or server image?)
vm/guest system will use ubuntu 20.04.1 ISO to test.
- vm configuration information if possible[1]
- (nice to have) the tool you launch the VMs (qemu, virsh, virt-manager, virt-inst, uvtool, or ...?)
now we AlSO are reproducing it, Customer machine have reproduced it.
apt install qemu-kvm virtinst

[1] You may dump vm information by

    virsh dumpxml vm1 > vm1.xml

It will be very helpful if the corresponding vm1.xml is attached.

Revision history for this message
Taihsiang Ho (tai271828) wrote :

Regarding focal 5.4, there is only

    aa9d22dd456e net: hns3: fix error handling for desc filling

in the kernel src tree (since Ubuntu-5.4.0-43.47)

It means 5.40-42 has none of the proposed solutions.

Revision history for this message
Taihsiang Ho (tai271828) wrote :

Curently still trying to reproduce the issue. The hostdev passthrough is working on focal (5.4.0-53-generic) . However, uvtool is not suggested to use in this case because the corresponding cloud image has no hns3 module loaded when the passthrough device is attached.

Revision history for this message
Taihsiang Ho (tai271828) wrote :

hi @Fred

Thank you for the feedback. If possible, I would like to know more information regarding:
 - is eth0 and/or eth1 connected with cables physically?
 - would you provide the output of "route -n" in your environment? It looks like the subnet 5.5.5.0/24 is pre-defined, and there is presumably a router taking care of this subnet. Is there such a router?
 - is 5.5.5.12 in your crash.png a pre-setup netperf server? 5.5.5.12 is not one of vm1 and/or vm2, right? Or, 5.5.5.12 is vm2?
 - is this typo? The "eth01" of "brctl addif br00 eth01" seems a typo.

Taihsiang Ho (tai271828)
tags: added: tairadar
Taihsiang Ho (tai271828)
Changed in kunpeng920:
status: New → In Progress
Revision history for this message
Taihsiang Ho (tai271828) wrote :

egarding bionic hwe-5.4, there is only

    56ac7541c8bf net: hns3: fix error handling for desc filling

in the kernel src tree.

( The comment#3 is incorrect for the same git log message. Let me hide comment#3 to avoid confusion)

Revision history for this message
Taihsiang Ho (tai271828) wrote :

Summary of the current back-porting status:

bionic - not clean cherry-pick
bionic-hwe (5.4) - clean cherry-pick
focal - clean cherry-pick
focal-hwe (5.8) - fix already committed

Revision history for this message
Taihsiang Ho (tai271828) wrote :

@Fred

I did not manage to reproduce this issue, and I need more description of the environment setup and/or elaborated reproducing steps. Before getting the corresponding information, I will try to provide kernel deb for you to test in your environment. Is it ok for you?

Revision history for this message
Ike Panhc (ikepanhc) wrote :

Can not clean cherry-pick to 18.04. Set to won't fix.

Revision history for this message
Ike Panhc (ikepanhc) wrote :

Can clean cherry-pick to 20.04. Git branch pushed to

https://kernel.ubuntu.com/git/ikepanhc/public.git/log/?h=lp1903267.1

I am also trying to reproduce before sending them to kernel team.

Revision history for this message
Ike Panhc (ikepanhc) wrote :

Build test kernel debs at https://kernel.ubuntu.com/~ikepanhc/lp1903267.1/
Still working on reproducing.

Revision history for this message
Ike Panhc (ikepanhc) wrote :

In #7 Tai says there is no hns3 loaded in guest. I believe it is because no linux-modules-extra installed in ubuntu cloud image. We need to install it manually.

ubuntu@vm1:~$ dpkg --list | grep linux-mod
ii linux-modules-5.4.0-54-generic 5.4.0-54.60 arm64 Linux kernel extra modules for version 5.4.0 on ARMv8 SMP
ubuntu@vm1:~$

Revision history for this message
Ike Panhc (ikepanhc) wrote :

I can not reproduce with 5.4.0-54.60 kernel, please let me know if my steps are wrong.

The attachment is the script I use to build VM guests on kunpeng920 machine
The net1.xml and net2.xml used in the scripts are

$ cat net1.xml
<hostdev mode='subsystem' type='pci' managed='yes'>
     <source>
             <address domain='0x0000' bus='0xbd' slot='0x01' function='0x0'/>
     </source>
</hostdev>

$ cat net2.xml
<hostdev mode='subsystem' type='pci' managed='yes'>
     <source>
             <address domain='0x0000' bus='0xbd' slot='0x01' function='0x1'/>
     </source>
</hostdev>

The netperf runs on TCP and UDP are passed.

ubuntu@vm1:~$ netperf -H 192.168.100.10 -t TCP_STREAM -l 60 -- -m 1472
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.100.10 () port 0 AF_INET : demo
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec

131072 16384 1472 60.02 506.88
ubuntu@vm1:~$ netperf -H 192.168.100.10 -t UDP_STREAM -l 60 -- -m 1472
MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.100.10 () port 0 AF_INET : demo
Socket Message Elapsed Messages
Size Size Time Okay Errors Throughput
bytes bytes secs # # 10^6bits/sec

212992 1472 60.00 2853676 0 560.08
212992 60.00 2853534 560.05

Changed in kunpeng920:
status: In Progress → Incomplete
Revision history for this message
Ike Panhc (ikepanhc) wrote :

Tried 5.4.0-33.37 but still can not reproduce. On kvm guest and host, the kernels are 5.4.0-33.37.

ubuntu@vm1:~$ uname -a;netperf -H 192.168.100.10 -t TCP_STREAM -l 60 -- -m 1472
Linux vm1 5.4.0-33-generic #37-Ubuntu SMP Thu May 21 12:55:12 UTC 2020 aarch64 aarch64 aarch64 GNU/Linux
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.100.10 () port 0 AF_INET : demo
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec

131072 16384 1472 60.04 498.85

$ uname -a;dmesg | tail -20
Linux kreiken 5.4.0-33-generic #37-Ubuntu SMP Thu May 21 12:55:12 UTC 2020 aarch64 aarch64 aarch64 GNU/Linux
[ 761.715382] br00: port 1(enp189s0f0v0) entered disabled state
[ 761.715648] device enp189s0f0v0 entered promiscuous mode
[ 761.728784] br00: port 2(enp189s0f0v1) entered blocking state
[ 761.728787] br00: port 2(enp189s0f0v1) entered disabled state
[ 761.728943] device enp189s0f0v1 entered promiscuous mode
[ 761.742500] br00: port 3(enp189s0f0) entered blocking state
[ 761.742503] br00: port 3(enp189s0f0) entered disabled state
[ 761.742781] device enp189s0f0 entered promiscuous mode
[ 761.742936] hns3 0000:bd:00.0 enp189s0f0: disable vlan filter
[ 761.743034] br00: port 3(enp189s0f0) entered blocking state
[ 761.743036] br00: port 3(enp189s0f0) entered forwarding state
[ 761.864861] VFIO - User Level meta-driver version: 0.3
[ 761.898013] device enp189s0f0v0 left promiscuous mode
[ 761.898044] br00: port 1(enp189s0f0v0) entered disabled state
[ 762.240434] audit: type=1400 audit(1606889769.899:43): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="libvirt-ee75f599-dbef-4703-b13b-6f3c0bab2c66" pid=3834 comm="apparmor_parser"
[ 762.311350] vfio-pci 0000:bd:01.0: enabling device (0000 -> 0002)
[ 762.608755] device enp189s0f0v1 left promiscuous mode
[ 762.608768] br00: port 2(enp189s0f0v1) entered disabled state
[ 762.955569] audit: type=1400 audit(1606889770.615:44): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="libvirt-6733ce35-cf1e-4497-b8d3-6816cc66405d" pid=3848 comm="apparmor_parser"
[ 763.019507] vfio-pci 0000:bd:01.1: enabling device (0000 -> 0002)

Revision history for this message
Ike Panhc (ikepanhc) wrote :

All patches hit 20.04.2 HWE kernel.

Changed in kunpeng920:
status: Incomplete → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.