[hns-1126]net: hns3: fix __QUEUE_STATE_STACK_XOFF not cleared issue

Bug #1853939 reported by Fred Kimmy on 2019-11-26
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
kunpeng920
Undecided
Unassigned
Ubuntu-18.04
Undecided
Unassigned
Ubuntu-18.04-hwe
Undecided
Unassigned
Ubuntu-19.04
Undecided
Unassigned
Ubuntu-19.10
Undecided
Unassigned
Upstream-kernel
Undecided
Unassigned

Bug Description

"[Bug Description]
When change MTU or other operations, which just calling .reset_notify
to do HNAE3_DOWN_CLIENT and HNAE3_UP_CLIENT, then
the netdev_tx_reset_queue() in the hns3_clear_all_ring() will be
ignored. So the dev_watchdog() may misdiagnose a TX timeout.

[Steps to Reproduce]
1.load PF & VF driver
2.run iperf
3.modify MTU

[Actual Results]
get TX timeout after MTU changed.

 5613.947272] hns3 0000:bd:00.2 eth6: already using mac address 34:f1:27:e3:4b:2a
[ 5629.938362] hns3 0000:bd:00.2 eth6: link down
[ 5634.034364] hns3 0000:bd:00.2 eth6: link up
[ 5650.720675] hns3 0000:bd:00.2 eth6: link down
[ 5650.740080] IPv6: ADDRCONF(NETDEV_CHANGE): eth6: link becomes ready
[ 5651.762529] hns3 0000:bd:00.2 eth6: link up
[ 5652.790522] hns3 0000:bd:00.2 eth6: link down
[ 5655.868111] hns3 0000:bd:00.2 eth6: link up
[ 5655.876462] IPv6: ADDRCONF(NETDEV_CHANGE): eth6: link becomes ready
[ 5661.334349] ------------[ cut here ]------------
[ 5661.343550] NETDEV WATCHDOG: eth6 (hns3): transmit queue 47 timed out
[ 5661.356405] WARNING: CPU: 20 PID: 0 at net/sched/sch_generic.c:461 dev_watchdog+0x2a8/0x2b0
[ 5661.373048] Modules linked in: vfio_iommu_type1(E) vfio_pci(E) vfio_virqfd(E) vfio(E) tun(OE) vxlan(OE) ip6_udp_tunnel(E) udp_tunnel(E) ip_tunnel(E) 8021q(E) garp(E) mrp(E) bonding(OE) hns3_dfx(OE) hns3(OE) hclge(OE) hnae3(OE)
[ 5661.413089] CPU: 20 PID: 0 Comm: swapper/20 Tainted: G W OE 4.19.30-vhulk1903.5.1.h163.eulerosv3r1.aarch64 #2
[ 5661.434930] Hardware name: Huawei Technologies Co., Ltd. EVBCS/EVBCS, BIOS SPC100B011 TB 06/13/2019
[ 5661.452960] pstate: 40400009 (nZcv daif +PAN -UAO)
[ 5661.462504] pc : dev_watchdog+0x2a8/0x2b0
[ 5661.470489] lr : dev_watchdog+0x2a8/0x2b0
[ 5661.478472] sp : ffff8027dfba0aa0
[ 5661.485070] x29: ffff8027dfba0aa0 x28: 0000000000000002
[ 5661.495652] x27: 0000000000000002 x26: ffff0000093060c8
[ 5661.506235] x25: 0000000000000140 x24: 00000000ffffffff
[ 5661.516817] x23: 0000000000000000 x22: ffff80279b3c4480
[ 5661.527400] x21: ffff000009307000 x20: ffff80279b3c4000
[ 5661.537983] x19: 000000000000002f x18: 0000000000000010
[ 5661.548564] x17: 0000000000000000 x16: 0000000000000000
[ 5661.559145] x15: ffff0000895099df x14: 0720072007200720
[ 5661.569728] x13: 0720072007200720 x12: ffff00000930b838
[ 5661.580310] x11: ffff0000086bfe70 x10: 0774072007370734
[ 5661.590892] x9 : 00000000000006bf x8 : 0775077107200774
[ 5661.601474] x7 : 0769076d0773076e x6 : ffff8027dfb91270
[ 5661.612055] x5 : ffff8027dfb91270 x4 : 0000000000000000
[ 5661.622637] x3 : ffff8027dfb99848 x2 : ffff8027dfb91270
[ 5661.633220] x1 : f3b9b3a056e8ac00 x0 : 0000000000000000
[ 5661.643803] Call trace:
[ 5661.648669] dev_watchdog+0x2a8/0x2b0
[ 5661.655961] call_timer_fn+0x34/0x178
[ 5661.663252] expire_timers+0xec/0x158
[ 5661.670542] run_timer_softirq+0xc0/0x1f8
[ 5661.678525] __do_softirq+0x11c/0x31c
[ 5661.685816] irq_exit+0x104/0x138
[ 5661.692414] __handle_domain_irq+0x6c/0xb8
[ 5661.700571] gic_handle_irq+0xe4/0x1c8
[ 5661.708035] el1_irq+0xf0/0x1c0
[ 5661.714286] arch_cpu_idle+0x34/0x1c0
[ 5661.721578] default_idle_call+0x24/0x44
[ 5661.729389] do_idle+0x1ec/0x2d0
[ 5661.735814] cpu_startup_entry+0x2c/0x30
[ 5661.743625] secondary_start_kernel+0x1bc/0x248
[ 5661.752647] ---[ end trace 6e5b9286c0279339 ]---
[ 5661.761848] hns3 0000:bd:00.2 eth6: tx_timeout count: 1, queue id: 47, SW_NTU: 0x0, SW_NTC: 0x0, napi state: 16
[ 5661.781958] hns3 0000:bd:00.2 eth6: tx_pkts: 230943, tx_bytes: 349641994, io_err_cnt: 0, sw_err_cnt: 0
[ 5661.800511] hns3 0000:bd:00.2 eth6: seg_pkt_cnt: 0, tx_err_cnt: 0, restart_queue: 0, tx_busy: 0
[ 5661.818678] hns3 0000:bd:00.2 eth6: tx_pause_cnt: 0, rx_pause_cnt: 0
[ 5661.831340] hns3 0000:bd:00.2 eth6: BD_NUM: 0x7f HW_HEAD: 0x0, HW_TAIL: 0x0, BD_ERR: 0x0, INT: 0x1
[ 5661.849200] hns3 0000:bd:00.2 eth6: RING_EN: 0x1, TC: 0x0, FBD_NUM: 0x0 FBD_OFT: 0x0, EBD_NUM: 0x400, EBD_OFT: 0x0
[ 5667.478369] hns3 0000:bd:00.2 eth6: tx_timeout count: 2, queue id: 47, SW_NTU: 0x0, SW_NTC: 0x0, napi state: 16
[ 5667.498485] hns3 0000:bd:00.2 eth6: tx_pkts: 230943, tx_bytes: 349641994, io_err_cnt: 0, sw_err_cnt: 0
[ 5667.517041] hns3 0000:bd:00.2 eth6: seg_pkt_cnt: 0, tx_err_cnt: 0, restart_queue: 0, tx_busy: 0
[ 5667.535200] hns3 0000:bd:00.2 eth6: tx_pause_cnt: 0, rx_pause_cnt: 0
[ 5667.547865] hns3 0000:bd:00.2 eth6: BD_NUM: 0x7f HW_HEAD: 0x0, HW_TAIL: 0x0, BD_ERR: 0x0, INT: 0x1
[ 5667.565726] hns3 0000:bd:00.2 eth6: RING_EN: 0x1, TC: 0x0, FBD_NUM: 0x0 FBD_OFT: 0x0, EBD_NUM: 0x400, EBD_OFT: 0x0
[ 5667.586359] hns3 0000:bd:00.2: received reset event , reset type is 5
[ 5667.599917] hns3 0000:bd:00.2: PF Reset requested
[ 5667.619979] hns3 0000:bd:00.2 eth6: link down
[ 5667.846349] hns3 0000:bd:00.2: prepare wait ok
[ 5668.045010] hns3 0000:bd:00.2: The firmware version is 01080100

[Expected Results]
ok

[Reproducibility]
Inevitably

[Additional information]
Hardware: D06
Firmware: NA
Kernel: NA

[Resolution]
This patch separates netdev_tx_reset_queue() from
hns3_clear_all_ring(), and unifies hns3_clear_all_ring() and
hns3_force_clear_all_ring into one, since they are doing
similar things."

Ike Panhc (ikepanhc) wrote :

Is this patch the fix?

commit f96315f2f17e7b2580d2fec7c4d6a706a131d904
Author: Huazhong Tan <email address hidden>
Date: Fri Jun 28 19:50:07 2019 +0800

    net: hns3: fix __QUEUE_STATE_STACK_XOFF not cleared issue

    When change MTU or other operations, which just calling .reset_notify
    to do HNAE3_DOWN_CLIENT and HNAE3_UP_CLIENT, then
    the netdev_tx_reset_queue() in the hns3_clear_all_ring() will be
    ignored. So the dev_watchdog() may misdiagnose a TX timeout.

    This patch separates netdev_tx_reset_queue() from
    hns3_clear_all_ring(), and unifies hns3_clear_all_ring() and
    hns3_force_clear_all_ring into one, since they are doing
    similar things.

    Fixes: 3a30964a2eef ("net: hns3: delay ring buffer clearing during reset")
    Signed-off-by: Huazhong Tan <email address hidden>
    Signed-off-by: David S. Miller <email address hidden>

Changed in kunpeng920:
status: New → Incomplete
dann frazier (dannf) wrote :

The patch Ike found matches the bug title, so I assume it is the correct fix.
This patch, and the patch it fixes, were both introduced in v5.3, so marking older kernels "Invalid".

no longer affects: kunpeng920/ubuntu-20.04
no longer affects: kunpeng920/ubuntu-18.10
Changed in kunpeng920:
status: Incomplete → Fix Committed
Changed in kunpeng920:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers