[hns-1126]net: hns3: fix __QUEUE_STATE_STACK_XOFF not cleared issue

Bug #1853939 reported by Fred Kimmy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
kunpeng920
Fix Released
Undecided
Unassigned
Ubuntu-18.04
Invalid
Undecided
Unassigned
Ubuntu-18.04-hwe
Fix Released
Undecided
Unassigned
Ubuntu-19.04
Invalid
Undecided
Unassigned
Ubuntu-19.10
Fix Released
Undecided
Unassigned
Upstream-kernel
Fix Released
Undecided
Unassigned

Bug Description

"[Bug Description]
When change MTU or other operations, which just calling .reset_notify
to do HNAE3_DOWN_CLIENT and HNAE3_UP_CLIENT, then
the netdev_tx_reset_queue() in the hns3_clear_all_ring() will be
ignored. So the dev_watchdog() may misdiagnose a TX timeout.

[Steps to Reproduce]
1.load PF & VF driver
2.run iperf
3.modify MTU

[Actual Results]
get TX timeout after MTU changed.

 5613.947272] hns3 0000:bd:00.2 eth6: already using mac address 34:f1:27:e3:4b:2a
[ 5629.938362] hns3 0000:bd:00.2 eth6: link down
[ 5634.034364] hns3 0000:bd:00.2 eth6: link up
[ 5650.720675] hns3 0000:bd:00.2 eth6: link down
[ 5650.740080] IPv6: ADDRCONF(NETDEV_CHANGE): eth6: link becomes ready
[ 5651.762529] hns3 0000:bd:00.2 eth6: link up
[ 5652.790522] hns3 0000:bd:00.2 eth6: link down
[ 5655.868111] hns3 0000:bd:00.2 eth6: link up
[ 5655.876462] IPv6: ADDRCONF(NETDEV_CHANGE): eth6: link becomes ready
[ 5661.334349] ------------[ cut here ]------------
[ 5661.343550] NETDEV WATCHDOG: eth6 (hns3): transmit queue 47 timed out
[ 5661.356405] WARNING: CPU: 20 PID: 0 at net/sched/sch_generic.c:461 dev_watchdog+0x2a8/0x2b0
[ 5661.373048] Modules linked in: vfio_iommu_type1(E) vfio_pci(E) vfio_virqfd(E) vfio(E) tun(OE) vxlan(OE) ip6_udp_tunnel(E) udp_tunnel(E) ip_tunnel(E) 8021q(E) garp(E) mrp(E) bonding(OE) hns3_dfx(OE) hns3(OE) hclge(OE) hnae3(OE)
[ 5661.413089] CPU: 20 PID: 0 Comm: swapper/20 Tainted: G W OE 4.19.30-vhulk1903.5.1.h163.eulerosv3r1.aarch64 #2
[ 5661.434930] Hardware name: Huawei Technologies Co., Ltd. EVBCS/EVBCS, BIOS SPC100B011 TB 06/13/2019
[ 5661.452960] pstate: 40400009 (nZcv daif +PAN -UAO)
[ 5661.462504] pc : dev_watchdog+0x2a8/0x2b0
[ 5661.470489] lr : dev_watchdog+0x2a8/0x2b0
[ 5661.478472] sp : ffff8027dfba0aa0
[ 5661.485070] x29: ffff8027dfba0aa0 x28: 0000000000000002
[ 5661.495652] x27: 0000000000000002 x26: ffff0000093060c8
[ 5661.506235] x25: 0000000000000140 x24: 00000000ffffffff
[ 5661.516817] x23: 0000000000000000 x22: ffff80279b3c4480
[ 5661.527400] x21: ffff000009307000 x20: ffff80279b3c4000
[ 5661.537983] x19: 000000000000002f x18: 0000000000000010
[ 5661.548564] x17: 0000000000000000 x16: 0000000000000000
[ 5661.559145] x15: ffff0000895099df x14: 0720072007200720
[ 5661.569728] x13: 0720072007200720 x12: ffff00000930b838
[ 5661.580310] x11: ffff0000086bfe70 x10: 0774072007370734
[ 5661.590892] x9 : 00000000000006bf x8 : 0775077107200774
[ 5661.601474] x7 : 0769076d0773076e x6 : ffff8027dfb91270
[ 5661.612055] x5 : ffff8027dfb91270 x4 : 0000000000000000
[ 5661.622637] x3 : ffff8027dfb99848 x2 : ffff8027dfb91270
[ 5661.633220] x1 : f3b9b3a056e8ac00 x0 : 0000000000000000
[ 5661.643803] Call trace:
[ 5661.648669] dev_watchdog+0x2a8/0x2b0
[ 5661.655961] call_timer_fn+0x34/0x178
[ 5661.663252] expire_timers+0xec/0x158
[ 5661.670542] run_timer_softirq+0xc0/0x1f8
[ 5661.678525] __do_softirq+0x11c/0x31c
[ 5661.685816] irq_exit+0x104/0x138
[ 5661.692414] __handle_domain_irq+0x6c/0xb8
[ 5661.700571] gic_handle_irq+0xe4/0x1c8
[ 5661.708035] el1_irq+0xf0/0x1c0
[ 5661.714286] arch_cpu_idle+0x34/0x1c0
[ 5661.721578] default_idle_call+0x24/0x44
[ 5661.729389] do_idle+0x1ec/0x2d0
[ 5661.735814] cpu_startup_entry+0x2c/0x30
[ 5661.743625] secondary_start_kernel+0x1bc/0x248
[ 5661.752647] ---[ end trace 6e5b9286c0279339 ]---
[ 5661.761848] hns3 0000:bd:00.2 eth6: tx_timeout count: 1, queue id: 47, SW_NTU: 0x0, SW_NTC: 0x0, napi state: 16
[ 5661.781958] hns3 0000:bd:00.2 eth6: tx_pkts: 230943, tx_bytes: 349641994, io_err_cnt: 0, sw_err_cnt: 0
[ 5661.800511] hns3 0000:bd:00.2 eth6: seg_pkt_cnt: 0, tx_err_cnt: 0, restart_queue: 0, tx_busy: 0
[ 5661.818678] hns3 0000:bd:00.2 eth6: tx_pause_cnt: 0, rx_pause_cnt: 0
[ 5661.831340] hns3 0000:bd:00.2 eth6: BD_NUM: 0x7f HW_HEAD: 0x0, HW_TAIL: 0x0, BD_ERR: 0x0, INT: 0x1
[ 5661.849200] hns3 0000:bd:00.2 eth6: RING_EN: 0x1, TC: 0x0, FBD_NUM: 0x0 FBD_OFT: 0x0, EBD_NUM: 0x400, EBD_OFT: 0x0
[ 5667.478369] hns3 0000:bd:00.2 eth6: tx_timeout count: 2, queue id: 47, SW_NTU: 0x0, SW_NTC: 0x0, napi state: 16
[ 5667.498485] hns3 0000:bd:00.2 eth6: tx_pkts: 230943, tx_bytes: 349641994, io_err_cnt: 0, sw_err_cnt: 0
[ 5667.517041] hns3 0000:bd:00.2 eth6: seg_pkt_cnt: 0, tx_err_cnt: 0, restart_queue: 0, tx_busy: 0
[ 5667.535200] hns3 0000:bd:00.2 eth6: tx_pause_cnt: 0, rx_pause_cnt: 0
[ 5667.547865] hns3 0000:bd:00.2 eth6: BD_NUM: 0x7f HW_HEAD: 0x0, HW_TAIL: 0x0, BD_ERR: 0x0, INT: 0x1
[ 5667.565726] hns3 0000:bd:00.2 eth6: RING_EN: 0x1, TC: 0x0, FBD_NUM: 0x0 FBD_OFT: 0x0, EBD_NUM: 0x400, EBD_OFT: 0x0
[ 5667.586359] hns3 0000:bd:00.2: received reset event , reset type is 5
[ 5667.599917] hns3 0000:bd:00.2: PF Reset requested
[ 5667.619979] hns3 0000:bd:00.2 eth6: link down
[ 5667.846349] hns3 0000:bd:00.2: prepare wait ok
[ 5668.045010] hns3 0000:bd:00.2: The firmware version is 01080100

[Expected Results]
ok

[Reproducibility]
Inevitably

[Additional information]
Hardware: D06
Firmware: NA
Kernel: NA

[Resolution]
This patch separates netdev_tx_reset_queue() from
hns3_clear_all_ring(), and unifies hns3_clear_all_ring() and
hns3_force_clear_all_ring into one, since they are doing
similar things."

Revision history for this message
Ike Panhc (ikepanhc) wrote :

Is this patch the fix?

commit f96315f2f17e7b2580d2fec7c4d6a706a131d904
Author: Huazhong Tan <email address hidden>
Date: Fri Jun 28 19:50:07 2019 +0800

    net: hns3: fix __QUEUE_STATE_STACK_XOFF not cleared issue

    When change MTU or other operations, which just calling .reset_notify
    to do HNAE3_DOWN_CLIENT and HNAE3_UP_CLIENT, then
    the netdev_tx_reset_queue() in the hns3_clear_all_ring() will be
    ignored. So the dev_watchdog() may misdiagnose a TX timeout.

    This patch separates netdev_tx_reset_queue() from
    hns3_clear_all_ring(), and unifies hns3_clear_all_ring() and
    hns3_force_clear_all_ring into one, since they are doing
    similar things.

    Fixes: 3a30964a2eef ("net: hns3: delay ring buffer clearing during reset")
    Signed-off-by: Huazhong Tan <email address hidden>
    Signed-off-by: David S. Miller <email address hidden>

Changed in kunpeng920:
status: New → Incomplete
Revision history for this message
dann frazier (dannf) wrote :

The patch Ike found matches the bug title, so I assume it is the correct fix.
This patch, and the patch it fixes, were both introduced in v5.3, so marking older kernels "Invalid".

no longer affects: kunpeng920/ubuntu-20.04
no longer affects: kunpeng920/ubuntu-18.10
Changed in kunpeng920:
status: Incomplete → Fix Committed
Changed in kunpeng920:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.