[hns-1126] net: hns3: fix race conditions between reset and module loading & unloading

Bug #1853932 reported by Fred Kimmy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
kunpeng920
Fix Released
Undecided
Unassigned
Ubuntu-18.04
Won't Fix
Undecided
Unassigned
Ubuntu-18.04-hwe
Fix Released
Undecided
Unassigned
Ubuntu-19.04
Won't Fix
Undecided
Unassigned
Ubuntu-19.10
Fix Released
Undecided
Unassigned
Upstream-kernel
Fix Released
Undecided
Unassigned

Bug Description

[Bug Description]
When doing reset and unloading driver at the same, there will be some problem,
such as NULL pointer panic, hardware error.

[Steps to Reproduce]
1.load PF & VF driver
2.run iperf & reset & bind & unbind

[Actual Results]
panic or hardware error.
[ 4392.974255] hns3 0000:7d:00.0: fail to instantiate client, ret = -16
[ 4392.976412] hns3 0000:7d:00.2: Reset done, hclge driver initialization finished.
[ 4392.980600] hns3 0000:7d:00.0: match and instantiation failed for port, ret = -16
[ 4392.995450] hns3 0000:7d:00.1: Device is busy in resetting state.
[ 4392.995450] please retry later.
[ 4393.004658] hns3 0000:7d:00.1: fail to instantiate client, ret = -16
[ 4393.011005] hns3 0000:7d:00.1: match and instantiation failed for port, ret = -16
[ 4393.018477] hns3 0000:7d:00.2: Device is busy in resetting state.
[ 4393.018477] please retry later.
[ 4393.019768] hns3 0000:7d:00.2: In reset process RoCE client reinit.
[ 4393.027686] hns3 0000:7d:00.2: fail to instantiate client, ret = -16
[ 4393.027689] hns3 0000:7d:00.2: match and instantiation failed for port, ret = -16
[ 4393.033972] Unable to handle kernel NULL pointer dereference at virtual address 000000000000043e
[ 4393.040285] hns3 0000:7d:00.3: Device is busy in resetting state.
[ 4393.040285] please retry later.
[ 4393.040286] hns3 0000:7d:00.3: fail to instantiate client, ret = -16
[ 4393.040287] hns3 0000:7d:00.3: match and instantiation failed for port, ret = -16
[ 4393.079540] Mem abort info:
[ 4393.082322] ESR = 0x96000004
[ 4393.085366] Exception class = DABT (current EL), IL = 32 bits
[ 4393.091274] SET = 0, FnV = 0
[ 4393.094317] EA = 0, S1PTW = 0
[ 4393.097447] Data abort info:
[ 4393.100314] ISV = 0, ISS = 0x00000004
[ 4393.104137] CM = 0, WnR = 0
[ 4393.107095] user pgtable: 4k pages, 48-bit VAs, pgdp = 00000000c87f115d
[ 4393.113698] [000000000000043e] pgd=0000000000000000
[ 4393.118566] Internal error: Oops: 96000004 [#1] SMP
[ 4393.123432] CPU: 11 PID: 30404 Comm: kworker/11:0 Tainted: G W OE 4.19.30-vhulk1903.5.1.h163.eulerosv3r1.aarch64 #2
[ 4393.134892] Hardware name: Huawei Technologies Co., Ltd. EVBCS/EVBCS, BIOS CS B078 1P TA 05/25/2019

Message from syslogd@localhos[ 4393.143928] Workqueue: events hclge_reset_service_task [hclge]
t at Feb 15 12:16:06 ...
 kernel:[ 4393.118566] Internal error: Oops: 96000004 [#1] SMP
[ 4393.160589] pstate: 80c00009 (Nzcv daif +PAN +UAO)
[ 4393.165368] pc : hclge_ae_dev_reset_cnt+0x3c/0x60 [hclge]
[ 4393.170753] lr : 0xffff000000c38f7c
[ 4393.174227] sp : ffff8023801ebc40
[ 4393.177527] x29: ffff8023801ebc40 x28: 0000000000000000
[ 4393.182825] x27: 0000000000000000 x26: ffff8023ac29a698
[ 4393.188123] x25: 0000000000000001 x24: ffff80232a2f8500
[ 4393.193421] x23: 0000000000000041 x22: ffff802356a856ac
[ 4393.198719] x21: ffff8023ac29a5b0 x20: 0000000000000041
[ 4393.204016] x19: ffff802356a84000 x18: 0000000000000010
[ 4393.209314] x17: 000000002edd842f x16: 000000000a36b2a4
[ 4393.214611] x15: ffff0000895099df x14: 0000000000000004
[ 4393.219909] x13: ffff0000095099ed x12: ffff00000930b838
[ 4393.225207] x11: ffff8023801ebc40 x10: ffff8023801ebc40
[ 4393.230505] x9 : 00000000ffffffd8 x8 : fffffffffffffffe
[ 4393.235802] x7 : 0000000000000004 x6 : 0000000000000000
[ 4393.241100] x5 : 0000000000000004 x4 : ffff802356a8400f
[ 4393.246397] x3 : ffff0a00ffffff04 x2 : 584dcc94a82bab00
[ 4393.251695] x1 : ffff000000ae0a48 x0 : 0000000000000036
[ 4393.256993] Process kworker/11:0 (pid: 30404, stack limit = 0x000000004925a8db)
[ 4393.264286] Call trace:
[ 4393.266722] hclge_ae_dev_reset_cnt+0x3c/0x60 [hclge]
[ 4393.271758] 0xffff000000c430bc
[ 4393.274887] hclge_notify_roce_client+0x7c/0xe8 [hclge]
[ 4393.280099] hclge_reset+0x78c/0x9d8 [hclge]
[ 4393.284356] hclge_reset_service_task+0x124/0x2f8 [hclge]
[ 4393.289743] process_one_work+0x1b4/0x3f8
[ 4393.293738] worker_thread+0x54/0x470
[ 4393.297386] kthread+0x134/0x138
[ 4393.300601] ret_from_fork+0x10/0x18
[ 4393.304163] Code: d1120273 f9423e60 f9400bf3 a8c27bfd (b9440800)
[ 4393.310242] Modules linked in: hns_roce(OE) hns3_dfx(OE) rdma_ucm(E) rdma_cm(E) ib_cm(E) iw_cm(E) ib_uverbs(E) ib_core(OE) hns3(OE) hclge(OE) hnae3(OE) mem_drv(OE) [last unloaded: hns_roce_pci]
[ 4393.327442] ---[ end trace 525e14504e091414 ]---
[ 4393.332045] Kernel panic - not syncing: Fatal exception
[ 4393.337256] kernel fault(0x5) notification starting on CPU 11
[ 4393.342987] kernel fault(0x5) notification finished on CPU 11
[ 4393.348718] SMP: stopping secondary CPUs
[ 4393.352635] Kernel Offset: disabled
[ 4393.356110] CPU features: 0x2,a2a00a38
[ 4393.359844] Memory Limit: none
[ 4393.362896] kernel reboot(0x2) notification starting on CPU 11
[ 4393.368714] kernel reboot(0x2) notification finished on CPU 11
[ 4393.374533] ---[ end Kernel panic - not syncing: Fatal exception ]---

[Expected Results]
ok

[Reproducibility]
Inevitably

[Additional information]
Hardware: D06
Firmware: NA
Kernel: NA

[Resolution]
adds flag to indicate whether the client is registered, and does not
schedule reset task while unloading, also fixes some bugs.

net: hns3: fix race conditions between reset and module loading & unloading registered
net: hns3: fix a memory leak issue for hclge_map_unmap_ring_to_vf_vector
net: hns3: adjust hns3_uninit_phy()'s location in the hns3_client_uninit()
net: hns3: stop schedule reset service while unloading driver
net: hns3: add handshake with hardware while doing reset
net: hns3: use HCLGEVF_STATE_NIC_REGISTERED to indicate VF NIC client has registered
net: hns3: use HCLGE_STATE_ROCE_REGISTERED to indicate PF ROCE client has registered
net: hns3: use HCLGE_STATE_NIC_REGISTERED to indicate PF NIC client has register"

dann frazier (dannf)
description: updated
Revision history for this message
dann frazier (dannf) wrote :

Each of these commits were introduced upstream in v5.3. v5.3 will be the new HWE base kernel for 18.04.4.

Note that the current SRU cycle is targeted for 18.04.4:
  https://lists.ubuntu.com/archives/kernel-sru-announce/2019-October/000158.html

The "last-commit" date for this cycle was 11-Nov. Since 18.04.4 will switch the HWE kernel from 5.0 to 5.3, backporting these changes to the 5.0 branch would be of no benefit to Ubuntu LTS. Therefore, marking 19.04 "Won't Fix" and targeting Ubuntu-18.04-hwe to Ubuntu-18.04.4.

Changed in kunpeng920:
status: New → Fix Committed
dann frazier (dannf)
no longer affects: kunpeng920/ubuntu-20.04
Revision history for this message
dann frazier (dannf) wrote :

Marking "Won't Fix" for Ubuntu-18.04, as the reproduction case sounds more like a stress test than something that would occur on a production system, and the fix is neither simple nor obvious.

summary: - []hns-1126net: hns3: fix race conditions between reset and module
+ [hns-1126] net: hns3: fix race conditions between reset and module
loading & unloading
Changed in kunpeng920:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.