calltraces occurs when unregister device with reference to channels

Bug #1982456 reported by Fred Kimmy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
kunpeng920
Won't Fix
Undecided
Unassigned
Ubuntu-20.04-hwe
Won't Fix
Undecided
Unassigned

Bug Description

Summary: dangling pointers are left when unregister device with reference to channels
Further information:
kunpeng920 bug reporting guidelines:
Please use the following bug template:

[Bug Description]
Currently if dma_async_device_unregister is invoked while some clients
still hold a reference to some channels it would prevent device to be released
which would leave dangling pointers inside dma_device_list and cause crashes
in methods that tries to use it.

[Steps to Reproduce]
1) ismod async_tx.ko and hisi_dma.ko.
2) unbind the DMA devices that is bound with hisi_dma.
3) bind the DMA device with hisi_dma.
4) repeat 2) and 3) for several times.

[Actual Results]
1) the refcout of hisi_dma is not zero after step 1).
2) After all DMA devices are unbound, the refcout of hisi_dma is not zero.
3) For the first unbinding, warn about __dma_async_device_channel_unregister called while some clients hold the references.
4) after unbinding the device, following calltraces may be reported:
[ 1594.902108] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[ 1594.910871] Mem abort info:
[ 1594.913653] ESR = 0x96000004
[ 1594.916713] EC = 0x25: DABT (current EL), IL = 32 bits
[ 1594.922019] SET = 0, FnV = 0
[ 1594.925069] EA = 0, S1PTW = 0
[ 1594.928207] Data abort info:
[ 1594.931084] ISV = 0, ISS = 0x00000004
[ 1594.934916] CM = 0, WnR = 0
[ 1594.937870] user pgtable: 4k pages, 48-bit VAs, pgdp=00000020e8b2a000
[ 1594.944294] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000
[ 1594.951077] Internal error: Oops: 96000004 [#1] SMP
[ 1595.098816] pstate: a0400009 (NzCv daif +PAN -UAO -TCO BTYPE=--)
[ 1595.104797] pc : dma_channel_rebalance+0xf8/0x338
[ 1595.109491] lr : dma_channel_rebalance+0xa0/0x338
[ 1595.114173] sp : ffff8000b759bb00
[ 1595.117472] x29: ffff8000b759bb00 x28: ffff002085903d00
[ 1595.122761] x27: 0000000000000000 x26: 0000000000000000
[ 1595.128048] x25: 0000000000000000 x24: ffffa8dfb33bd3b8
[ 1595.133337] x23: 000000000000000f x22: ffffa8dfb312e440
[ 1595.138624] x21: 0000000000000010 x20: ffffa8dfb312e660
[ 1595.143913] x19: ffffa8dfb3593e80 x18: 0000000000000000
[ 1595.149201] x17: 0000000000000000 x16: ffffa8dfb1ec19e8
[ 1595.154488] x15: 0000000000000040 x14: ffffa8dfb34210a0
[ 1595.159777] x13: 0000000000000228 x12: 0000000000000000
[ 1595.165064] x11: 0000000000000000 x10: 0000000000000000
[ 1595.170352] x9 : ffffa8dfb1b48f58 x8 : ffff0020e81336e0
[ 1595.175640] x7 : 0000000000000000 x6 : 0000000000000003
[ 1595.180929] x5 : 0000000000000000 x4 : 0000000000000000
[ 1595.186218] x3 : ffff20200f9180a0 x2 : ffff20200f918090
[ 1595.191505] x1 : 0000000000000000 x0 : ffffffffffffffc8
[ 1595.196794] Call trace:
[ 1595.199230] dma_channel_rebalance+0xf8/0x338
[ 1595.203566] dma_async_device_unregister+0x90/0x148
[ 1595.208424] dmam_device_release+0x1c/0x28
[ 1595.212502] release_nodes+0x1c0/0x240
[ 1595.216243] devres_release_all+0x68/0x2c0
[ 1595.220322] device_release_driver_internal+0x138/0x1e8
[ 1595.225524] device_driver_detach+0x20/0x30
[ 1595.229689] unbind_store+0xe8/0x110
[ 1595.233247] drv_attr_store+0x2c/0x40
[ 1595.236893] sysfs_kf_write+0x4c/0x60
[ 1595.240544] kernfs_fop_write_iter+0x130/0x1c0
[ 1595.244969] new_sync_write+0xf0/0x198
[ 1595.248706] vfs_write+0x1ec/0x2c0
[ 1595.252094] ksys_write+0x74/0x108
[ 1595.255482] __arm64_sys_write+0x24/0x30
[ 1595.259387] el0_svc_common.constprop.0+0x84/0x218
[ 1595.264163] do_el0_svc+0x2c/0x98
[ 1595.267464] el0_svc+0x20/0x30
[ 1595.270512] el0_sync_handler+0xb0/0xb8
[ 1595.274331] el0_sync+0x184/0x1c0
[ 1595.277634] Code: eb00007f d100e000 54fffec0 d503201f (f9401c01)
[ 1595.283701] ---[ end trace 2476858dc1c23bcf ]---

5) The similar calltrace about dma_channel_rebalance may be reported
 on dma_async_device_register during binding.

6) Results 4) or 5) are always produced during step 4).

[Expected Results]
1. The refcout of hisi_dma and of DMA channels decrease correctly when unbind devices even if there are some clients hold references to DMA channels.
2. After all DMA devices are unbound, the refcout of hisi_dma is zero.
3. There are no dangling pointers in dma_device_list.
4. The calltrace in step 3) never occurs.

[Reproducibility]
This problem occurs according to the procedure for reproducing the problem.

[Additional information]
(Firmware version, kernel version, affected hardware, etc. if required022041209162):
kernel version:Linux tx 5.11.0-27-generic #29~20.04.1-Ubuntu SMP Wed Aug 11 15:58:08 UTC 2021 aarch64 aarch64 aarch64 GNU/Linux

[Resolution]

description: updated
Revision history for this message
Ike Panhc (ikepanhc) wrote :

Set to incomplete and wait for fix.

Changed in kunpeng920:
status: New → Incomplete
Revision history for this message
Ike Panhc (ikepanhc) wrote :

In bug 1936771 CONFIG_HISI_DMA has been disabled for soft RAID 5. I believe we shall solve the problem first.

Revision history for this message
Ike Panhc (ikepanhc) wrote :

Apply all patches for hisi_dma.c to Ubuntu 5.15.0-53.59 kernel but system still crash when RAID 5 is setup. For now I do not see any reason we shall re-enable CONFIG_HISI_DMA.

The patches applied are

fa8e8c4e6892 <email address hidden> 2022-10-24 07:12:44 +0000 dmaengine: hisilicon: Dump regs to debugfs
ece07f953395 <email address hidden> 2022-10-24 07:12:34 +0000 dmaengine: hisilicon: Adapt DMA driver to HiSilicon IP09
c64bc7326a68 <email address hidden> 2022-10-24 07:12:22 +0000 dmaengine: hisilicon: Use macros instead of magic number
a0b1e6bb9569 <email address hidden> 2022-10-24 07:12:03 +0000 dmaengine: hisilicon: Add multi-thread support for a DMA channel
ca8c98693ff9 <email address hidden> 2022-10-24 07:11:51 +0000 dmaengine: hisilicon: Fix CQ head update
d5a2fdcd7c69 <email address hidden> 2022-10-24 07:11:40 +0000 dmaengine: hisilicon: Disable channels when unregister hisi_dma
d4b1e65bb7a7 <email address hidden> 2022-10-24 07:11:24 +0000 dmaengine: hisi_dma: switch from 'pci_' to 'dma_' API

ubuntu@saenger:~$ sudo mdadm -Cv -l5 -n3 /dev/md0 /dev/sdb1 /dev/sdb2 /dev/sdb3
mdadm: layout defaults to left-symmetric
mdadm: layout defaults to left-symmetric
mdadm: chunk size defaults to 512K
mdadm: size set to 104791040K
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.
ubuntu@saenger:~$ [ 317.762217] hisi_dma 0000:7b:00.0: dma_sync_wait: timeout!
[ 317.767703] Kernel panic - not syncing: async_tx_quiesce: DMA error waiting for transaction
[ 317.776042] CPU: 86 PID: 2515 Comm: md0_raid5 Not tainted 5.15.0-53-generic #59~20.04.1+hisidma.1
[ 317.784901] Hardware name: Huawei XA320 V2 /BC82HPNB, BIOS 0.95 08/15/2019
[ 317.791762] Call trace:
[ 317.794195] dump_backtrace+0x0/0x200
[ 317.797850] show_stack+0x20/0x30
[ 317.801152] dump_stack_lvl+0x68/0x84
[ 317.804804] dump_stack+0x18/0x34
[ 317.808106] panic+0x18c/0x39c
[ 317.811150] async_tx_submit+0x0/0x610 [async_tx]
[ 317.815846] async_trigger_callback+0x94/0x15c [async_tx]
[ 317.821232] raid_run_ops+0x960/0x1288 [raid456]
[ 317.825851] handle_stripe+0x79c/0x1218 [raid456]
[ 317.830546] handle_active_stripes.isra.0+0x3f8/0x5f8 [raid456]
[ 317.836455] raid5d+0x378/0x6e0 [raid456]
[ 317.840455] md_thread+0xc8/0x1a8
[ 317.843760] kthread+0x114/0x120
[ 317.846978] ret_from_fork+0x10/0x20
[ 317.850544] SMP: stopping secondary CPUs
[ 317.854476] Kernel Offset: 0x50000 from 0xffff800008000000
[ 317.859948] PHYS_OFFSET: 0x0
[ 317.862815] CPU features: 0x00000441,a3202c40
[ 317.867159] Memory Limit: none
[ 318.031605] ---[ end Kernel panic - not syncing: async_tx_quiesce: DMA error waiting for transaction ]---

Revision history for this message
Fred Kimmy (kongzizaixian) wrote :

can you review this link:
https://<email address hidden>/

Revision history for this message
Ike Panhc (ikepanhc) wrote :

Hi Xinwei,

The patch in comment #4 does not fix the issue of bug 1936771, which will crash system when soft RAID 5 is in use.

Before we have a fix for bug 1936771, it is meaningless to apply any patch for hisi_dma.c because CONFIG_HISI_DMA is not set.

Revision history for this message
Ike Panhc (ikepanhc) wrote :

As discuss, patch for hisi_dma is irrelevant because CONFIG_HISI_DMA is disabled.

I am going to close this issue and we can re-open this issue after we have proper fix for bug 1936771.

Changed in kunpeng920:
status: Incomplete → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.