[sas-1126]scsi: hisi_sas: Fix the conflict between device gone and host reset

Bug #1853997 reported by Fred Kimmy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
kunpeng920
Fix Released
Undecided
Unassigned
Ubuntu-18.04
Fix Released
Undecided
Ike Panhc
Ubuntu-18.04-hwe
Fix Released
Undecided
Ike Panhc
Ubuntu-19.04
Fix Released
Undecided
Ike Panhc
Ubuntu-19.10
Fix Released
Undecided
Ike Panhc
Ubuntu-20.04
Fix Released
Undecided
Unassigned
Upstream-kernel
Fix Released
Undecided
Unassigned
linux (Ubuntu)
Fix Released
Undecided
Unassigned
Bionic
Fix Released
Undecided
Ike Panhc
Disco
Fix Released
Undecided
Ike Panhc
Eoan
Fix Released
Undecided
Ike Panhc
Focal
Fix Released
Undecided
Unassigned

Bug Description

[Impact]
Some SAS devices is gone when recovering

[Test Case]
No known test case, stress test on SAS deivce.

[Fix]
e74006edd0d4 scsi: hisi_sas: Fix the conflict between device gone and host reset

[Regression Risk]
Patch restricted to hisi_sas driver.

"[Steps to Reproduce]
1. Close all the PHYS;
2. Inject error;
3. Open one PHY;

[Actual Results]
Some disk will be lost

[Expected Results]
No disk will be lost

[Reproducibility]
occasionally

[Additional information]
Hardware: D06 CS
Firmware: NA+I59
Kernel: NA

[Resolution]
When init device for SAS disks, it will send TMF IO to clear disks. At that
time TMF IO is broken by some operations such as injecting controller reset
from HW RAs event, the TMF IO will be timeout, and at last device will be
gone. Print is as followed:

hisi_sas_v3_hw 0000:74:02.0: dev[240:1] found
...
hisi_sas_v3_hw 0000:74:02.0: controller resetting...
hisi_sas_v3_hw 0000:74:02.0: phyup: phy7 link_rate=10(sata)
hisi_sas_v3_hw 0000:74:02.0: phyup: phy0 link_rate=9(sata)
hisi_sas_v3_hw 0000:74:02.0: phyup: phy1 link_rate=9(sata)
hisi_sas_v3_hw 0000:74:02.0: phyup: phy2 link_rate=9(sata)
hisi_sas_v3_hw 0000:74:02.0: phyup: phy3 link_rate=9(sata)
hisi_sas_v3_hw 0000:74:02.0: phyup: phy6 link_rate=10(sata)
hisi_sas_v3_hw 0000:74:02.0: phyup: phy5 link_rate=11
hisi_sas_v3_hw 0000:74:02.0: phyup: phy4 link_rate=11
hisi_sas_v3_hw 0000:74:02.0: controller reset complete
hisi_sas_v3_hw 0000:74:02.0: abort tmf: TMF task timeout and not done
hisi_sas_v3_hw 0000:74:02.0: dev[240:1] is gone
sas: driver on host 0000:74:02.0 cannot handle device 5000c500a75a860d,
error:5

To improve the reliability, retry TMF IO max of 3 times for SAS disks which
is the same as softreset does."

scsi: hisi_sas: Fix the conflict between device gone and host reset

Revision history for this message
Ike Panhc (ikepanhc) wrote :

Could you provide how to inject error into SAS driver please?

Changed in kunpeng920:
status: New → Incomplete
Revision history for this message
Ike Panhc (ikepanhc) wrote :

Is this bug duplicate of bug 1853999?

dann frazier (dannf)
Changed in kunpeng920:
status: Incomplete → New
Revision history for this message
dann frazier (dannf) wrote :

Potential NULL pointer dereference.

dann frazier (dannf)
Changed in kunpeng920:
status: New → Triaged
Changed in linux (Ubuntu Bionic):
status: New → Triaged
Changed in linux (Ubuntu Disco):
status: New → Triaged
Changed in linux (Ubuntu Eoan):
status: New → Triaged
Changed in linux (Ubuntu Focal):
status: New → Fix Released
Revision history for this message
Fred Kimmy (kongzizaixian) wrote :

=>Is this bug duplicate of bug 1853999?
No, it should inject diff error type.

Ike Panhc (ikepanhc)
Changed in linux (Ubuntu Bionic):
assignee: nobody → Ike Panhc (ikepanhc)
status: Triaged → In Progress
Changed in linux (Ubuntu Disco):
assignee: nobody → Ike Panhc (ikepanhc)
status: Triaged → In Progress
Changed in linux (Ubuntu Eoan):
assignee: nobody → Ike Panhc (ikepanhc)
status: Triaged → In Progress
Revision history for this message
Ike Panhc (ikepanhc) wrote :

Patch has been sent for review

Changed in kunpeng920:
status: Triaged → In Progress
Revision history for this message
Ike Panhc (ikepanhc) wrote :
Ike Panhc (ikepanhc)
description: updated
Changed in linux (Ubuntu Bionic):
status: In Progress → Fix Committed
Changed in linux (Ubuntu Disco):
status: In Progress → Fix Committed
Changed in linux (Ubuntu Eoan):
status: In Progress → Fix Committed
Ike Panhc (ikepanhc)
Changed in kunpeng920:
status: In Progress → Fix Committed
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-disco' to 'verification-done-disco'. If the problem still exists, change the tag 'verification-needed-disco' to 'verification-failed-disco'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-disco
Revision history for this message
Ike Panhc (ikepanhc) wrote :

Thanks. Ubuntu-5.0.0-40.44 works for me.

tags: added: verification-done-disco
removed: verification-needed-disco
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (22.6 KiB)

This bug was fixed in the package linux - 5.0.0-40.44

---------------
linux (5.0.0-40.44) disco; urgency=medium

  * disco/linux: 5.0.0-40.44 -proposed tracker (LP: #1859724)

  * use-after-free in i915_ppgtt_close (LP: #1859522) // CVE-2020-7053
    - SAUCE: drm/i915: Fix use-after-free when destroying GEM context

  * CVE-2019-14615
    - drm/i915/gen9: Clear residual context state on context switch

  * System hang with kernel traces while entering reboot process on a Disco
    ARM64 moonshot node (LP: #1859582)
    - Revert "RDMA/cm: Fix memory leak in cm_add/remove_one"

linux (5.0.0-39.43) disco; urgency=medium

  * disco/linux: 5.0.0-39.43 -proposed tracker (LP: #1858547)

  * [Regression] usb usb2-port2: Cannot enable. Maybe the USB cable is bad?
    (LP: #1856608)
    - SAUCE: Revert "usb: handle warm-reset port requests on hub resume"

  * PAN is broken for execute-only user mappings on ARMv8 (LP: #1858815)
    - arm64: Revert support for execute-only user mappings

  * Fix unusable USB hub on Dell TB16 after S3 (LP: #1855312)
    - SAUCE: USB: core: Make port power cycle a seperate helper function
    - SAUCE: USB: core: Attempt power cycle port when it's in eSS.Disabled state

  * [sas-1126]scsi: hisi_sas: Fix out of bound at debug_I_T_nexus_reset()
    (LP: #1853992)
    - scsi: hisi_sas: Fix out of bound at debug_I_T_nexus_reset()

  * [sas-1126]scsi: hisi_sas: Assign NCQ tag for all NCQ commands (LP: #1853995)
    - scsi: hisi_sas: Assign NCQ tag for all NCQ commands

  * [sas-1126]scsi: hisi_sas: Fix the conflict between device gone and host
    reset (LP: #1853997)
    - scsi: hisi_sas: Fix the conflict between device gone and host reset

  * scsi: hisi_sas: Check sas_port before using it (LP: #1855952)
    - scsi: hisi_sas: Check sas_port before using it

  * CVE-2019-18885
    - btrfs: refactor btrfs_find_device() take fs_devices as argument
    - btrfs: merge btrfs_find_device and find_device

  * Integrate Intel SGX driver into linux-azure (LP: #1844245)
    - [Packaging] Add systemd service to load intel_sgx

  * [SRU][B/OEM-B/OEM-OSP1/D/E/F] Add LG I2C touchscreen multitouch support
    (LP: #1857541)
    - SAUCE: HID: multitouch: Add LG MELF0410 I2C touchscreen support

  * cifs: DFS Caching feature causing problems traversing multi-tier DFS setups
    (LP: #1854887)
    - cifs: Fix retrieval of DFS referrals in cifs_mount()

  * qede driver causes 100% CPU load (LP: #1855409)
    - qede: Handle infinite driver spinning for Tx timestamp.

  * [roce-1126]RDMA/hns: bugfix for slab-out-of-bounds when loading hip08 driver
    (LP: #1853989)
    - RDMA/hns: Bugfix for slab-out-of-bounds when unloading hip08 driver
    - RDMA/hns: bugfix for slab-out-of-bounds when loading hip08 driver

  * [roce-1126]RDMA/hns: Fixs hw access invalid dma memory error (LP: #1853990)
    - RDMA/hns: Fixs hw access invalid dma memory error

  * [hns-1126]net: hns3: revert to old channel when setting new channel num fail
    (LP: #1853983)
    - net: hns3: revert to old channel when setting new channel num fail

  * [hns-1126]net: hns3: fix port setting handle for fibre port
    (LP: #1853984)
    - net: hns3: fix port setting handle for fibre...

Changed in linux (Ubuntu Disco):
status: Fix Committed → Fix Released
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-bionic' to 'verification-done-bionic'. If the problem still exists, change the tag 'verification-needed-bionic' to 'verification-failed-bionic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-bionic
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-eoan' to 'verification-done-eoan'. If the problem still exists, change the tag 'verification-needed-eoan' to 'verification-failed-eoan'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-eoan
Revision history for this message
Ike Panhc (ikepanhc) wrote :

Both 4.15.0-87.87 and 5.3.0-40.32 work fine for me. Thanks.

tags: added: verification-done-bionic verification-done-eoan
removed: verification-needed-bionic verification-needed-eoan
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (78.1 KiB)

This bug was fixed in the package linux - 5.3.0-40.32

---------------
linux (5.3.0-40.32) eoan; urgency=medium

  * eoan/linux: 5.3.0-40.32 -proposed tracker (LP: #1861214)

  * No sof soundcard for 'ASoC: CODEC DAI intel-hdmi-hifi1 not registered' after
    modprobe sof (LP: #1860248)
    - ASoC: SOF: Intel: fix HDA codec driver probe with multiple controllers

  * ocfs2-tools is causing kernel panics in Ubuntu Focal (Ubuntu-5.4.0-9.12)
    (LP: #1852122)
    - ocfs2: fix the crash due to call ocfs2_get_dlm_debug once less

  * QAT drivers for C3XXX and C62X not included as modules (LP: #1845959)
    - [Config] CRYPTO_DEV_QAT_C3XXX=m, CRYPTO_DEV_QAT_C62X=m and
      CRYPTO_DEV_QAT_DH895xCC=m

  * Eoan update: upstream stable patchset 2020-01-24 (LP: #1860816)
    - scsi: lpfc: Fix discovery failures when target device connectivity bounces
    - scsi: mpt3sas: Fix clear pending bit in ioctl status
    - scsi: lpfc: Fix locking on mailbox command completion
    - Input: atmel_mxt_ts - disable IRQ across suspend
    - f2fs: fix to update time in lazytime mode
    - iommu: rockchip: Free domain on .domain_free
    - iommu/tegra-smmu: Fix page tables in > 4 GiB memory
    - dmaengine: xilinx_dma: Clear desc_pendingcount in xilinx_dma_reset
    - scsi: target: compare full CHAP_A Algorithm strings
    - scsi: lpfc: Fix SLI3 hba in loop mode not discovering devices
    - scsi: csiostor: Don't enable IRQs too early
    - scsi: hisi_sas: Replace in_softirq() check in hisi_sas_task_exec()
    - powerpc/pseries: Mark accumulate_stolen_time() as notrace
    - powerpc/pseries: Don't fail hash page table insert for bolted mapping
    - powerpc/tools: Don't quote $objdump in scripts
    - dma-debug: add a schedule point in debug_dma_dump_mappings()
    - leds: lm3692x: Handle failure to probe the regulator
    - clocksource/drivers/asm9260: Add a check for of_clk_get
    - clocksource/drivers/timer-of: Use unique device name instead of timer
    - powerpc/security/book3s64: Report L1TF status in sysfs
    - powerpc/book3s64/hash: Add cond_resched to avoid soft lockup warning
    - ext4: update direct I/O read lock pattern for IOCB_NOWAIT
    - ext4: iomap that extends beyond EOF should be marked dirty
    - jbd2: Fix statistics for the number of logged blocks
    - scsi: tracing: Fix handling of TRANSFER LENGTH == 0 for READ(6) and WRITE(6)
    - scsi: lpfc: Fix duplicate unreg_rpi error in port offline flow
    - f2fs: fix to update dir's i_pino during cross_rename
    - clk: qcom: Allow constant ratio freq tables for rcg
    - clk: clk-gpio: propagate rate change to parent
    - irqchip/irq-bcm7038-l1: Enable parent IRQ if necessary
    - irqchip: ingenic: Error out if IRQ domain creation failed
    - fs/quota: handle overflows of sysctl fs.quota.* and report as unsigned long
    - scsi: lpfc: fix: Coverity: lpfc_cmpl_els_rsp(): Null pointer dereferences
    - PCI: rpaphp: Fix up pointer to first drc-info entry
    - scsi: ufs: fix potential bug which ends in system hang
    - powerpc/pseries/cmm: Implement release() function for sysfs device
    - PCI: rpaphp: Don't rely on firmware feature to imply drc-info support
    - PCI: rpaphp: Annotate and corr...

Changed in linux (Ubuntu Eoan):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (79.8 KiB)

This bug was fixed in the package linux - 4.15.0-88.88

---------------
linux (4.15.0-88.88) bionic; urgency=medium

  * bionic/linux: 4.15.0-88.88 -proposed tracker (LP: #1862824)

  * Segmentation fault (kernel oops) with memory-hotplug in
    ubuntu_kernel_selftests on Bionic kernel (LP: #1862312)
    - Revert "mm/memory_hotplug: fix online/offline_pages called w.o.
      mem_hotplug_lock"
    - mm/memory_hotplug: fix online/offline_pages called w.o. mem_hotplug_lock

linux (4.15.0-87.87) bionic; urgency=medium

  * bionic/linux: 4.15.0-87.87 -proposed tracker (LP: #1861165)

  * Bionic update: upstream stable patchset 2020-01-22 (LP: #1860602)
    - scsi: lpfc: Fix discovery failures when target device connectivity bounces
    - scsi: mpt3sas: Fix clear pending bit in ioctl status
    - scsi: lpfc: Fix locking on mailbox command completion
    - Input: atmel_mxt_ts - disable IRQ across suspend
    - iommu/tegra-smmu: Fix page tables in > 4 GiB memory
    - scsi: target: compare full CHAP_A Algorithm strings
    - scsi: lpfc: Fix SLI3 hba in loop mode not discovering devices
    - scsi: csiostor: Don't enable IRQs too early
    - powerpc/pseries: Mark accumulate_stolen_time() as notrace
    - powerpc/pseries: Don't fail hash page table insert for bolted mapping
    - powerpc/tools: Don't quote $objdump in scripts
    - dma-debug: add a schedule point in debug_dma_dump_mappings()
    - clocksource/drivers/asm9260: Add a check for of_clk_get
    - powerpc/security/book3s64: Report L1TF status in sysfs
    - powerpc/book3s64/hash: Add cond_resched to avoid soft lockup warning
    - ext4: update direct I/O read lock pattern for IOCB_NOWAIT
    - jbd2: Fix statistics for the number of logged blocks
    - scsi: tracing: Fix handling of TRANSFER LENGTH == 0 for READ(6) and WRITE(6)
    - scsi: lpfc: Fix duplicate unreg_rpi error in port offline flow
    - f2fs: fix to update dir's i_pino during cross_rename
    - clk: qcom: Allow constant ratio freq tables for rcg
    - irqchip/irq-bcm7038-l1: Enable parent IRQ if necessary
    - irqchip: ingenic: Error out if IRQ domain creation failed
    - fs/quota: handle overflows of sysctl fs.quota.* and report as unsigned long
    - scsi: lpfc: fix: Coverity: lpfc_cmpl_els_rsp(): Null pointer dereferences
    - scsi: ufs: fix potential bug which ends in system hang
    - powerpc/pseries/cmm: Implement release() function for sysfs device
    - powerpc/security: Fix wrong message when RFI Flush is disable
    - scsi: atari_scsi: sun3_scsi: Set sg_tablesize to 1 instead of SG_NONE
    - clk: pxa: fix one of the pxa RTC clocks
    - bcache: at least try to shrink 1 node in bch_mca_scan()
    - HID: logitech-hidpp: Silence intermittent get_battery_capacity errors
    - libnvdimm/btt: fix variable 'rc' set but not used
    - HID: Improve Windows Precision Touchpad detection.
    - scsi: pm80xx: Fix for SATA device discovery
    - scsi: ufs: Fix error handing during hibern8 enter
    - scsi: scsi_debug: num_tgts must be >= 0
    - scsi: NCR5380: Add disconnect_mask module parameter
    - scsi: iscsi: Don't send data to unbound connection
    - scsi: target: iscsi: Wait for all commands to finish before freeing a
...

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
Revision history for this message
Andrew Cloke (andrew-cloke) wrote :

Updating kunpeng920 18.04 series to match linux bionic series.

Revision history for this message
Andrew Cloke (andrew-cloke) wrote :

Released for Eoan.

Ike Panhc (ikepanhc)
Changed in kunpeng920:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.