fstrim on nvme / AMD CPU fails and produces kernel error messages

Bug #1856603 reported by Seth Bromberger
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Invalid
Undecided
Unassigned
Bionic
Fix Released
Medium
Connor Kuehl
Disco
Won't Fix
Medium
Connor Kuehl
Eoan
Fix Released
Medium
Connor Kuehl

Bug Description

[Impact]

Discard requests can fail on a non-compliant nvme device meaning that prescribed maintenance use of fstrim will be unsuccessful and unused blocks are no longer properly disposed of.

[Test case]

Run fstrim (from bug report, ran as root: fstrim -v /)

Expected result: "/: 758.3 GiB (814159003648 bytes) trimmed" -- will vary depending on the blocks that are unused for your system

Unpatched actual result: "fstrim: /: FITRIM ioctl failed: Input/output error"

[Regression Potential]

This patch only increases the size of a memory allocation and does not add any changes in logic for error handling or normal flow of control. This routine already handles the case where the memory allocation fails. Because of this, it is a low risk change.

Original bug description below:
--------------------------------------

/dev/nvme0n1 Sabrent Rocket 4.0 1TB firmware RKT401.1

on Ubuntu 19.10 with an ASRock 300 Deskmini motherboard and a Ryzen 3400G CPU. The filesystem is ext4:

Linux elemental 5.3.0-24-generic #26-Ubuntu SMP Thu Nov 14 01:33:18 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

UUID=c1812230-91be-4a18-8055-c3b7c82fbbd8 / ext4 defaults 0 0
/dev/nvme0n1p2 on / type ext4 (rw,relatime)

When I run fstrim -v / as root, I get the following error message at the command line:

seth@elemental:~$ sudo fstrim -v /
fstrim: /: FITRIM ioctl failed: Input/output error

and the following kernel messages are logged:

[ 136.309115] nvme 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x0 flags=0x0000]
[ 136.309129] nvme 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x700 flags=0x0000]
[ 136.309139] nvme 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x680 flags=0x0000]
[ 136.309150] nvme 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x300 flags=0x0000]
[ 136.309162] nvme 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x200 flags=0x0000]
[ 136.309171] nvme 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x580 flags=0x0000]
[ 136.309180] nvme 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x100 flags=0x0000]
[ 136.309189] nvme 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x400 flags=0x0000]
[ 136.309198] nvme 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x380 flags=0x0000]
[ 136.309207] nvme 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x780 flags=0x0000]
[ 136.309216] amd_iommu_report_page_fault: 1 callbacks suppressed
[ 136.309218] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0000 address=0x180 flags=0x0000]
[ 136.309228] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0000 address=0x500 flags=0x0000]
[ 136.309238] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0000 address=0x480 flags=0x0000]
[ 136.309250] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0000 address=0x280 flags=0x0000]
[ 136.309259] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0000 address=0x600 flags=0x0000]
[ 136.309269] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0000 address=0x80 flags=0x0000]
[ 136.309279] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0000 address=0x980 flags=0x0000]
[ 136.309291] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0000 address=0x900 flags=0x0000]
[ 136.309301] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0000 address=0x880 flags=0x0000]
[ 136.309311] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0000 address=0xa00 flags=0x0000]
[ 136.309762] blk_update_request: I/O error, dev nvme0n1, sector 1141976 op 0x3:(DISCARD) flags 0x800 phys_seg 1 prio class 0

I have tried setting iommu passthrough on boot but this doesn’t seem to help:

GRUB_CMDLINE_LINUX_DEFAULT="amd_iommu=pt avic=1”

This is possibly related to:

https://bugzilla.kernel.org/show_bug.cgi?id=202665
http://git.infradead.org/nvme.git/commitdiff/530436c45ef2e446c12538a400e465929a0b3ade?hp=400b6a7b13a3fd71cff087139ce45dd1e5fff444

Revision history for this message
Seth Bromberger (sbromberger) wrote :

Update: I built a 5.3.10 kernel with the patch suggested in the infradead link (patch located at http://git.infradead.org/nvme.git/patch/530436c45ef2e446c12538a400e465929a0b3ade?hp=400b6a7b13a3fd71cff087139ce45dd1e5fff444) and the errors went away; fstrim is working properly:

seth@elemental:~$ sudo fstrim -v /
/: 758.3 GiB (814159003648 bytes) trimmed

and no kernel errors.

Could this be merged soon? Thank you.

Connor Kuehl (connork)
Changed in linux (Ubuntu):
status: New → Invalid
Changed in linux (Ubuntu Eoan):
status: New → In Progress
importance: Undecided → Medium
assignee: nobody → Connor Kuehl (connork)
no longer affects: fstrim (Ubuntu Eoan)
Connor Kuehl (connork)
no longer affects: fstrim (Ubuntu)
Changed in linux (Ubuntu Xenial):
status: New → In Progress
Changed in linux (Ubuntu Bionic):
status: New → In Progress
Changed in linux (Ubuntu Disco):
status: New → In Progress
importance: Undecided → Medium
Changed in linux (Ubuntu Bionic):
importance: Undecided → Medium
Changed in linux (Ubuntu Xenial):
importance: Undecided → Medium
Changed in linux (Ubuntu Bionic):
assignee: nobody → Connor Kuehl (connork)
Changed in linux (Ubuntu Disco):
assignee: nobody → Connor Kuehl (connork)
Changed in linux (Ubuntu Xenial):
assignee: nobody → Connor Kuehl (connork)
no longer affects: fstrim (Ubuntu Xenial)
no longer affects: fstrim (Ubuntu Bionic)
no longer affects: fstrim (Ubuntu Disco)
Revision history for this message
Connor Kuehl (connork) wrote :

It looks like the version of this routine is Xenial is not impacted since it hasn't received this refactor patch which constrains the allocation to 16 bytes: 03b5929ebb20 ("nvme: rewrite discard support"). I will remove the Xenial nomination.

description: updated
no longer affects: linux (Ubuntu Xenial)
Revision history for this message
Connor Kuehl (connork) wrote :
Changed in linux (Ubuntu Bionic):
status: In Progress → Fix Committed
Changed in linux (Ubuntu Disco):
status: In Progress → Fix Committed
Changed in linux (Ubuntu Eoan):
status: In Progress → Fix Committed
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-eoan' to 'verification-done-eoan'. If the problem still exists, change the tag 'verification-needed-eoan' to 'verification-failed-eoan'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-eoan
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-bionic' to 'verification-done-bionic'. If the problem still exists, change the tag 'verification-needed-bionic' to 'verification-failed-bionic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-bionic
Revision history for this message
Seth Bromberger (sbromberger) wrote :

I'm not quite sure how to go about testing this proposed kernel. The instructions assume a GUI which I don't have. Could you point me to steps to install the proposed kernel via CLI?

Unless the request to test this is intended for someone else, in which case please disregard.

Revision history for this message
Khaled El Mously (kmously) wrote :

Hi Seth,

The second half of the instructions explain how to go about enabling the proposed pocket. Here is the relevant text for convenience:

-------------------------------
Or you can modify the software sources manually by adding the following line to /etc/apt/sources.list:

deb http://archive.ubuntu.com/ubuntu/ xenial-proposed restricted main multiverse universe

If you are using a port arch such as armhf/arm64/ppc64el/s390x you need to add the following line instead :

deb http://ports.ubuntu.com/ubuntu-ports xenial-proposed restricted main multiverse universe

Replace "xenial" with "trusty", "vivid", "utopic", "precise", or "lucid" depending on which release you are on.
-------------------------------

Thanks in advance!

Revision history for this message
Seth Bromberger (sbromberger) wrote :

Thanks. I read that but I'm 1) on eoan, and 2) already running a custom kernel 5.3.10+ that has this fix. 1) should be an easy substitute, but I'm really nervous about trying to shoehorn instructions from the section following the one you quoted into a system that's already pretty customized (and running in production).

I'm not expert enough in switching among different kernels to be remotely comfortable with the instructions as they're currently laid out, so I guess I have to wait for someone who knows what they're doing to test this.

If it helps, I can tell you that I applied the diff that started this whole thing and rebuilt the kernel and things have been working well ever since.

Revision history for this message
Khaled El Mously (kmously) wrote :

@Seth. Fair enough. I'm happy to consider the bug verified if you have no further issues.

Thanks for reporting the bug!

tags: added: verification-done-bionic verification-done-eoan
removed: verification-needed-bionic verification-needed-eoan
Revision history for this message
Seth Bromberger (sbromberger) wrote :

Thank you. I've been using the patch since late December with no issues.

Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (78.1 KiB)

This bug was fixed in the package linux - 5.3.0-40.32

---------------
linux (5.3.0-40.32) eoan; urgency=medium

  * eoan/linux: 5.3.0-40.32 -proposed tracker (LP: #1861214)

  * No sof soundcard for 'ASoC: CODEC DAI intel-hdmi-hifi1 not registered' after
    modprobe sof (LP: #1860248)
    - ASoC: SOF: Intel: fix HDA codec driver probe with multiple controllers

  * ocfs2-tools is causing kernel panics in Ubuntu Focal (Ubuntu-5.4.0-9.12)
    (LP: #1852122)
    - ocfs2: fix the crash due to call ocfs2_get_dlm_debug once less

  * QAT drivers for C3XXX and C62X not included as modules (LP: #1845959)
    - [Config] CRYPTO_DEV_QAT_C3XXX=m, CRYPTO_DEV_QAT_C62X=m and
      CRYPTO_DEV_QAT_DH895xCC=m

  * Eoan update: upstream stable patchset 2020-01-24 (LP: #1860816)
    - scsi: lpfc: Fix discovery failures when target device connectivity bounces
    - scsi: mpt3sas: Fix clear pending bit in ioctl status
    - scsi: lpfc: Fix locking on mailbox command completion
    - Input: atmel_mxt_ts - disable IRQ across suspend
    - f2fs: fix to update time in lazytime mode
    - iommu: rockchip: Free domain on .domain_free
    - iommu/tegra-smmu: Fix page tables in > 4 GiB memory
    - dmaengine: xilinx_dma: Clear desc_pendingcount in xilinx_dma_reset
    - scsi: target: compare full CHAP_A Algorithm strings
    - scsi: lpfc: Fix SLI3 hba in loop mode not discovering devices
    - scsi: csiostor: Don't enable IRQs too early
    - scsi: hisi_sas: Replace in_softirq() check in hisi_sas_task_exec()
    - powerpc/pseries: Mark accumulate_stolen_time() as notrace
    - powerpc/pseries: Don't fail hash page table insert for bolted mapping
    - powerpc/tools: Don't quote $objdump in scripts
    - dma-debug: add a schedule point in debug_dma_dump_mappings()
    - leds: lm3692x: Handle failure to probe the regulator
    - clocksource/drivers/asm9260: Add a check for of_clk_get
    - clocksource/drivers/timer-of: Use unique device name instead of timer
    - powerpc/security/book3s64: Report L1TF status in sysfs
    - powerpc/book3s64/hash: Add cond_resched to avoid soft lockup warning
    - ext4: update direct I/O read lock pattern for IOCB_NOWAIT
    - ext4: iomap that extends beyond EOF should be marked dirty
    - jbd2: Fix statistics for the number of logged blocks
    - scsi: tracing: Fix handling of TRANSFER LENGTH == 0 for READ(6) and WRITE(6)
    - scsi: lpfc: Fix duplicate unreg_rpi error in port offline flow
    - f2fs: fix to update dir's i_pino during cross_rename
    - clk: qcom: Allow constant ratio freq tables for rcg
    - clk: clk-gpio: propagate rate change to parent
    - irqchip/irq-bcm7038-l1: Enable parent IRQ if necessary
    - irqchip: ingenic: Error out if IRQ domain creation failed
    - fs/quota: handle overflows of sysctl fs.quota.* and report as unsigned long
    - scsi: lpfc: fix: Coverity: lpfc_cmpl_els_rsp(): Null pointer dereferences
    - PCI: rpaphp: Fix up pointer to first drc-info entry
    - scsi: ufs: fix potential bug which ends in system hang
    - powerpc/pseries/cmm: Implement release() function for sysfs device
    - PCI: rpaphp: Don't rely on firmware feature to imply drc-info support
    - PCI: rpaphp: Annotate and corr...

Changed in linux (Ubuntu Eoan):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (79.8 KiB)

This bug was fixed in the package linux - 4.15.0-88.88

---------------
linux (4.15.0-88.88) bionic; urgency=medium

  * bionic/linux: 4.15.0-88.88 -proposed tracker (LP: #1862824)

  * Segmentation fault (kernel oops) with memory-hotplug in
    ubuntu_kernel_selftests on Bionic kernel (LP: #1862312)
    - Revert "mm/memory_hotplug: fix online/offline_pages called w.o.
      mem_hotplug_lock"
    - mm/memory_hotplug: fix online/offline_pages called w.o. mem_hotplug_lock

linux (4.15.0-87.87) bionic; urgency=medium

  * bionic/linux: 4.15.0-87.87 -proposed tracker (LP: #1861165)

  * Bionic update: upstream stable patchset 2020-01-22 (LP: #1860602)
    - scsi: lpfc: Fix discovery failures when target device connectivity bounces
    - scsi: mpt3sas: Fix clear pending bit in ioctl status
    - scsi: lpfc: Fix locking on mailbox command completion
    - Input: atmel_mxt_ts - disable IRQ across suspend
    - iommu/tegra-smmu: Fix page tables in > 4 GiB memory
    - scsi: target: compare full CHAP_A Algorithm strings
    - scsi: lpfc: Fix SLI3 hba in loop mode not discovering devices
    - scsi: csiostor: Don't enable IRQs too early
    - powerpc/pseries: Mark accumulate_stolen_time() as notrace
    - powerpc/pseries: Don't fail hash page table insert for bolted mapping
    - powerpc/tools: Don't quote $objdump in scripts
    - dma-debug: add a schedule point in debug_dma_dump_mappings()
    - clocksource/drivers/asm9260: Add a check for of_clk_get
    - powerpc/security/book3s64: Report L1TF status in sysfs
    - powerpc/book3s64/hash: Add cond_resched to avoid soft lockup warning
    - ext4: update direct I/O read lock pattern for IOCB_NOWAIT
    - jbd2: Fix statistics for the number of logged blocks
    - scsi: tracing: Fix handling of TRANSFER LENGTH == 0 for READ(6) and WRITE(6)
    - scsi: lpfc: Fix duplicate unreg_rpi error in port offline flow
    - f2fs: fix to update dir's i_pino during cross_rename
    - clk: qcom: Allow constant ratio freq tables for rcg
    - irqchip/irq-bcm7038-l1: Enable parent IRQ if necessary
    - irqchip: ingenic: Error out if IRQ domain creation failed
    - fs/quota: handle overflows of sysctl fs.quota.* and report as unsigned long
    - scsi: lpfc: fix: Coverity: lpfc_cmpl_els_rsp(): Null pointer dereferences
    - scsi: ufs: fix potential bug which ends in system hang
    - powerpc/pseries/cmm: Implement release() function for sysfs device
    - powerpc/security: Fix wrong message when RFI Flush is disable
    - scsi: atari_scsi: sun3_scsi: Set sg_tablesize to 1 instead of SG_NONE
    - clk: pxa: fix one of the pxa RTC clocks
    - bcache: at least try to shrink 1 node in bch_mca_scan()
    - HID: logitech-hidpp: Silence intermittent get_battery_capacity errors
    - libnvdimm/btt: fix variable 'rc' set but not used
    - HID: Improve Windows Precision Touchpad detection.
    - scsi: pm80xx: Fix for SATA device discovery
    - scsi: ufs: Fix error handing during hibern8 enter
    - scsi: scsi_debug: num_tgts must be >= 0
    - scsi: NCR5380: Add disconnect_mask module parameter
    - scsi: iscsi: Don't send data to unbound connection
    - scsi: target: iscsi: Wait for all commands to finish before freeing a
...

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
Revision history for this message
Seth Bromberger (sbromberger) wrote :

Thank you all for your help, especially @connork, who stepped me through building the custom kernel and shepherded this bug report through its early days.

Revision history for this message
Seth Bromberger (sbromberger) wrote :

(I can confirm that the new kernel fixes the original problem.)

Steve Langasek (vorlon)
Changed in linux (Ubuntu Disco):
status: Fix Committed → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.