IO errors when writing large amounts of data to USB storage in eoan on RPI2/3 (armhf kernel)

Bug #1852510 reported by Paul Larson
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux-raspi2 (Ubuntu)
Fix Released
Critical
Unassigned
Eoan
Fix Released
Critical
Hui Wang

Bug Description

[Impact]
On the RPI2/3 boards with eoan armhf kernel, if we copy large size
files to usb stick, the usb host driver dwc_otg will fail and print
lots of IO errors in the dmesg.

[Fix]
To support rpi4, we enabled the LPAE/HIGHMEM/VMSPLIT_3G in the armhf
kernel, the dwc_otg has some problem with highmem enabled. If the
urb's buffer is in the highmem region, the enqueue function will return
the -EINVALID unconditionally, as a result, it can't handle the urb
which contains the highmem buffer. But the driver itself can handle the
highmem buffer, we just need to do a little change.

[Test Case]
With the patch applied, I tested armhf and arm64 kernel on rpi2/3/4,
all worked well, no regression and the usb driver works well.

[Regression Risk]
Low, the upstream already looked at my patch and agreed with my
change, and I tested the patch on rpi2/3/4 with armhf and arm64
kernels, all worked well.

Kernel tested:
Linux ubuntu 5.3.0-1012-raspi2 #14-Ubuntu SMP Mon Nov 11 10:08:39 UTC 2019 armv7l armv7l armv7l GNU/Linux

I've only been able to reproduce this with the armhf kernel and on the following devices:
RPI3B+
RPI3B
RPI2

At the moment, it does not appear that arm64 is affected, nor are RPI3A+ and RPI4 (at least not the 2GB version)

Steps to reproduce:
- Insert and mount a USB storage device
- cp a large file to it (300-600MB recommended - smaller files will sometimes not trigger it)
- sync

After running the sync, a lot of IO errors will show up in dmesg like:
[ 176.129299] sd 0:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
[ 176.129326] sd 0:0:0:0: [sda] tag#0 CDB: Write(10) 2a 00 00 2e 24 b0 00 00 f0 00
[ 176.129349] blk_update_request: I/O error, dev sda, sector 3024048 op 0x1:(WRITE) flags 0x4000 phys_seg 15 prio class 0
[ 176.883968] usb 1-1.1.2: reset high-speed USB device number 8 using dwc_otg
[ 177.079960] usb 1-1.1.2: reset high-speed USB device number 8 using dwc_otg

It eventually finishes, and if you unmount/remount the device, the checksum will be different from the original file.

CVE References

Revision history for this message
Paul Larson (pwlars) wrote :
Changed in linux-raspi2 (Ubuntu):
importance: Undecided → Critical
Changed in linux-raspi2 (Ubuntu Eoan):
importance: Undecided → Critical
Revision history for this message
Chris Wayne (cwayne) wrote :

Note that according to Dave this issue doesn't appear using the Disco kernel

Hui Wang (hui.wang)
Changed in linux-raspi2 (Ubuntu Eoan):
assignee: nobody → Hui Wang (hui.wang)
Revision history for this message
Hui Wang (hui.wang) wrote :

I reproduced this problem on the rpi3B+ and rpi2Bv1.1 with armhf kernel.

This problem is introduced by enabling two kernel configs for arm32 kernel, they are HIGHMEM and VMSPLIT_3G.

We need to use a single arm32 kernel to support rpi2/3/4, while rpi4 has 1G/2G/4G ram, to support 2G and 4G ram, we have to enable HIGHMEM and VMSPLIT_3G, otherwise we can't get 2G/4G memory in the system.

The upstream kernel of https://github.com/raspberrypi/linux.git also has this issue.

some explanation:
These two configs are specific to 32bit kernel, so we can't reproduce this issue on arm64 kernel.
The usb ports on rpi4 are routed to xhci host controller instead of dwc_otg host controller, so we can't reproduce this issue on rpi4 even with armhf kernel.
On rpi3A+ board, there are only 512M physical memory, there is no ram to map to highmemory region, so we can't reproduce this problem on rpi3A+.

I guess the problem is in the dwc_otg usb host driver, it depends on dma to work, but dma gets a highmem buffer then it can't work normally anymore. I will report this issue to rpi community and try a find a fix asap.

so far, the workaround is to set total_mem=512 in the config.txt for rpi2/3 working with 32bit kernel , then there is no memory mapped to highmem region.

Revision history for this message
Dave Jones (waveform) wrote :

> The upstream kernel of https://github.com/raspberrypi/linux.git also has this issue.

This be one reason why Raspbian uses three different kernels (one for 0/B+, one for 2B/3B/3B+, one for 4B).

Changed in linux-raspi2 (Ubuntu Eoan):
status: New → In Progress
Revision history for this message
Dave Jones (waveform) wrote :

Some additional notes from testing last night (more or less blindly after a load of googling for dwc_otg errors and mitigations):

* Adding dwc_otg.speed=1 (limiting the driver speed to Full Speed USB1.1), fixes the mass-storage issue, but breaks compatibility with most keyboards. So, not terribly useful as a work-around but may be helpful in debugging?

* All other attempts either yielded no difference or just broke things further; did have a quick play with the various DMA options (e.g. dma_enable) on the basis of your notes above but managed no improvement.

Revision history for this message
Hui Wang (hui.wang) wrote :

In theory, we could enable HIGHMEM and VMSPLIT_2G (not VMSPLIT_3G) for arm32 kernel to workaround this issue, with the 2G user space and 2G kernel space, those boards (rpi2/3 boards) with only 1G physical ram will not map ram to highmem region.

But I remember setting VMSPLIT_2G will break the SD card on rpi4, make the rpi4 can't mount the rootfs on the sd card. I will retest VMSPLIT_2G setting, maybe with the updated patches, the sd controller can work with VMSPLIT_2G.

Revision history for this message
Hui Wang (hui.wang) wrote :
Revision history for this message
Łukasz Zemczak (sil2100) wrote :

Thank you Hui! Please continue to communicate with upstream about possible solutions. In the meantime, were you able to re-test the VMSPLIT_2G setting with the current updated patches?

Revision history for this message
Hui Wang (hui.wang) wrote :

I did test with VMSPLIT_2G today, the mmc/sd controller will not work anymore on rpi4 boards, so enable VMSPLIT_2G is not a solution so far.

And today I also tested dwc2, it worked well, maybe we could enable dwc2 instead of dwc_otg, then we could use a single kernel to support rpi2/3/4. dwc2 worked well both with VMSPLIT_2G and VMSPLIT_3G.

I am building a formal armhf and arm64 kernel with dwc2 enabled, will test all kernels on all boards. And I will share the kernels to the lp, anyone could help test them.

thx.

Revision history for this message
Dave Jones (waveform) wrote : Re: [Bug 1852510] Re: IO errors when writing large amounts of data to USB storage in eoan on RPI

On Fri, 15 Nov 2019 at 09:30, Hui Wang <email address hidden> wrote:
>
> I did test with VMSPLIT_2G today, the mmc/sd controller will not work
> anymore on rpi4 boards, so enable VMSPLIT_2G is not a solution so far.

Oh well, was worth a try.

> And today I also tested dwc2, it worked well, maybe we could enable dwc2
> instead of dwc_otg, then we could use a single kernel to support
> rpi2/3/4. dwc2 worked well both with VMSPLIT_2G and VMSPLIT_3G.

Interesting; I noted upstream talking about the optimization of
dwc_otg (I'm guessing, from a quick skim of each, that's mostly from
the FIQ FSM stuff in dwc_otg which dwc2 seems to lack); we should
probably benchmark the performance differences between dwc_otg and
dwc2, both USB transfer speeds, and load on the ARM during transfers
(my hunch is, if anything is affected, it'll be the latter - I vaguely
recall forum posts about improvements on the ARM load during large USB
transactions).

> I am building a formal armhf and arm64 kernel with dwc2 enabled, will
> test all kernels on all boards. And I will share the kernels to the lp,
> anyone could help test them.

Given arm64 isn't affected, would it be worth sticking with dwc_otg on
that arch? Or is the inconsistency (e.g. different capabilities?) an
issue in and of itself?

I'm more than happy to test out some kernels when they're available!

Thanks,

Dave.

Revision history for this message
Hui Wang (hui.wang) wrote : Re: IO errors when writing large amounts of data to USB storage in eoan on RPI

These are the kernels with the dwc2 enabled (replaced dwc_otg with dwc2 for both arm64 and armhf kernels).

https://people.canonical.com/~hwang4/dwc2/

I just tested arm64 and armhf kernels on the rpi4 boards, the uart/ethernet/sd card/sound/hdmi display/wifi/usb host all worked well.

I will test them on rpi2/3 boards, and compare the usb performance (by copying large size files). I didn't test usb performance on rpi4 since the usb host on rpi4 is xhci rather than dwc_otg/dwc2.

Revision history for this message
Hui Wang (hui.wang) wrote :

Tested the new kernels on rpi3B+, rpi3A+ and rpi2Bv1.1, those devices worked well as rpi4.

And I also compared the usb performance by copying a 700M file to usb stick, there is no significant difference between dwc_otg and dwc2, all cost around 2:40s. And I also tested usb performance with 'stress-ng -c 4' is running, no significant difference as well.

Revision history for this message
Dave Jones (waveform) wrote :

Tested the new armhf kernel on RPi3B+; copied 40Mb and 600Mb files successfully with no issues. Compared performance of the 40Mb copy+sync to the same machine running Disco (which has dwc_otg) and performance on eoan with the new kernel was marginally quicker (6.8-7.0s on eoan with dwc2 vs 8.2s-8.5s on disco with dwc_otg). I'd guess that that's down to kernel improvements generally between disco and eoan rather than anything to do with the driver, but at least it demonstrates there's no performance regression.

Looking good overall!

Revision history for this message
Brian Murray (brian-murray) wrote :

Dave in comment #10 you said "load on the ARM during transfers (my hunch is, if anything is affected, it'll be the latter - I vaguely recall forum posts about improvements on the ARM load during large USB transactions)." Were the systems checked for load during the transfer tests?

Additionally, would it be worthwhile to test different versions of USB media? I'd really hate to be caught off guard by something.

Revision history for this message
Hui Wang (hui.wang) wrote :

Probably I found the root cause of this problem.

After enabling the HIGHMEM and VMSPLIT_3G, the urb sent to the driver of dwc_otg maybe allocated in the highmem, if it is in the highmem, the usb core will not set the urb->transfer_buffer and call hcd->urb_enquenue().

With other hcd drivers, they could handle the situation of urb->transfer_buffer to be NULL, but for dwc_otg hcd driver, it returns an error unconditionally if the urb->transfer_buffer is NULL. I check the code, the dwc_otg driver could handle the situation that the urb buffer is in highmem (urb->tansfer_buffer is NULL).

Maybe we could fix this issue with a simple patch, I will post my found to the upstream bug https://github.com/raspberrypi/linux/issues/3332

hwang4@hwang4-Vostro-5390:~/work/mainline/raspi/linux-32$ git diff
diff --git a/drivers/usb/host/dwc_otg/dwc_otg_hcd_linux.c b/drivers/usb/host/dwc_otg/dwc_otg_hcd_linux.c
index 08a3e41038a3..5d04adfc58c0 100644
--- a/drivers/usb/host/dwc_otg/dwc_otg_hcd_linux.c
+++ b/drivers/usb/host/dwc_otg/dwc_otg_hcd_linux.c
@@ -821,10 +821,10 @@ static int dwc_otg_urb_enqueue(struct usb_hcd *hcd,
                dump_urb_info(urb, "dwc_otg_urb_enqueue");
        }
 #endif
-
+#if 0
        if (!urb->transfer_buffer && urb->transfer_buffer_length)
                return -EINVAL;
-
+#endif
        if ((usb_pipetype(urb->pipe) == PIPE_ISOCHRONOUS)
            || (usb_pipetype(urb->pipe) == PIPE_INTERRUPT)) {
                if (!dwc_otg_hcd_is_bandwidth_allocated

Revision history for this message
Hui Wang (hui.wang) wrote :

I tested the patch myself (arm64 and armhf kernels on different Pi boards), it worked well. With this patch, the usb_owg could work well even with HIGHMEM and VMSPLIT_3G enabled, so far we still could use a single armhf kernel for Pi2/3/4 boards.

The testing kernel is at:
https://people.canonical.com/~hwang4/dwcotg/

And I also posted my patch to comment #22 of https://github.com/raspberrypi/linux/issues/3332

Hui Wang (hui.wang)
summary: IO errors when writing large amounts of data to USB storage in eoan on
- RPI
+ RPI2/3 (armhf kernel)
Hui Wang (hui.wang)
description: updated
Changed in linux-raspi2 (Ubuntu Eoan):
status: In Progress → Fix Committed
Revision history for this message
Hui Wang (hui.wang) wrote :

We are going to release a new kernel soon, right now it is in the proposed channel, Welcome to test with the new kernel:

edit the $rpi_rootfs/etc/apt/sources.list and add:
deb http://ports.ubuntu.com/ubuntu-ports eoan-proposed main restricted
deb http://ports.ubuntu.com/ubuntu-ports eoan-proposed universe
deb http://ports.ubuntu.com/ubuntu-ports eoan-proposed multiverse

boot the rpi board, then run:
sudo apt-get update
sudo apt install linux-image-5.3.0-1013-raspi2
sudo reboot

Then you could do the test with the proposed kernel.

At least the 1013 kernel fixed these bugs compared with 1012 kernel:
https://bugs.launchpad.net/bugs/1850876
https://bugs.launchpad.net/bugs/1852510

Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (34.4 KiB)

This bug was fixed in the package linux-raspi2 - 5.3.0-1014.16

---------------
linux-raspi2 (5.3.0-1014.16) eoan; urgency=medium

  * eoan/linux-raspi2: 5.3.0-1014.16 -proposed tracker (LP: #1854006)

  * Need to disable CONFIG_DRM_V3D in the raspi2 eoan kernel (LP: #1853789)
    - [config] raspi2: Revert "UBUNTU: [config] raspi2: CONFIG_DRM_V3D=m"

linux-raspi2 (5.3.0-1013.15) eoan; urgency=medium

  * eoan/linux-raspi2: 5.3.0-1013.15 -proposed tracker (LP: #1852220)

  * Eoan update: 5.3.9 upstream stable release (LP: #1851550)
    - raspi2: [Config] Remove CONFIG_GENERIC_COMPAT_VDSO and
      CONFIG_CROSS_COMPILE_COMPAT_VDSO

  * Eoan update: v5.3.8 upstream stable release (LP: #1850456)
    - raspi2: [Config] CAVIUM_TX2_ERRATUM_219=n

  * IO errors when writing large amounts of data to USB storage in eoan on
    RPI2/3 (armhf kernel) (LP: #1852510)
    - SAUCE: dwc_otg: checking the urb->transfer_buffer too early (#3332)

  * Incorrect raspi2 snapcraft.yaml file (LP: #1851469)
    - [Packaging] raspi2: Fix snapcraft.yaml

  * CONFIG_DRM_V3D is disabled for linux-raspi2 of eoan (LP: #1850876)
    - [config] raspi2: CONFIG_DRM_V3D=m

  [ Ubuntu: 5.3.0-24.26 ]

  * eoan/linux: 5.3.0-24.26 -proposed tracker (LP: #1852232)
  * Eoan update: 5.3.9 upstream stable release (LP: #1851550)
    - io_uring: fix up O_NONBLOCK handling for sockets
    - dm snapshot: introduce account_start_copy() and account_end_copy()
    - dm snapshot: rework COW throttling to fix deadlock
    - Btrfs: fix inode cache block reserve leak on failure to allocate data space
    - btrfs: qgroup: Always free PREALLOC META reserve in
      btrfs_delalloc_release_extents()
    - iio: adc: meson_saradc: Fix memory allocation order
    - iio: fix center temperature of bmc150-accel-core
    - libsubcmd: Make _FORTIFY_SOURCE defines dependent on the feature
    - perf tests: Avoid raising SEGV using an obvious NULL dereference
    - perf map: Fix overlapped map handling
    - perf script brstackinsn: Fix recovery from LBR/binary mismatch
    - perf jevents: Fix period for Intel fixed counters
    - perf tools: Propagate get_cpuid() error
    - perf annotate: Propagate perf_env__arch() error
    - perf annotate: Fix the signedness of failure returns
    - perf annotate: Propagate the symbol__annotate() error return
    - perf annotate: Fix arch specific ->init() failure errors
    - perf annotate: Return appropriate error code for allocation failures
    - perf annotate: Don't return -1 for error when doing BPF disassembly
    - staging: rtl8188eu: fix null dereference when kzalloc fails
    - RDMA/siw: Fix serialization issue in write_space()
    - RDMA/hfi1: Prevent memory leak in sdma_init
    - RDMA/iw_cxgb4: fix SRQ access from dump_qp()
    - RDMA/iwcm: Fix a lock inversion issue
    - HID: hyperv: Use in-place iterator API in the channel callback
    - kselftest: exclude failed TARGETS from runlist
    - selftests/kselftest/runner.sh: Add 45 second timeout per test
    - nfs: Fix nfsi->nrequests count error on nfs_inode_remove_request
    - arm64: cpufeature: Effectively expose FRINT capability to userspace
    - arm64: Fix incorrect irqflag restore for priority masking fo...

Changed in linux-raspi2 (Ubuntu Eoan):
status: Fix Committed → Fix Released
Changed in linux-raspi2 (Ubuntu):
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.