oob_net0 file transfers can crash kernel

Bug #1928852 reported by David Thompson
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux-bluefield (Ubuntu)
Invalid
Undecided
Unassigned
Focal
Fix Released
Undecided
Unassigned

Bug Description

SRU Justification:

[Impact]
* Certain file transfers over the oob_net0 interface, which is
  managed by the mlxbf_gige driver, can fail in one of these ways:
  1) Transfer fails due to lost connection, e.g. SCP of a large (~1GB)
     file from a server into the BlueField-2 OOB interface can fail
     and return "lost connection" status
  2) Transfer fails due to kernel crash, e.g. issuing SCP on BlueField-2
     platform to retrieve a file from a server and copy it into an NFS
     mounted directory on the BlueField-2 platform can fail and crash the
     Linux kernel with a page fault and issues this message:
     "Unable to handle kernel paging request at virtual address XXX"

[Fix]
* This delivery provides a set of changes to add stability to
the mlxbf_gige driver transmit and receive processing:

Changes to mlxbf_gige_rx_packet()
---------------------------------
1) Changed logic to remove the assumption that there's at
   least one packet to process. Instead, at the start of
   routine check the RX CQE polarity bit, and if it is not
   the expected value then exit.

2) Moved call to "dma_unmap_single()" to within the path
   where packet status is OK. Otherwise if an errored
   packet is received, the SKB is unmapped but no SKB is
   allocated to fill that same index.

3) Defer call to "netif_receive_skb()" to end of routine
   since this call can trigger more processing, even
   packet transmissions, in the networking stack.

Changes to mlxbf_gige_start_xmit()
----------------------------------
1) Added logic to drop oversized packets

2) Added logic to use a spin lock when access priv->tx_pi
   since this index is also accessed by the transmit
   completion logic.

Changes to mlxbf_gige_handle_tx_complete
----------------------------------------
1) Added call to "mb()" to flush prev_tx_ci updates

[Test Case]
* #1 After booting platform, verify that file transfers of large files (~1GB) from a
  server into the BlueField-2 platform's /tmp directory over the oob_net0 interface succeed
* #2 Configure an NFS mounted directory on BlueField-2 platform and transfer
  large files over the oob_net0 interface into this directory. It is important to
  ensure that the oob_net0 is used for the NFS mount, and no other active interface
  will be involved. In the below example, the <peer-ip> is the IP address of the
  server interface that is the peer to the BlueField-2 OOB.
   1) Configure NFS server on a remote server
   2) Configure NFS client on BlueField-2 platform
      a) mkdir /mnt/share
      b) mount -t nfs <peer-ip>:<nfs-server-mount> /mnt/share
   3) Exercise file transfers over oob_net0 interface
      a) cd /mnt/share
      b) scp <user>@<peer-ip>:/tmp/<large-file> <local-file>

[Regression Potential]
* These changes have been well tested, but there's a chance that certain file
  transfers could still experience problems (hung transfer, lost connection)

[Other]
* The mlxbf_gige driver will display v1.22 in modinfo after these changes.

Tim Gardner (timg-tpi)
Changed in linux-bluefield (Ubuntu):
status: New → Invalid
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-focal' to 'verification-done-focal'. If the problem still exists, change the tag 'verification-needed-focal' to 'verification-failed-focal'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-focal
Revision history for this message
Meriton Tuli (meriton) wrote :

Using Kernel 5.4.0-1013-bluefield this Issue has been fixed.

tags: added: verification-done-focal
removed: verification-needed-focal
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (34.5 KiB)

This bug was fixed in the package linux-bluefield - 5.4.0-1013.16

---------------
linux-bluefield (5.4.0-1013.16) focal; urgency=medium

  * focal/linux-bluefield: 5.4.0-1013.16 -proposed tracker (LP: #1930009)

  * Automate soft reset of BlueField ARM via GPIO7 (LP: #1929736)
    - SAUCE: Automate soft reset of BlueField ARM via GPIO7

  * Remove dependency between module and driver (LP: #1927246)
    - net/sched: act_ct: Make tcf_ct_flow_table_restore_skb inline
    - netfilter: flowtable: Make nf_flow_table_offload_add/del_cb inline

  * Increase flow insertion rate by using rw lock instead of mutex on the flow
    block. (LP: #1927251)
    - netfilter: flowtable: Use rw sem as flow block lock
    - netfilter: flowtable: Free block_cb when being deleted

  * oob_net0 file transfers can crash kernel (LP: #1928852)
    - SAUCE: mlxbf_gige: syncup with v1.23 content

  * CT: Fix CT template allocation for zone 0 (LP: #1929460)
    - SAUCE: net/sched: act_ct: Fix ct template allocation for zone 0

  * CT: Offload connections with commit action (LP: #1929459)
    - SAUCE: net/sched: act_ct: Offload connections with commit action

  * CT: check offload bit on table dump (LP: #1929458)
    - SAUCE: netfilter: conntrack: Check offload bit on table dump

  * Memleak on restore flow when offloading conntrack. (LP: #1929844)
    - SAUCE: skbuff: Release nfct refcount on napi stolen or re-used skbs

  [ Ubuntu: 5.4.0-75.84 ]

  * focal/linux: 5.4.0-75.84 -proposed tracker (LP: #1930032)
  * Packaging resync (LP: #1786013)
    - update dkms package versions
  * CVE-2021-33200
    - bpf: Wrap aux data inside bpf_sanitize_info container
    - bpf: Fix mask direction swap upon off reg sign change
    - bpf: No need to simulate speculative domain for immediates
  * Realtek USB hubs in Dell WD19SC/DC/TB fail to work after exiting s2idle
    (LP: #1928242)
    - USB: Verify the port status when timeout happens during port suspend
  * CVE-2020-26145
    - ath10k: drop fragments with multicast DA for SDIO
    - ath10k: add CCMP PN replay protection for fragmented frames for PCIe
    - ath10k: drop fragments with multicast DA for PCIe
  * CVE-2020-26141
    - ath10k: Fix TKIP Michael MIC verification for PCIe
  * CVE-2020-24588
    - mac80211: properly handle A-MSDUs that start with an RFC 1042 header
    - cfg80211: mitigate A-MSDU aggregation attacks
    - mac80211: drop A-MSDUs on old ciphers
    - ath10k: drop MPDU which has discard flag set by firmware for SDIO
  * CVE-2020-26139
    - mac80211: do not accept/forward invalid EAPOL frames
  * CVE-2020-24586 // CVE-2020-24587 // CVE-2020-24587 for such cases.
    - mac80211: extend protection against mixed key and fragment cache attacks
  * CVE-2020-24586 // CVE-2020-24587
    - mac80211: prevent mixed key and fragment cache attacks
    - mac80211: add fragment cache to sta_info
    - mac80211: check defrag PN against current frame
    - mac80211: prevent attacks on TKIP/WEP as well
  * CVE-2020-26147
    - mac80211: assure all fragments are encrypted
  * raid10: Block discard is very slow, causing severe delays for mkfs and
    fstrim operations (LP: #1896578)
    - md: add md_submit_discard_bio() for subm...

Changed in linux-bluefield (Ubuntu Focal):
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.