aws: network performance regression due to initial TCP receive buffer size change

Bug #1910200 reported by Andrea Righi on 2021-01-05
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux-aws (Ubuntu)
Undecided
Unassigned
Bionic
High
Unassigned
Focal
High
Unassigned
Groovy
High
Unassigned

Bug Description

[Impact]

AWS has seen some customers reporting networking performance degradation after they upgraded their Ubuntu instanceses. This regression is highly impacting customers who are using MTU=9000 (which is the default in EC2).

[Test case]

Bug reproduced internally in AWS (no test case provided), but apparently it is very easy to reproduce simply by measuring networking performance.

[Fix]

AWS worked internally and found that the regression has been introduced by:

 a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB")

To solve the problem we need to apply the following upstream commit that explicitly fixes the problem introduced by the commit above:

 33ae7b5bb841 ("tcp: select sane initial rcvq_space.space for big MSS")

[Regression potential]

Upstream fix that is only affecting the initial TCP buffer space and allows the TCP window size to be dynamically increased, basically restoring the previous (correct) behavior, so regression potential is minimal.

Andrea Righi (arighi) on 2021-01-05
Changed in linux-aws (Ubuntu Bionic):
importance: Undecided → High
Changed in linux-aws (Ubuntu Focal):
importance: Undecided → High
Changed in linux-aws (Ubuntu Groovy):
importance: Undecided → High
Changed in linux-aws (Ubuntu Bionic):
status: New → Fix Committed
Changed in linux-aws (Ubuntu Focal):
status: New → Fix Committed
Changed in linux-aws (Ubuntu Groovy):
status: New → Fix Committed
Launchpad Janitor (janitor) wrote :
Download full text (7.0 KiB)

This bug was fixed in the package linux-aws - 4.15.0-1093.99

---------------
linux-aws (4.15.0-1093.99) bionic; urgency=medium

  * bionic/linux-aws: 4.15.0-1093.99 -proposed tracker (LP: #1911275)

  * aws: network performance regression due to initial TCP receive buffer size
    change (LP: #1910200)
    - tcp: select sane initial rcvq_space.space for big MSS

  * arm64: prevent losing page dirty state (LP: #1908503)
    - arm64: pgtable: Ensure dirty bit is preserved across pte_wrprotect()

  * Disable Atari partition support for cloud kernels (LP: #1908264)
    - [Config] Disable Atari partition support

  * aws: xen-netfront: prevent potential error on hibernate (LP: #1906850)
    - SAUCE: xen-netfront: prevent unnecessary close on hibernate

  [ Ubuntu: 4.15.0-133.137 ]

  * bionic/linux: 4.15.0-133.137 -proposed tracker (LP: #1911295)
  * [drm:qxl_enc_commit [qxl]] *ERROR* head number too large or missing monitors
    config: (LP: #1908219)
    - qxl: remove qxl_io_log()
    - qxl: move qxl_send_monitors_config()
    - qxl: hook monitors_config updates into crtc, not encoder.
  * Touchpad not detected on ByteSpeed C15B laptop (LP: #1906128)
    - Input: i8042 - add ByteSpeed touchpad to noloop table
  * vmx_nm_test in ubuntu_kvm_unit_tests interrupted on X-oracle-4.15 /
    B-oracle-4.15 / X-KVM / B-KVM (LP: #1872401)
    - KVM: nVMX: Always reflect #NM VM-exits to L1
  * stack trace in kernel (LP: #1903596)
    - net: napi: remove useless stack trace
  * CVE-2020-27777
    - [Config]: Set CONFIG_PPC_RTAS_FILTER
  * Bionic update: upstream stable patchset 2020-12-04 (LP: #1906875)
    - regulator: defer probe when trying to get voltage from unresolved supply
    - ring-buffer: Fix recursion protection transitions between interrupt context
    - time: Prevent undefined behaviour in timespec64_to_ns()
    - nbd: don't update block size after device is started
    - btrfs: sysfs: init devices outside of the chunk_mutex
    - btrfs: reschedule when cloning lots of extents
    - genirq: Let GENERIC_IRQ_IPI select IRQ_DOMAIN_HIERARCHY
    - hv_balloon: disable warning when floor reached
    - net: xfrm: fix a race condition during allocing spi
    - perf tools: Add missing swap for ino_generation
    - ALSA: hda: prevent undefined shift in snd_hdac_ext_bus_get_link()
    - can: rx-offload: don't call kfree_skb() from IRQ context
    - can: dev: can_get_echo_skb(): prevent call to kfree_skb() in hard IRQ
      context
    - can: dev: __can_get_echo_skb(): fix real payload length return value for RTR
      frames
    - can: can_create_echo_skb(): fix echo skb generation: always use skb_clone()
    - can: peak_usb: add range checking in decode operations
    - can: peak_usb: peak_usb_get_ts_time(): fix timestamp wrapping
    - can: peak_canfd: pucan_handle_can_rx(): fix echo management when loopback is
      on
    - xfs: flush new eof page on truncate to avoid post-eof corruption
    - Btrfs: fix missing error return if writeback for extent buffer never started
    - ath9k_htc: Use appropriate rs_datalen type
    - usb: gadget: goku_udc: fix potential crashes in probe
    - gfs2: Free rd_bits later in gfs2_clear_rgrpd to fix use-after-free
  ...

Read more...

Changed in linux-aws (Ubuntu Bionic):
status: Fix Committed → Fix Released
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux-aws - 5.4.0-1037.39

---------------
linux-aws (5.4.0-1037.39) focal; urgency=medium

  * focal/linux-aws: 5.4.0-1037.39 -proposed tracker (LP: #1911314)

  * aws: network performance regression due to initial TCP receive buffer size
    change (LP: #1910200)
    - tcp: select sane initial rcvq_space.space for big MSS

  * Disable Atari partition support for linux-aws (LP: #1908264)
    - [Config] Disable Atari partition support

  * aws: xen-netfront: prevent potential error on hibernate (LP: #1906850)
    - SAUCE: xen-netfront: prevent unnecessary close on hibernate

  [ Ubuntu: 5.4.0-63.71 ]

  * focal/linux: 5.4.0-63.71 -proposed tracker (LP: #1911333)
  * overlay: permission regression in 5.4.0-51.56 due to patches related to
    CVE-2020-16120 (LP: #1900141)
    - ovl: do not fail because of O_NOATIME
  * Focal update: v5.4.79 upstream stable release (LP: #1907151)
    - net/mlx5: Use async EQ setup cleanup helpers for multiple EQs
    - net/mlx5: poll cmd EQ in case of command timeout
    - net/mlx5: Fix a race when moving command interface to events mode
    - net/mlx5: Add retry mechanism to the command entry index allocation
  * Kernel 5.4.0-56 Wi-Fi does not connect (LP: #1906770)
    - mt76: fix fix ampdu locking
  * [Ubuntu 21.04 FEAT] mpt3sas: Request to include the patch set which supports
    topology where zoning is enabled in expander (LP: #1899802)
    - scsi: mpt3sas: Define hba_port structure
    - scsi: mpt3sas: Allocate memory for hba_port objects
    - scsi: mpt3sas: Rearrange _scsih_mark_responding_sas_device()
    - scsi: mpt3sas: Update hba_port's sas_address & phy_mask
    - scsi: mpt3sas: Get device objects using sas_address & portID
    - scsi: mpt3sas: Rename transport_del_phy_from_an_existing_port()
    - scsi: mpt3sas: Get sas_device objects using device's rphy
    - scsi: mpt3sas: Update hba_port objects after host reset
    - scsi: mpt3sas: Set valid PhysicalPort in SMPPassThrough
    - scsi: mpt3sas: Handling HBA vSES device
    - scsi: mpt3sas: Add bypass_dirty_port_flag parameter
    - scsi: mpt3sas: Handle vSES vphy object during HBA reset
    - scsi: mpt3sas: Add module parameter multipath_on_hba
    - scsi: mpt3sas: Bump driver version to 35.101.00.00

  [ Ubuntu: 5.4.0-62.70 ]

  * focal/linux: 5.4.0-62.70 -proposed tracker (LP: #1911144)
  * CVE-2020-28374
    - SAUCE: target: fix XCOPY NAA identifier lookup
  * Packaging resync (LP: #1786013)
    - update dkms package versions

 -- Kelsey Skunberg <email address hidden> Wed, 13 Jan 2021 19:01:10 -0700

Changed in linux-aws (Ubuntu Focal):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers