aws s3 cp --recursive hangs on the last file on a large file transfer to instance

Bug #1796469 reported by Joshua Edwards on 2018-10-06
34
This bug affects 6 people
Affects Status Importance Assigned to Milestone
linux-aws (Ubuntu)
Undecided
Unassigned

Bug Description

aws s3 cp --recursive hangs on the last file on a large transfer to an instance

I have confirmed that this works on version Linux/4.15.0-1021-aws

aws cli version
aws-cli/1.16.23 Python/2.7.15rc1 Linux/4.15.0-1023-aws botocore/1.12.13

Ubuntu version
Description: Ubuntu 18.04.1 LTS
Release: 18.04
eu-west-1 - ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20180912 - ami-00035f41c82244dab

Package version
linux-aws:
  Installed: 4.15.0.1023.23
  Candidate: 4.15.0.1023.23
  Version table:
 *** 4.15.0.1023.23 500
        500 http://eu-west-1.ec2.archive.ubuntu.com/ubuntu bionic-updates/main amd64 Packages
        500 http://security.ubuntu.com/ubuntu bionic-security/main amd64 Packages
        100 /var/lib/dpkg/status
     4.15.0.1007.7 500
        500 http://eu-west-1.ec2.archive.ubuntu.com/ubuntu bionic/main amd64 Packages

Joshua Edwards (josh.edwards) wrote :

Does anyone have any more information about this?

Eli S. Denman (elisd) wrote :

Not sure if it's related but I'll leave this here.

We have a bunch of EC2 instances running Ubuntu 18.04 LTS and GitLab Runner, with the latter configured to execute build jobs inside Docker containers. Our EC2 instances are of the t2.medium, t2.xlarge and t3.medium sizes and run in the eu-west-1 region (in different AZs).

After upgrading from kernel 4.15.0-1021-aws to 4.15.0-1023-aws we've seen a lot of build failures across all of our GitLab Runner instances on Ubuntu 18.04 LTS. The problems happen intermittently.

Specifically the build jobs fail during the startup phase where GitLab Runner clones the git repository. The git clone operation fails with:

```
Cloning repository...
Cloning into '/builds/my-group/my-project'...
error: RPC failed; curl 18 transfer closed with outstanding read data remaining
fatal: The remote end hung up unexpectedly
fatal: early EOF
fatal: index-pack failed
```

According to the documentation at https://curl.haxx.se/libcurl/c/libcurl-errors.html this means:

> CURLE_PARTIAL_FILE (18)
>
> A file transfer was shorter or larger than expected. This happens when the server first reports an expected transfer size, and then delivers data that doesn't match the previously given size.

When this problem happened it seems that the bandwidth used by the `git clone` operation eventually dropped to a few or zero kByte/s (the reporting wasn't granular enough to determine which). I assume that the GitLab server later closed/timed out the connection before the git client had downloaded all of the data it expected, leaving us with a `CURLE_PARTIAL_FILE` error.

We've also seen larger (10s-100s of MBs) file downloads hang indefinitely (e.g. apt install).

We've rolled back changes made during the last few weeks one by one and eventually found that downgrading to 4.15.0-1021-aws solved the issue on all of our GitLab Runner instances (ranging from 11.2.0, 11.2.1 and 11.3.1).

I couldn't spot anything obvious that seemed to be related to the issue in the changelog here:
https://changelogs.ubuntu.com/changelogs/pool/main/l/linux-aws/linux-aws_4.15.0-1023.23/changelog

FWIW, this is what `docker version` reports:

```
Client:
 Version: 18.06.1-ce
 API version: 1.38
 Go version: go1.10.3
 Git commit: e68fc7a
 Built: Tue Aug 21 17:24:56 2018
 OS/Arch: linux/amd64
 Experimental: false

Server:
 Engine:
  Version: 18.06.1-ce
  API version: 1.38 (minimum version 1.12)
  Go version: go1.10.3
  Git commit: e68fc7a
  Built: Tue Aug 21 17:23:21 2018
  OS/Arch: linux/amd64
  Experimental: false
```

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux-aws (Ubuntu):
status: New → Confirmed
Joshua Edwards (josh.edwards) wrote :

For anyone using ansible this is how I rolled back to version 21

 - name: Removing awscli
   pip:
     name: awscli
     state: absent

 - name: Wait for sudo
   become: yes
   shell: while sudo fuser /var/lib/dpkg/lock >/dev/null 2>&1; do sleep 1; done;

 - name: remove aws kernel version
   become: yes
   apt:
     name: "{{ item }}"
     state: absent
   with_items:
     - linux-aws
     - linux-image-aws
     - linux-headers-aws

 - name: Wait for sudo
   become: yes
   shell: while sudo fuser /var/lib/dpkg/lock >/dev/null 2>&1; do sleep 1; done;

 - name: Pin aws kernel
   become: yes
   apt:
     deb: http://launchpadlibrarian.net/385721541/linux-image-aws_4.15.0.1021.21_amd64.deb

 - name: Wait for sudo
   become: yes
   shell: while sudo fuser /var/lib/dpkg/lock >/dev/null 2>&1; do sleep 1; done;

 - name: Pin aws kernel dependency
   become: yes
   apt:
     deb: http://launchpadlibrarian.net/385721540/linux-headers-aws_4.15.0.1021.21_amd64.deb

 - name: Wait for sudo
   become: yes
   shell: while sudo fuser /var/lib/dpkg/lock >/dev/null 2>&1; do sleep 1; done;

 - name: Pin aws kernel version
   become: yes
   apt:
     deb: http://launchpadlibrarian.net/385721538/linux-aws_4.15.0.1021.21_amd64.deb

 - name: Installing awscli
   pip:
     name: awscli

Download full text (5.1 KiB)

The problem affects also transfers from RDS instances - it prevents for example creating an export file using pg_dump from an RDS Postgres database.

The problem seems to be caused by a sudden collapse of downloader TCP receive window to "1" (usually with scale 7, so 1*2^7=128 bytes) after transferring several tens of GB of data over a single connection. The TCP receive window never recovers.

Analysis of a single stalled transfer from RDS PostgreSQL:

tcpdump -nn -vvv -r pcap --dont-verify-checksums; done | xz -1 > pcap.txt.xz

$ xzgrep wscale pcap.txt.xz
    10.16.14.237.33578 > 10.16.10.102.5432: Flags [S], seq 3273208485, win 26883, options [mss 8961,sackOK,TS val 1284903196 ecr 0,nop,wscale 7], length 0
    10.16.10.102.5432 > 10.16.14.237.33578: Flags [S.], seq 2908863056, ack 3273208486, win 28960, options [mss 1460,sackOK,TS val 120076048 ecr 1284903196,nop,wscale 10], length 0

So the window scale for this TCP connection is 7, so each "win N" values have to be multiplied by 2^7=128.

The window size TCP parameter at the beginning of the file is for example 852, which means after scaling 852*2^7=109056 bytes
$ xzgrep '10.16.14.237.33578' pcap.txt.xz | head -100000 | tail -1
    10.16.14.237.33578 > 10.16.10.102.5432: Flags [.], seq 2256009, ack 201198741, win 852, options [nop,nop,TS val 1284911021 ecr 120078004], length 0

But near the end of the file it is 1, which means after scaling 1*2^7=128 bytes:
$ xzcat pcap.txt.xz | tail -1000 | grep '10.16.14.237.33578' | head -1
    10.16.14.237.33578 > 10.16.10.102.5432: Flags [.], seq 2266022, ack 3742538664, win 1, options [nop,nop,TS val 1286238401 ecr 120409852], length 0

And indeed the RDS server is sending 128 bytes of data and waiting for confirmation before sending another 128:
13:47:27.170534 IP (tos 0x0, ttl 255, id 11479, offset 0, flags [DF], proto TCP (6), length 180)
    10.16.10.102.5432 > 10.16.14.237.33578: Flags [P.], seq 3742538664:3742538792, ack 2266022, win 174, options [nop,nop,TS val 120409903 ecr 1286238401], length 128
13:47:27.170542 IP (tos 0x0, ttl 64, id 28256, offset 0, flags [DF], proto TCP (6), length 52)
    10.16.14.237.33578 > 10.16.10.102.5432: Flags [.], seq 2266022, ack 3742538792, win 1, options [nop,nop,TS val 1286238605 ecr 120409903], length 0
13:47:27.374539 IP (tos 0x0, ttl 255, id 11480, offset 0, flags [DF], proto TCP (6), length 180)
    10.16.10.102.5432 > 10.16.14.237.33578: Flags [P.], seq 3742538792:3742538920, ack 2266022, win 174, options [nop,nop,TS val 120409954 ecr 1286238605], length 128

The switch from "win 600" (76800 bytes) to "win 1" (128 bytes) is sudden and I have no idea what could have caused it:
13:33:26.230782 IP (tos 0x0, ttl 64, id 24124, offset 0, flags [DF], proto TCP (6), length 52)
    10.16.14.237.33578 > 10.16.10.102.5432: Flags [.], seq 2266022, ack 3741933728, win 600, options [nop,nop,TS val 1285397677 ecr 120199669], length 0
13:33:26.230868 IP (tos 0x0, ttl 255, id 7295, offset 0, flags [DF], proto TCP (6), length 4396)
    10.16.10.102.5432 > 10.16.14.237.33578: Flags [.], seq 3741933728:3741938072, ack 2266022, win 174, options [nop,nop,TS val 120199669 ecr 1285397677], length 4344
13:33:26.230918 IP (tos...

Read more...

Alex Khomenko (akhomenko) wrote :

We ran into the exact same issue with pg_dump after upgrading to Bionic (4.15.0-1023-aws kernel) on r5.xlarge (a next-gen Nitro instance). After 16-80 MB of data is received, tcpdump shows the receive window collapsing to 1 (not quite suddenly, but quickly), and it never recovers. We have engaged with AWS support on this issue, and they referred us to this bug.

Other than downgrading the kernel, has anyone discovered any other workarounds?

Alex Khomenko (akhomenko) wrote :

Trying it again with the newly posted 4.15.0-1025-aws kernel, we can no longer reproduce the issue with pg_dump.

Eli S. Denman (elisd) wrote :

Can also confirm that the 4.15.0-1025-aws kernel appears to have solved the timeout issues we were seeing with `git clone` processes.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers