Bug #1796469 “aws s3 cp --recursive hangs on the last file on a ...” : Bugs : linux-aws package : Ubuntu

Revision history for this message

Joshua Edwards (josh.edwards) wrote on 2018-10-12:

#1

Does anyone have any more information about this?

Revision history for this message

Eli S. Denman (elisd) wrote on 2018-10-12:

#2

Not sure if it's related but I'll leave this here.

We have a bunch of EC2 instances running Ubuntu 18.04 LTS and GitLab Runner, with the latter configured to execute build jobs inside Docker containers. Our EC2 instances are of the t2.medium, t2.xlarge and t3.medium sizes and run in the eu-west-1 region (in different AZs).

After upgrading from kernel 4.15.0-1021-aws to 4.15.0-1023-aws we've seen a lot of build failures across all of our GitLab Runner instances on Ubuntu 18.04 LTS. The problems happen intermittently.

Specifically the build jobs fail during the startup phase where GitLab Runner clones the git repository. The git clone operation fails with:

```
Cloning repository...
Cloning into '/builds/my-group/my-project'...
error: RPC failed; curl 18 transfer closed with outstanding read data remaining
fatal: The remote end hung up unexpectedly
fatal: early EOF
fatal: index-pack failed
```

According to the documentation at https://curl.haxx.se/libcurl/c/libcurl-errors.html this means:

> CURLE_PARTIAL_FILE (18)
>
> A file transfer was shorter or larger than expected. This happens when the server first reports an expected transfer size, and then delivers data that doesn't match the previously given size.

When this problem happened it seems that the bandwidth used by the `git clone` operation eventually dropped to a few or zero kByte/s (the reporting wasn't granular enough to determine which). I assume that the GitLab server later closed/timed out the connection before the git client had downloaded all of the data it expected, leaving us with a `CURLE_PARTIAL_FILE` error.

We've also seen larger (10s-100s of MBs) file downloads hang indefinitely (e.g. apt install).

We've rolled back changes made during the last few weeks one by one and eventually found that downgrading to 4.15.0-1021-aws solved the issue on all of our GitLab Runner instances (ranging from 11.2.0, 11.2.1 and 11.3.1).

I couldn't spot anything obvious that seemed to be related to the issue in the changelog here:
https://changelogs.ubuntu.com/changelogs/pool/main/l/linux-aws/linux-aws_4.15.0-1023.23/changelog

FWIW, this is what `docker version` reports:

```
Client:
Version: 18.06.1-ce
API version: 1.38
Go version: go1.10.3
Git commit: e68fc7a
Built: Tue Aug 21 17:24:56 2018
OS/Arch: linux/amd64
Experimental: false

Server:
Engine:
  Version: 18.06.1-ce
  API version: 1.38 (minimum version 1.12)
  Go version: go1.10.3
  Git commit: e68fc7a
  Built: Tue Aug 21 17:23:21 2018
  OS/Arch: linux/amd64
  Experimental: false
```

Not sure if it's related but I'll leave this here.

We have a bunch of EC2 instances running Ubuntu 18.04 LTS and GitLab Runner, with the latter configured to execute build jobs inside Docker containers.  Our EC2 instances are of the t2.medium, t2.xlarge and t3.medium sizes and run in the eu-west-1 region (in different AZs).

After upgrading from kernel 4.15.0-1021-aws to 4.15.0-1023-aws we've seen a lot of build failures across all of our GitLab Runner instances on Ubuntu 18.04 LTS.  The problems happen intermittently.

Specifically the build jobs fail during the startup phase where GitLab Runner clones the git repository.  The git clone operation fails with:

```
Cloning repository...
Cloning into '/builds/my-group/my-project'...
error: RPC failed; curl 18 transfer closed with outstanding read data remaining
fatal: The remote end hung up unexpectedly
fatal: early EOF
fatal: index-pack failed
```

According to the documentation at https://curl.haxx.se/libcurl/c/libcurl-errors.html this means:

> CURLE_PARTIAL_FILE (18)
> 
> A file transfer was shorter or larger than expected. This happens when the server first reports an expected transfer size, and then delivers data that doesn't match the previously given size.

When this problem happened it seems that the bandwidth used by the `git clone` operation eventually dropped to a few or zero kByte/s (the reporting wasn't granular enough to determine which).  I assume that the GitLab server later closed/timed out the connection before the git client had downloaded all of the data it expected, leaving us with a `CURLE_PARTIAL_FILE` error.

We've also seen larger (10s-100s of MBs) file downloads hang indefinitely (e.g. apt install).

We've rolled back changes made during the last few weeks one by one and eventually found that downgrading to 4.15.0-1021-aws solved the issue on all of our GitLab Runner instances (ranging from 11.2.0, 11.2.1 and 11.3.1).

I couldn't spot anything obvious that seemed to be related to the issue in the changelog here:
https://changelogs.ubuntu.com/changelogs/pool/main/l/linux-aws/linux-aws_4.15.0-1023.23/changelog

FWIW, this is what `docker version` reports:

```
Client:
 Version:           18.06.1-ce
 API version:       1.38
 Go version:        go1.10.3
 Git commit:        e68fc7a
 Built:             Tue Aug 21 17:24:56 2018
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
 Engine:
  Version:          18.06.1-ce
  API version:      1.38 (minimum version 1.12)
  Go version:       go1.10.3
  Git commit:       e68fc7a
  Built:            Tue Aug 21 17:23:21 2018
  OS/Arch:          linux/amd64
  Experimental:     false
```

Revision history for this message

Launchpad Janitor (janitor) wrote on 2018-10-15:

#3

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux-aws (Ubuntu):
status:	New → Confirmed

Revision history for this message

Joshua Edwards (josh.edwards) wrote on 2018-10-15:

#4

For anyone using ansible this is how I rolled back to version 21

- name: Removing awscli
   pip:
     name: awscli
     state: absent

- name: Wait for sudo
become: yes
shell: while sudo fuser /var/lib/dpkg/lock >/dev/null 2>&1; do sleep 1; done;

- name: remove aws kernel version
   become: yes
   apt:
     name: "{{ item }}"
     state: absent
   with_items:
     - linux-aws
     - linux-image-aws
     - linux-headers-aws

- name: Wait for sudo
become: yes
shell: while sudo fuser /var/lib/dpkg/lock >/dev/null 2>&1; do sleep 1; done;

- name: Pin aws kernel
   become: yes
   apt:
     deb: http://launchpadlibrarian.net/385721541/linux-image-aws_4.15.0.1021.21_amd64.deb

- name: Wait for sudo
become: yes
shell: while sudo fuser /var/lib/dpkg/lock >/dev/null 2>&1; do sleep 1; done;

- name: Pin aws kernel dependency
   become: yes
   apt:
     deb: http://launchpadlibrarian.net/385721540/linux-headers-aws_4.15.0.1021.21_amd64.deb

- name: Wait for sudo
become: yes
shell: while sudo fuser /var/lib/dpkg/lock >/dev/null 2>&1; do sleep 1; done;

- name: Pin aws kernel version
   become: yes
   apt:
     deb: http://launchpadlibrarian.net/385721538/linux-aws_4.15.0.1021.21_amd64.deb

- name: Installing awscli
pip:
name: awscli

Revision history for this message

Tomasz Ostrowski (tometzky+ubuntu) wrote on 2018-10-19:

#5

Download full text (5.1 KiB)

The problem affects also transfers from RDS instances - it prevents for example creating an export file using pg_dump from an RDS Postgres database.

The problem seems to be caused by a sudden collapse of downloader TCP receive window to "1" (usually with scale 7, so 1*2^7=128 bytes) after transferring several tens of GB of data over a single connection. The TCP receive window never recovers.

Analysis of a single stalled transfer from RDS PostgreSQL:

tcpdump -nn -vvv -r pcap --dont-verify-checksums; done | xz -1 > pcap.txt.xz

$ xzgrep wscale pcap.txt.xz
10.16.14.237.33578 > 10.16.10.102.5432: Flags [S], seq 3273208485, win 26883, options [mss 8961,sackOK,TS val 1284903196 ecr 0,nop,wscale 7], length 0
10.16.10.102.5432 > 10.16.14.237.33578: Flags [S.], seq 2908863056, ack 3273208486, win 28960, options [mss 1460,sackOK,TS val 120076048 ecr 1284903196,nop,wscale 10], length 0

So the window scale for this TCP connection is 7, so each "win N" values have to be multiplied by 2^7=128.

The window size TCP parameter at the beginning of the file is for example 852, which means after scaling 852*2^7=109056 bytes
$ xzgrep '10.16.14.237.33578' pcap.txt.xz | head -100000 | tail -1
10.16.14.237.33578 > 10.16.10.102.5432: Flags [.], seq 2256009, ack 201198741, win 852, options [nop,nop,TS val 1284911021 ecr 120078004], length 0

But near the end of the file it is 1, which means after scaling 1*2^7=128 bytes:
$ xzcat pcap.txt.xz | tail -1000 | grep '10.16.14.237.33578' | head -1
10.16.14.237.33578 > 10.16.10.102.5432: Flags [.], seq 2266022, ack 3742538664, win 1, options [nop,nop,TS val 1286238401 ecr 120409852], length 0

And indeed the RDS server is sending 128 bytes of data and waiting for confirmation before sending another 128:
13:47:27.170534 IP (tos 0x0, ttl 255, id 11479, offset 0, flags [DF], proto TCP (6), length 180)
    10.16.10.102.5432 > 10.16.14.237.33578: Flags [P.], seq 3742538664:3742538792, ack 2266022, win 174, options [nop,nop,TS val 120409903 ecr 1286238401], length 128
13:47:27.170542 IP (tos 0x0, ttl 64, id 28256, offset 0, flags [DF], proto TCP (6), length 52)
    10.16.14.237.33578 > 10.16.10.102.5432: Flags [.], seq 2266022, ack 3742538792, win 1, options [nop,nop,TS val 1286238605 ecr 120409903], length 0
13:47:27.374539 IP (tos 0x0, ttl 255, id 11480, offset 0, flags [DF], proto TCP (6), length 180)
    10.16.10.102.5432 > 10.16.14.237.33578: Flags [P.], seq 3742538792:3742538920, ack 2266022, win 174, options [nop,nop,TS val 120409954 ecr 1286238605], length 128

The switch from "win 600" (76800 bytes) to "win 1" (128 bytes) is sudden and I have no idea what could have caused it:
13:33:26.230782 IP (tos 0x0, ttl 64, id 24124, offset 0, flags [DF], proto TCP (6), length 52)
10.16.14.237.33578 > 10.16.10.102.5432: Flags [.], seq 2266022, ack 3741933728, win 600, options [nop,nop,TS val 1285397677 ecr 120199669], length 0
13:33:26.230868 IP (tos 0x0, ttl 255, id 7295, offset 0, flags [DF], proto TCP (6), length 4396)
10.16.10.102.5432 > 10.16.14.237.33578: Flags [.], seq 3741933728:3741938072, ack 2266022, win 174, options [nop,nop,TS val 120199669 ecr 1285397677], length 4344
13:33:26.230918 IP (tos...

The problem affects also transfers from RDS instances - it prevents for example creating an export file using pg_dump from an RDS Postgres database.

The problem seems to be caused by a sudden collapse of downloader TCP receive window to "1" (usually with scale 7, so 1*2^7=128 bytes) after transferring several tens of GB of data over a single connection. The TCP receive window never recovers.

Analysis of a single stalled transfer from RDS PostgreSQL:

tcpdump -nn -vvv -r pcap --dont-verify-checksums; done | xz -1 > pcap.txt.xz

$ xzgrep wscale pcap.txt.xz 
    10.16.14.237.33578 > 10.16.10.102.5432: Flags [S], seq 3273208485, win 26883, options [mss 8961,sackOK,TS val 1284903196 ecr 0,nop,wscale 7], length 0
    10.16.10.102.5432 > 10.16.14.237.33578: Flags [S.], seq 2908863056, ack 3273208486, win 28960, options [mss 1460,sackOK,TS val 120076048 ecr 1284903196,nop,wscale 10], length 0

So the window scale for this TCP connection is 7, so each "win N" values have to be multiplied by 2^7=128.

The window size TCP parameter at the beginning of the file is for example 852, which means after scaling 852*2^7=109056 bytes
$ xzgrep '10.16.14.237.33578' pcap.txt.xz | head -100000 | tail -1
    10.16.14.237.33578 > 10.16.10.102.5432: Flags [.], seq 2256009, ack 201198741, win 852, options [nop,nop,TS val 1284911021 ecr 120078004], length 0

But near the end of the file it is 1, which means after scaling 1*2^7=128 bytes:
$ xzcat pcap.txt.xz | tail -1000 | grep '10.16.14.237.33578' | head -1
    10.16.14.237.33578 > 10.16.10.102.5432: Flags [.], seq 2266022, ack 3742538664, win 1, options [nop,nop,TS val 1286238401 ecr 120409852], length 0

And indeed the RDS server is sending 128 bytes of data and waiting for confirmation before sending another 128:
13:47:27.170534 IP (tos 0x0, ttl 255, id 11479, offset 0, flags [DF], proto TCP (6), length 180)
    10.16.10.102.5432 > 10.16.14.237.33578: Flags [P.], seq 3742538664:3742538792, ack 2266022, win 174, options [nop,nop,TS val 120409903 ecr 1286238401], length 128
13:47:27.170542 IP (tos 0x0, ttl 64, id 28256, offset 0, flags [DF], proto TCP (6), length 52)
    10.16.14.237.33578 > 10.16.10.102.5432: Flags [.], seq 2266022, ack 3742538792, win 1, options [nop,nop,TS val 1286238605 ecr 120409903], length 0
13:47:27.374539 IP (tos 0x0, ttl 255, id 11480, offset 0, flags [DF], proto TCP (6), length 180)
    10.16.10.102.5432 > 10.16.14.237.33578: Flags [P.], seq 3742538792:3742538920, ack 2266022, win 174, options [nop,nop,TS val 120409954 ecr 1286238605], length 128

The switch from "win 600" (76800 bytes) to "win 1" (128 bytes) is sudden and I have no idea what could have caused it:
13:33:26.230782 IP (tos 0x0, ttl 64, id 24124, offset 0, flags [DF], proto TCP (6), length 52)
    10.16.14.237.33578 > 10.16.10.102.5432: Flags [.], seq 2266022, ack 3741933728, win 600, options [nop,nop,TS val 1285397677 ecr 120199669], length 0
13:33:26.230868 IP (tos 0x0, ttl 255, id 7295, offset 0, flags [DF], proto TCP (6), length 4396)
    10.16.10.102.5432 > 10.16.14.237.33578: Flags [.], seq 3741933728:3741938072, ack 2266022, win 174, options [nop,nop,TS val 120199669 ecr 1285397677], length 4344
13:33:26.230918 IP (tos 0x0, ttl 255, id 7298, offset 0, flags [DF], proto TCP (6), length 43492)
    10.16.10.102.5432 > 10.16.14.237.33578: Flags [.], seq 3741938072:3741981512, ack 2266022, win 174, options [nop,nop,TS val 120199669 ecr 1285397677], length 43440
13:33:26.230932 IP (tos 0x0, ttl 255, id 7328, offset 0, flags [DF], proto TCP (6), length 13708)
    10.16.10.102.5432 > 10.16.14.237.33578: Flags [P.], seq 3741981512:3741995168, ack 2266022, win 174, options [nop,nop,TS val 120199669 ecr 1285397677], length 13656
13:33:26.230948 IP (tos 0x0, ttl 255, id 7338, offset 0, flags [DF], proto TCP (6), length 2948)
    10.16.10.102.5432 > 10.16.14.237.33578: Flags [.], seq 3741995168:3741998064, ack 2266022, win 174, options [nop,nop,TS val 120199669 ecr 1285397677], length 2896
13:33:26.230969 IP (tos 0x0, ttl 255, id 7340, offset 0, flags [DF], proto TCP (6), length 5348)
    10.16.10.102.5432 > 10.16.14.237.33578: Flags [P.], seq 3741998064:3742003360, ack 2266022, win 174, options [nop,nop,TS val 120199669 ecr 1285397677], length 5296
13:33:26.231759 IP (tos 0x0, ttl 255, id 7344, offset 0, flags [DF], proto TCP (6), length 4396)
    10.16.10.102.5432 > 10.16.14.237.33578: Flags [.], seq 3742003360:3742007704, ack 2266022, win 174, options [nop,nop,TS val 120199669 ecr 1285397677], length 4344
13:33:26.231775 IP (tos 0x0, ttl 255, id 7347, offset 0, flags [DF], proto TCP (6), length 2876)
    10.16.10.102.5432 > 10.16.14.237.33578: Flags [.], seq 3742007704:3742010528, ack 2266022, win 174, options [nop,nop,TS val 120199669 ecr 1285397677], length 2824
13:33:26.233238 IP (tos 0x0, ttl 64, id 24125, offset 0, flags [DF], proto TCP (6), length 52)
    10.16.14.237.33578 > 10.16.10.102.5432: Flags [.], seq 2266022, ack 3742010528, win 1, options [nop,nop,TS val 1285397679 ecr 120199669], length 0

----

There's one changelog entry which seems to touch TCP receive window calculations: tcp: "avoid integer overflows in tcp_rcv_space_adjust()". I don't know if this is the change which caused this.

Revision history for this message

Alex Khomenko (akhomenko) wrote on 2018-10-23:

#6

We ran into the exact same issue with pg_dump after upgrading to Bionic (4.15.0-1023-aws kernel) on r5.xlarge (a next-gen Nitro instance). After 16-80 MB of data is received, tcpdump shows the receive window collapsing to 1 (not quite suddenly, but quickly), and it never recovers. We have engaged with AWS support on this issue, and they referred us to this bug.

Other than downgrading the kernel, has anyone discovered any other workarounds?

Revision history for this message

Alex Khomenko (akhomenko) wrote on 2018-10-24:

#7

Trying it again with the newly posted 4.15.0-1025-aws kernel, we can no longer reproduce the issue with pg_dump.

Revision history for this message

Eli S. Denman (elisd) wrote on 2018-10-25:

#8

Can also confirm that the 4.15.0-1025-aws kernel appears to have solved the timeout issues we were seeing with `git clone` processes.

Ubuntu
linux-aws package

aws s3 cp --recursive hangs on the last file on a large file transfer to instance

Bug Description

Other bug subscribers

Remote bug watches

Ubuntulinux-aws package

aws s3 cp --recursive hangs on the last file on a large file transfer to instance

Bug Description

Other bug subscribers

Remote bug watches

Ubuntu
linux-aws package