Ubuntu
apt package

occasional hanging 'apt-get update' from daily cronjob since Jammy 22.04

Bug #2003851 reported by Walter on 2023-01-25

This bug affects 17 people

Affects		Status	Importance	Assigned to	Milestone
	apt (Ubuntu)	Confirmed	Undecided	Unassigned

Bug Description

Hi!

Yesterday I spotted several machines of ours where a period `apt-get update` was stalled. The `http` children were hanging in `WaitFd` (waiting for parent instructions/queue). The parent was looping in `AcquireUpdate` every 500ms.

We have a cronjob that runs every few hours which calls `apt-get update` and does some post-processing. We noticed that several of them had stalled at some point in time. Killing the parent (apt-get) got it unstuck, removing the locks.

Example:
```
# apt-get update
Reading package lists... Done
E: Could not get lock /var/lib/apt/lists/lock. It is held by process 154026 (apt-get)
N: Be aware that removing the lock file is not a solution and may break your system.
E: Unable to lock directory /var/lib/apt/lists/
```

Task listing:
```
root 153929 \_ /usr/sbin/CRON -f -P
root 153942 \_ /bin/sh -c [ -x /etc/zabbix/scripts/dpkg.updates ] && /etc/zabbix/scripts/dpkg.updates --cron
root 153943 \_ /bin/sh /etc/zabbix/scripts/dpkg.updates --cron
root 154026 \_ apt-get update
_apt 154029 \_ /usr/lib/apt/methods/http
_apt 154030 \_ /usr/lib/apt/methods/http
_apt 154031 \_ /usr/lib/apt/methods/http
_apt 154033 \_ /usr/lib/apt/methods/gpgv
```
Open (TCP) sockets. All have 1 item in the Recv-Q (probably a FIN or RST?):
```
# netstat -apn | grep -E '154026|154029|154030|154031|154033'
tcp 1 0 10.x.x.x:60868 217.x.x.x:80 CLOSE_WAIT 154030/http
tcp 1 0 10.x.x.x:40756 178.x.x.x:80 CLOSE_WAIT 154029/http
tcp 1 0 10.x.x.x:56818 185.x.x.x:80 CLOSE_WAIT 154031/http
```
All children (including gpgv) were waiting using pselect6(1, [0], NULL, NULL, NULL, NULL).

The parent (apt-get) was waiting using pselect6(10, [5 6 7 9], [], NULL, {tv_sec=0, tv_nsec=500000000}, NULL).

The http sockets in the children were at fd=3.

Parent lsof:
```
# lsof -p 154026 +E
...
apt-get 154026 root 4uW REG 8,1 0 262281 /var/lib/apt/lists/lock
apt-get 154026 root 5r FIFO 0,13 0t0 4015176 pipe 154029,http,1w
apt-get 154026 root 6r FIFO 0,13 0t0 4012448 pipe 154030,http,1w
apt-get 154026 root 7r FIFO 0,13 0t0 4015192 pipe 154031,http,1w
apt-get 154026 root 8w FIFO 0,13 0t0 4015177 pipe 154029,http,0r
apt-get 154026 root 9r FIFO 0,13 0t0 4015233 pipe 154033,gpgv,1w
apt-get 154026 root 10w FIFO 0,13 0t0 4012449 pipe 154030,http,0r
apt-get 154026 root 12w FIFO 0,13 0t0 4015193 pipe 154031,http,0r
apt-get 154026 root 14w FIFO 0,13 0t0 4015234 pipe 154033,gpgv,0r
http 154029 _apt 0r FIFO 0,13 0t0 4015177 pipe 154026,apt-get,8w
http 154029 _apt 1w FIFO 0,13 0t0 4015176 pipe 154026,apt-get,5r
http 154030 _apt 0r FIFO 0,13 0t0 4012449 pipe 154026,apt-get,10w
http 154030 _apt 1w FIFO 0,13 0t0 4012448 pipe 154026,apt-get,6r
http 154031 _apt 0r FIFO 0,13 0t0 4015193 pipe 154026,apt-get,12w
http 154031 _apt 1w FIFO 0,13 0t0 4015192 pipe 154026,apt-get,7r
gpgv 154033 _apt 0r FIFO 0,13 0t0 4015234 pipe 154026,apt-get,14w
gpgv 154033 _apt 1w FIFO 0,13 0t0 4015233 pipe 154026,apt-get,9r
```
So:
- apt-get is waiting for any data written by any of its four children (at fd 5/6/7/9)
- http and gpgv are waiting for any data written by their parent (at their respective fd 0)

Parent backtrace:
```
#0 0x00007f420116a74d in select ()
   from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f420153fb5d in pkgAcquire::Run(int) ()
   from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
#2 0x00007f420161d535 in AcquireUpdate(pkgAcquire&, int, bool, bool) ()
   from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
#3 0x00007f420161d986 in ListUpdate(pkgAcquireStatus&, pkgSourceList&, int) ()
   from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
#4 0x00007f42016d127b in DoUpdate (CmdL=...)
    at ./apt-private/private-update.cc:73
#5 0x00007f420156d73f in CommandLine::DispatchArg(CommandLine::Dispatch const*, bool) ()
   from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
#6 0x00007f420169fa97 in DispatchCommandLine (CmdL=...,
    Cmds=std::vector of length 27, capacity 32 = {...})
    at ./apt-private/private-cmndline.cc:588
#7 0x0000561fc06bafbd in main (argc=2, argv=0x7ffc0e14aa38)
    at ./cmdline/apt-get.cc:447
```
Child backtrace:
```
#0 0x00007f58b1c9b74d in select ()
   from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f58b237cb43 in WaitFd(int, bool, unsigned long) ()
   from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
#2 0x00005643e37bc7b3 in BaseHttpMethod::Loop (
    this=0x7ffdeaed56c0) at ./methods/basehttp.cc:611
#3 main (argv=<optimized out>) at ./methods/http.cc:1052
```
Where this is:
```
(gdb) print *this
$1 = {<aptAuthConfMethod> = {<aptMethod> = {<pkgAcqMethod> = {<No data fields>}, <aptConfigWrapperForMethods> = {
        methodNames = std::vector of length 2, capacity 2 = {
          "http", "http::XXXXXX.nl"}}, Binary = "http",
      SeccompFlags = 6},
    authconfs = std::vector of length 1, capacity 1 = {
      std::unique_ptr<FileFd> = {get() = 0x5643e5343100}}},
  Server = std::unique_ptr<ServerState> = {
    get() = 0x5643e534fba0}, NextURI = "",
  AllowRedirect = true, Debug = false, PipelineDepth = 10,
  static FailFile = {static npos = 18446744073709551615,
    _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>},
      _M_p = 0x5643e534d650 "/var/lib/apt/lists/partial/apt.osso.nl_ubuntu-security_dists_jammy-security_InRelease"},
    _M_string_length = 85, {
      _M_local_buf = "\226", '\000' <repeats 14 times>,
      _M_allocated_capacity = 150}}, static FailFd = -1,
  static FailTime = 1674071760}
```

Relevant code in parent:
```
pkgAcquire::RunResult pkgAcquire::Run(int PulseInterval)
{
...
   // Run till all things have been acquired
   struct timeval tv = SteadyDurationToTimeVal(std::chrono::microseconds(PulseInterval));
   while (ToFetch > 0)
...
      int Res;
      do
      {
         Res = select(Highest+1,&RFds,&WFds,0,&tv);
      }
      while (Res < 0 && errno == EINTR);
```
Relevant code in child:
```
int BaseHttpMethod::Loop()
{
   signal(SIGTERM,SigTerm);
   signal(SIGINT,SigTerm);

Server = 0;

   int FailCounter = 0;
   while (1)
   {
      // We have no commands, wait for some to arrive
      if (Queue == 0)
      {
         if (WaitFd(STDIN_FILENO) == false)
            return 0;
```

Versions:
```
Ubuntu 22.04.1 LTS (Jammy Jellyfish)
apt 2.4.8
```

Preliminiary conclusion/notes:

- I don't know why it stalls, but it does. And this was not a single incident. All the stalling apt processes had stopped at a different time/day.

- Apparently there is a situation possible where there are no commands queued from the parent, and the parent isn't planning on sending any either.

- Our apt-get is invoked by a cron job which calls either `update-notifier-common` or `apt-get update`: https://github.com/ossobv/zabbix-agent-osso/blob/c61aee6087c0d03e66d03013c61acd9f65d0eaab/scripts/dpkg.updates#L14-L31 -- on the machines where apt-get was hanging, there was no `update-notifier-common`.

- I don't know if this is 2.4.x related (on Focal we have apt 2.0.x), but it sure looks like it. I did check a few other Focal machines that lacked `update-notifier-common`, but they were not having a stalled apt-get (although _not_ seeing a hung process proves nothing)

Let me know if there's anything I can get you. Should this be filed elsewhere? (Debian Salsa?)

Cheers,
Walter Doekes
OSSO B.V.

Tags:

Revision history for this message

Julian Andres Klode (juliank) wrote on 2023-01-25:

Patches are welcome but unfortunately there isn't much we can do otherwise, this highly depends on network specifics like latency and mtus and is never reliably reproducible.

Revision history for this message

Walter (wdoekes) wrote on 2023-01-25:

Thanks for the reply. Julian.

Let's assume that the problem is indeed latency/dropped packets/whatever on our side. IMO, an (occasionally) broken network should not cause apt-get to hang indefinitely. Do you think it should?

Also, it doesn't address that the behaviour seems recent. We have not observed anything like this with Focal or Bionic.

In the mean time, I've added notification code so at least we can we track when this occurs. I'll fill you in if I get more details (like versions that are (not) affected).

----

Additional debug info:

I called recv(3..) on the http socket with a recv-q. It returns 0 (no error, EOF).
```
(gdb) call (ssize_t)recv(3, "abcdef", 1, 0)
$1 = 0
```
```
# netstat -apn | grep -E '154026|154029|154030|154031|154033'
tcp 1 0 10.91.52.91:60868 217.21.205.139:80 CLOSE_WAIT 154030/http
tcp 1 0 10.91.52.91:40756 178.128.6.101:80 CLOSE_WAIT 154029/http
tcp 0 0 10.91.52.91:56818 185.37.124.14:80 CLOSE_WAIT 154031/http
```

Next, I called close(0) on the stdin socket on one of the processes, and this awoke the whole list of tasks:
```
(gdb) call (int)close(0)
$1 = 0
```
This yielded the following error message:
```
E: Method http has died unexpectedly!
```
And the apt-get process got unstuck / was reaped succesfully.

Revision history for this message

Walter (wdoekes) wrote on 2023-02-10:

Checking in.

I've had an `apt-get update` running in a while loop since Feb 6 on a machine: no luck getting it to stall.

Halfway, I added a `iptables -A INPUT -m statistic --mode random --probability 0.05 -j DROP` but this did not affect anything.

But, in the mean time, on production I have had two more of these since filing this bug report...

Revision history for this message

Walter (wdoekes) wrote on 2023-02-21:

Download full text (7.2 KiB)

Okay. I got some pcaps now. And apparently our apt-cacher-ng proxy sometimes serves:

- an InRelease
- with a 200 OK
- and a non-zero Content-Length
- BUT, no content, but instead a TCP-FIN.

Under the right circumstances, this causes apt to stall forever.

The right circumstances seem to include:

- a valid download (or 304) from the same host
- a valid download from another host

---

I created a test server, and got this to reproduce with apt 2.4.9 (from jammy-proposed):

=====================================
sources.list
=====================================
deb http://junk.devs.nu:3851/ubuntu jammy main
deb http://junk.devs.nu:3851/ubuntu jammy-updates main
deb [arch=amd64 signed-by=/etc/apt/keyrings/zabbix.gpg] http://repo.zabbix.com/zabbix/6.2/ubuntu jammy main

=====================================
server.py logs
=====================================
13:31:58 [2771009] client connected: ('217.21.199.56', 42838)
13:31:58 [2771009] client said: b'GET /ubuntu/dists/jammy/InRelease HTTP/1.1\r\nHost: junk.devs.nu:3851\r\nCache-Control: max-age=0\r\nAccept: text/*\r\nUser-Agent: Debian APT-HTTP/1.3 (2.4.9) non-interactive\r\n\r\n'
13:31:58 [2771009] we say: 200 and serve local InRelease
13:31:58 [2771009] client said: b'GET /ubuntu/dists/jammy-updates/InRelease HTTP/1.1\r\nHost: junk.devs.nu:3851\r\nCache-Control: max-age=0\r\nAccept: text/*\r\nUser-Agent: Debian APT-HTTP/1.3 (2.4.9) non-interactive\r\n\r\n'
13:31:58 [2771009] we say: 200 but serve a FIN
13:31:58 [2771009] client said: b''
13:31:58 [2771009] client disconnected: ('217.21.199.56', 42838)

=====================================
apt-get update -o Debug::Acquire::http=1
=====================================
0% [Connecting to junk.devs.nu] [Connecting to repo.zabbix.com (2604:a880:2:d0::2062:d001)]GET /zabbix/6.2/ubuntu/dists/jammy/InRelease HTTP/1.1
Host: repo.zabbix.com
Cache-Control: max-age=0
Accept: text/*
If-Modified-Since: Wed, 01 Feb 2023 15:02:57 GMT
User-Agent: Debian APT-HTTP/1.3 (2.4.9) non-interactive

Answer for: http://repo.zabbix.com/zabbix/6.2/ubuntu/dists/jammy/InRelease
HTTP/1.1 304 Not Modified
Server: nginx/1.14.2
Date: Tue, 21 Feb 2023 12:31:54 GMT
Last-Modified: Wed, 01 Feb 2023 15:02:57 GMT
Connection: keep-alive
ETag: "63da7f21-135e"

Hit:1 http://repo.zabbix.com/zabbix/6.2/ubuntu jammy InRelease
0% [Connecting to junk.devs.nu]GET /ubuntu/dists/jammy/InRelease HTTP/1.1
Host: junk.devs.nu:3851
Cache-Control: max-age=0
Accept: text/*
User-Agent: Debian APT-HTTP/1.3 (2.4.9) non-interactive

Answer for: http://junk.devs.nu:3851/ubuntu/dists/jammy/InRelease
HTTP/1.1 200 OK
date: Fri, 17 Feb 2023 15:11:49 GMT
content-type: octet/stream
content-length: 270087
last-modified: Fri, 17 Feb 2023 13:29:00 GMT
x-original-source: http://de.archive.ubuntu.com/ubuntu/dists/jammy/InRelease

Get:2 http://junk.devs.nu:3851/ubuntu jammy InRelease [270 kB]
0% [2 InRelease 0 B/270 kB 0%]GET /ubuntu/dists/jammy-updates/InRelease HTTP/1.1
Host: junk.devs.nu:3851
Cache-Control: max-age=0
Accept: text/*
User-Agent: Debian APT-HTTP/1.3 (2.4.9) non-interactive

Answer for: http://junk.devs.nu:3851/ubuntu/dists/jammy-updates/InRelease
...

Okay. I got some pcaps now. And apparently our apt-cacher-ng proxy sometimes serves:

- an InRelease
- with a 200 OK
- and a non-zero Content-Length
- BUT, no content, but instead a TCP-FIN.

Under the right circumstances, this causes apt to stall forever.

The right circumstances seem to include:

- a valid download (or 304) from the same host
- a valid download from another host

---

I created a test server, and got this to reproduce with apt 2.4.9 (from jammy-proposed):

Hit:1 http://repo.zabbix.com/zabbix/6.2/ubuntu jammy InRelease                             
0% [Connecting to junk.devs.nu]GET /ubuntu/dists/jammy/InRelease HTTP/1.1                  
Host: junk.devs.nu:3851
Cache-Control: max-age=0
Accept: text/*
User-Agent: Debian APT-HTTP/1.3 (2.4.9) non-interactive

Answer for: http://junk.devs.nu:3851/ubuntu/dists/jammy-updates/InRelease
HTTP/1.1 200 OK
date: Fri, 17 Feb 2023 15:11:49 GMT
content-type: octet/stream
content-length: 118747
last-modified: Fri, 17 Feb 2023 13:29:00 GMT
x-original-source: http://de.archive.ubuntu.com/ubuntu/dists/jammy-updates/InRelease

Get:3 http://junk.devs.nu:3851/ubuntu jammy-updates InRelease [119 kB]
Ign:3 http://junk.devs.nu:3851/ubuntu jammy-updates InRelease
0% [Working]

=====================================
ps faxu
=====================================
root     3549798  0.0  0.0  16652  6556 pts/3    S+   12:31   0:00                          \_ apt-get update -o Debug::Acquire::http=1
_apt     3549801  0.0  0.0  22068  6540 pts/3    S+   12:31   0:00                              \_ /usr/lib/apt/methods/http
_apt     3549802  0.0  0.0  22068  6552 pts/3    S+   12:31   0:00                              \_ /usr/lib/apt/methods/http
_apt     3549804  0.0  0.0  15440  5016 pts/3    S+   12:31   0:00                              \_ /usr/lib/apt/methods/gpgv

=====================================
netstat
=====================================
tcp        1      0 217.21.199.56:44494     178.128.6.101:80        CLOSE_WAIT  3549801/http

(this IP belongs to the zabbix repo. the connection to the junk.devs.nu broken server is already gone.)

=====================================
preliminary info
=====================================
The connection to the zabbix repo is unclosed. The connection to the broken server is closed. The sequence was:

12:31:58.082726 eth0  In  IP 217.21.198.165.3851 > 217.21.199.56.42838: Flags [F.], seq 270320:270560, ack 349, win 507, options [nop,nop,TS val 4041174334 ecr 558653630], length 240
12:31:58.083017 eth0  Out IP 217.21.199.56.42838 > 217.21.198.165.3851: Flags [F.], seq 349, ack 270561, win 1022, options [nop,nop,TS val 558653631 ecr 4041174334], length 0
12:31:58.083199 eth0  In  IP 217.21.198.165.3851 > 217.21.199.56.42838: Flags [.], ack 350, win 507, options [nop,nop,TS val 4041174335 ecr 558653631], length 0

Where that last request was:

GET /ubuntu/dists/jammy-updates/InRelease HTTP/1.1
Host: junk.devs.nu:3851
Cache-Control: max-age=0
Accept: text/*
User-Agent: Debian APT-HTTP/1.3 (2.4.9) non-interactive

And the last response was:

HTTP/1.1 200 OK
date: Fri, 17 Feb 2023 15:11:49 GMT
content-type: octet/stream
content-length: 118747
last-modified: Fri, 17 Feb 2023 13:29:00 GMT
x-original-source: http://de.archive.ubuntu.com/ubuntu/dists/jammy-updates/InRelease

This response was sent in the FIN packet. So no actual content would be delivered.

The apt-get process which got the broken server has this backtrace:

(gdb) bt
#0  0x00007efc90d3774d in __GI___select (nfds=1, readfds=0x7ffe64790920, writefds=0x0, exceptfds=0x0, timeout=0x0) at ../sysdeps/unix/sysv/linux/select.c:69
#1  0x00007efc91418b43 in WaitFd(int, bool, unsigned long) () from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
#2  0x0000555a612d47b3 in BaseHttpMethod::Loop (this=0x7ffe64790bf0) at ./methods/basehttp.cc:611
#3  main (argv=<optimized out>) at ./methods/http.cc:1052

The process which got the good server (zabbix) has a similar backtrace:

(gdb) bt
#0  0x00007f8e345e674d in __GI___select (nfds=1, readfds=0x7ffc6f6b9990, writefds=0x0, exceptfds=0x0, timeout=0x0) at ../sysdeps/unix/sysv/linux/select.c:69
#1  0x00007f8e34cc7b43 in WaitFd(int, bool, unsigned long) () from /lib/x86_64-linux-gnu/libapt-pkg.so.6.0
#2  0x00005614260967b3 in BaseHttpMethod::Loop (this=0x7ffc6f6b9c60) at ./methods/basehttp.cc:611
#3  main (argv=<optimized out>) at ./methods/http.cc:1052

=====================================
server.py
=====================================
http://junk.devs.nu/2023/lp2003851/server.py
http://junk.devs.nu/2023/lp2003851/2023-02-21-13-31.pcap

=====================================

I'll go take a look at why apt-cacher-ng misbehaves.

[~juliank]: Maybe this is sufficient info for you to work with? Even though our server is at fault, we still don't want an infinitely hanging apt.

NOTE: Maybe this isn't an apt 2.4.x thing, but something that started happening for Jammy-served apt-cacher-ng responses and therefore only visible on Jammy? In that case this isn't a newly introduced problem in apt.

Cheers,
Walter Doekes
OSSO B.V.

Revision history for this message

Walter (wdoekes) wrote on 2023-02-23:

Got some more pcaps. I can also reproduce by sending a 503 after a 200, without aborting during/before body transfer (*).

============================
server.py logs
============================
13:29:55 [3666956] client said: b'GET /ubuntu/dists/jammy/InRelease HTTP/1.1\r\nHost: junk.devs.nu:3851\r\nCache-Control: max-age=0\r\nAccept: text/*\r\nUser-Agent: Debian APT-HTTP/1.3 (2.4.9) non-interactive\r\n\r\n'
13:29:55 [3666956] we say: 200 and serve local InRelease
13:29:55 [3666956] client said: b'GET /ubuntu/dists/jammy-updates/InRelease HTTP/1.1\r\nHost: junk.devs.nu:3851\r\nCache-Control: max-age=0\r\nAccept: text/*\r\nUser-Agent: Debian APT-HTTP/1.3 (2.4.9) non-interactive\r\n\r\n'
13:29:55 [3666956] we say: 503

============================
503 request/response (valid length, no early FIN)
============================
GET /ubuntu/dists/jammy-updates/InRelease HTTP/1.1
Host: junk.devs.nu:3851
Cache-Control: max-age=0
Accept: text/*
User-Agent: Debian APT-HTTP/1.3 (2.4.9) non-interactive

HTTP/1.1 503 OK
date: Fri, 17 Feb 2023 15:11:49 GMT
content-type: text/plain
content-length: 19

something is broken

============================

I present to you:

- https://junk.devs.nu/2023/lp2003851/not-reproduced-port-34634.txt
- https://junk.devs.nu/2023/lp2003851/not-reproduced-port-34634.pcap
- https://junk.devs.nu/2023/lp2003851/reproduced-port-58014.txt
- https://junk.devs.nu/2023/lp2003851/reproduced-port-58014.pcap

In the not-reproduced case, the error is acknowledged and the download is retried. Then it fails quickly.

In the reproduced case, the error is not acknowledged, and the result is a hanging apt-get. Logs ending with:

Hit:3 http://repo.zabbix.com/zabbix/6.2/ubuntu jammy InRelease
0% [Working]

============================
(*) observed in the wild with these req/resp:
============================
GET /ubuntu/dists/jammy-backports/InRelease HTTP/1.1
Host: apt.osso.nl
Cache-Control: max-age=0
Accept: text/*
If-Modified-Since: Tue, 21 Feb 2023 19:32:00 GMT
User-Agent: Debian APT-HTTP/1.3 (2.4.8) non-interactive

HTTP/1.1 200 OK
date: Wed, 22 Feb 2023 02:36:08 GMT
content-type: octet/stream
content-length: 106807
last-modified: Wed, 22 Feb 2023 01:49:00 GMT
x-original-source: http://nl.archive.ubuntu.com/ubuntu/dists/jammy-backports/InRelease
(+ valid body)

GET /ubuntu-security/dists/jammy-security/InRelease HTTP/1.1
Host: apt.osso.nl
Cache-Control: max-age=0
Accept: text/*
If-Modified-Since: Tue, 21 Feb 2023 23:11:00 GMT
User-Agent: Debian APT-HTTP/1.3 (2.4.8) non-interactive

HTTP/1.1 503 Connection timeout
date: Wed, 22 Feb 2023 02:36:08 GMT
content-type: text/html
content-length: 486
(+ valid body)

============================
extra contemplations:
============================
We've only recently added the /ubuntu-security/ (extra path, on the same server). Maybe that is the reason why this is biting us right now.

Got some more pcaps. I can also reproduce by sending a 503 after a 200, without aborting during/before body transfer (*).

HTTP/1.1 503 OK
date: Fri, 17 Feb 2023 15:11:49 GMT
content-type: text/plain
content-length: 19

something is broken

============================

I present to you:

In the not-reproduced case, the error is acknowledged and the download is retried. Then it fails quickly.

In the reproduced case, the error is not acknowledged, and the result is a hanging apt-get. Logs ending with:

Hit:3 http://repo.zabbix.com/zabbix/6.2/ubuntu jammy InRelease
0% [Working]

HTTP/1.1 503 Connection timeout
date: Wed, 22 Feb 2023 02:36:08 GMT
content-type: text/html
content-length: 486
(+ valid body)

Revision history for this message

Walter (wdoekes) wrote on 2023-02-23:

Relevant bug in apt-cacher-ng:
https://bugs.launchpad.net/ubuntu/+source/apt-cacher-ng/+bug/1983856

Revision history for this message

Launchpad Janitor (janitor) wrote on 2023-02-23:

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in apt (Ubuntu):
status:	New → Confirmed

Ken Sharp (kennybobs) on 2023-04-24

tags:

added: jammy

Revision history for this message

Walter (wdoekes) wrote on 2023-05-02:

Hi Julian,

> [There] isn't much we can do otherwise, this highly depends on network specifics like latency and mtus and is never reliably reproducible.

So, with my bogus python PoC server I mentioned in the comments, it is quite easily reproducible.

If I can assist you in getting it up and running, do let me know.

Cheers,
Walter

Revision history for this message

Alexey Gavrilov (le9i0nx) wrote on 2023-07-21 (last edit on 2023-10-03):

I suggest adding a parameter or setting Timeout=120 in default settings.

apt -o Acquire::http::Timeout=120 update

https://askubuntu.com/questions/141513/how-to-lower-wait-time-for-repository-updates
added Acquire::http::Timeout "120"; in file apt.conf or /etc/apt/apt.conf.d/99timeout

Revision history for this message

Jay D.R. (vtsj) wrote on 2023-10-04:

#10

FYI, ran into this same issue with Debian bookworm's apt-get (2.6.1) hanging indefinitely talking to an apt-cacher-ng 3.7.x instance, easily reproducible by having ansible run an `apt-get update` on ~20-25 machines simultaneously. Problem disappears when switching back to apt-cacher-ng 3.6.x, so +1 for the apt-cacher-ng bug linked above.

Still, the problem persists even when invoking apt-get with `-o Acquire::http::Timeout=120`, so presumably there must still be something bugged on the side of apt itself not properly adhering to the configured timeout.

Revision history for this message

anarcat (anarcat) wrote on 2023-10-13:

#11

We're seeing a similar issue here. At first we thought it was an issue specific to a prometheus collector (https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1028212 / https://github.com/prometheus-community/node-exporter-textfile-collector-scripts/issues/179) but now that I see this bug report, I can't help but think this is an issue in apt itself.

I should also mention this seems like a regression between bullseye and bookworm. For context, we have a nightly job that runs apt-update here as well, a home-grown (`dsa-update-apt-status`, from DSA) script that spews warnings through cron when there's an issue. In my mailbox where I track those, I do have instances of this before we started the bookworm upgrade, but those were rare. Starting from the beginning of our bookworm deployment though, we are seeing more and more of this as we upgrade machines over. We're now seeing daily warnings, as the `dsa-update-apt-status` runs into lock contention with another job (`apt_info.py` from the above collector, runs every 15m) more frequently. We were getting daily warnings from the fleet, all bookworm machines, with locks sometimes being held for hours.

Our current workaround has been to set a time limit to the `apt_info.py` job, but we're *still* seeing errors, which is interesting in itself, as it means the issue is *not* specific to that script: it's a global apt issue. We've had unattended-upgrades.py hanging forever as well now, which we've never seen before.

So I think this is an apt issue. Perhaps Acquire::http::Timeout=120 is a valid workaround, but I can't help but think this is an issue that was specifically introduced between bullseye and bookworm (2.2 vs 2.6).

Revision history for this message

Adam Saponara (as-v) wrote on 2024-03-21:

#12

Hello, also affected by this.

I'm able to reproduce the bug using Walter's mock server. I suspect it may be related to the retry code that kicks in when a transient error is encountered (like a 503). Retries are enabled by default since apt 2.3.2, which seems to fit the regression window people have mentioned here.

Frustratingly, I can't reproduce while step debugging, I'm guessing because the bug is clock-dependent.

Setting either `Acquire::Retries::Delay=false` or `Acquire::Retries=0` seems to mitigate it. I'm curious if that works for others as well. I'll be testing this in prod soon and will report back.

In the mean time I'll keep trying to find the actual bug.

Revision history for this message

Adam Saponara (as-v) wrote on 2024-03-25:

#13

patch Edit (779 bytes, text/plain)

Following up, `Acquire::Retries=0` seems to be avoiding the bug in prod, which strengthens my suspicion that it's related to retries.

As for the actual bug, when apt is hanging, I notice that the work queue has items in it, but they have not been delegated to a worker. There is a condition that is supposed to "unstuck" the queue here[1], but for some reason `GetFetchAfter` is returning a zeroed time_point which avoids that condition[2].

The way we're setting `FetchAfter` looks OK to me[3] though I'm not familiar with std::chrono. Maybe there's some other place where we're partially copying the item or instantiating a new item and failing to copy the `FetchAfter` value.

A bandaid patch is attached. I'm not sure if there's any downside in calling `Cycle` there. A real bugfix IMO would address the zeroed `FetchAfter` value described above, or potentially refactor so that work items are immediately delegated instead of having to rely on calling `Cycle` at the right time. I am missing a lot of context so that maybe that's not possible.

[1] https://salsa.debian.org/apt-team/apt/-/blob/db797312548f6e7d6d1d1faa9f1e48c1e521d290/apt-pkg/acquire.cc#L738-740

[2] https://salsa.debian.org/apt-team/apt/-/blob/db797312548f6e7d6d1d1faa9f1e48c1e521d290/apt-pkg/acquire.cc#L733-736

[3] https://salsa.debian.org/apt-team/apt/-/blob/db797312548f6e7d6d1d1faa9f1e48c1e521d290/apt-pkg/acquire-worker.cc#L651-655

Revision history for this message

Mikko Rantalainen (mira) wrote on 2024-05-07:

#14

This sounds like a race condition that leads to live lock, considering that both gpgv and apt are waiting for the other end to speak at the same time.

I assume the protocol between these is custom made for this connection. How the protocol is supposed to prevent this live lock situation from happening? If the protocol cannot avoid this situation, then the bug is in the protocol design and the least invasive fix is to add timeout for the waiting time and consider the execution as failed if timeout is triggered.

Revision history for this message

Mikko Rantalainen (mira) wrote on 2024-05-07:

#15

I'm seeing the same problem on Ubuntu 22.04.4 LTS:

$ ps -eHo pid,ppid,etimes,command | grep -w [a]pt
3162391 1 3033258 /bin/sh /usr/lib/apt/apt.systemd.daily update
3162395 3162391 3033258 /bin/sh /usr/lib/apt/apt.systemd.daily lock_is_held update
3162423 3162395 3033257 apt-get -qq -y update
3162432 3162423 3033257 /usr/lib/apt/methods/http
3163221 3162423 3033222 /usr/lib/apt/methods/gpgv

And since /usr/lib/apt/apt.systemd.daily is from package apt, this is definitely a bug in apt. A pretty simple workaround would be to run

/usr/bin/timeout 15m /usr/lib/apt/apt.systemd.daily lock_is_held update

instead. Of course, that wouldn't fix the underlying problem but it would automatically release the lock after 15 minute delay.

Revision history for this message

Adam Saponara (as-v) wrote on 2024-05-07:

#16

Following up again. Disabling retries reduced the frequency of this bug, but didn't eliminate it.

On the other hand, the patch I shared does seem to fix the bug. No occurrences since we shipped it ~2 weeks ago. I encourage others to try it out.

Revision history for this message

Julian Andres Klode (juliank) wrote on 2024-05-14:

#17

We're going to need to fix this properly rather than busy cycle the queues outside the select loop.

Revision history for this message

Julian Andres Klode (juliank) wrote on 2024-05-14:

#18

The important point to note is that fetchAfter==0 is the default case if we did not get retries, so every non-retried item has that value and we don't want to cycle it. The loop's purpose is to find the smallest timeout to wait for, as an _optimization_ to the select() call below, such that if we have an item queued to fetch in 5s we fetch it in 5s even if we received no updates from the workers to process.

The queue is ordered such that all the 0 fetchAfter items come first (so as not to get stuck behind FetchAfter ones), but there may be an issue there since new owners of items can appear that can *change* the fetch after after we have inserted the item, it probably needs to be dequeued and requeued when an owner is added (or when we change FetchAfter of an owner, but I think we do there already).

Revision history for this message

Julian Andres Klode (juliank) wrote on 2024-05-14:

#19

So if you turn on Debug::pkgAcquire::Worker you should probably see it fail after a @ Queue: Action combined for