containerd sporadic timeouts

Bug #1996678 reported by Ivan Kurnosov
64
This bug affects 9 people
Affects Status Importance Assigned to Milestone
containerd (Ubuntu)
Invalid
Undecided
Unassigned
Focal
Invalid
Undecided
Unassigned
linux (Ubuntu)
Invalid
Undecided
Unassigned
Focal
Fix Released
Critical
Thadeu Lima de Souza Cascardo

Bug Description

This morning I routinely upgraded security upgrades on number of machines.

Containerd has upgraded from `1.5.9-0ubuntu1~20.04.4` for me `1.5.9-0ubuntu1~20.04.5`.

What happened next:

at some random time on machines with new containerd something happens with containerd tasks and/or cgroups.

This is how it's seen in syslog:

containerd[710]: time="2022-11-16T03:21:21.004153517Z" level=error msg="get state for 2f5a8376b476809b1696b140ca87f91422113bb16b27a8174437cc63b48e259a" error="context deadline exceeded: unknown"

And some ctr commands:

# ctr --namespace k8s.io task ls|grep 2f5a8376b476809b1696b140ca87f91422113bb16b27a8174437cc63b48e259a
2f5a8376b476809b1696b140ca87f91422113bb16b27a8174437cc63b48e259a 0 UNKNOWN

See that the status of the task is UNKNOWN (!!!)

# ctr --namespace k8s.io container ls|grep 2f5a8376b476809b1696b140ca87f91422113bb16b27a8174437cc63b48e259a
2f5a8376b476809b1696b140ca87f91422113bb16b27a8174437cc63b48e259a k8s.gcr.io/pause:3.5 io.containerd.runc.v2

Cgroups:

    ├─kubepods-besteffort-pod3fdc0061_bbf6_47d1_97f5_b1f271b46e23.slice
    │ ├─cri-containerd-5f78e0cb957de97fd8465cc42c842bdd764d981ca7a903a2515bbc6bb06796a9.scope …
    │ │ └─2677 /csi-node-driver-registrar --v=0 --csi-address=/csi/csi.sock --k…
    │ ├─cri-containerd-af070f16c1f0ff22eb16661e787e85db3810727909abd23d69a6a43578c1dced.scope …
    │ │ └─3264 /usr/local/bin/cephcsi --type=liveness --endpoint=unix:///csi/cs…
    │ ├─cri-containerd-4e063ef0c8f768dbf34cf7a179bca5cc98a04fa7e00b29d20c17d3031d409f86.scope …
    │ │ └─2960 /usr/local/bin/cephcsi --nodeid=nas-decent-bobcat.dev-k8s-1.hq.w…
    │ └─cri-containerd-2f5a8376b476809b1696b140ca87f91422113bb16b27a8174437cc63b48e259a.scope …
    │ └─2414 /pause

# ps auxf|grep 2414 -B 2
root 2279 0.1 0.0 114100 4956 ? Sl Nov15 0:42 /usr/bin/containerd-shim-runc-v2 -namespace k8s.io -id 2f5a8376b476809b1696b140ca87f91422113bb16b27a8174437cc63b48e259a -address /run/containerd/containerd.sock
65535 2414 0.0 0.0 964 4 ? Ss Nov15 0:00 \_ /pause

It happens not immediately - but after some random time. Sometimes it's several minutes, sometimes it's around an hour. But nonetheless - all machines with the new package get into this weird state.

As long as I revert package - it all returns to run as expected.

Revision history for this message
Ivan Kurnosov (zerkms) wrote :

One extra note: it's not necessary `pause` process, or `cephcsi` pod like in the example above. The pod/process selection also looks random, or at least I couldn't find any logic behind which one get's "UNKNOWN".

On this very node it's

# ctr --namespace k8s.io task ls|grep -i unknown
5f78e0cb957de97fd8465cc42c842bdd764d981ca7a903a2515bbc6bb06796a9 0 UNKNOWN
4e063ef0c8f768dbf34cf7a179bca5cc98a04fa7e00b29d20c17d3031d409f86 0 UNKNOWN
af070f16c1f0ff22eb16661e787e85db3810727909abd23d69a6a43578c1dced 0 UNKNOWN
2f5a8376b476809b1696b140ca87f91422113bb16b27a8174437cc63b48e259a 0 UNKNOWN

where as one may see all the tasks belong the same pod.

Revision history for this message
Ivan Kurnosov (zerkms) wrote :

And another note: `pod3fdc0061_bbf6_47d1_97f5_b1f271b46e23` pod is up and running, doing what it should do with no problems at the moment.

Revision history for this message
Ivan Kurnosov (zerkms) wrote :

Okay, my apologies, it now has reproduced on `1.5.9-0ubuntu1~20.04.4` too :-(

One other variable that has changed this morning was kernel upgrade `5.4.0.131.131` -> `5.4.0.132.132`.

Revision history for this message
Ivan Kurnosov (zerkms) wrote :

Also, new error message in syslog not seen before:

containerd[707]: time="2022-11-16T06:54:21.950622528Z" level=error msg="collecting metrics for d51815e8054f080da3cb9e8a83288bdd93db75dca2884f330a66c7d4dee4bbe2" error="cgroups: cgroup deleted: unknown"

ohhhhh :-(

Revision history for this message
Michael Sinz (michaelsinz) wrote :

Just a note - this seems to be related to the kernel - specifically, we have seen this in Ubuntu 18.04 too with the update to the 5.4.0-1095 kernel from the 5.4.0-1094 kernel.

Same problem that at some point, sometimes minutes sometimes many hours after starting, the kernetes node metrics fail and that is actually due to containerd having problems getting metrics.

They actually do continue to work but change from taking a few milliseconds to taking many seconds to complete which is beyond the timeout for collecting metrics.

We tried changing containerd versions but it did not matter - just the kernel change showed the impact and rolling that back to the 5.4.0-1094 kernel fixed everything.

Note that the 5.4.0-1095 update has a massive change list: https://launchpad.net/ubuntu/+source/linux-azure-5.4/5.4.0-1095.101~18.04.1

It may be that the cgroup change is involved (https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1988584) but I don't yet know if that is true. It may be something with IPC that is actually the problem. Have not gotten a chance to debug the kernel yet (been busy dealing with our services and rolling back the kernel update)

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in containerd (Ubuntu):
status: New → Confirmed
Revision history for this message
Ivan Kurnosov (zerkms) wrote :

Okay, so ubuntu 20.04 and 18.04 are affected. I can also confirm that the latest ubuntu 22.04 kernel (5.15.0-53) is NOT affected and runs just fine.

Revision history for this message
Khaled El Mously (kmously) wrote :

Hello @Ivan Kurnosov

> Okay, so ubuntu 20.04 and 18.04 are affected. I can also confirm that the latest ubuntu 22.04 kernel (5.15.0-53) is NOT affected and runs just fine.

When you say 20.04 is affected - could you please clarify which kernel you were using with 20.04? 5.4 or 5.15?

Revision history for this message
Ivan Kurnosov (zerkms) wrote :

@Khaled El Mously (kmously),

for us the latest kernel that works fine on ubuntu 20.04 is `5.4.0-131`.

And `5.4.0-132` reproduces the problem.

Revision history for this message
gerald.yang (gerald-yang-tw) wrote :

Hello @Ivan

Our customers are hitting this issue too
we are also trying to reproduce it on our side but with no luck

Could you provide some info about your environment and do you happen to know what configs could trigger this issue?

Thanks much

Revision history for this message
Ivan Kurnosov (zerkms) wrote :

@gerald.yang (gerald-yang-tw)

it's:

os: ubuntu 20.04
kernel: 5.4.0-132-generic
containerd: 1.5.9-0ubuntu1~20.04.4
runc: 1.1.0-0ubuntu1~20.04.1
kubernetes: 1.24.6-00

> do you happen to know what configs could trigger this issue?

so far I don't know what exactly triggers, and to me it looks random: either in a few minutes, or in an hour. In logs there is nothing that would point out to anything particular.

Revision history for this message
Thadeu Lima de Souza Cascardo (cascardo) wrote :

We suspect this is a kernel race condition on epoll that is more likely to happen with a recent change on kernel 5.4.0-132-generic (and other 5.4 kernels carrying that same change).

I have a kernel building with that epoll change reverted at ppa:cascardo/ppa (that should give you kernel version 5.4.0-132.148+epollcascardo1). Would you be able to test that with that kernel installed, the problem cannot be so easily reproduced anymore?

We are also working on the real fix for the race condition, given we were able to reproduce it with a synthetic workload even on 5.4.0-131-generic.

Cascardo.

Revision history for this message
Ivan Kurnosov (zerkms) wrote :

@Thadeu Lima de Souza Cascardo,

that's good news; yet, I'm keen to help resolve it but installing kernel from a third party does not comply with our policies unfortunately :-(

Revision history for this message
Ivan Kurnosov (zerkms) wrote :

We would be able to install it from some canonical test package repository though

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Hey Ivan,

Thanks for being willing to help and verify test kernels.

The policy requirements are certainly understandable.

Would it help w/ the policy requirements verifying that
~cascardo is a member of the Launchpad teams 'Canonical'
(Canonical employees) and the 'Canonical Kernel Team'? [1]

If not, I think he's out for the day (we chatted earlier),
and I may be able to find someone who could copy it over
to a PPA owned by the Canonical Kernel Team [2] (of which he
is a member/has rights to, so it might be equivalent).

Please let us know if either way would work for you.
Thanks!

[1] https://launchpad.net/~cascardo/+participation
[2] https://launchpad.net/~canonical-kernel-team/+archive/ubuntu/ppa

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

Hi,

Staff Engineers at Canonical Kernel Team here. I have now setup https://launchpad.net/~canonical-kernel-team/+archive/ubuntu/test-builds

Which is Canonical Kernel Team restricted PPA location.

The test kernel will be available from there shortly.

You will soon be able to use $ sudo add-apt-repository ppa:canonical-kernel-team/test-builds

To install packages from there, you will need to use explicit version number specifies to apt with = signs, i.e. sudo apt install linux-image-unsigned-5.4.0-132-generic=5.4.0-132.148+epollcascardo1 linux-modules-5.4.0-132-generic=5.4.0-132.148+epollcascardo1 linux-modules-extra-5.4.0-132-generic=5.4.0-132.148+epollcascardo1

And so on, for all the desired packages (tools, headers, etc. if needed).

Regards,

Dimitri.

Revision history for this message
Ivan Kurnosov (zerkms) wrote :

@Mauricio Faria de Oliveira (mfo), @Dimitri John Ledkov (xnox),

thanks! Please allow me around a day to set it up and run.

Changed in linux (Ubuntu):
status: New → Invalid
Changed in linux (Ubuntu Focal):
status: New → In Progress
assignee: nobody → Thadeu Lima de Souza Cascardo (cascardo)
importance: Undecided → Critical
summary: - 1.5.9-0ubuntu1~20.04.5 sporadic timeouts
+ containerd sporadic timeouts
Revision history for this message
Ivan Kurnosov (zerkms) wrote :

I just have booted into the new test kernel:

```
$ uname -a
Linux <redacted-hostname> 5.4.0-132-generic #148+epollcascardo1-Ubuntu SMP Tue Nov 22 12:57:05 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
```

Revision history for this message
Thadeu Lima de Souza Cascardo (cascardo) wrote :

There is also an alternative fix, which is the one we want to release with building at ppa:canonical-kernel-team/ppa2. The data point with the other kernel is still useful, though.

Thanks.
Cascardo.

Revision history for this message
Ivan Kurnosov (zerkms) wrote :

@Thadeu Lima de Souza Cascardo (cascardo),

if necessary later today I could install that to. It would be `5.4.0-135.152` which you uploaded just `41 minutes ago`?

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Hi Ivan,

Yes, I just checked its changelog, which includes commits for this LP bug number (below).

Thanks!

```
linux (5.4.0-135.152) focal; urgency=medium

  * focal/linux: 5.4.0-135.152 -proposed tracker (LP: #1997412)

  * containerd sporadic timeouts (LP: #1996678)
    - epoll: call final ep_events_available() check under the lock
    - epoll: check for events when removing a timed out thread from the wait queue
    - Revert "fs: check FMODE_LSEEK to control internal pipe splicing"

...
```

Revision history for this message
Ivan Kurnosov (zerkms) wrote :

1.5 hours of runtime: looks good so far (`5.4.0-132-generic #148+epollcascardo1-Ubuntu`).

Revision history for this message
Hans Rakers (hrak) wrote :

The conclusion at https://github.com/prometheus/node_exporter/issues/2500 was that the relevant kernel change that caused this was `- epoll: autoremove wakers even more aggressively`.

There is a small bug reproduction script in that issue too.

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :
Download full text (4.0 KiB)

Hi Hans,

Thanks for the pointer to the synthetic reproducer!

It provided accurate and consistent results considering
the kernel versions reported (not) to exhibit the issue.

The Azure test kernel with the 3 patches [1] to address
that shows the same (good) results as the Azure kernel
prior to the regression being introduced.

P.S.: the issue isn't strictly having that patch in,
as it's included in later kernel versions w/out this
issue (eg, 5.15), but having that patch in while not
having these other patches in as well (as, eg, 5.15).

[1] https://lists.ubuntu.com/archives/kernel-team/2022-November/135069.html

...

Test Results from 4x VMs on Azure (2x 4vCPU/16G and 2x 8vCPU/32G)

Test Steps follow below; essentially, run the for-loop with curl
10x times, and count how many times it doesn't finish / is stuck.
(i.e., epoll wait didn't return/finish).

1) original/"good" kernel: 0% error rate
-- 5.4.0-1094-azure #100-Ubuntu SMP Mon Oct 17 03:14:36 UTC 2022

VM1: 0/10
VM2: 0/10
VM3: 0/10
VM4: 0/10

2) regression/"bad" kernel: 60%-80% error rate
-- 5.4.0-1095-azure #101-Ubuntu SMP Thu Oct 20 15:50:47 UTC 2022

VM1: 8/10
VM2: 7/10
VM3: 7/10
VM4: 6/10

3) candidate/"test" kernel: 0% error rate
-- 5.4.0-1098-azure #104-Ubuntu SMP Wed Nov 23 21:19:57 UTC 2022

VM1: 0/10
VM2: 0/10
VM3: 0/10
VM4: 0/10

...

Test Steps/Criteria on Focal:

Install go 1.19:

$ sudo snap install --channel=1.19/stable --classic go

Create test programs:

$ cat <<EOF >main.go
package main

import (
        "fmt"
        "github.com/prometheus/procfs/sysfs"
        "log"
        "net/http"
)

func main() {

        fs, err := sysfs.NewFS("/sys")
        if err != nil {
                panic(err)
        }
        netDevices, err := fs.NetClassDevices()

        http.HandleFunc("/", func(writer http.ResponseWriter, request *http.Request) {
                for _, device := range netDevices {
                        _, err := fs.NetClassByIface(device)
                        if err != nil {
                                panic(err)
                        }
                }
                fmt.Printf(". ")
                writer.WriteHeader(200)
                writer.Write([]byte("ok"))
        })
        log.Fatal(http.ListenAndServe(":9100", nil))
}
EOF

$ cat <<EOF >go.mod
module app

go 1.19

require (
        github.com/prometheus/procfs v0.8.0 // indirect
        golang.org/x/sync v0.0.0-20220601150217-0de741cfad7f // indirect
)
EOF

Fetch test deps:

$ go mod download github.com/prometheus/procfs
$ go get <email address hidden>

Start test program:

go run main.go &

Exercise it:
for i in {0..10000} ; do curl localhost:9100/metrics ; done

Test Criteria:

PASS = for-loop finishes.
FAIL = for-loop doesn't finish.

# reference: https://github.com/prometheus/node_exporter/issues/2500#issuecomment-1304847221

Stack traces on FAIL / for-loop not finished:

azureuser@ktest-3:~$ sudo grep -l epoll /proc/$(pidof go)/task/*/stack | xargs sudo grep -H ^
/proc/33267/task/33287/stack:[<0>] ep_poll+0x3bb/0x410
/proc/33267/task/33287/stack:[<0>] do_epoll_wait+0xb8/0xd0
/proc/33267/task/33287/stack:[<0>] __x64_sys_epoll_pwait+0x4c/0xa0
/proc/33267/task/33287/...

Read more...

Revision history for this message
Ivan Kurnosov (zerkms) wrote :

Right, 1 day running patched kernel proven it has fixed the issue. Given there is a proper report in github - I will probably revert to some older official kernel version.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in containerd (Ubuntu Focal):
status: New → Confirmed
Revision history for this message
Jasper Spaans (jap171) wrote :

Note that this is not limited to containerd; we're running an upstream version of the prometheus `node_exporter` on several hosts that is also seeing spurious hangs that started after upgrading to the -132 kernel. Rebooting into -131 fixes it.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 5.4.0-135.152

---------------
linux (5.4.0-135.152) focal; urgency=medium

  * focal/linux: 5.4.0-135.152 -proposed tracker (LP: #1997412)

  * containerd sporadic timeouts (LP: #1996678)
    - epoll: call final ep_events_available() check under the lock
    - epoll: check for events when removing a timed out thread from the wait queue
    - Revert "fs: check FMODE_LSEEK to control internal pipe splicing"

  * CVE-2022-3621
    - nilfs2: fix NULL pointer dereference at nilfs_bmap_lookup_at_level()

  * CVE-2022-3565
    - mISDN: fix use-after-free bugs in l1oip timer handlers

  * CVE-2022-3566
    - tcp: Fix data races around icsk->icsk_af_ops.

  * CVE-2022-3567
    - ipv6: annotate some data-races around sk->sk_prot
    - ipv6: Fix data races around sk->sk_prot.

  * CVE-2022-3564
    - Bluetooth: L2CAP: Fix use-after-free caused by l2cap_reassemble_sdu

  * CVE-2022-3524
    - tcp/udp: Fix memory leak in ipv6_renew_options().

  * CVE-2022-3594
    - r8152: Rate limit overflow messages

  * CVE-2022-42703
    - mm/rmap.c: don't reuse anon_vma if we just want a copy

 -- Thadeu Lima de Souza Cascardo <email address hidden> Wed, 23 Nov 2022 16:51:44 -0300

Changed in linux (Ubuntu Focal):
status: In Progress → Fix Released
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-oracle/5.4.0-1091.100 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-focal' to 'verification-done-focal'. If the problem still exists, change the tag 'verification-needed-focal' to 'verification-failed-focal'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-focal-linux-oracle verification-needed-focal
Revision history for this message
Ivan Kurnosov (zerkms) wrote :

I was running `5.4.0-135` for 5 hours, and it looks good so far.

Revision history for this message
Lucas Kanashiro (lucaskanashiro) wrote :

This bug does not seem to be caused by containerd, so I am marking its tasks as Invalid. If you disagree, please say so.

Changed in containerd (Ubuntu Focal):
status: Confirmed → Invalid
Changed in containerd (Ubuntu):
status: Confirmed → Invalid
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-aws/5.4.0-1093.101 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-focal' to 'verification-done-focal'. If the problem still exists, change the tag 'verification-needed-focal' to 'verification-failed-focal'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-focal-linux-aws
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-azure/5.4.0-1100.106 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-focal' to 'verification-done-focal'. If the problem still exists, change the tag 'verification-needed-focal' to 'verification-failed-focal'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-focal-linux-azure
Revision history for this message
gerald.yang (gerald-yang-tw) wrote :

This has been verified on 5.4.0-135.152 generic kernel and also released to focal-updates
see comment #28

Set to verification done

tags: added: verification-done-focal
removed: verification-needed-focal
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.