undercloud (and overcloud nodes) in master became unresponsive after a couple of weeks

Bug #1923607 reported by Michele Baldessari
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
tripleo
Critical
Unassigned

Bug Description

I had left a deployment done a few weeks ago alone for a while, came back to it today and it was totally unusable. Since I could not run any commands I took a crash dump via virsh (did so of the UC, it seemed like the OC nodes where in a similar state although did not check them in detail):
virsh dump --memory-only --file /tmp/undercloud-dump.crash --live undercloud-0

I loaded up the vmcore in the crash utility [1]
crash kernel/usr/lib/debug/lib/modules/4.18.0-240.10.1.el8_3.x86_64/vmlinux undercloud-dump.crash

And could conclude the following (UC has 16GB of RAM):
A) Load was sky-rocket high and free memory was none
      KERNEL: kernel/usr/lib/debug/lib/modules/4.18.0-240.10.1.el8_3.x86_64/vmlinux
    DUMPFILE: undercloud-dump.crash
        CPUS: 4
        DATE: Tue Apr 13 05:32:44 2021
      UPTIME: 41 days, 15:28:18
LOAD AVERAGE: 31.78, 31.95, 32.38
       TASKS: 4242
    NODENAME: undercloud-0.bgp.ftw
     RELEASE: 4.18.0-240.10.1.el8_3.x86_64
     VERSION: #1 SMP Mon Jan 18 17:05:51 UTC 2021
     MACHINE: x86_64 (2194 Mhz)
      MEMORY: 16 GB
       PANIC: ""

crash> kmem -i
                 PAGES TOTAL PERCENTAGE
    TOTAL MEM 4052899 15.5 GB ----
         FREE 35350 138.1 MB 0% of TOTAL MEM
         USED 4017549 15.3 GB 99% of TOTAL MEM
       SHARED 203722 795.8 MB 5% of TOTAL MEM
      BUFFERS 0 0 0% of TOTAL MEM
       CACHED 533131 2 GB 13% of TOTAL MEM
         SLAB 1360379 5.2 GB 33% of TOTAL MEM

   TOTAL HUGE 0 0 ----
    HUGE FREE 0 0 0% of TOTAL HUGE

   TOTAL SWAP 0 0 ----
    SWAP USED 0 0 0% of TOTAL SWAP
    SWAP FREE 0 0 0% of TOTAL SWAP

 COMMIT LIMIT 2026449 7.7 GB ----
    COMMITTED 32410872 123.6 GB 1599% of TOTAL LIMIT

B) Most memory was used up by an incredibly large amount of podman processes
crash> ps -u -G|tail -n +2|cut -b2- | sort -n -k8 | awk '{print $8/1048576" "$9}' | awk '{ arr[$2]+=$1 } END { for (key in arr) printf("%s\t%s\n", key, arr[key]) }' | sort -n -k2|tail -n10
iscsid 0.0118484
bash 0.0243454
sshd 0.063778
httpd 0.0780067
run-parts 0.0805359
logger 0.141033
podman 0.202381
crond 1.26209
(ontainer) 3.1892
(podman) 17.2151

crash> ps -u -G |wc -l
3775
crash> ps -u -G |grep podman |wc -l
2555

C) There are a truckload of processes called '(podman)' with parentheses whose parent pid is 1.
crash> ps -u -G |grep "(podman)" |wc -l
2547

D) Under a normal freshly deployed and working undercloud there basically are *no* podman processes, because they are actually called conmon. I took a crashdump of a working undercloud and saw:
crash> ps -u -G |grep -e podman |wc -l
0
crash> ps -u -G |grep -e conmon |wc -l
23

which is a lot more sensible.

[1] https://crash-utility.github.io/

Revision history for this message
Michele Baldessari (michele) wrote :

Interestingly if we inspect a normal "podman" process we see:
PID: 846098 TASK: ffff9c20fd8ddc40 CPU: 1 COMMAND: "podman"
ARG: /usr/bin/podman --root /var/lib/containers/storage --runroot /var/run/containers/storage --log-level error --cgroup-manager systemd --tmpdir /var/run/libpod --runtime runc --storage-driver overlay --storage-opt overlay.mountopt=nodev,metacopy=on --events-backend file container cleanup 881e8ef19bb5e57de90c1fb2a784f821934707b1075c44452e2e355b9df3aba7
ENV: PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
     _OCI_SYNCPIPE=3
     _OCI_STARTPIPE=4
     XDG_RUNTIME_DIR=
     _CONTAINERS_USERNS_CONFIGURED=
     _CONTAINERS_ROOTLESS_UID=

But those '(podman)' processes do not show any arguments nor env variables:
PID: 846103 TASK: ffff9c20f8aa1ec0 CPU: 1 COMMAND: "(podman)"
ARG: (podman)
ENV: HOME=/
     TERM=vt220
     BOOT_IMAGE=(hd0,msdos1)/boot/vmlinuz-4.18.0-240.10.1.el8_3.x86_64
     crashkernel=auto

description: updated
Revision history for this message
Michele Baldessari (michele) wrote :

crash> bt 846786
PID: 846786 TASK: ffff9c2030da8000 CPU: 1 COMMAND: "(podman)"
 #0 [ffffc0998fe33c60] __schedule at ffffffffbd0d33f6
 #1 [ffffc0998fe33cf8] schedule at ffffffffbd0d3888
 #2 [ffffc0998fe33d08] schedule_timeout at ffffffffbd0d74a6
 #3 [ffffc0998fe33da0] unix_wait_for_peer at ffffffffbd03d44f
 #4 [ffffc0998fe33df0] unix_stream_connect at ffffffffbd0404ac
 #5 [ffffc0998fe33e70] __sys_connect at ffffffffbcf19c5a
 #6 [ffffc0998fe33f30] __x64_sys_connect at ffffffffbcf19ca6
 #7 [ffffc0998fe33f38] do_syscall_64 at ffffffffbc80419b
 #8 [ffffc0998fe33f50] entry_SYSCALL_64_after_hwframe at ffffffffbd2000ad
    RIP: 00007f8fdc558aa7 RSP: 00007ffdd43c8940 RFLAGS: 00000293
    RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f8fdc558aa7
    RDX: 000000000000001d RSI: 000055d5fc3f1c60 RDI: 0000000000000003
    RBP: 000055d5fc3f1c60 R8: 0000000000000000 R9: 00007ffdd43c8f14
    R10: 00007ffdd43c8bb8 R11: 0000000000000293 R12: 000000000000001d
    R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000003
    ORIG_RAX: 000000000000002a CS: 0033 SS: 002b

All '(podman)' processes seem to be stuck waiting on a unix socket for their peer.

Revision history for this message
Michele Baldessari (michele) wrote :

(Thanks to eck who team-tagged some bits). So we know that
1) We called unix_stream_connect()
2) We ended up in unix_wait_for_peer()

So if we look inside unix_stream_connect() we call unix_wait_for_peer() here https://github.com/torvalds/linux/blob/v4.18/net/unix/af_unix.c#L1261-L1273 :

 if (unix_recvq_full(other)) {
  err = -EAGAIN;
  if (!timeo)
   goto out_unlock;

  timeo = unix_wait_for_peer(other, timeo);

  err = sock_intr_errno(timeo);
  if (signal_pending(current))
   goto out;
  sock_put(other);
  goto restart;
 }

So this, at the very least, probably means that all those '(podman)' processes are stuck because the there is no connect backlog space in the socket:
static inline int unix_recvq_full(struct sock const *sk)
{
 return skb_queue_len(&sk->sk_receive_queue) > sk->sk_max_ack_backlog;
}

In any case the other side of the unix socket is either stuck or not calling accept() quickly enough it would seem

Revision history for this message
Michele Baldessari (michele) wrote :
Download full text (10.8 KiB)

So let's take one of these '(podman)' processes:

crash> bt -f 869520
PID: 869520 TASK: ffff9c1e89065c40 CPU: 1 COMMAND: "(podman)"
 #0 [ffffc099a8e97c60] __schedule at ffffffffbd0d33f6
    ffffc099a8e97c68: ffff9c1e89066690 ffff9c22afaa9dc0
    ffffc099a8e97c78: ffff9c1e89061ec0 ffff9c1e89065c40
    ffffc099a8e97c88: ffff9c22afaa9dc0 ffffc099a8e97cf0
    ffffc099a8e97c98: ffffffffbd0d33f6 0000000688340c02
    ffffc099a8e97ca8: 0000000092102035 0000000000000000
    ffffc099a8e97cb8: ffff9c2200000004 0e666d1c6cf1cc00
    ffffc099a8e97cc8: ffff9c1e89065c40 7fffffffffffffff
    ffffc099a8e97cd8: 000000000000001e ffff9c1e43336c00
    ffffc099a8e97ce8: ffff9c1f8dec30f0 ffff9c1f8dec3100
    ffffc099a8e97cf8: ffffffffbd0d3888
 #1 [ffffc099a8e97cf8] schedule at ffffffffbd0d3888
    ffffc099a8e97d00: ffff9c1f8dec2d00 ffffffffbd0d74a6
 #2 [ffffc099a8e97d08] schedule_timeout at ffffffffbd0d74a6
    ffffc099a8e97d10: 7fffffffffffffff ffff9c222e90e110
    ffffc099a8e97d20: ffffc099a8e97d60 ffff9c222e90e110
    ffffc099a8e97d30: ffffc099a8e97d60 ffff9c22afa36920
    ffffc099a8e97d40: ffff9c222e90e110 0000000000000202
    ffffc099a8e97d50: ffffc099a8e97da8 ffffffffbc90255a
    ffffc099a8e97d60: ffff9c1f8dec3108 0000000000000000
    ffffc099a8e97d70: 0e666d1c6cf1cc00 ffff9c1f8dec2d00
    ffffc099a8e97d80: ffff9c1f8dec3100 7fffffffffffffff
    ffffc099a8e97d90: 000000000000001e ffff9c1e43336c00
    ffffc099a8e97da0: ffffffffbd03d44f
 #3 [ffffc099a8e97da0] unix_wait_for_peer at ffffffffbd03d44f
    ffffc099a8e97da8: 0000000000000001 ffff9c1e89065c40
    ffffc099a8e97db8: ffffffffbc902b80 ffffc099a8e9fdc0
    ffffc099a8e97dc8: ffffc099a8ea7dc0 0e666d1c6cf1cc00
    ffffc099a8e97dd8: ffff9c1e761c4900 ffffc099a8e97e68
    ffffc099a8e97de8: ffff9c1f8dec2d00 ffffffffbd0404ac
 #4 [ffffc099a8e97df0] unix_stream_connect at ffffffffbd0404ac
    ffffc099a8e97df8: ffff9c1e761c4cf0 ffff9c20bed20d00
    ffffc099a8e97e08: ffff9c1e46dfa400 ffff9c1f8dec2d80
    ffffc099a8e97e18: 7fffffffffffffff ffffc099a8e97e80
    ffffc099a8e97e28: ffffffffbdb9d1c0 fffffff5a8e97e80
    ffffc099a8e97e38: 0e666d1c6cf1cc00 ffff9c1e9b2e2d00
    ffffc099a8e97e48: ffffc099a8e97e80 0000000000000000
    ffffc099a8e97e58: 000055d5fc3f1c60 0000000000000000
    ffffc099a8e97e68: 000000000000001d ffffffffbcf19c5a
 #5 [ffffc099a8e97e70] __sys_connect at ffffffffbcf19c5a
    ffffc099a8e97e78: 0000000000000002 732f6e75722f0001
    ffffc099a8e97e88: 6a2f646d65747379 732f6c616e72756f
    ffffc099a8e97e98: ffff0074756f6474 ffff9c1f461ed6a0
    ffffc099a8e97ea8: ffffc099a8e97f58 00000000c000003e
    ffffc099a8e97eb8: 0000000000000000 ffffffffbc803d23
    ffffc099a8e97ec8: ffff9c20c5a47698 ffff9c20c5a47698
    ffffc099a8e97ed8: ffffffffbc983b99 0000000000000080
    ffffc099a8e97ee8: ffffc099a8e97f58 ffffc099a8e97f58
    ffffc099a8e97ef8: 0000000000000000 0e666d1c6cf1cc00
    ffffc099a8e97f08: 000000000000002a ffffc099a8e97f58
    ffffc099a8e97f18: 0000000000000000 0000000000000000
    ffffc099a8e97f28: 0000000000000000 ffffffffbcf19ca6
 #6 [ffffc099a8e97f30] __x64_sys_connect at ffffffffbcf19ca6
    ffffc099a8e97f38: ffffffffbc80419b
 #7 [ffffc099a8e97f38] do_syscall_64 at ffffffffbc80419b
    ffffc099a8e97...

Revision history for this message
Michele Baldessari (michele) wrote :
Download full text (4.7 KiB)

Adding a redacted version of the previous comment:

#4 [ffffc099a8e97df0] unix_stream_connect at ffffffffbd0404ac
    ffffc099a8e97df8: ffff9c1e761c4cf0 ffff9c20bed20d00
    ffffc099a8e97e08: ffff9c1e46dfa400 ffff9c1f8dec2d80
    ffffc099a8e97e18: 7fffffffffffffff ffffc099a8e97e80
    ffffc099a8e97e28: ffffffffbdb9d1c0 fffffff5a8e97e80
    ffffc099a8e97e38: 0e666d1c6cf1cc00 ffff9c1e9b2e2d00
    ffffc099a8e97e48: ffffc099a8e97e80 0000000000000000
    ffffc099a8e97e58: 000055d5fc3f1c60 0000000000000000
    ffffc099a8e97e68: 000000000000001d ffffffffbcf19c5a
 #5 [ffffc099a8e97e70] __sys_connect at ffffffffbcf19c5a
    ffffc099a8e97e78: 0000000000000002 732f6e75722f0001
    ffffc099a8e97e88: 6a2f646d65747379 732f6c616e72756f
    ffffc099a8e97e98: ffff0074756f6474 ffff9c1f461ed6a0
    ffffc099a8e97ea8: ffffc099a8e97f58 00000000c000003e
    ffffc099a8e97eb8: 0000000000000000 ffffffffbc803d23
    ffffc099a8e97ec8: ffff9c20c5a47698 ffff9c20c5a47698
    ffffc099a8e97ed8: ffffffffbc983b99 0000000000000080
    ffffc099a8e97ee8: ffffc099a8e97f58 ffffc099a8e97f58
    ffffc099a8e97ef8: 0000000000000000 0e666d1c6cf1cc00
    ffffc099a8e97f08: 000000000000002a ffffc099a8e97f58
    ffffc099a8e97f18: 0000000000000000 0000000000000000
    ffffc099a8e97f28: 0000000000000000 ffffffffbcf19ca6
 #6 [ffffc099a8e97f30] __x64_sys_connect at ffffffffbcf19ca6
    ffffc099a8e97f38: ffffffffbc80419b
 #7 [ffffc099a8e97f38] do_syscall_64 at ffffffffbc80419b
    ffffc099a8e97f40: 0000000000000000 0000000000000000
    ffffc099a8e97f50: ffffffffbd2000ad
 #8 [ffffc099a8e97f50] entry_SYSCALL_64_after_hwframe at ffffffffbd2000ad
    RIP: 00007f8fdc558aa7 RSP: 00007ffdd43c8a20 RFLAGS: 00000293
    RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f8fdc558aa7
    RDX: 000000000000001d RSI: 000055d5fc3f1c60 RDI: 0000000000000003
    RBP: 000055d5fc3f1c60 R8: 0000000000000000 R9: 00007ffdd43c8ff4
    R10: 00007ffdd43c8c98 R11: 0000000000000293 R12: 000000000000001d
    R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000003
    ORIG_RAX: 000000000000002a CS: 0033 SS: 002b

At frame4 (unix_stream_connect) we poked at some random 0xffff addresses until we found the right one:
crash> ptype struct sockaddr_un
type = struct sockaddr_un {
    __kernel_sa_family_t sun_family;
    char sun_path[108];
}

crash> p *(struct sockaddr_un *) 0xffffc099a8e97e80
$8 = {
  sun_family = 1,
  sun_path = "/run/systemd/journal/stdout\000\377\377\240\326\036F\037\234\377\377X\177騙\300\377\377>\000\000\300\000\000\000\000\000\000\000\000\000\000\000\000#=\200\274\377\377\377\377\230v\244\305 \234\377\377\230v\244\305 \234
\377\377\231;\230\274\377\377\377\377\200\000\000\000\000\000\000\000X\177騙\300"
}

So all these (podman) processes are trying to talk to /run/systemd/journal/stdout

Which, according to https://unix.stackexchange.com/questions/205883/understand-logging-in-linux/294206#294206 is:
"It listens on the AF_LOCAL stream socket at /run/systemd/journal/stdout for log data coming from systemd-managed services."

So we started looking at the filesystem
crash> mod -s xfs
     MODULE NAME SIZE OBJECT FILE
ffffffffc0652600 xfs 1...

Read more...

Revision history for this message
John Eckersberg (jeckersb) wrote :

re: the questionable inode numbers above...

I did the same thing on a normal, functional undercloud. It also showed insane inode counts and no free inodes. However inspection of the running system showed that everything was fine.

So we've just done something wrong and made poor assumptions about how we're reading the data out of the xfs superblock struct.

Revision history for this message
Paras Babbar (pbabbar) wrote :

I have faced similar issue in tripleo deployment , I also kept the environment for few weeeks and then I can't even ssh to the nodes and OC just became unresponsive.

Changed in tripleo:
importance: High → Critical
Changed in tripleo:
milestone: wallaby-rc1 → xena-1
Changed in tripleo:
status: Triaged → In Progress
Revision history for this message
Michele Baldessari (michele) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-ansible (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/c/openstack/tripleo-ansible/+/794633

wes hayutin (weshayutin)
tags: added: alert
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-ansible (master)

Reviewed: https://review.opendev.org/c/openstack/tripleo-ansible/+/794485
Committed: https://opendev.org/openstack/tripleo-ansible/commit/f31bab878bfd3332c20a10bf9ca26d443028d214
Submitter: "Zuul (22348)"
Branch: master

commit f31bab878bfd3332c20a10bf9ca26d443028d214
Author: Michele Baldessari <email address hidden>
Date: Thu Jun 3 11:07:30 2021 +0200

    Add podman's events_logger option by default set to journald

    By default podman 3.0.x sets the [engine]/events_logger to "file".
    This causes every exec in podman to create a line of text in
    /run/libpod/events/events.log like the following:

      {"ID":"412b6770c0b418e6d49a4801e71a198ddb81bbbefdaf1c9aad4d7948f77910ee","Image":"quay.io/centos/centos:latest","Name":"leak-test-7","Status":"exec","Time":"2021-06-03T08:36:05.237964012Z","Type":"container","Attributes":{"org.label-schema.build-date":"20201204","org.label-schema.license":"GPLv2","org.label-schema.name":"CentOS Base Image","org.label-schema.schema-version":"1.0","org.label-schema.vendor":"CentOS"}}

    Since by default /run is mounted on tmpfs, this has the side-effect of
    increasing kernel slab objects over time indefinitely eventually causing
    an OOM of the box.

    We initially wanted to switch to the 'none' backend, but the podman
    folks recommended using the journald backend because events logs are
    used by podman in case of a rare race when running "podman run --rm".
    Given that we call run with --rm from in a multithreaded fashion this
    seems to be the safest approach. The drawback of using journald is
    that events won't be logged for rootless containers unless the user
    is part of the 'wheel' group. We believe we're not using those
    containers in tripleo anyways, so this should be safe.

    Tested by applying a backport of this patch to Train + podman 3.0.x and
    got the following:
    [root@controller-0 containers]# ls -la /run/libpod/events/
    total 0
    drwx------. 2 root root 40 Jun 3 11:55 .
    drwxr-x--x. 5 root root 140 Jun 3 11:55 ..

    [root@controller-0 containers]# more /etc/containers/containers.conf
    [containers]
    pids_limit = 4096
    [engine]
    events_logger = "journald"

    Also tested the override via the corresponding THT change in
    Ieffe2852111c3ec8347343a042dd78bbf691d79a.

    Closes-Bug: #1923607

    Change-Id: I780103e17f1bb42a0546c30bd6c001c642ad88b3

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-ansible (stable/wallaby)

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/tripleo-ansible/+/794948

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (master)

Reviewed: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/794592
Committed: https://opendev.org/openstack/tripleo-heat-templates/commit/4ea2a6eb7c84682d49b50f1e087c64b7dce13103
Submitter: "Zuul (22348)"
Branch: master

commit 4ea2a6eb7c84682d49b50f1e087c64b7dce13103
Author: Michele Baldessari <email address hidden>
Date: Thu Jun 3 15:47:06 2021 +0200

    Allow customizing podman's [engine]/events_logger

    In I780103e17f1bb42a0546c30bd6c001c642ad88b3 we introduced the
    journald default for the events_logger key. With this change we
    allow to change this new default, in case we do need to change it
    for some reason.

    Related-Bug: #1923607

    Depends-On: https://review.opendev.org/c/openstack/tripleo-ansible/+/794485
    Change-Id: Ieffe2852111c3ec8347343a042dd78bbf691d79a

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (stable/wallaby)

Related fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/795034

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-ansible (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/tripleo-ansible/+/794948
Committed: https://opendev.org/openstack/tripleo-ansible/commit/79be78bba35199c5b26632e51d8bda411a8239c5
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 79be78bba35199c5b26632e51d8bda411a8239c5
Author: Michele Baldessari <email address hidden>
Date: Thu Jun 3 11:07:30 2021 +0200

    Add podman's events_logger option by default set to journald

    By default podman 3.0.x sets the [engine]/events_logger to "file".
    This causes every exec in podman to create a line of text in
    /run/libpod/events/events.log like the following:

      {"ID":"412b6770c0b418e6d49a4801e71a198ddb81bbbefdaf1c9aad4d7948f77910ee","Image":"quay.io/centos/centos:latest","Name":"leak-test-7","Status":"exec","Time":"2021-06-03T08:36:05.237964012Z","Type":"container","Attributes":{"org.label-schema.build-date":"20201204","org.label-schema.license":"GPLv2","org.label-schema.name":"CentOS Base Image","org.label-schema.schema-version":"1.0","org.label-schema.vendor":"CentOS"}}

    Since by default /run is mounted on tmpfs, this has the side-effect of
    increasing kernel slab objects over time indefinitely eventually causing
    an OOM of the box.

    We initially wanted to switch to the 'none' backend, but the podman
    folks recommended using the journald backend because events logs are
    used by podman in case of a rare race when running "podman run --rm".
    Given that we call run with --rm from in a multithreaded fashion this
    seems to be the safest approach. The drawback of using journald is
    that events won't be logged for rootless containers unless the user
    is part of the 'wheel' group. We believe we're not using those
    containers in tripleo anyways, so this should be safe.

    Tested by applying a backport of this patch to Train + podman 3.0.x and
    got the following:
    [root@controller-0 containers]# ls -la /run/libpod/events/
    total 0
    drwx------. 2 root root 40 Jun 3 11:55 .
    drwxr-x--x. 5 root root 140 Jun 3 11:55 ..

    [root@controller-0 containers]# more /etc/containers/containers.conf
    [containers]
    pids_limit = 4096
    [engine]
    events_logger = "journald"

    Also tested the override via the corresponding THT change in
    Ieffe2852111c3ec8347343a042dd78bbf691d79a.

    Closes-Bug: #1923607

    Change-Id: I780103e17f1bb42a0546c30bd6c001c642ad88b3
    (cherry picked from commit f31bab878bfd3332c20a10bf9ca26d443028d214)

tags: added: in-stable-wallaby
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-ansible (stable/victoria)

Fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/tripleo-ansible/+/795041

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (stable/victoria)

Related fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/795042

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/795034
Committed: https://opendev.org/openstack/tripleo-heat-templates/commit/aaf57f21df8386b6143513960f04cb1b956b021b
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit aaf57f21df8386b6143513960f04cb1b956b021b
Author: Michele Baldessari <email address hidden>
Date: Thu Jun 3 15:47:06 2021 +0200

    Allow customizing podman's [engine]/events_logger

    In I780103e17f1bb42a0546c30bd6c001c642ad88b3 we introduced the
    journald default for the events_logger key. With this change we
    allow to change this new default, in case we do need to change it
    for some reason.

    Related-Bug: #1923607

    Depends-On: https://review.opendev.org/c/openstack/tripleo-ansible/+/794948
    Change-Id: Ieffe2852111c3ec8347343a042dd78bbf691d79a
    (cherry picked from commit 4ea2a6eb7c84682d49b50f1e087c64b7dce13103)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-ansible (stable/victoria)

Reviewed: https://review.opendev.org/c/openstack/tripleo-ansible/+/795041
Committed: https://opendev.org/openstack/tripleo-ansible/commit/637db1c401c6c6a0d2e3cef26ab8a97cc3b31bf2
Submitter: "Zuul (22348)"
Branch: stable/victoria

commit 637db1c401c6c6a0d2e3cef26ab8a97cc3b31bf2
Author: Michele Baldessari <email address hidden>
Date: Thu Jun 3 11:07:30 2021 +0200

    Add podman's events_logger option by default set to journald

    By default podman 3.0.x sets the [engine]/events_logger to "file".
    This causes every exec in podman to create a line of text in
    /run/libpod/events/events.log like the following:

      {"ID":"412b6770c0b418e6d49a4801e71a198ddb81bbbefdaf1c9aad4d7948f77910ee","Image":"quay.io/centos/centos:latest","Name":"leak-test-7","Status":"exec","Time":"2021-06-03T08:36:05.237964012Z","Type":"container","Attributes":{"org.label-schema.build-date":"20201204","org.label-schema.license":"GPLv2","org.label-schema.name":"CentOS Base Image","org.label-schema.schema-version":"1.0","org.label-schema.vendor":"CentOS"}}

    Since by default /run is mounted on tmpfs, this has the side-effect of
    increasing kernel slab objects over time indefinitely eventually causing
    an OOM of the box.

    We initially wanted to switch to the 'none' backend, but the podman
    folks recommended using the journald backend because events logs are
    used by podman in case of a rare race when running "podman run --rm".
    Given that we call run with --rm from in a multithreaded fashion this
    seems to be the safest approach. The drawback of using journald is
    that events won't be logged for rootless containers unless the user
    is part of the 'wheel' group. We believe we're not using those
    containers in tripleo anyways, so this should be safe.

    Tested by applying a backport of this patch to Train + podman 3.0.x and
    got the following:
    [root@controller-0 containers]# ls -la /run/libpod/events/
    total 0
    drwx------. 2 root root 40 Jun 3 11:55 .
    drwxr-x--x. 5 root root 140 Jun 3 11:55 ..

    [root@controller-0 containers]# more /etc/containers/containers.conf
    [containers]
    pids_limit = 4096
    [engine]
    events_logger = "journald"

    Also tested the override via the corresponding THT change in
    Ieffe2852111c3ec8347343a042dd78bbf691d79a.

    Closes-Bug: #1923607

    Change-Id: I780103e17f1bb42a0546c30bd6c001c642ad88b3
    (cherry picked from commit f31bab878bfd3332c20a10bf9ca26d443028d214)
    (cherry picked from commit 79be78bba35199c5b26632e51d8bda411a8239c5)

tags: added: in-stable-victoria
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-ansible (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/tripleo-ansible/+/795150

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (stable/victoria)

Reviewed: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/795042
Committed: https://opendev.org/openstack/tripleo-heat-templates/commit/496accc442ebbbb2221d4c8cca2d37609a6c8ede
Submitter: "Zuul (22348)"
Branch: stable/victoria

commit 496accc442ebbbb2221d4c8cca2d37609a6c8ede
Author: Michele Baldessari <email address hidden>
Date: Thu Jun 3 15:47:06 2021 +0200

    Allow customizing podman's [engine]/events_logger

    In I780103e17f1bb42a0546c30bd6c001c642ad88b3 we introduced the
    journald default for the events_logger key. With this change we
    allow to change this new default, in case we do need to change it
    for some reason.

    Related-Bug: #1923607

    Depends-On: https://review.opendev.org/c/openstack/tripleo-ansible/+/795041
    Change-Id: Ieffe2852111c3ec8347343a042dd78bbf691d79a
    (cherry picked from commit 4ea2a6eb7c84682d49b50f1e087c64b7dce13103)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (stable/ussuri)

Related fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/795281

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-ansible (stable/train)

Reviewed: https://review.opendev.org/c/openstack/tripleo-ansible/+/794633
Committed: https://opendev.org/openstack/tripleo-ansible/commit/486e8d3833adbc78b5e646851f6edaa7a95c872a
Submitter: "Zuul (22348)"
Branch: stable/train

commit 486e8d3833adbc78b5e646851f6edaa7a95c872a
Author: Michele Baldessari <email address hidden>
Date: Thu Jun 3 11:07:30 2021 +0200

    Add podman's events_logger option by default set to journald

    By default podman 3.0.x sets the [engine]/events_logger to "file".
    This causes every exec in podman to create a line of text in
    /run/libpod/events/events.log like the following:

      {"ID":"412b6770c0b418e6d49a4801e71a198ddb81bbbefdaf1c9aad4d7948f77910ee","Image":"quay.io/centos/centos:latest","Name":"leak-test-7","Status":"exec","Time":"2021-06-03T08:36:05.237964012Z","Type":"container","Attributes":{"org.label-schema.build-date":"20201204","org.label-schema.license":"GPLv2","org.label-schema.name":"CentOS Base Image","org.label-schema.schema-version":"1.0","org.label-schema.vendor":"CentOS"}}

    Since by default /run is mounted on tmpfs, this has the side-effect of
    increasing kernel slab objects over time indefinitely eventually causing
    an OOM of the box.

    We initially wanted to switch to the 'none' backend, but the podman
    folks recommended using the journald backend because events logs are
    used by podman in case of a rare race when running "podman run --rm".
    Given that we call run with --rm from in a multithreaded fashion this
    seems to be the safest approach. The drawback of using journald is
    that events won't be logged for rootless containers unless the user
    is part of the 'wheel' group. We believe we're not using those
    containers in tripleo anyways, so this should be safe.

    Tested by applying a backport of this patch to Train + podman 3.0.x and
    got the following:
    [root@controller-0 containers]# ls -la /run/libpod/events/
    total 0
    drwx------. 2 root root 40 Jun 3 11:55 .
    drwxr-x--x. 5 root root 140 Jun 3 11:55 ..

    [root@controller-0 containers]# more /etc/containers/containers.conf
    [containers]
    pids_limit = 4096
    [engine]
    events_logger = "journald"

    Also tested the override via the corresponding THT change in
    Ieffe2852111c3ec8347343a042dd78bbf691d79a.

    Closes-Bug: #1923607

    Change-Id: I780103e17f1bb42a0546c30bd6c001c642ad88b3
    (cherry picked from commit f31bab878bfd3332c20a10bf9ca26d443028d214)

tags: added: in-stable-train
tags: added: in-stable-ussuri
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-ansible (stable/ussuri)

Reviewed: https://review.opendev.org/c/openstack/tripleo-ansible/+/795150
Committed: https://opendev.org/openstack/tripleo-ansible/commit/33637b4ddf6b0561e740b9bf93f391b52f468605
Submitter: "Zuul (22348)"
Branch: stable/ussuri

commit 33637b4ddf6b0561e740b9bf93f391b52f468605
Author: Michele Baldessari <email address hidden>
Date: Thu Jun 3 11:07:30 2021 +0200

    Add podman's events_logger option by default set to journald

    By default podman 3.0.x sets the [engine]/events_logger to "file".
    This causes every exec in podman to create a line of text in
    /run/libpod/events/events.log like the following:

      {"ID":"412b6770c0b418e6d49a4801e71a198ddb81bbbefdaf1c9aad4d7948f77910ee","Image":"quay.io/centos/centos:latest","Name":"leak-test-7","Status":"exec","Time":"2021-06-03T08:36:05.237964012Z","Type":"container","Attributes":{"org.label-schema.build-date":"20201204","org.label-schema.license":"GPLv2","org.label-schema.name":"CentOS Base Image","org.label-schema.schema-version":"1.0","org.label-schema.vendor":"CentOS"}}

    Since by default /run is mounted on tmpfs, this has the side-effect of
    increasing kernel slab objects over time indefinitely eventually causing
    an OOM of the box.

    We initially wanted to switch to the 'none' backend, but the podman
    folks recommended using the journald backend because events logs are
    used by podman in case of a rare race when running "podman run --rm".
    Given that we call run with --rm from in a multithreaded fashion this
    seems to be the safest approach. The drawback of using journald is
    that events won't be logged for rootless containers unless the user
    is part of the 'wheel' group. We believe we're not using those
    containers in tripleo anyways, so this should be safe.

    Tested by applying a backport of this patch to Train + podman 3.0.x and
    got the following:
    [root@controller-0 containers]# ls -la /run/libpod/events/
    total 0
    drwx------. 2 root root 40 Jun 3 11:55 .
    drwxr-x--x. 5 root root 140 Jun 3 11:55 ..

    [root@controller-0 containers]# more /etc/containers/containers.conf
    [containers]
    pids_limit = 4096
    [engine]
    events_logger = "journald"

    Also tested the override via the corresponding THT change in
    Ieffe2852111c3ec8347343a042dd78bbf691d79a.

    Closes-Bug: #1923607

    Change-Id: I780103e17f1bb42a0546c30bd6c001c642ad88b3
    (cherry picked from commit f31bab878bfd3332c20a10bf9ca26d443028d214)
    (cherry picked from commit 79be78bba35199c5b26632e51d8bda411a8239c5)
    (cherry picked from commit 637db1c401c6c6a0d2e3cef26ab8a97cc3b31bf2)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-heat-templates (stable/train)

Change abandoned by "wes hayutin <email address hidden>" on branch: stable/train
Review: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/794634
Reason: zuul gate is stuck here

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (stable/ussuri)

Reviewed: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/795281
Committed: https://opendev.org/openstack/tripleo-heat-templates/commit/8e31cbf17adc278449948807ac9592be1f251dad
Submitter: "Zuul (22348)"
Branch: stable/ussuri

commit 8e31cbf17adc278449948807ac9592be1f251dad
Author: Michele Baldessari <email address hidden>
Date: Thu Jun 3 15:47:06 2021 +0200

    Allow customizing podman's [engine]/events_logger

    In I780103e17f1bb42a0546c30bd6c001c642ad88b3 we introduced the
    journald default for the events_logger key. With this change we
    allow to change this new default, in case we do need to change it
    for some reason.

    Related-Bug: #1923607

    Depends-On: https://review.opendev.org/c/openstack/tripleo-ansible/+/795150
    Change-Id: Ieffe2852111c3ec8347343a042dd78bbf691d79a
    (cherry picked from commit 4ea2a6eb7c84682d49b50f1e087c64b7dce13103)
    (cherry picked from commit 496accc442ebbbb2221d4c8cca2d37609a6c8ede)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (stable/train)

Reviewed: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/794634
Committed: https://opendev.org/openstack/tripleo-heat-templates/commit/6900793477cc70e2df20a54788d1c63b67ac5051
Submitter: "Zuul (22348)"
Branch: stable/train

commit 6900793477cc70e2df20a54788d1c63b67ac5051
Author: Michele Baldessari <email address hidden>
Date: Thu Jun 3 15:47:06 2021 +0200

    Allow customizing podman's [engine]/events_logger

    In I780103e17f1bb42a0546c30bd6c001c642ad88b3 we introduced the
    journald default for the events_logger key. With this change we
    allow to change this new default, in case we do need to change it
    for some reason.

    Related-Bug: #1923607

    Depends-On: https://review.opendev.org/c/openstack/tripleo-ansible/+/794633
    Change-Id: Ieffe2852111c3ec8347343a042dd78bbf691d79a
    (cherry picked from commit 31db48f61d1863a73c04e89d0547848db0661957)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-ansible 3.1.2

This issue was fixed in the openstack/tripleo-ansible 3.1.2 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-ansible 1.5.4

This issue was fixed in the openstack/tripleo-ansible 1.5.4 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-ansible 4.0.0

This issue was fixed in the openstack/tripleo-ansible 4.0.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-ansible 2.4.0

This issue was fixed in the openstack/tripleo-ansible 2.4.0 release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers