Activity log for bug #1923607

Date Who What changed Old value New value Message
2021-04-13 12:16:55 Michele Baldessari bug added bug
2021-04-13 13:25:53 Michele Baldessari description I had left a deployment done a few weeks ago alone for a while, came back to it today and it was totally unusable. Since I could not run any commands I took a crash dump via virsh (did so of the UC, it seemed like the OC nodes where in a similar state although did not check them in detail): dump --memory-only --file /tmp/undercloud-dump.crash --live undercloud-0 I loaded up the vmcore in the crash utility [1] crash kernel/usr/lib/debug/lib/modules/4.18.0-240.10.1.el8_3.x86_64/vmlinux undercloud-dump.crash And could conclude the following (UC has 16GB of RAM): A) Load was sky-rocket high and free memory was none KERNEL: kernel/usr/lib/debug/lib/modules/4.18.0-240.10.1.el8_3.x86_64/vmlinux DUMPFILE: undercloud-dump.crash CPUS: 4 DATE: Tue Apr 13 05:32:44 2021 UPTIME: 41 days, 15:28:18 LOAD AVERAGE: 31.78, 31.95, 32.38 TASKS: 4242 NODENAME: undercloud-0.bgp.ftw RELEASE: 4.18.0-240.10.1.el8_3.x86_64 VERSION: #1 SMP Mon Jan 18 17:05:51 UTC 2021 MACHINE: x86_64 (2194 Mhz) MEMORY: 16 GB PANIC: "" crash> kmem -i PAGES TOTAL PERCENTAGE TOTAL MEM 4052899 15.5 GB ---- FREE 35350 138.1 MB 0% of TOTAL MEM USED 4017549 15.3 GB 99% of TOTAL MEM SHARED 203722 795.8 MB 5% of TOTAL MEM BUFFERS 0 0 0% of TOTAL MEM CACHED 533131 2 GB 13% of TOTAL MEM SLAB 1360379 5.2 GB 33% of TOTAL MEM TOTAL HUGE 0 0 ---- HUGE FREE 0 0 0% of TOTAL HUGE TOTAL SWAP 0 0 ---- SWAP USED 0 0 0% of TOTAL SWAP SWAP FREE 0 0 0% of TOTAL SWAP COMMIT LIMIT 2026449 7.7 GB ---- COMMITTED 32410872 123.6 GB 1599% of TOTAL LIMIT B) Most memory was used up by an incredibly large amount of podman processes crash> ps -u -G|tail -n +2|cut -b2- | sort -n -k8 | awk '{print $8/1048576" "$9}' | awk '{ arr[$2]+=$1 } END { for (key in arr) printf("%s\t%s\n", key, arr[key]) }' | sort -n -k2|tail -n10 iscsid 0.0118484 bash 0.0243454 sshd 0.063778 httpd 0.0780067 run-parts 0.0805359 logger 0.141033 podman 0.202381 crond 1.26209 (ontainer) 3.1892 (podman) 17.2151 crash> ps -u -G |wc -l 3775 crash> ps -u -G |grep podman |wc -l 2555 C) There are a truckload of processes called '(podman)' with parentheses whose parent pid is 1. crash> ps -u -G |grep "(podman)" |wc -l 2547 D) Under a normal freshly deployed and working undercloud there basically are *no* podman processes, because they are actually called conmon. I took a crashdump of a working undercloud and saw: crash> ps -u -G |grep -e podman |wc -l 0 crash> ps -u -G |grep -e conmon |wc -l 23 which is a lot more sensible. [1] https://crash-utility.github.io/ I had left a deployment done a few weeks ago alone for a while, came back to it today and it was totally unusable. Since I could not run any commands I took a crash dump via virsh (did so of the UC, it seemed like the OC nodes where in a similar state although did not check them in detail): virsh dump --memory-only --file /tmp/undercloud-dump.crash --live undercloud-0 I loaded up the vmcore in the crash utility [1] crash kernel/usr/lib/debug/lib/modules/4.18.0-240.10.1.el8_3.x86_64/vmlinux undercloud-dump.crash And could conclude the following (UC has 16GB of RAM): A) Load was sky-rocket high and free memory was none       KERNEL: kernel/usr/lib/debug/lib/modules/4.18.0-240.10.1.el8_3.x86_64/vmlinux     DUMPFILE: undercloud-dump.crash         CPUS: 4         DATE: Tue Apr 13 05:32:44 2021       UPTIME: 41 days, 15:28:18 LOAD AVERAGE: 31.78, 31.95, 32.38        TASKS: 4242     NODENAME: undercloud-0.bgp.ftw      RELEASE: 4.18.0-240.10.1.el8_3.x86_64      VERSION: #1 SMP Mon Jan 18 17:05:51 UTC 2021      MACHINE: x86_64 (2194 Mhz)       MEMORY: 16 GB        PANIC: "" crash> kmem -i                  PAGES TOTAL PERCENTAGE     TOTAL MEM 4052899 15.5 GB ----          FREE 35350 138.1 MB 0% of TOTAL MEM          USED 4017549 15.3 GB 99% of TOTAL MEM        SHARED 203722 795.8 MB 5% of TOTAL MEM       BUFFERS 0 0 0% of TOTAL MEM        CACHED 533131 2 GB 13% of TOTAL MEM          SLAB 1360379 5.2 GB 33% of TOTAL MEM    TOTAL HUGE 0 0 ----     HUGE FREE 0 0 0% of TOTAL HUGE    TOTAL SWAP 0 0 ----     SWAP USED 0 0 0% of TOTAL SWAP     SWAP FREE 0 0 0% of TOTAL SWAP  COMMIT LIMIT 2026449 7.7 GB ----     COMMITTED 32410872 123.6 GB 1599% of TOTAL LIMIT B) Most memory was used up by an incredibly large amount of podman processes crash> ps -u -G|tail -n +2|cut -b2- | sort -n -k8 | awk '{print $8/1048576" "$9}' | awk '{ arr[$2]+=$1 } END { for (key in arr) printf("%s\t%s\n", key, arr[key]) }' | sort -n -k2|tail -n10 iscsid 0.0118484 bash 0.0243454 sshd 0.063778 httpd 0.0780067 run-parts 0.0805359 logger 0.141033 podman 0.202381 crond 1.26209 (ontainer) 3.1892 (podman) 17.2151 crash> ps -u -G |wc -l 3775 crash> ps -u -G |grep podman |wc -l 2555 C) There are a truckload of processes called '(podman)' with parentheses whose parent pid is 1. crash> ps -u -G |grep "(podman)" |wc -l 2547 D) Under a normal freshly deployed and working undercloud there basically are *no* podman processes, because they are actually called conmon. I took a crashdump of a working undercloud and saw: crash> ps -u -G |grep -e podman |wc -l 0 crash> ps -u -G |grep -e conmon |wc -l 23 which is a lot more sensible. [1] https://crash-utility.github.io/
2021-04-13 13:46:38 John Eckersberg bug added subscriber John Eckersberg
2021-04-28 14:42:52 Bogdan Dobrelya tripleo: importance High Critical
2021-05-06 14:35:09 Marios Andreou tripleo: milestone wallaby-rc1 xena-1
2021-06-03 12:03:28 OpenStack Infra tripleo: status Triaged In Progress
2021-06-03 16:26:44 wes hayutin tags alert
2021-06-05 20:53:32 OpenStack Infra tripleo: status In Progress Fix Released
2021-06-07 12:29:34 OpenStack Infra tags alert alert in-stable-wallaby
2021-06-07 20:35:36 OpenStack Infra tags alert in-stable-wallaby alert in-stable-victoria in-stable-wallaby
2021-06-08 21:32:59 OpenStack Infra tags alert in-stable-victoria in-stable-wallaby alert in-stable-train in-stable-victoria in-stable-wallaby
2021-06-08 21:33:17 OpenStack Infra tags alert in-stable-train in-stable-victoria in-stable-wallaby alert in-stable-train in-stable-ussuri in-stable-victoria in-stable-wallaby
2021-09-06 09:55:28 Bogdan Dobrelya tripleo: assignee Michele Baldessari (michele)