tripleo

Bug #1923607
Activity log

Activity log for bug #1923607

Date	Who	What changed	Old value	New value	Message
2021-04-13 12:16:55	Michele Baldessari	bug			added bug
2021-04-13 13:25:53	Michele Baldessari	description	I had left a deployment done a few weeks ago alone for a while, came back to it today and it was totally unusable. Since I could not run any commands I took a crash dump via virsh (did so of the UC, it seemed like the OC nodes where in a similar state although did not check them in detail): dump --memory-only --file /tmp/undercloud-dump.crash --live undercloud-0 I loaded up the vmcore in the crash utility [1] crash kernel/usr/lib/debug/lib/modules/4.18.0-240.10.1.el8_3.x86_64/vmlinux undercloud-dump.crash And could conclude the following (UC has 16GB of RAM): A) Load was sky-rocket high and free memory was none KERNEL: kernel/usr/lib/debug/lib/modules/4.18.0-240.10.1.el8_3.x86_64/vmlinux DUMPFILE: undercloud-dump.crash CPUS: 4 DATE: Tue Apr 13 05:32:44 2021 UPTIME: 41 days, 15:28:18 LOAD AVERAGE: 31.78, 31.95, 32.38 TASKS: 4242 NODENAME: undercloud-0.bgp.ftw RELEASE: 4.18.0-240.10.1.el8_3.x86_64 VERSION: #1 SMP Mon Jan 18 17:05:51 UTC 2021 MACHINE: x86_64 (2194 Mhz) MEMORY: 16 GB PANIC: "" crash> kmem -i PAGES TOTAL PERCENTAGE TOTAL MEM 4052899 15.5 GB ---- FREE 35350 138.1 MB 0% of TOTAL MEM USED 4017549 15.3 GB 99% of TOTAL MEM SHARED 203722 795.8 MB 5% of TOTAL MEM BUFFERS 0 0 0% of TOTAL MEM CACHED 533131 2 GB 13% of TOTAL MEM SLAB 1360379 5.2 GB 33% of TOTAL MEM TOTAL HUGE 0 0 ---- HUGE FREE 0 0 0% of TOTAL HUGE TOTAL SWAP 0 0 ---- SWAP USED 0 0 0% of TOTAL SWAP SWAP FREE 0 0 0% of TOTAL SWAP COMMIT LIMIT 2026449 7.7 GB ---- COMMITTED 32410872 123.6 GB 1599% of TOTAL LIMIT B) Most memory was used up by an incredibly large amount of podman processes crash> ps -u -G\|tail -n +2\|cut -b2- \| sort -n -k8 \| awk '{print $8/1048576" "$9}' \| awk '{ arr[$2]+=$1 } END { for (key in arr) printf("%s\t%s\n", key, arr[key]) }' \| sort -n -k2\|tail -n10 iscsid 0.0118484 bash 0.0243454 sshd 0.063778 httpd 0.0780067 run-parts 0.0805359 logger 0.141033 podman 0.202381 crond 1.26209 (ontainer) 3.1892 (podman) 17.2151 crash> ps -u -G \|wc -l 3775 crash> ps -u -G \|grep podman \|wc -l 2555 C) There are a truckload of processes called '(podman)' with parentheses whose parent pid is 1. crash> ps -u -G \|grep "(podman)" \|wc -l 2547 D) Under a normal freshly deployed and working undercloud there basically are no podman processes, because they are actually called conmon. I took a crashdump of a working undercloud and saw: crash> ps -u -G \|grep -e podman \|wc -l 0 crash> ps -u -G \|grep -e conmon \|wc -l 23 which is a lot more sensible. [1] https://crash-utility.github.io/	I had left a deployment done a few weeks ago alone for a while, came back to it today and it was totally unusable. Since I could not run any commands I took a crash dump via virsh (did so of the UC, it seemed like the OC nodes where in a similar state although did not check them in detail): virsh dump --memory-only --file /tmp/undercloud-dump.crash --live undercloud-0 I loaded up the vmcore in the crash utility [1] crash kernel/usr/lib/debug/lib/modules/4.18.0-240.10.1.el8_3.x86_64/vmlinux undercloud-dump.crash And could conclude the following (UC has 16GB of RAM): A) Load was sky-rocket high and free memory was none KERNEL: kernel/usr/lib/debug/lib/modules/4.18.0-240.10.1.el8_3.x86_64/vmlinux DUMPFILE: undercloud-dump.crash CPUS: 4 DATE: Tue Apr 13 05:32:44 2021 UPTIME: 41 days, 15:28:18 LOAD AVERAGE: 31.78, 31.95, 32.38 TASKS: 4242 NODENAME: undercloud-0.bgp.ftw RELEASE: 4.18.0-240.10.1.el8_3.x86_64 VERSION: #1 SMP Mon Jan 18 17:05:51 UTC 2021 MACHINE: x86_64 (2194 Mhz) MEMORY: 16 GB PANIC: "" crash> kmem -i PAGES TOTAL PERCENTAGE TOTAL MEM 4052899 15.5 GB ---- FREE 35350 138.1 MB 0% of TOTAL MEM USED 4017549 15.3 GB 99% of TOTAL MEM SHARED 203722 795.8 MB 5% of TOTAL MEM BUFFERS 0 0 0% of TOTAL MEM CACHED 533131 2 GB 13% of TOTAL MEM SLAB 1360379 5.2 GB 33% of TOTAL MEM TOTAL HUGE 0 0 ---- HUGE FREE 0 0 0% of TOTAL HUGE TOTAL SWAP 0 0 ---- SWAP USED 0 0 0% of TOTAL SWAP SWAP FREE 0 0 0% of TOTAL SWAP COMMIT LIMIT 2026449 7.7 GB ---- COMMITTED 32410872 123.6 GB 1599% of TOTAL LIMIT B) Most memory was used up by an incredibly large amount of podman processes crash> ps -u -G\|tail -n +2\|cut -b2- \| sort -n -k8 \| awk '{print $8/1048576" "$9}' \| awk '{ arr[$2]+=$1 } END { for (key in arr) printf("%s\t%s\n", key, arr[key]) }' \| sort -n -k2\|tail -n10 iscsid 0.0118484 bash 0.0243454 sshd 0.063778 httpd 0.0780067 run-parts 0.0805359 logger 0.141033 podman 0.202381 crond 1.26209 (ontainer) 3.1892 (podman) 17.2151 crash> ps -u -G \|wc -l 3775 crash> ps -u -G \|grep podman \|wc -l 2555 C) There are a truckload of processes called '(podman)' with parentheses whose parent pid is 1. crash> ps -u -G \|grep "(podman)" \|wc -l 2547 D) Under a normal freshly deployed and working undercloud there basically are no podman processes, because they are actually called conmon. I took a crashdump of a working undercloud and saw: crash> ps -u -G \|grep -e podman \|wc -l 0 crash> ps -u -G \|grep -e conmon \|wc -l 23 which is a lot more sensible. [1] https://crash-utility.github.io/
2021-04-13 13:46:38	John Eckersberg	bug			added subscriber John Eckersberg
2021-04-28 14:42:52	Bogdan Dobrelya	tripleo: importance	High	Critical
2021-05-06 14:35:09	Marios Andreou	tripleo: milestone	wallaby-rc1	xena-1
2021-06-03 12:03:28	OpenStack Infra	tripleo: status	Triaged	In Progress
2021-06-03 16:26:44	wes hayutin	tags		alert
2021-06-05 20:53:32	OpenStack Infra	tripleo: status	In Progress	Fix Released
2021-06-07 12:29:34	OpenStack Infra	tags	alert	alert in-stable-wallaby
2021-06-07 20:35:36	OpenStack Infra	tags	alert in-stable-wallaby	alert in-stable-victoria in-stable-wallaby
2021-06-08 21:32:59	OpenStack Infra	tags	alert in-stable-victoria in-stable-wallaby	alert in-stable-train in-stable-victoria in-stable-wallaby
2021-06-08 21:33:17	OpenStack Infra	tags	alert in-stable-train in-stable-victoria in-stable-wallaby	alert in-stable-train in-stable-ussuri in-stable-victoria in-stable-wallaby
2021-09-06 09:55:28	Bogdan Dobrelya	tripleo: assignee		Michele Baldessari (michele)