merge_configs consumes high amount of memory with higher number of nodes

Bug #1885750 reported by Zdenek Dvorak
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
kolla-ansible
Triaged
Medium
Unassigned

Bug Description

Action plugin merge_configs is using high amount of memory with higher number of compute nodes.
We are deploying openstack with 200 compute nodes.

Deployment is failing with error:
"ERROR! Unexpected Exception, this is probably a bug: [Errno 12] Cannot allocate memory"

Deployments failed during action:
"Copying over ceilometer.conf"
which is using plugin "merge_configs".

We increased memory size on the testing computer to 16G and deployment passed.
We observed memory utilization and detected multiple peaks.
Peaks are observed during these actions:
"Copying over cron logrotate config files"
"Copying over chrony.conf"
"Copying over nova.conf"
"Copying over neutron.conf"
"Copying over ceilometer.conf"
The highest peaks are present in actions using merge_configs plugin.

We are using kolla-ansible to deploy openstack rocky.

We have this configuration:
VM running kolla-ansible: 8CPU + 8G RAM
deployment size: 2 controlers + 200 computes

Ansible version and configuration:
ansible 2.5.15+company_patches.1 and standard 2.9.9 ware used with similar results

Short deployment time is needed in real environment therefore we use ansible with configuration parameter "forks = 16". This leads to 18 ansible processes.

Total memory consumption can be decreased by decreasing number of processes (in ansible.cfg), but this is prolonging deployment time.

Per process memory consumption is growing with number of computes. We tested this using "top" comand with this result:
allocated memory is in KiB

number of computes VIRT RES SHR
10 441712 92256 1856
40 528036 192176 1856
100 688868 328044 1856
200 873684 523844 1856
400 1227676 954724 2208

Generation of config file for one node (out of 400) uses 10 times more mem than generation of the same file with lover number of nodes.

Example of "top" command output during the test is bellow.

It would be best to decrease memory consumption per process.

Problem reproduction is possible to with described configuration.

Reproduction with small testlab (I did it with 2 controler +2 computes) is also possible.
Needed steps:
Deploy small configuration.
add dummy nodes to the configuration.
start playbook with changed configuration.

I will add detail description later.

p.s. ansible details:

#ansible-playbook --version
ansible-playbook 2.5.15+company_patches.1
  config file = /etc/ansible/ansible.cfg
  configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/site-packages/ansible
  executable location = /usr/bin/ansible-playbook
  python version = 2.7.5 (default, May 3 2017, 07:55:04) [GCC 4.8.5 20150623 (Red Hat 4.8.5-14)]

# ansible-playbook --version
ansible-playbook 2.9.9
  config file = /etc/ansible/ansible.cfg
  configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/site-packages/ansible
  executable location = /usr/bin/ansible-playbook
  python version = 2.7.5 (default, Aug 4 2017, 00:39:18) [GCC 4.8.5 20150623 (Red Hat 4.8.5-16)]

p.s.2 output of top commabd filtered to ansible proceses:

top command result for 10 computes
{code}
Thu Jun 11 06:49:29 KST 2020
top - 06:49:30 up 5 days, 16:14, 3 users, load average: 2.39, 0.80, 0.72
Tasks: 187 total, 13 running, 174 sleeping, 0 stopped, 0 zombie
%Cpu(s): 72.0 us, 6.1 sy, 0.0 ni, 22.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 7878656 total, 4365256 free, 1117652 used, 2395748 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 6380452 avail Mem

  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
26244 root 20 0 367980 92220 1828 R 56.2 1.2 0:01.74 ansible-playb
26281 root 20 0 441712 92256 1856 R 56.2 1.2 0:00.58 ansible-playb
26291 root 20 0 441712 92248 1852 R 56.2 1.2 0:00.37 ansible-playb
26248 root 20 0 367980 92220 1828 R 50.0 1.2 0:01.71 ansible-playb
26251 root 20 0 441712 92256 1856 R 50.0 1.2 0:01.69 ansible-playb
26260 root 20 0 441712 92256 1856 R 50.0 1.2 0:01.55 ansible-playb
26268 root 20 0 441712 92256 1856 R 50.0 1.2 0:00.84 ansible-playb
26271 root 20 0 441712 92256 1856 R 50.0 1.2 0:00.75 ansible-playb
26273 root 20 0 441712 92256 1856 R 50.0 1.2 0:00.66 ansible-playb
26287 root 20 0 441712 92256 1856 R 50.0 1.2 0:00.48 ansible-playb
26256 root 20 0 441712 92256 1856 R 43.8 1.2 0:01.60 ansible-playb
26264 root 20 0 441712 92256 1856 R 43.8 1.2 0:01.49 ansible-playb
24235 root 20 0 367980 96216 5836 S 12.5 1.2 1:29.16 ansible-playb
26300 root 20 0 157740 2096 1468 R 12.5 0.0 0:00.02 top

top command result for 400 computes failed on memory error

Thu Jun 11 08:08:54 KST 2020
KiB Mem : 7878656 total, 4243060 free, 2975644 used, 659952 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 4533696 avail Mem
  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
14105 root 20 0 1227676 954728 2212 R 56.2 12.1 0:06.58 ansible-playb
14129 root 20 0 1227676 954684 2180 R 50.0 12.1 0:00.30 ansible-playb
14098 root 20 0 1227676 954724 2208 R 43.8 12.1 0:07.44 ansible-playb
14130 root 20 0 1227676 954680 2176 R 43.8 12.1 0:00.17 ansible-playb
10306 root 20 0 1225552 958308 5828 R 37.5 12.2 55:57.69 ansible-playb
14056 root 20 0 1227676 954728 2212 R 31.2 12.1 0:08.24 ansible-playb
14060 root 20 0 1227676 954724 2208 R 31.2 12.1 0:07.38 ansible-playb
14062 root 20 0 1227676 954724 2208 R 31.2 12.1 0:06.75 ansible-playb
14066 root 20 0 1227676 954724 2208 R 31.2 12.1 0:06.82 ansible-playb
14084 root 20 0 1227676 954724 2208 R 31.2 12.1 0:05.65 ansible-playb
14088 root 20 0 1227676 954724 2208 R 31.2 12.1 0:04.39 ansible-playb
14092 root 20 0 1227676 954728 2212 R 31.2 12.1 0:03.82 ansible-playb
14094 root 20 0 1227676 954724 2208 R 31.2 12.1 0:03.89 ansible-playb
14106 root 20 0 1227676 954728 2212 R 31.2 12.1 0:03.61 ansible-playb
14116 root 20 0 1227676 954728 2212 R 31.2 12.1 0:03.12 ansible-playb
14120 root 20 0 1227676 954712 2200 R 31.2 12.1 0:00.96 ansible-playb
14124 root 20 0 1227676 954708 2200 R 31.2 12.1 0:00.59 ansible-playb
10315 root 20 0 2668048 126068 2168 S 0.0 1.6 1:11.12 ansible-playb

description: updated
description: updated
description: updated
description: updated
description: updated
Revision history for this message
Mark Goddard (mgoddard) wrote :

Thanks for the detailed bug report, this is very useful information. I am currently looking at performance improvements for this blueprint: https://blueprints.launchpad.net/kolla-ansible/+spec/performance-improvements. We have had reports of high memory usage in the past. Some of this is down to the nature of ansible - each process has variables for each host. However, there may be some improvements we can make.

In the mean time I would say that 16G is not huge, and more memory will always allow you to run with more forks.

Revision history for this message
Mark Goddard (mgoddard) wrote :

Of course, if you are available to help with profiling and performance tuning efforts, that would be appreciated.

Changed in kolla-ansible:
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
Zdenek Dvorak (zdenek-dvorak) wrote :

Hello,
I did some experiments to improve the current state.
Key is the memory consumption of one process. This can be difficult/ time demanding to solve.

Secondary is the number of processes/forks. This is much easier to change. It is possible to reduce number of forks for problematic playbooks.
New version of ansible - 2.9 is providing keyword "throttle". Official description is here:https://docs.ansible.com/ansible/latest/user_guide/playbooks_strategies.html

"The second keyword to affect execution is throttle, which can also be used at the block and task level. This keyword limits the number of workers up to the maximum set via the forks setting or serial."

Reduced number of processes is just workaround, not solution of the problem.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.