In OPNFV we have run into several different issues on the undercloud while deploying. First we were running out of memory and openvswitch agent was crashing. After fixing that by increasing the RAM to 12GiB for the Undercloud VM. Following that change we started to hit random errors where heat stack queries to sql would fail, mysqldb would crash, etc:
ERROR: Remote error: DBConnectionError (pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on '192.0.2.1' ([Errno 111] ECONNREFUSED)") [SQL: u'SELECT 1']
^https://build.opnfv.org/ci/job/apex-deploy-virtual-os-nosdn-nofeature-ha-master/249/console
ERROR: Remote error: DBConnectionError (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: u'SELECT stack.created_at AS stack_created_at, stack.deleted_at AS stack_deleted_at, stack.action AS stack_action, stack.status AS stack_status, stack.status_reason AS stack_status_reason, stack.id AS stack_id, stack.name AS stack_name, stack.raw_template_id AS stack_raw_template_id, stack.prev_raw_template_id AS stack_prev_raw_template_id, stack.username AS stack_username, stack.tenant AS stack_tenant, stack.user_creds_id AS stack_user_creds_id, stack.owner_id AS stack_owner_id, stack.parent_resource_name AS stack_parent_resource_name, stack.timeout AS stack_timeout, stack.disable_rollback AS stack_disable_rollback, stack.stack_user_project_id AS stack_stack_user_project_id, stack.backup AS stack_backup, stack.nested_depth AS stack_nested_depth, stack.convergence AS stack_convergence, stack.current_traversal AS stack_current_traversal, stack.current_deps AS stack_current_deps, stack.updated_at AS stack_updated_at, raw_template_1.created_at AS raw_template_1_created_at, raw_template_1.updated_at AS raw_template_1_updated_at, raw_template_1.id AS raw_template_1_id, raw_template_1.template AS raw_template_1_template, raw_template_1.files AS raw_template_1_files, raw_template_1.environment AS raw_template_1_environment \nFROM stack LEFT OUTER JOIN raw_template AS raw_template_1 ON raw_template_1.id = stack.raw_template_id \nWHERE stack.id = %s'] [parameters: ('5cd88f2b-4d8f-4c3b-9656-c400d7684d47',)]
[u'
^https://build.opnfv.org/ci/job/apex-deploy-virtual-os-odl_l2-nofeature-ha-master/259/console
I think there is still some kind of CPU or other resource contention. Looking at num_engine_workers in heat, this value by default has no limit for the undercloud. By modifying this value to 2, as well as the heat api workers, it solved the issue via this commit in OPNFV:
https://gerrit.opnfv.org/gerrit/#/c/14523/
TripleO should set a limit on num_engine_workers when configuring undercloud to stop so many heat engine process forks. Looking at our OPNFV CI, the time to deploy with setting the value to 2 is about the same as allowing infinite - about 35-40 minutes for an HA deployment.
Note the default is not infinite, it will use either the number of cores on the box, or 4 (whichever is greater), and using the number of cores is the default behavior for most OpenStack services.
The memory usage issues are under investigation, see:
https:/ /bugs.launchpad .net/heat/ +bug/1570983 /bugs.launchpad .net/heat/ +bug/1570974
https:/
The problem with forcing e.g 2 as a default, is that you can then run into RPC timeouts where you deploy very large TripleO stacks consisting of many compute nodes, then a single (or even two) workers can't keep up and RPC timeouts cause the deployment to fail (ref bug #1526045)
How many cores are you using in this environment? We've seen issues similar to what you report in the past trying to run deployments on a single core VM, at least two (preferably 4) cores is reccomended, as the undercloud services are fairly CPU and memory intensive. 12G should be fine for RAM provided you're deploying relatively small overclouds, add some swap and monitor its usage to be sure.