Reported here: https://bugzilla.redhat.com/show_bug.cgi?id=1853635
In large scale test, the overcloud heat stack failed while tried to scale out nodes from 200 to 250.
Till 200 nodes, we didn't face any issues in overcloud heat stack.
We hit the issue after adding 50 compute nodes in the stack which exists with 200 compute nodes. To reproduce the issue, we use two types of composable hardware with 50 node count (50x1029p & 50x1029u).
Version-Release number of selected component (if applicable):
Red Hat OpenStack Platform release 16.1.0 RC (Train)
Red Hat Enterprise Linux release 8.2 (Ootpa)
python3-tripleoclient-12.3.2-0.20200615103427.6f877f6.el8ost.noarch
python3-tripleoclient-heat-installer-12.3.2-0.20200615103427.6f877f6.el8ost.noarch
How reproducible: 100% reproducible in Scale lab.
Steps to Reproduce:
1. Deployed and successfully scaled out compute nodes with 200 counts.
$ openstack stack list
+--------------------------------------+------------+----------------------------------+---------------+----------------------+----------------------+
| ID | Stack Name | Project | Stack Status | Creation Time | Updated Time |
+--------------------------------------+------------+----------------------------------+---------------+----------------------+----------------------+
| 94a1e1aa-c10e-4597-8050-4c95b8118388 | overcloud | 5afea8d232064664b24278742e2cca22 | UPDATE_FAILED | 2020-07-01T15:35:11Z | 2020-07-03T10:06:18Z |
+--------------------------------------+------------+----------------------------------+---------------+----------------------+----------------------+
2. Added 50 x Compute nodes (1029p/1029u). But the heat stack update failed with below memory issue.
$ openstack stack event list --nested-depth 5 overcloud|grep -i FAILED
2020-07-03 01:27:50Z [overcloud]: UPDATE_FAILED Expression consumed too much memory
2020-07-03 07:07:37Z [overcloud]: UPDATE_FAILED Expression consumed too much memory
2020-07-03 09:17:42Z [overcloud]: UPDATE_FAILED Expression consumed too much memory
2020-07-03 11:02:43Z [overcloud]: UPDATE_FAILED Expression consumed too much memory
3. Heat-engine log reported below exceptions.
$ grep ^"2020-07-03 11" /var/log/containers/heat/heat-engine.log
..
2020-07-03 11:02:43.349 42 ERROR heat.engine.check_resource [req-0099248f-6fd4-4be5-bc04-8dce3668a8ae - admin - default default] Unexpected exception in resource check.: yaql.language.exceptions.MemoryQuotaExceededException: Expression consumed too much memory
2020-07-03 11:02:43.349 42 ERROR heat.engine.check_resource Traceback (most recent call last):
2020-07-03 11:02:43.349 42 ERROR heat.engine.check_resource File "/usr/lib/python3.6/site-packages/heat/engine/check_resource.py", line 313, in check
2020-07-03 11:02:43.349 42 ERROR heat.engine.check_resource adopt_stack_data)
2020-07-03 11:02:43.349 42 ERROR heat.engine.check_resource File "/usr/lib/python3.6/site-packages/heat/engine/check_resource.py", line 152, in _do_check_resource
2020-07-03 11:02:43.349 42 ERROR heat.engine.check_resource stack, self.msg_queue)
2020-07-03 11:02:43.349 42 ERROR heat.engine.check_resource File "/usr/lib/python3.6/site-packages/heat/engine/check_resource.py", line 395, in check_resource_update
2020-07-03 11:02:43.349 42 ERROR heat.engine.check_resource check_message)
2020-07-03 11:02:43.349 42 ERROR heat.engine.check_resource File "/usr/lib/python3.6/site-packages/heat/engine/resource.py", line 1462, in update_convergence
2020-07-03 11:02:43.349 42 ERROR heat.engine.check_resource runner(timeout=timeout, progress_callback=progress_callback)
2020-07-03 11:02:43.349 42 ERROR heat.engine.check_resource File "/usr/lib/python3.6/site-packages/heat/engine/scheduler.py", line 163, in __call__
2020-07-03 11:02:43.349 42 ERROR heat.engine.check_resource progress_callback=progress_callback):
..
2020-07-03 11:02:43.349 42 ERROR heat.engine.check_resource yaql.language.exceptions.MemoryQuotaExceededException: Expression consumed too much memory
2020-07-03 11:02:43.349 42 ERROR heat.engine.check_resource
2020-07-03 11:02:43.351 42 INFO heat.engine.stack [req-0099248f-6fd4-4be5-bc04-8dce3668a8ae - admin - default default] Stack UPDATE FAILED (overcloud): Expression consumed too much memory
2020-07-03 11:02:43.364 59 DEBUG heat.engine.sync_point [req-0099248f-6fd4-4be5-bc04-8dce3668a8ae - admin - default default] [8372:a425d54f-40f5-47a8-80f8-e773c19c0003:False] Waiting 8372: Got ConvergenceNode(rsrc_id=8372, is_update=True); still need ConvergenceNode(rsrc_id=8369, is_update=False) sync /usr/lib/python3.6/site-packages/heat/engine/sync_point.py:148
4. No failure for tripleo containers.
$ systemctl list-units|grep -i fail
_ NetworkManager-wait-online.service loaded failed failed Network Manager Wait Online
5. heat memory consumption queue in rabbitmq.
$ sudo podman exec -it -u root rabbitmq rabbitmqctl list_queues name messages memory consumers|grep heat
heat-engine-listener.2333acf3-9fe3-4a66-bb63-af1745f9fe01 0 34876 1
heat-engine-listener_fanout_22414f883aae43bb8106e8559c4d74e3 0 34876 1
heat-engine-listener_fanout_ea46f05f1b724da19d9022a7104e14dc 0 34876 1
heat-engine-listener_fanout_5584624c14b7462e99bf3dba25ba7320 0 34876 1
heat-engine-listener_fanout_8b8627681b7f439bb85a7ceaf14e018c 0 34876 1
heat-engine-listener.8fa46a3e-8219-4f54-a2df-85895b38c12e 0 34876 1
heat-engine-listener_fanout_79bb5ccf6bf143ca87e7edaec3fcac24 0 34876 1
heat-engine-listener.c02eb874-bde6-4fd9-b42f-63a05413232d 0 34876 1
heat-engine-listener_fanout_e62ecc4e710f4f61bccd9e745a55f3a9 0 34876 1
heat-engine-listener_fanout_61e5480cfa4441cda92753af42797aec 0 34876 1
heat-engine-listener.79663783-2b14-43e3-ae5f-9e86e4e55cc5 0 34876 1
heat-engine-listener_fanout_514b9c7045ae4c158e3d90b099ca2670 0 34876 1
heat-engine-listener_fanout_5d92c54b968243fca50589b4ce122fe9 0 34876 1
heat-engine-listener.decec203-0f5b-4f06-a849-ed3f0c3c00e2 0 34876 1
heat-engine-listener.c5faf3f5-6388-43f7-8735-ac25feaf9bbe 0 34876 1
heat-engine-listener.49888dcc-1706-40ff-95b4-60860d79c79c 0 34876 1
heat-engine-listener_fanout_027da2962c724555bb3c7fedcc1563d1 0 34876 1
heat-engine-listener.400bf389-70d2-46a3-9e86-514468b0ecf5 0 34876 1
heat-engine-listener_fanout_da09506a1dd94b588646e400a3d6a4f7 0 34876 1
heat-engine-listener_fanout_070dce69c83a44f6b10039bdf04cb207 0 34876 1
heat-engine-listener.e26fb6b1-3adb-44d1-8c13-08da00649ad4 0 34876 1
heat-engine-listener_fanout_49c1f88a28314d1291b8bcbf42217116 0 34876 1
heat-engine-listener.da97ed73-c704-44e3-83ab-41a7edda5328 0 34876 1
heat-engine-listener_fanout_bbe95f9fc7d24c6bb62cf8710e57750c 0 34876 1
heat-engine-listener_fanout_1da2d891d71a403eaa77426567dfdecc 0 34876 1
heat-engine-listener.16a2c102-ac3b-4a1e-9bb0-1c1046fcbc53 0 34876 1
heat-engine-listener_fanout_b4f413fba1e248e388002e9bc01858f3 0 34876 1
heat-engine-listener.359421d4-ac08-4d3c-9d5a-36cebe1f83a8 0 34876 1
heat-engine-listener_fanout_7c26a4bc6a324520a7cbb211e9425129 0 34876 1
heat-engine-listener.83cccf38-c532-4e1a-9190-18dd907e5ac1 0 34876 1
heat-engine-listener 0 58588 24
heat-engine-listener.3d476f14-67c9-4173-87bb-1d684ab460d1 0 34876 1
heat-engine-listener.94dfca03-8481-4f94-b1ae-4abacc5df0d1 0 34876 1
heat-engine-listener.a0dbff6d-8772-44a4-a4b5-8742e01170ee 0 34876 1
heat-engine-listener_fanout_d826fa669c28460c90efd67e683c774f 0 34876 1
heat-engine-listener_fanout_ac24f6d620b44365856c156f6f54a2d7 0 34876 1
heat-engine-listener.ff9e0297-2caf-4955-968b-656bb4862bba 0 34876 1
heat-engine-listener.c2be9a98-9c99-47c3-9442-1bb6a8c2410e 0 34876 1
heat-engine-listener.69826b66-7ddb-4940-90e5-b011ce1e3f66 0 34876 1
heat-engine-listener_fanout_38a7e06a998b498ca1a4a02bf581f3da 0 34876 1
heat-engine-listener_fanout_fa9e41e2c8714bea86e9d249d45d5629 0 34876 1
heat-engine-listener.5a3e1ff9-23a4-4035-90a3-15df9905b1ed 0 34876 1
heat-engine-listener.49492ba6-2bbf-4281-8473-9ab8c7e1b8fb 0 34876 1
heat-engine-listener.cb92c6d4-a42b-4486-9505-fc7d3e10af96 0 34876 1
heat-engine-listener.943126b4-4431-4215-bafd-708775a79cab 0 34876 1
heat-engine-listener.af713d29-358c-4af4-bf5d-6667482f8009 0 34876 1
heat-engine-listener_fanout_b7a6e6c903504e858711b684925943d6 0 34876 1
heat-engine-listener_fanout_19294f8e1aea4255bb60287ec2083e85 0 34876 1
heat-engine-listener_fanout_8f6b117847d34fd1b8e272955811f084 0 34876 1
Actual results: Scaled test failed at 250 node count.
Expected results: We never experience heat-engine memory issue in OSP16.0 scale test and we scaled 250 nodes without having performance tunning.
So we would expect better heat-engine performance this time.
Note: related to https:/ /bugs.launchpad .net/tripleo/ +bug/1869375 and the patch https:/ /review. opendev. org/#/c/ 716497