Host entries for large scale deployments report YAQL memory error

Bug #1886203 reported by Emilien Macchi
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
Bogdan Dobrelya

Bug Description

Reported here: https://bugzilla.redhat.com/show_bug.cgi?id=1853635

In large scale test, the overcloud heat stack failed while tried to scale out nodes from 200 to 250.
Till 200 nodes, we didn't face any issues in overcloud heat stack.
We hit the issue after adding 50 compute nodes in the stack which exists with 200 compute nodes. To reproduce the issue, we use two types of composable hardware with 50 node count (50x1029p & 50x1029u).

Version-Release number of selected component (if applicable):
Red Hat OpenStack Platform release 16.1.0 RC (Train)
Red Hat Enterprise Linux release 8.2 (Ootpa)
python3-tripleoclient-12.3.2-0.20200615103427.6f877f6.el8ost.noarch
python3-tripleoclient-heat-installer-12.3.2-0.20200615103427.6f877f6.el8ost.noarch

How reproducible: 100% reproducible in Scale lab.

Steps to Reproduce:
1. Deployed and successfully scaled out compute nodes with 200 counts.

$ openstack stack list
+--------------------------------------+------------+----------------------------------+---------------+----------------------+----------------------+
| ID | Stack Name | Project | Stack Status | Creation Time | Updated Time |
+--------------------------------------+------------+----------------------------------+---------------+----------------------+----------------------+
| 94a1e1aa-c10e-4597-8050-4c95b8118388 | overcloud | 5afea8d232064664b24278742e2cca22 | UPDATE_FAILED | 2020-07-01T15:35:11Z | 2020-07-03T10:06:18Z |
+--------------------------------------+------------+----------------------------------+---------------+----------------------+----------------------+

2. Added 50 x Compute nodes (1029p/1029u). But the heat stack update failed with below memory issue.
    $ openstack stack event list --nested-depth 5 overcloud|grep -i FAILED
    2020-07-03 01:27:50Z [overcloud]: UPDATE_FAILED Expression consumed too much memory
    2020-07-03 07:07:37Z [overcloud]: UPDATE_FAILED Expression consumed too much memory
    2020-07-03 09:17:42Z [overcloud]: UPDATE_FAILED Expression consumed too much memory
    2020-07-03 11:02:43Z [overcloud]: UPDATE_FAILED Expression consumed too much memory

3. Heat-engine log reported below exceptions.

$ grep ^"2020-07-03 11" /var/log/containers/heat/heat-engine.log
..
2020-07-03 11:02:43.349 42 ERROR heat.engine.check_resource [req-0099248f-6fd4-4be5-bc04-8dce3668a8ae - admin - default default] Unexpected exception in resource check.: yaql.language.exceptions.MemoryQuotaExceededException: Expression consumed too much memory
2020-07-03 11:02:43.349 42 ERROR heat.engine.check_resource Traceback (most recent call last):
2020-07-03 11:02:43.349 42 ERROR heat.engine.check_resource File "/usr/lib/python3.6/site-packages/heat/engine/check_resource.py", line 313, in check
2020-07-03 11:02:43.349 42 ERROR heat.engine.check_resource adopt_stack_data)
2020-07-03 11:02:43.349 42 ERROR heat.engine.check_resource File "/usr/lib/python3.6/site-packages/heat/engine/check_resource.py", line 152, in _do_check_resource
2020-07-03 11:02:43.349 42 ERROR heat.engine.check_resource stack, self.msg_queue)
2020-07-03 11:02:43.349 42 ERROR heat.engine.check_resource File "/usr/lib/python3.6/site-packages/heat/engine/check_resource.py", line 395, in check_resource_update
2020-07-03 11:02:43.349 42 ERROR heat.engine.check_resource check_message)
2020-07-03 11:02:43.349 42 ERROR heat.engine.check_resource File "/usr/lib/python3.6/site-packages/heat/engine/resource.py", line 1462, in update_convergence
2020-07-03 11:02:43.349 42 ERROR heat.engine.check_resource runner(timeout=timeout, progress_callback=progress_callback)
2020-07-03 11:02:43.349 42 ERROR heat.engine.check_resource File "/usr/lib/python3.6/site-packages/heat/engine/scheduler.py", line 163, in __call__
2020-07-03 11:02:43.349 42 ERROR heat.engine.check_resource progress_callback=progress_callback):
..
2020-07-03 11:02:43.349 42 ERROR heat.engine.check_resource yaql.language.exceptions.MemoryQuotaExceededException: Expression consumed too much memory
2020-07-03 11:02:43.349 42 ERROR heat.engine.check_resource
2020-07-03 11:02:43.351 42 INFO heat.engine.stack [req-0099248f-6fd4-4be5-bc04-8dce3668a8ae - admin - default default] Stack UPDATE FAILED (overcloud): Expression consumed too much memory
2020-07-03 11:02:43.364 59 DEBUG heat.engine.sync_point [req-0099248f-6fd4-4be5-bc04-8dce3668a8ae - admin - default default] [8372:a425d54f-40f5-47a8-80f8-e773c19c0003:False] Waiting 8372: Got ConvergenceNode(rsrc_id=8372, is_update=True); still need ConvergenceNode(rsrc_id=8369, is_update=False) sync /usr/lib/python3.6/site-packages/heat/engine/sync_point.py:148

4. No failure for tripleo containers.

$ systemctl list-units|grep -i fail
_ NetworkManager-wait-online.service loaded failed failed Network Manager Wait Online

5. heat memory consumption queue in rabbitmq.

$ sudo podman exec -it -u root rabbitmq rabbitmqctl list_queues name messages memory consumers|grep heat
heat-engine-listener.2333acf3-9fe3-4a66-bb63-af1745f9fe01 0 34876 1
heat-engine-listener_fanout_22414f883aae43bb8106e8559c4d74e3 0 34876 1
heat-engine-listener_fanout_ea46f05f1b724da19d9022a7104e14dc 0 34876 1
heat-engine-listener_fanout_5584624c14b7462e99bf3dba25ba7320 0 34876 1
heat-engine-listener_fanout_8b8627681b7f439bb85a7ceaf14e018c 0 34876 1
heat-engine-listener.8fa46a3e-8219-4f54-a2df-85895b38c12e 0 34876 1
heat-engine-listener_fanout_79bb5ccf6bf143ca87e7edaec3fcac24 0 34876 1
heat-engine-listener.c02eb874-bde6-4fd9-b42f-63a05413232d 0 34876 1
heat-engine-listener_fanout_e62ecc4e710f4f61bccd9e745a55f3a9 0 34876 1
heat-engine-listener_fanout_61e5480cfa4441cda92753af42797aec 0 34876 1
heat-engine-listener.79663783-2b14-43e3-ae5f-9e86e4e55cc5 0 34876 1
heat-engine-listener_fanout_514b9c7045ae4c158e3d90b099ca2670 0 34876 1
heat-engine-listener_fanout_5d92c54b968243fca50589b4ce122fe9 0 34876 1
heat-engine-listener.decec203-0f5b-4f06-a849-ed3f0c3c00e2 0 34876 1
heat-engine-listener.c5faf3f5-6388-43f7-8735-ac25feaf9bbe 0 34876 1
heat-engine-listener.49888dcc-1706-40ff-95b4-60860d79c79c 0 34876 1
heat-engine-listener_fanout_027da2962c724555bb3c7fedcc1563d1 0 34876 1
heat-engine-listener.400bf389-70d2-46a3-9e86-514468b0ecf5 0 34876 1
heat-engine-listener_fanout_da09506a1dd94b588646e400a3d6a4f7 0 34876 1
heat-engine-listener_fanout_070dce69c83a44f6b10039bdf04cb207 0 34876 1
heat-engine-listener.e26fb6b1-3adb-44d1-8c13-08da00649ad4 0 34876 1
heat-engine-listener_fanout_49c1f88a28314d1291b8bcbf42217116 0 34876 1
heat-engine-listener.da97ed73-c704-44e3-83ab-41a7edda5328 0 34876 1
heat-engine-listener_fanout_bbe95f9fc7d24c6bb62cf8710e57750c 0 34876 1
heat-engine-listener_fanout_1da2d891d71a403eaa77426567dfdecc 0 34876 1
heat-engine-listener.16a2c102-ac3b-4a1e-9bb0-1c1046fcbc53 0 34876 1
heat-engine-listener_fanout_b4f413fba1e248e388002e9bc01858f3 0 34876 1
heat-engine-listener.359421d4-ac08-4d3c-9d5a-36cebe1f83a8 0 34876 1
heat-engine-listener_fanout_7c26a4bc6a324520a7cbb211e9425129 0 34876 1
heat-engine-listener.83cccf38-c532-4e1a-9190-18dd907e5ac1 0 34876 1
heat-engine-listener 0 58588 24
heat-engine-listener.3d476f14-67c9-4173-87bb-1d684ab460d1 0 34876 1
heat-engine-listener.94dfca03-8481-4f94-b1ae-4abacc5df0d1 0 34876 1
heat-engine-listener.a0dbff6d-8772-44a4-a4b5-8742e01170ee 0 34876 1
heat-engine-listener_fanout_d826fa669c28460c90efd67e683c774f 0 34876 1
heat-engine-listener_fanout_ac24f6d620b44365856c156f6f54a2d7 0 34876 1
heat-engine-listener.ff9e0297-2caf-4955-968b-656bb4862bba 0 34876 1
heat-engine-listener.c2be9a98-9c99-47c3-9442-1bb6a8c2410e 0 34876 1
heat-engine-listener.69826b66-7ddb-4940-90e5-b011ce1e3f66 0 34876 1
heat-engine-listener_fanout_38a7e06a998b498ca1a4a02bf581f3da 0 34876 1
heat-engine-listener_fanout_fa9e41e2c8714bea86e9d249d45d5629 0 34876 1
heat-engine-listener.5a3e1ff9-23a4-4035-90a3-15df9905b1ed 0 34876 1
heat-engine-listener.49492ba6-2bbf-4281-8473-9ab8c7e1b8fb 0 34876 1
heat-engine-listener.cb92c6d4-a42b-4486-9505-fc7d3e10af96 0 34876 1
heat-engine-listener.943126b4-4431-4215-bafd-708775a79cab 0 34876 1
heat-engine-listener.af713d29-358c-4af4-bf5d-6667482f8009 0 34876 1
heat-engine-listener_fanout_b7a6e6c903504e858711b684925943d6 0 34876 1
heat-engine-listener_fanout_19294f8e1aea4255bb60287ec2083e85 0 34876 1
heat-engine-listener_fanout_8f6b117847d34fd1b8e272955811f084 0 34876 1

Actual results: Scaled test failed at 250 node count.

Expected results: We never experience heat-engine memory issue in OSP16.0 scale test and we scaled 250 nodes without having performance tunning.
So we would expect better heat-engine performance this time.

Revision history for this message
Emilien Macchi (emilienm) wrote :
Changed in tripleo:
milestone: none → victoria-1
importance: Undecided → High
status: New → Triaged
assignee: nobody → Emilien Macchi (emilienm)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (master)

Fix proposed to branch: master
Review: https://review.opendev.org/739249

Changed in tripleo:
status: Triaged → In Progress
tags: added: train-backport-potential ussuri-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/739539

Changed in tripleo:
assignee: Emilien Macchi (emilienm) → Bogdan Dobrelya (bogdando)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (master)

Reviewed: https://review.opendev.org/739249
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=c67c53a8d8094382bc75e7d56480ecb3f8897610
Submitter: Zuul
Branch: master

commit c67c53a8d8094382bc75e7d56480ecb3f8897610
Author: Emilien Macchi <email address hidden>
Date: Fri Jul 3 10:57:57 2020 -0400

    undercloud/heat: set YAQL memory quota to 200000

    Since the "optimization" [1] of host entries in Heat and it's YAQLization,
    we need to increase the memory quota for YAQL queries or the resource
    will fail to process at large scale (250 nodes).

    [1] 3b8e6f78e19e776c087dc5c3ff225703b5c487bc

    Change-Id: I04cb72210fbd25a720158988698a300140f4e7db
    Closes-Bug: #1886203

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/739631

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/739632

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/739661

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-heat-templates (master)

Change abandoned by Bogdan Dobrelya (bogdando) (<email address hidden>) on branch: master
Review: https://review.opendev.org/739539
Reason: thanks, https://review.opendev.org/#/c/739661 implements the same idea and even more

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (stable/ussuri)

Reviewed: https://review.opendev.org/739631
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=d423af38a1bec45c3316c89d0d7a55e201792a50
Submitter: Zuul
Branch: stable/ussuri

commit d423af38a1bec45c3316c89d0d7a55e201792a50
Author: Emilien Macchi <email address hidden>
Date: Fri Jul 3 10:57:57 2020 -0400

    undercloud/heat: set YAQL memory quota to 200000

    Since the "optimization" [1] of host entries in Heat and it's YAQLization,
    we need to increase the memory quota for YAQL queries or the resource
    will fail to process at large scale (250 nodes).

    [1] 3b8e6f78e19e776c087dc5c3ff225703b5c487bc

    Change-Id: I04cb72210fbd25a720158988698a300140f4e7db
    Closes-Bug: #1886203
    (cherry picked from commit c67c53a8d8094382bc75e7d56480ecb3f8897610)

tags: added: in-stable-ussuri
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (stable/train)

Reviewed: https://review.opendev.org/739632
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=39c977afb6a3292c57914658453795c9eab51271
Submitter: Zuul
Branch: stable/train

commit 39c977afb6a3292c57914658453795c9eab51271
Author: Emilien Macchi <email address hidden>
Date: Fri Jul 3 10:57:57 2020 -0400

    undercloud/heat: set YAQL memory quota to 200000

    Since the "optimization" [1] of host entries in Heat and it's YAQLization,
    we need to increase the memory quota for YAQL queries or the resource
    will fail to process at large scale (250 nodes).

    [1] 3b8e6f78e19e776c087dc5c3ff225703b5c487bc

    Change-Id: I04cb72210fbd25a720158988698a300140f4e7db
    Closes-Bug: #1886203
    (cherry picked from commit c67c53a8d8094382bc75e7d56480ecb3f8897610)

tags: added: in-stable-train
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (master)

Reviewed: https://review.opendev.org/739661
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=d573f4e8787c99982f6172aae74042f05a750d56
Submitter: Zuul
Branch: master

commit d573f4e8787c99982f6172aae74042f05a750d56
Author: Rabi Mishra <email address hidden>
Date: Tue Jul 7 10:56:25 2020 +0530

    Simplify host entries generation

    This removes a resource and the the unnecessary yaql function.
    Also replaces json data types with lists to reduce memory
    footprint.

    Change-Id: I04a6114ca3d2703ca2891d6807d49b78ffee0f97
    Related-Bug: #1886203

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (stable/ussuri)

Related fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/741411

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (stable/train)

Related fix proposed to branch: stable/train
Review: https://review.opendev.org/741586

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (stable/train)

Reviewed: https://review.opendev.org/741586
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=c33f910054fe02d8b9479e74b967428852436581
Submitter: Zuul
Branch: stable/train

commit c33f910054fe02d8b9479e74b967428852436581
Author: Rabi Mishra <email address hidden>
Date: Tue Jul 7 10:56:25 2020 +0530

    Simplify host entries generation

    This removes a resource and the the unnecessary yaql function.
    Also replaces json data types with lists to reduce memory
    footprint.

    Change-Id: I04a6114ca3d2703ca2891d6807d49b78ffee0f97
    Related-Bug: #1886203
    (cherry picked from commit d573f4e8787c99982f6172aae74042f05a750d56)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (stable/ussuri)

Reviewed: https://review.opendev.org/741411
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=eefa55e340a9f015ecdaa809f460defba428f28d
Submitter: Zuul
Branch: stable/ussuri

commit eefa55e340a9f015ecdaa809f460defba428f28d
Author: Rabi Mishra <email address hidden>
Date: Tue Jul 7 10:56:25 2020 +0530

    Simplify host entries generation

    This removes a resource and the the unnecessary yaql function.
    Also replaces json data types with lists to reduce memory
    footprint.

    Change-Id: I04a6114ca3d2703ca2891d6807d49b78ffee0f97
    Related-Bug: #1886203
    (cherry picked from commit d573f4e8787c99982f6172aae74042f05a750d56)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 11.4.0

This issue was fixed in the openstack/tripleo-heat-templates 11.4.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.