Heat stack delete hangs on some stacks with networking resources

Bug #1441726 reported by Georgy Okrokvertskhov
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Invalid
Critical
MOS Murano
6.0.x
Invalid
Critical
MOS Murano

Bug Description

Heat frequently hangs on stack delete operation. Heat templates are created by Murano.

Stack example is attached.

Tags: heat murano
Revision history for this message
Georgy Okrokvertskhov (gokrokvertskhov) wrote :

Heat template with hanging delete

Changed in mos:
importance: Undecided → Critical
milestone: none → 6.1
Revision history for this message
Georgy Okrokvertskhov (gokrokvertskhov) wrote :

Events list from heat ui

tags: added: heat
tags: added: murano
Changed in mos:
assignee: nobody → MOS Heat (mos-heat)
status: New → Confirmed
Revision history for this message
Pavlo Shchelokovskyy (pshchelo) wrote :

First, debugging with machine generated templates is a PITA, so I prettified it a bit, removing all the muranoagent-related user-data and making use of parameters to set some properties easier - see attached template. I intentionally left that weird resource name in place (looks like there is some problem in Murano that it has not parsed that line).

Could you please verify that using your settings for image, flavor, public net and its router the bug still manifests when launching this attached template via Heat?

More about possible causes - judging from the event log, Nova could not fulfill the request to create servers (No valid host found..). By default Heat tries 5 times to re-create failed resources (and that is why we see them deleted first), but here for a particular server I counted 12. Could you check what is the value of `action_retry_limit` in /etc/heat/heat.conf on your environment?

I don't have a MOS 6.1 env at hands, but I quickly checked the attached template on DevStack master (overloading flavor with m1.xlarge), creation simply failed after a number of recreate attempts and successfully deleted afterwards. Would be nice if possible to get an access to your env where the bug was caught.

Changed in mos:
assignee: MOS Heat (mos-heat) → Pavlo Shchelokovskyy (pshchelo)
ruhe (ruhe)
summary: - Heat sack delete hangs on some stacks with networking resources
+ Heat stack delete hangs on some stacks with networking resources
Revision history for this message
ruhe (ruhe) wrote :

Pavlo, could you please update bug description once you figure out all the details? It'll help us and our users to properly identify this bug in fiture.

Revision history for this message
Pavlo Shchelokovskyy (pshchelo) wrote :

I used the following env to test

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "6.1"
  openstack_version: "2014.2-6.1"
  api: "1.0"
  build_number: "296"
  build_id: "2015-04-09_09-43-04"
  nailgun_sha: "b752e6b0bb240d0048a04919d7ccbe51513a4562"
  python-fuelclient_sha: "5c94b59bafc8dc1cbecb088020f4ef14ce62044a"
  astute_sha: "5041b2fb508e6860c3cb96474ca31ec97e549e8b"
  fuellib_sha: "b43d7665d4448abbc0bb5b4a30cdbb1592f1a2b1"
  ostf_sha: "c3b06dba5c96d225882e9f1a465f74eaa6374fbf"
  fuelmain_sha: "2ca546b86e651d5638dbb1be9bae44b86c84a893"

one compute, one controller, Neutron+GRE

I launched the prettified template with TestVM image and flavor m1.xlarge which was surely overlimit for the compute. Stack creation failed as expected, with the same expected error messages in the event list (ResourceInError: Went to status ERROR due to "Message: No valid host was found. , Code: 500") and there were exactly 5 retries to create each of 3 servers specified in the template. After that I successfully deleted this stack.

Please provide the info I asked in previous comment, and ideally provide access to the env where you are encountering this bug.

Changed in mos:
status: Confirmed → Incomplete
Changed in mos:
status: Incomplete → New
Revision history for this message
Pavlo Shchelokovskyy (pshchelo) wrote :

Environment was provided, MOS 6.0. We've also seen such failure on MOS 6.1 in similar scenario - stack with too big VMs created by
Murano.

We still can not reproduce this bug using Heat alone. We need to see the complete logs of what calls to Heat are made by Murano that lead to such failures.

Can someone from Murano team provide such logs? They should expose the api calls and all parameters used (templates at least) so we can analyze them and try to reproduce the issue by manually using Heat only.

Changed in mos:
status: New → Incomplete
Revision history for this message
Pavlo Shchelokovskyy (pshchelo) wrote :

I have obtained Murano logs from the environment and grepped for "Pushing.*<env-id-from-the-hanging-template>", results are attached (if there is a need, I could also attach the full log). I believe those must be the arguments used for heat.stacks.create (1st one) and heat.stacks.update (others) calls - please correct me if I'm wrong. Further we need to convert this info to real templates, prettify/simplify them and run those calls with Heat manually.

Changed in mos:
status: Incomplete → In Progress
ruhe (ruhe)
Changed in mos:
assignee: Pavlo Shchelokovskyy (pshchelo) → Kairat Kushaev (kkushaev)
Revision history for this message
Victor Ryzhenkin (vryzhenkin) wrote :
Revision history for this message
Kairat Kushaev (kkushaev) wrote :

So guys,
after deep analysis I realized the following:
1) Murano splits request for stack-create into several incremental requests (as mentioned by Pavlo).
2) The first request is stack-create. The subsequent requests are stack-updates.
3) When one of stack-updates is failing Murano is trying to apply stack-update for further requests (independently from results of the previous update requests).
Heat has a update-recovery feature that allows a user to update failed stack but it seems that this feature is not working correctly in the case above.
I almost defined the root cause. The only thing that confuses me is that we have nothing in logs. I manage to reproduce the case with some messages in logs but in case of Murano the logs are clear. I will come with the final conclusion soon.

Revision history for this message
Kairat Kushaev (kkushaev) wrote :

Colleagues,
as discussed with Sergey K and Serg M I created the issue for heat here:
https://bugs.launchpad.net/mos/+bug/1446678
So we are going to fix heat related part there.
In the current patch we propose to fix Murano behavior (do not launch stack-update again if the first update failed)

Changed in mos:
assignee: Kairat Kushaev (kkushaev) → MOS Murano (mos-murano)
Revision history for this message
Serg Melikyan (smelikyan) wrote :

Kairat,

>In the current patch we propose to fix Murano behavior (do not launch stack-update again if the first update failed)
We actually try to avoid to do that, it will break any application that manually changes heat stack

Revision history for this message
Serg Melikyan (smelikyan) wrote :
Changed in mos:
status: In Progress → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.