Heat operation failed after controller failover

Bug #1465840 reported by Tatyanka on 2015-06-16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Bogdan Dobrelya
Timur Nurlygayanov
Bogdan Dobrelya

Bug Description

Delete stack failed with next error:
http://paste.openstack.org/show/296169/, at the same time other services (like nova on command nova delete instance_uuid works fine)

haproxy status for heat:
      # pxname svname stot ereq econ eresp status chkfail chkdown downtime iid
      heat-api FRONTEND 0 0 OPEN 16
      heat-api node-76 0 0 0 UP 0 0 0 16
      heat-api node-77 0 0 0 UP 0 0 0 16
      heat-api node-79 0 0 0 UP 0 0 0 16
      heat-api BACKEND 0 0 0 UP 0 0 16
  heat-api-cfn FRONTEND 0 0 OPEN 17
  heat-api-cfn node-76 0 0 0 UP 0 0 0 17
  heat-api-cfn node-77 0 0 0 UP 0 0 0 17
  heat-api-cfn node-79 0 0 0 UP 0 0 0 17
  heat-api-cfn BACKEND 0 0 0 UP 0 0 17
heat-api-cloudwatch FRONTEND 0 0 OPEN 18
heat-api-cloudwatch node-76 0 0 0 UP 0 0 0 18
heat-api-cloudwatch node-77 0 0 0 UP 0 0 0 18
heat-api-cloudwatch node-79 0 0 0 UP 0 0 0 18
heat-api-cloudwatch BACKEND 0 0 0 UP 0 0 18

crm status


Steps To Reproduce:
Os: CentOS
HA with Neutron GRE:
1 controller + 2 controllers with mongo
1 mongo
1 cinder
2 computes
Ceilometer enabled

1. Deploy cluster, when cluster is ready
2. Navigate to fuel health tab and run all ostf tests - ha, smoke. sanity, platfrom tests are passed (configuration may fail if you do not change default cread for ssh on master node and user cred to openstack cluster)
3. Shutdown primary controller
4. Wait 5 minutes - run ostf ha suit (it passed, if not you may need to wait for a liitle bit more and run again)
5. Run smoke, sanity OSTF tests - they are passed
6. Run platfrom tests - Actual result all heat test failed (update/ create/ delete stack)
with 504 error
7. Turn on controller - repeat step 4-6 - result is the same all heat tests are failed

ssh to each controller and try to delete stack - failed with error listed about

{"build_id": "2015-06-08_06-13-27", "build_number": "521", "release_versions": {"2014.2.2-6.1": {"VERSION": {"build_id": "2015-06-08_06-13-27", "build_number": "521", "api": "1.0", "fuel-library_sha": "f43c2ae1af3b493ee0e7810eab7bb7b50c986c7d", "nailgun_sha": "4340d55c19029394cd5610b0e0f56d6cb8cb661b", "feature_groups": ["mirantis"], "openstack_version": "2014.2.2-6.1", "production": "docker", "python-fuelclient_sha": "4fc55db0265bbf39c369df398b9dc7d6469ba13b", "astute_sha": "7766818f079881e2dbeedb34e1f67e517ed7d479", "fuel-ostf_sha": "7c938648a246e0311d05e2372ff43ef1eb2e2761", "release": "6.1", "fuelmain_sha": "bcc909ffc5dd5156ba54cae348b6a07c1b607b24"}}}, "auth_required": true, "api": "1.0", "fuel-library_sha": "f43c2ae1af3b493ee0e7810eab7bb7b50c986c7d", "nailgun_sha": "4340d55c19029394cd5610b0e0f56d6cb8cb661b", "feature_groups": ["mirantis"], "openstack_version": "2014.2.2-6.1", "production": "docker", "python-fuelclient_sha": "4fc55db0265bbf39c369df398b9dc7d6469ba13b", "astute_sha": "7766818f079881e2dbeedb34e1f67e517ed7d479", "fuel-ostf_sha": "7c938648a246e0311d05e2372ff43ef1eb2e2761", "release": "6.1", "fuelmain_sha": "bcc909ffc5dd5156ba54cae348b6a07c1b607b24"}

Second env:
3 controllers with ceph + 1 compute + 2 mongo + 2 ceph
with nova Flat
1. Deploy cluster
2. When cluster ready - run ostf ha, smoke, sanity, platfrom tests - tests are passed
3. shutdown non-primary controller
4. wait near 5-7 minutes and run ostf ha - passed
5. run sanity/ smoke - passed
6. run platform tests - heat tests are failed with 504 error
7. ssh to online controllers and try to create and delete stack - failed with 504 error

Info: Add snapshot later according it is to big and to slow uploading in google drive

Info: issue do not reproduce each time, for now 2 from 5

description: updated
description: updated
Changed in fuel:
assignee: MOS Heat (mos-heat) → Sergey Kraynev (skraynev)
Sergey Kraynev (skraynev) wrote :

Attached snapshot does not contain heat related logs. Will sync with Tanya to get more info.

Changed in fuel:
status: New → Incomplete
Tatyanka (tatyana-leontovich) wrote :

yep you are right, for some reason remote directory for nodes are empty :( upload snapshot from second env

Changed in fuel:
status: Incomplete → Confirmed
Sergey Kraynev (skraynev) wrote :

Regarding last snapshot all heat errors are related with issues with rabbit mq (heat-api could not connect to rabbit and find queue).

Also ostf logs have errors of other (not heat service) tests [1]

Regarding issues mentioned above is not really possible to find root cause of errors and it looks like other services (not only Heat) are broken too.

After discussion with tatyana-leontovich, we decided to deploy one separate environment for reproducing this issue. It should give ability to get clear logs which show root problem.

Note: Deployment will be ready tomorrow.

[1] http://paste.openstack.org/show/298982/

Changed in fuel:
assignee: Sergey Kraynev (skraynev) → Timur Nurlygayanov (tnurlygayanov)
importance: High → Critical
importance: Critical → High

We investigated the issue and looks like the root of the issue in the following: we have disabled ha_queues in heat.conf and after the disaster scenario Heat lost one queue and can't work until restart. We need to try to reproduce the issue with the fix for the issue and if it helps we need to change this parameter in Heat configuration file.

We need to change "rabbit_ha_queues=False " to "rabbit_ha_queues=True" in heat config file to fix the issue.
Then we need to restart Heat services on all controllers:

root@node-1:~# service heat-api restart
root@node-1:~# service heat-engine restart
root@node-1:~# service heat-api-cfn restart
root@node-1:~# service heat-api-cloudwatch restart

Tried to reproduce issue with

1. Find "primary controller":
root@node-2:~# ifconfig | grep br-ex-hapr
br-ex-hapr Link encap:Ethernet HWaddr 06:7a:c8:57:99:3b
2. Shutdown primary controller:
shutdown now -P
3. Wait 10 minutes
4. Run Heat OSTF tests

Observed Result:
Issue not reproduced with parameter "rabbit_ha_queues=True" in /etc/heat/heat.conf, so, it is a way to fix the issue, I will commit fix soon.

Changed in fuel:
status: Confirmed → In Progress
Bogdan Dobrelya (bogdando) wrote :

High bugs cannot be addressed while HCF in progress

Bogdan Dobrelya (bogdando) wrote :

This bug impacts HA failover of Heat services badly, rasing to critical.

Bogdan Dobrelya (bogdando) wrote :

It seems that this issue affects only 1 controller node cases, which is not HA and it is not important which value for the rabbit_ha_queues will be configured for services.

I checked the 1 node deploy:
/etc/hiera/globals.yaml: rabbit_ha_queues: true

yes, for heat is seems inconsistent, but can be ignored for 1 controller node case, AFAIK.

Bogdan Dobrelya (bogdando) wrote :

I'm testing the case with 2 controllers, and will provide results for the heat ha queueus configuration

Bogdan Dobrelya (bogdando) wrote :

Yes, the bug looks valid.
For rabbit_hosts=,
heat.conf still contains rabbit_ha_queues=False
which is a critical issue

Bogdan Dobrelya (bogdando) wrote :

This bug is not in heat module. This is about how we pass amqp hosts to the heat class

Fix proposed to branch: stable/6.1
Review: https://review.openstack.org/193513

Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: stable/6.1
Review: https://review.openstack.org/193513
Reason: master first

Change abandoned by Timur Nurlygayanov (<email address hidden>) on branch: stable/6.1
Review: https://review.openstack.org/193488
Reason: We found the root of this issue, we fixed it with another commit: https://review.openstack.org/#/c/193511/

Change abandoned by Timur Nurlygayanov (<email address hidden>) on branch: master
Review: https://review.openstack.org/193483
Reason: We found the root of this issue, we fixed it with another commit: https://review.openstack.org/#/c/193511/

tags: added: release-notes

Release Notes:
Default Heat configuration for HA environments is not correct, we need to change parameter "rabbit_ha_queues=True" in /etc/heat/heat.conf to avoid Heat failures in case of destructive scenarios ( when we will power off some controllers and etc.)
When Heat configuration files will be changed on all controllers we should manually restart all Heat services on all OpenStack controllers:

:~# service heat-api restart
:~# service heat-engine restart
:~# service heat-api-cfn restart
:~# service heat-api-cloudwatch restart

tags: added: done
Changed in fuel:
status: In Progress → Fix Committed

Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: stable/6.1
Review: https://review.openstack.org/193513

Here is the correct fix for MOS 6.1:

Maksym Strukov (unbelll) on 2015-07-08
tags: added: on-verification
Maksym Strukov (unbelll) wrote :

Can't reproduce on 7.0-26 with both scenarios.

{"build_id": "2015-07-06_18-08-24", "build_number": "26", "release_versions": {"2014.2.2-7.0": {"VERSION": {"build_id": "2015-07-06_18-08-24", "build_number": "26", "api": "1.0", "fuel-library_sha": "251c54e8de2f41aacd260751e7a891e9fbffc45d", "nailgun_sha": "d040c5cebc9cdd24ef20cb7ecf0a337039baddec", "feature_groups": ["mirantis"], "openstack_version": "2014.2.2-7.0", "production": "docker", "python-fuelclient_sha": "315d8bf991fbe7e2ab91abfc1f59b2f24fd92f45", "astute_sha": "9cbb8ae5adbe6e758b24b3c1021aac1b662344e8", "fuel-ostf_sha": "a752c857deafd2629baf646b1b3188f02ff38084", "release": "7.0", "fuelmain_sha": "4f2dff3bdc327858fa45bcc2853cfbceae68a40c"}}}, "auth_required": true, "api": "1.0", "fuel-library_sha": "251c54e8de2f41aacd260751e7a891e9fbffc45d", "nailgun_sha": "d040c5cebc9cdd24ef20cb7ecf0a337039baddec", "feature_groups": ["mirantis"], "openstack_version": "2014.2.2-7.0", "production": "docker", "python-fuelclient_sha": "315d8bf991fbe7e2ab91abfc1f59b2f24fd92f45", "astute_sha": "9cbb8ae5adbe6e758b24b3c1021aac1b662344e8", "fuel-ostf_sha": "a752c857deafd2629baf646b1b3188f02ff38084", "release": "7.0", "fuelmain_sha": "4f2dff3bdc327858fa45bcc2853cfbceae68a40c"}

tags: removed: on-verification
tags: added: on-verification
Tatyanka (tatyana-leontovich) wrote :

{"build_id": "286", "build_number": "286", "release_versions": {"2015.1.0-7.0": {"VERSION": {"build_id": "286", "build_number": "286", "api": "1.0", "fuel-library_sha": "ff63a0bbc93a3a0fb78215c2fd0c77add8dfe589", "nailgun_sha": "5c33995a2e6d9b1b8cdddfa2630689da5084506f", "feature_groups": ["mirantis"], "fuel-nailgun-agent_sha": "d7027952870a35db8dc52f185bb1158cdd3d1ebd", "openstack_version": "2015.1.0-7.0", "fuel-agent_sha": "082a47bf014002e515001be05f99040437281a2d", "production": "docker", "python-fuelclient_sha": "1ce8ecd8beb640f2f62f73435f4e18d1469979ac", "astute_sha": "8283dc2932c24caab852ae9de15f94605cc350c6", "fuel-ostf_sha": "1f08e6e71021179b9881a824d9c999957fcc7045", "release": "7.0", "fuelmain_sha": "9ab01caf960013dc882825dc9b0e11ccf0b81cb0"}}}, "auth_required": true, "api": "1.0", "fuel-library_sha": "ff63a0bbc93a3a0fb78215c2fd0c77add8dfe589", "nailgun_sha": "5c33995a2e6d9b1b8cdddfa2630689da5084506f", "feature_groups": ["mirantis"], "fuel-nailgun-agent_sha": "d7027952870a35db8dc52f185bb1158cdd3d1ebd", "openstack_version": "2015.1.0-7.0", "fuel-agent_sha": "082a47bf014002e515001be05f99040437281a2d", "production": "docker", "python-fuelclient_sha": "1ce8ecd8beb640f2f62f73435f4e18d1469979ac", "astute_sha": "8283dc2932c24caab852ae9de15f94605cc350c6", "fuel-ostf_sha": "1f08e6e71021179b9881a824d9c999957fcc7045", "release": "7.0", "fuelmain_sha": "9ab01caf960013dc882825dc9b0e11ccf0b81cb0"}

tags: removed: on-verification

Reviewed: https://review.openstack.org/223455
Committed: https://git.openstack.org/cgit/stackforge/fuel-docs/commit/?id=ed673f381afea23ac9becf0adfb770ea7129cb4a
Submitter: Jenkins
Branch: master

commit ed673f381afea23ac9becf0adfb770ea7129cb4a
Author: evkonstantinov <email address hidden>
Date: Tue Sep 15 10:25:16 2015 +0300

    Add Heat HA false issue to relnotes

    Change-Id: Ifbacce627862b25142e2bae945ee8c434900360b

tags: added: non-release

Change abandoned by Sergii Golovatiuk (<email address hidden>) on branch: stable/6.1
Review: https://review.openstack.org/199447

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers