Fuel for OpenStack

Rabbit app failed to start and join cluster at the second controller node but cannot be noticed by OCF logic

Bug #1455761 reported by Nastya Urlapova on 2015-05-16

This bug affects 1 person

	Status	Importance	Assigned to	Milestone
Fuel for OpenStack	Fix Committed	High	Bogdan Dobrelya	Fuel for OpenStack 6.1
5.1.x	Won't Fix	High	Denis Meltsaykin	Fuel for OpenStack 5.1.1-updates
6.0.x	Won't Fix	High	Denis Meltsaykin	Fuel for OpenStack 6.0-updates

Bug Description

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "6.1"
  openstack_version: "2014.2.2-6.1"
  api: "1.0"
  build_number: "421"
  build_id: "2015-05-15_20-55-26"
  nailgun_sha: "eca3532abfcc15dc6c55f682dd3f037235c4e858"
  python-fuelclient_sha: "38765563e1a7f14f45201fd47cf507393ff5d673"
  astute_sha: "7e3e81f2e3d4557d5d1fd61a424df95c4d265601"
  fuel-library_sha: "1645fe45f226cdd6d2829bea9912d0baa3be5033"
  fuel-ostf_sha: "9ce1800749081780b8b2a4a7eab6586583ffaf33"
  fuelmain_sha: "d249d74f9beb5935c31b8ee674eb1ed696672f6e"

Deploy cluster in HA mode with bonding (active backup)
        Scenario:
            1. Create cluster
            2. Add 3 nodes with controller role
            3. Add 2 node with compute role
            4. Setup bonding for all interfaces
            4. Deploy the cluster
            5. Run network verification
            6. Run OSTF

Deployment failed with err:
(/Stage[main]/Rabbitmq::Install::Rabbitmqadmin/Staging::File[rabbitmqadmin]/Exec[/var/lib/rabbitmq/rabbitmqadmin]/returns) change from notrun to 0 failed: curl -k --noproxy localhost --retry 30 --retry-delay 6 -f -L -o /var/lib/rabbitmq/rabbitmqadmin http://nova:XT9PMcfX@localhost:15672/cli/rabbitmqadmin returned 7 instead of one of [0]

because the rabbit@node-2 app had never started and never tried to join the elected master (rabbit@node-1), see http://paste.openstack.org/show/lgkWy7A1EcFhCLoH6vdw/

Normally, when both beam.smp and rabbit app have started, there should be two log records:
1) "checking if rabbit app is running"
2) "rabbit app is running. checking if we are the part of healthy cluster"
But the logs shown the second record is missing, hence rabbit app was not
started. And get_monitor() was not able to detect this and reported OK.

The test case which reproduces this issue after some number of iterations:
is described here https://bugs.launchpad.net/fuel/+bug/1458830

See original description

Tags:

Revision history for this message

Nastya Urlapova (aurlapova) wrote on 2015-05-16:

fail_error_deploy_bonding_ha_active_backup-2015_05_16__02_41_50.tar.xz Edit (62.7 MiB, application/octet-stream)

summary:

- Deployment with bonds failed on second controller
+ Deployment with active backup bonds failed on second controller

Revision history for this message

Nastya Urlapova (aurlapova) wrote on 2015-05-16: Re: Deployment with active backup bonds failed on second controller

http://jenkins-product.srt.mirantis.net:8080/view/6.1_swarm/job/6.1.system_test.centos.bonding_ha/125/testReport/(root)/deploy_bonding_ha_active_backup/deploy_bonding_ha_active_backup/

Stanislaw Bogatkin (sbogatkin) on 2015-05-16

Changed in fuel:
status:	New → Confirmed

Oleksiy Molchanov (omolchanov) on 2015-05-16

Changed in fuel:
assignee:	Fuel Library Team (fuel-library) → Oleksiy Molchanov (omolchanov)
status:	Confirmed → In Progress

Bogdan Dobrelya (bogdando) on 2015-05-18

tags:

added: l23network

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-05-18: Re: Deployment with active backup bonds failed on second controller as rabbit failed to start and node-2

But main issue looks like with the rabbit@node-2 refused to start at 2015-05-16T01:57:17 but it looks like OCF logic was not able to detect and handle the failure as appropriate

summary:	- Deployment with active backup bonds failed on second controller + Deployment with active backup bonds failed on second controller as + rabbit failed to start and node-2
tags:	added: ha rabbitmq removed: l23network
Changed in fuel:
assignee:	Oleksiy Molchanov (omolchanov) → Bogdan Dobrelya (bogdando)
status:	In Progress → Confirmed

Bogdan Dobrelya (bogdando) on 2015-05-18

description:

updated

Bogdan Dobrelya (bogdando) on 2015-05-18

description:

updated

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-05-18: Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/184070

Changed in fuel:
status:	Confirmed → In Progress

Bogdan Dobrelya (bogdando) on 2015-05-18

summary:

- Deployment with active backup bonds failed on second controller as
- rabbit failed to start and node-2
+ Rabbit failed to start and join cluster at the second controller node

Bogdan Dobrelya (bogdando) on 2015-05-18

summary:

- Rabbit failed to start and join cluster at the second controller node
+ Rabbit app failed to start and join cluster at the second controller
+ node but cannot be noticed by OCF logic

Bogdan Dobrelya (bogdando) on 2015-05-18

description:	updated
description:	updated

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-05-18: Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/184070
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=454a1485453919756b8a4e3083fb71b0928158ec
Submitter: Jenkins
Branch: master

commit 454a1485453919756b8a4e3083fb71b0928158ec
Author: Bogdan Dobrelya <email address hidden>
Date: Mon May 18 16:00:37 2015 +0200

Fix get_monitor function and local vars for rabbit OCF

    W/o this fix:
    * get_monitor() does not handle the case then rabbit
    app is running and reports $OCF_RUNNING. This makes
    the resource to stuck in semi-started state for ever.
    * get_monitor() does not always record in logs when
    sets master score 0 to ensure the slave does not get
    promoted. This makes logging patterns not consistent.
    * get_monitor() has two exit point for generic error case
    instead of one. This makes logging patterns not consistent.
    * some variables are missing the local
    declaration and might be reused in another places
    leading to unpredictable results. For example,
    there is the nodelist var should be local as it is
    used both action_notify(), get_monitor().

    The solution is:
    * fix missing local declarations for variables and
    remove unused ones as well.
    * add missing log record for "ensuring this slave does
    not get promoted."
    * handle the get_monitor() case when rabbit app is not
    running as appropriate and report $OCF_NOT_RUNNING.
    * make the single get_monitor() return point for the
    $OCF_ERR_GENERIC case.

Closes-bug: #1455761

Change-Id: Iba5e9d984083acea1392cad0abd6453f5d6fbf8b
Signed-off-by: Bogdan Dobrelya <email address hidden>

Changed in fuel:
status:	In Progress → Fix Committed

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-05-19:

#10

How to reproduce:
root@node-8:~# rabbitmqctl stop_app
Stopping node 'rabbit@node-8' ...
...done.
root@node-8:~# ocf_handler_rabbitmq-server monitor
...
Exit status: -e Success (0)

How to test:
With the fix applied, the command above should report Not running instead of Success

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-05-19:

#11

Master logic was not fixed (cannot pass the test), reopening

Changed in fuel:
status:	Fix Committed → In Progress

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-05-19:

#13

The issue is that by design, the start action will start beam process and stop the rabbit app, unless master promoted or slave joined cluster.
This means that in order to fix this issue, the start action must be redesigned as the following:
1) Do not remove the iptables block rule on the action start() exit.
2) Leave the rabbit app started on the action start() exit.
3) remove the iptables block rule either on the post-promote notify, when master is elected and ready to join other nodes; or on the post-start notify, when slave is ready to join the cluster.

Otherwise, this bug cannot be fixed. We have to revert https://review.openstack.org/184070 as it introduced regression to action monitor

Changed in fuel:
importance:	High → Critical

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-05-19: Related fix proposed to fuel-library (master)

#14

Related fix proposed to branch: master
Review: https://review.openstack.org/184239

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-05-19:

#15

Raised to critical due to revert of the regression is required

description:

updated

Bogdan Dobrelya (bogdando) on 2015-05-19

description:

updated

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-05-19: Related fix merged to fuel-library (master)

#16

Reviewed: https://review.openstack.org/184239
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=5cc384073df35300501b161f05cf45f816d7703d
Submitter: Jenkins
Branch: master

commit 5cc384073df35300501b161f05cf45f816d7703d
Author: Bogdan Dobrelya <email address hidden>
Date: Tue May 19 14:51:20 2015 +0200

Revert "Merge "Fix get_monitor function and local vars for rabbit OCF""

This reverts commit 0562e2ae27c1d6b2e027e63c5af4e1153f44224b, reversing
changes made to 5fd379f0cd8106808ab9fc098ca1094c7d91041a.

    The issue is that by design, the start action will start beam process
    and stop the rabbit app, unless master promoted or slave joined cluster.
    We have to revert aforementioned commit as it introduced a regression
    to the action monitor, which is to report "Not running", when rabbit app
    is stopped.

Related-bug: #1455761
Change-Id: I686c620092909351c910598aeef0dcc73d2da080

Bogdan Dobrelya (bogdando) on 2015-05-19

Changed in fuel:
importance:	Critical → High
status:	In Progress → Won't Fix

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-05-20:

#17

As we discussed with Vladimir Kuklin, the solution should be either to:
* redesign OCF start action to not stop rabbit app on exit (can't be done for the 6.1 as too risky)
* or, to modify post-start notify logic to process join cluster both by nodes being started AND already running ones

Bogdan Dobrelya (bogdando) on 2015-05-20

Changed in fuel:
status:	Won't Fix → Confirmed

OpenStack Infra (hudson-openstack) on 2015-05-20

Changed in fuel:
assignee:	Bogdan Dobrelya (bogdando) → Bartlomiej Piotrowski (bpiotrowski)
status:	Confirmed → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-05-21: Fix proposed to fuel-library (master)

#19

Fix proposed to branch: master
Review: https://review.openstack.org/184671

Changed in fuel:
assignee:	Bartlomiej Piotrowski (bpiotrowski) → Vladimir Kuklin (vkuklin)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-05-21: Related fix proposed to fuel-library (master)

#20

Related fix proposed to branch: master
Review: https://review.openstack.org/184674

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-05-21:

#21

note: how to test the patch https://review.openstack.org/184674 at the deployed env -
rabbitmqctl eval 'application:get_env(rabbit, mnesia_table_loading_timeout).'
should return 10000 value

OpenStack Infra (hudson-openstack) on 2015-05-21

Changed in fuel:
assignee:	Vladimir Kuklin (vkuklin) → Bartlomiej Piotrowski (bpiotrowski)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-05-21: Related fix merged to fuel-library (master)

#22

Reviewed: https://review.openstack.org/184674
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=a2a146972dc63208280920ff1cf9321a6300171b
Submitter: Jenkins
Branch: master

commit a2a146972dc63208280920ff1cf9321a6300171b
Author: Vladimir Kuklin <email address hidden>
Date: Thu May 21 05:04:57 2015 +0300

Set mnesia_table_loading_timeout to 10 seconds

    This commit sets mnesia_table_loading_timeout to
    10 seconds thus making rabbitmq cluster failover
    process faster. This option was initially suggested
    by Michael Klishin (RabbitMQ developer)

    Change-Id: I8ff6388cdd785404ea3659584b20b9e977a1c253
    Related-bug: #1455761
    Related-bug: #1432603

OpenStack Infra (hudson-openstack) on 2015-05-21

Changed in fuel:
assignee:	Bartlomiej Piotrowski (bpiotrowski) → Vladimir Kuklin (vkuklin)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-05-22: Fix merged to fuel-library (master)

#23

Reviewed: https://review.openstack.org/184671
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=db6ec616bf3b4dea7ee8e7583f12fad16e501d72
Submitter: Jenkins
Branch: master

commit db6ec616bf3b4dea7ee8e7583f12fad16e501d72
Author: Vladimir Kuklin <email address hidden>
Date: Thu May 21 03:50:24 2015 +0300

Check hostlist against starting and active resources

    This commit makes post-start notify action to check
    hostlist of nodes that should be joined to the cluster
    to contain not only nodes that will be started but
    also ones that are already started. This fixes
    the case when Pacemaker sends notifies only for
    the latest event and thus the node which is not
    included into the start list will not join the
    cluster. Also it checks whether the node is
    already clustered and skips the join if it
    is not needed.

Change-Id: Ibe8ecdcfe42c14228350b1eb3c9d08b1a64e117d
Closes-bug: #1455761

Changed in fuel:
status:	In Progress → Fix Committed

Bogdan Dobrelya (bogdando) on 2015-05-22

no longer affects:

fuel/7.0.x

Bogdan Dobrelya (bogdando) on 2015-05-22

Changed in fuel:
status:	Fix Committed → In Progress
assignee:	Vladimir Kuklin (vkuklin) → Bogdan Dobrelya (bogdando)

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-05-22:

#24

Folks, after we merged https://review.openstack.org/184671,
there is a situation possible when the action start joins the node to the cluster (it becomes visible in db_nodes), but the monitor action cannot see if it is absent in running_db_nodes!

root@node-1:~# rabbitmqctl eval 'mnesia:system_info(running_db_nodes).'
['rabbit@node-3','rabbit@node-2']
...done.
root@node-1:~# rabbitmqctl eval 'mnesia:system_info(db_nodes).'
['rabbit@node-2','rabbit@node-3','rabbit@node-1']
...done.

root@node-1:~# ocf_handler_rabbitmq-server monitor
lrmd: DEBUG: p_rabbitmq-server: monitor: action start.
lrmd: INFO: p_rabbitmq-server: get_monitor(): get_status() returns 0.
lrmd: INFO: p_rabbitmq-server: get_monitor(): also checking if we are master.
lrmd: INFO: p_rabbitmq-server: get_monitor(): master attribute is 1
lrmd: INFO: p_rabbitmq-server: get_monitor(): checking if rabbit app is running
lrmd: INFO: p_rabbitmq-server: get_monitor(): preparing to update master score for node
lrmd: INFO: p_rabbitmq-server: get_monitor(): comparing our uptime (0) with node-3.test.domain.local (787)
lrmd: INFO: p_rabbitmq-server: get_monitor(): get_monitor function ready to return 0
lrmd: DEBUG: p_rabbitmq-server: monitor: role:
lrmd: DEBUG: p_rabbitmq-server: monitor: result: 0
lrmd: DEBUG: p_rabbitmq-server: monitor: action end.
Exit status: -e Success (0)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-05-22: Fix proposed to fuel-library (master)

#25

Fix proposed to branch: master
Review: https://review.openstack.org/184987

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-05-22:

#26

It looks like that the problem is just in the OCF_RESKEY_CRM_meta_notify_master_uname value can be sometimes returned empty instead of the current master. If so, the node will miss all join events and will remain unjoined.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-05-22: Change abandoned on fuel-library (master)

#27

Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: master
Review: https://review.openstack.org/184987

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-05-22: Fix proposed to fuel-library (master)

#28

Fix proposed to branch: master
Review: https://review.openstack.org/185044

OpenStack Infra (hudson-openstack) on 2015-05-23

Changed in fuel:
assignee:	Bogdan Dobrelya (bogdando) → Vladimir Kuklin (vkuklin)

Bogdan Dobrelya (bogdando) on 2015-05-25

Changed in fuel:
assignee:	Vladimir Kuklin (vkuklin) → Bogdan Dobrelya (bogdando)
importance:	High → Critical

Bogdan Dobrelya (bogdando) on 2015-05-25

description:

updated

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-05-25:

#32

The impact is not critical as the issue does not introduce full downtime for the AMQP cluster

Changed in fuel:
importance:	Critical → High

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-05-25: Fix merged to fuel-library (master)

#33

Reviewed: https://review.openstack.org/179032
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=c93bfdc5a70459432ca1989e3f5dccf44885c75d
Submitter: Jenkins
Branch: master

commit c93bfdc5a70459432ca1989e3f5dccf44885c75d
Author: Bogdan Dobrelya <email address hidden>
Date: Thu Apr 30 13:56:18 2015 +0200

Disable rabbitmq management plugin

    There is a known security issue exist in the
    management plugin for RabbitMQ <3.4.3, so it
    has to be disabled by default.

Closes-bug: #1450443
Closes-bug: #1455761

Change-Id: Ic01c26200f6019a8112b1c5fb04a282e64b3b3e6
Signed-off-by: Bogdan Dobrelya <email address hidden>

Changed in fuel:
status:	In Progress → Fix Committed

Vladimir Kuklin (vkuklin) on 2015-05-25

Changed in fuel:
status:	Fix Committed → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-05-25:

#34

Reviewed: https://review.openstack.org/185044
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=085fe8c5a2255d4274cdcee5c2a74c15c443c0db
Submitter: Jenkins
Branch: master

commit 085fe8c5a2255d4274cdcee5c2a74c15c443c0db
Author: Bogdan Dobrelya <email address hidden>
Date: Fri May 22 16:40:16 2015 +0200

Fix rabbit OCF demote/stop/promote actions

    * When the rabbit node went down, its status remains 'running'
      in mnesia db for a while, so few retries (50 sec of total) are
      required in order to kick and forget this node from the cluster.
      This also requires +50 sec for actions stop & demote timeout.
    * The rabbit master score in the CIB is retained after the current
      master moved manually. This is wrong and the score must be reset
      ASAP for post-demote and post-stop as well.
    * The demoted node must be kicked from cluster by other nodes
      on post-demote processing.
    * Post-demote should stop the rabbit app at the node being demoted as
      this node should be kicked from the cluster by other nodes.
      Instead, it stops the app at the *other* nodes and brings full
      cluster downtime.
    * The check to join should be only done at the post-start and not at
      the post-promote, otherwise the node being promoted may think it
      is clustered with some node while the join check reports it as
      already clustered with another one.
      (the regression was caused by https://review.openstack.org/184671)
    * Change `hostname` call to `crm_node -n` via $THIS_PCMK_NODE
      everywhere to ensure we are using correct pacemaker node name
    * Handle empty values for OCF_RESKEY_CRM_meta_notify_* by reporting
      the resource as not running. This will rerun resource and restore
      its state, eventually.

Closes-bug: #1436812
Closes-bug: #1455761

Change-Id: Ib01c1731b4f06e6b643a4bca845828f7db507ad3
Signed-off-by: Bogdan Dobrelya <email address hidden>

Reviewed:  https://review.openstack.org/185044
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=085fe8c5a2255d4274cdcee5c2a74c15c443c0db
Submitter: Jenkins
Branch:    master

commit 085fe8c5a2255d4274cdcee5c2a74c15c443c0db
Author: Bogdan Dobrelya <bdobrelia@mirantis.com>
Date:   Fri May 22 16:40:16 2015 +0200

Fix rabbit OCF demote/stop/promote actions
    
    * When the rabbit node went down, its status remains 'running'
      in mnesia db for a while, so few retries (50 sec of total) are
      required in order to kick and forget this node from the cluster.
      This also requires +50 sec for actions stop & demote timeout.
    * The rabbit master score in the CIB is retained after the current
      master moved manually. This is wrong and the score must be reset
      ASAP for post-demote and post-stop as well.
    * The demoted node must be kicked from cluster by other nodes
      on post-demote processing.
    * Post-demote should stop the rabbit app at the node being demoted as
      this node should be kicked from the cluster by other nodes.
      Instead, it stops the app at the *other* nodes and brings full
      cluster downtime.
    * The check to join should be only done at the post-start and not at
      the post-promote, otherwise the node being promoted may think it
      is clustered with some node while the join check reports it as
      already clustered with another one.
      (the regression was caused by https://review.openstack.org/184671)
    * Change `hostname` call to `crm_node -n` via $THIS_PCMK_NODE
      everywhere to ensure we are using correct pacemaker node name
    * Handle empty values for OCF_RESKEY_CRM_meta_notify_* by reporting
      the resource as not running. This will rerun resource and restore
      its state, eventually.
    
    Closes-bug: #1436812
    Closes-bug: #1455761
    
    Change-Id: Ib01c1731b4f06e6b643a4bca845828f7db507ad3
    Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>

Changed in fuel:
status:	In Progress → Fix Committed

Bogdan Dobrelya (bogdando) on 2015-05-26

Changed in fuel:
status:	Fix Committed → In Progress

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-05-26:

#37

The original bug is resolved now. The remaining issue with missing start for the rabbit app should be addressed by https://review.openstack.org/185530 as a separate bug

Changed in fuel:
status:	In Progress → Fix Committed

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-05-26:

#38

The new bug is https://bugs.launchpad.net/fuel/+bug/1458828

Bogdan Dobrelya (bogdando) on 2015-05-26

description:

updated

Revision history for this message

Denis Meltsaykin (dmeltsaykin) wrote on 2015-10-26:

#39

Setting this as Won't Fix for 5.1.1-updates and 6.0-updates, as such a complex change cannot be delivered in the scope of the Maintenance Update. Also, the possible solution of the backporting of RabbitMQ OCF script is covered in details by the Operations Guide from the official documentation of the Product.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

fail_error_deploy_bonding_ha_active_backup-2015_05_16__02_41_50.tar.xz Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.