Fuel for OpenStack

Changing of pacemaker parameters breaks RabbitMQ test fail on parsing rabbitmqctl cluster_status output

Bug #1539586 reported by Tatyanka on 2016-01-29

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Fix Released	High	Nastya Urlapova	Fuel for OpenStack 9.0
	8.0.x	Fix Released	High	Nastya Urlapova	Fuel for OpenStack 8.0

Bug Description

Change pacemaker parameters doesn't break RabbitMQ.

Scenario:
1. Deploy environment with at least 3 controllers
2. Change max_rabbitmqctl_timeouts parameter on one of
controllers,after that slaves rabbitmq will be restarted by
Pacemaker.
3. Wait for 1 minute.
4. Check RabbitMQ cluster is assembled until success in 10 min
5. Run OSTF
6. Repeat two more times steps 2-5

After attempt 5 next changes parameters crm_resource --resource p_rabbitmq-server --set-parameter max_rabbitmqctl_timeouts --parameter-value 8

Cluster was not assembled in time timeout=600, interval=20, only master was running
2016-01-29 01:50:34,794 - DEBUG helpers.py:335 -- Executing command: 'rabbitmqctl cluster_status'
2016-01-29 01:50:36,000 - DEBUG __init__.py:59 -- Done: run_on_remote_get_results with result: {'stdout_len': 5, 'stderr_len': 0, 'stdout': ["Cluster status of node 'rabbit@messaging-node-1' ...\n", "[{nodes,[{disc,['rabbit@messaging-node-1']}]},\n", " {running_nodes,['rabbit@messaging-node-1']},\n", ' {cluster_name,<<"<email address hidden>">>},\n', ' {partitions,[]}]\n'], 'exit_code': 0, 'stderr_str': '', 'stderr': [], 'stdout_str': 'Cluster status of node \'rabbit@messaging-node-1\' ...\n[{nodes,[{disc,[\'rabbit@messaging-node-1\']}]},\n {running_nodes,[\'rabbit@messaging-node-1\']},\n {cluster_name,<<"<email address hidden>">>},\n {partitions,[]}]\n'}
2016-01-29 01:50:36,001 - INFO sftp.py:129 -- [chan 0] sftp session closed.
2016-01-29 01:50:36,064 - DEBUG test_failover_base.py:1262 -- ### Status for slave-03
{running_nodes,['rabbit@messaging-node-1']},

After revert of failed env - looks like c cluster ok, on slaves node there are some errors like:
Error on AMQP connection <0.5460.0> (10.109.32.5:43207 -> 10.109.32.6:5673, vhost: '/', user: 'nova', state: running), channel 0:
{amqp_error,connection_forced,
"broker forced connection closure with reason 'shutdown'",none}

=ERROR REPORT==== 29-Jan-2016::01:48:34 ===
Error on AMQP connection <0.3069.0> (10.109.32.6:48067 -> 10.109.32.6:5673, vhost: '/', user: 'nova', state: running), channel 0:
{amqp_error,connection_forced,
"broker forced connection closure with reason 'shutdown'",none}

[root@nailgun ~]# cat /etc/fuel/version.yaml
VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  api: "1.0"
  build_number: "493"
  build_id: "493"
  fuel-nailgun_sha: "b900f9d9de4d2b6ccf27f4addf3f0e38502a0bac"
  python-fuelclient_sha: "4f234669cfe88a9406f4e438b1e1f74f1ef484a5"
  fuel-agent_sha: "e869072139670bb8bbfde00ef04dec3d189f5927"
  fuel-nailgun-agent_sha: "b2bb466fd5bd92da614cdbd819d6999c510ebfb1"
  astute_sha: "b81577a5b7857c4be8748492bae1dec2fa89b446"
  fuel-library_sha: "f7a008e6801ba0072b08302a740174aec506078a"
  fuel-ostf_sha: "ab5fd151fc6c1aa0b35bc2023631b1f4836ecd61"
  fuel-mirror_sha: "351d568fa3b3e4dd062054b91d766aa54d379867"
  fuelmenu_sha: "fac143f4dfa75785758e72afbdc029693e94ff2b"
  shotgun_sha: "63645dea384a37dde5c01d4f8905566978e5d906"
  network-checker_sha: "a43cf96cd9532f10794dce736350bf5bed350e9d"
  fuel-upgrade_sha: "616a7490ec7199f69759e97e42f9b97dfc87e85b"
  fuelmain_sha: "6b993b3004e8d97d840b672d6c1d44c320975cd9"
[root@nailgun ~]#

Tags:

Revision history for this message

Tatyanka (tatyana-leontovich) wrote on 2016-01-29:

fail_error_change_pacemaker_parameter_not_break_rabbitmq-fuel-snapshot-2016-01-29_01-58-46.tar.xz Edit (57.1 MiB, application/octet-stream)

Dmitry Mescheryakov (dmitrymex) on 2016-02-01

Changed in fuel:
status:	New → Confirmed

Roman Podoliaka (rpodolyaka) on 2016-02-01

summary:	- Change pacemaker parameters break RabbitMQ on attemp 5 + Changing of pacemaker parameters breaks RabbitMQ on attempt 5
tags:	added: area-oslo

Revision history for this message

Alexey Lebedeff (alebedev-a) wrote on 2016-02-01: Re: Changing of pacemaker parameters breaks RabbitMQ on attempt 5

As a side note - pacemaker logs contains a lot of messages from https://bugs.launchpad.net/fuel/+bug/1528686, but I don't think it's related to current bug.

Revision history for this message

Alexey Lebedeff (alebedev-a) wrote on 2016-02-01:

Another interesting thing - there several 'Segmentation fault (core dumped)' log records, probably from 'rabbitmqctl status'. Can we get hold of the live env, or at least of those core files?

Revision history for this message

Alexey Lebedeff (alebedev-a) wrote on 2016-02-01:

From ./nailgun.test.domain.local/var/log/docker-logs/remote/node-1.test.domain.local/lrmd.log:

2016-01-29T01:50:17.168543+00:00 info: INFO: p_rabbitmq-server: get_monitor(): rabbit app is running. master is node-2.test.domain.local
2016-01-29T01:50:17.841360+00:00 info: INFO: p_rabbitmq-server: get_monitor(): rabbit app is running. master is node-1.test.domain.local

Looks like pacemaker has elected 2 masters simultaneously.

Roman Podoliaka (rpodolyaka) on 2016-02-02

tags:

added: move-to-mu

Revision history for this message

Alexey Lebedeff (alebedev-a) wrote on 2016-02-02:

Actually, after looking into logs I can't see evidence that "Cluster was not assembled in time timeout=600" is true.

Pacemaker begun stopping slaves at '2016-01-29T01:48:33.101797+00:00' and all nodes were up after '2016-01-29T01:50:52.504074+00:00' - it's definitely less than 600 seconds.

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2016-02-02:

Tatyana, I agree with Alexey, there is clear indication in sys_test.log
https://product-ci.infra.mirantis.net/view/8.0_swarm/job/8.0.system_test.ubuntu.ha_neutron_destructive/123/

that cluster was up again in 01:51:01:

2016-01-29 01:51:01,569 - DEBUG __init__.py:59 -- Done: run_on_remote_get_results with result: {'stdout_len': 7, 'stderr_len': 0, 'stdout': ["Cluster status of node 'rabbit@messaging-node-1' ...\n", "[{nodes,[{disc,['rabbit@messaging-node-1','rabbit@messaging-node-2',\n", " 'rabbit@messaging-node-5']}]},\n", " {running_nodes,['rabbit@messaging-node-2','rabbit@messaging-node-5',\n", " 'rabbit@messaging-node-1']},\n", ' {cluster_name,<<"<email address hidden>">>},\n', ' {partitions,[]}]\n'], 'exit_code': 0, 'stderr_str': '', 'stderr': [], 'stdout_str': 'Cluster status of node \'rabbit@messaging-node-1\' ...\n[{nodes,[{disc,[\'rabbit@messaging-node-1\',\'rabbit@messaging-node-2\',\n \'rabbit@messaging-node-5\']}]},\n {running_nodes,[\'rabbit@messaging-node-2\',\'rabbit@messaging-node-5\',\n \'rabbit@messaging-node-1\']},\n {cluster_name,<<"<email address hidden>">>},\n {partitions,[]}]\n'}

Also, first time cluster_status check was performed just 3 minutes earlier - at 01:48:34. I.e. RabbitMQ reassembled in 3 minutes.

Could you please clarify what exaclty made the test failed?

Changed in fuel:
assignee:	MOS Oslo (mos-oslo) → Tatyanka (tatyana-leontovich)
status:	Confirmed → Invalid
status:	Invalid → Incomplete

Revision history for this message

Tatyanka (tatyana-leontovich) wrote on 2016-02-02:

Guys note that there is 5 attempts (test is executed in loop) - and fail happens only oh five one (4 was successful) and for now on latest run is against fail https://product-ci.infra.mirantis.net/view/8.0_swarm/job/8.0.system_test.ubuntu.ha_neutron_destructive/127/testReport/junit/%28root%29/change_pacemaker_parameter_not_break_rabbitmq/change_pacemaker_parameter_not_break_rabbitmq/
So please, help us to get a reason, may be we should to wait a little bit more here?

Revision history for this message

Tatyanka (tatyana-leontovich) wrote on 2016-02-02:

Ok Guys I look deeper and can confirm that it is fuel qa issue according to wrong parse of cmd output

Changed in fuel:
status:	Incomplete → Confirmed
assignee:	Tatyanka (tatyana-leontovich) → Fuel QA Team (fuel-qa)
tags:	added: area-qa removed: area-oslo move-to-mu

Revision history for this message

Alexey Lebedeff (alebedev-a) wrote on 2016-02-02:

Is it possible to access this environment? Because I can see that 'Segmentation fault (core dumped)' also reproduces in the latest run. And I need those core dumps for further investigation.

Tatyanka (tatyana-leontovich) on 2016-02-02

summary:

- Changing of pacemaker parameters breaks RabbitMQ on attempt 5
+ Changing of pacemaker parameters breaks RabbitMQ test fail on parsing
+ rabbitmqctl cluster_status output

OpenStack Infra (hudson-openstack) on 2016-02-03

Changed in fuel:
assignee:	Fuel QA Team (fuel-qa) → Artem Panchenko (apanchenko-8)
status:	Confirmed → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-02-03: Fix proposed to fuel-qa (stable/8.0)

#10

Fix proposed to branch: stable/8.0
Review: https://review.openstack.org/275807

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-02-03: Fix merged to fuel-qa (master)

#11

Reviewed: https://review.openstack.org/274130
Committed: https://git.openstack.org/cgit/openstack/fuel-qa/commit/?id=7bf3271ea16f599d6228862878f1befc61d3d2f2
Submitter: Jenkins
Branch: master

commit 7bf3271ea16f599d6228862878f1befc61d3d2f2
Author: Artem Panchenko <email address hidden>
Date: Fri Jan 29 17:53:43 2016 +0200

Use regex for matching node in rabbit-fence logs

    Currently it's possible to add special prefixes for
    hostnames which are used by RabbitMQ (messaging network).
    Modified tests, so such prefixes are ignored while
    parsing rabbit-fence logs.

    Also change 'get_rabbit_running_nodes' method to
    omit node name prefixes while returning list of running
    nodes.

    Change-Id: I63bac7c4eafa61fc756d033dadd7d2ba662eaf4c
    Closes-bug: #1538597
    Closes-bug: #1539586

Changed in fuel:
status:	In Progress → Fix Committed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-02-03: Fix merged to fuel-qa (stable/8.0)

#12

Reviewed: https://review.openstack.org/275807
Committed: https://git.openstack.org/cgit/openstack/fuel-qa/commit/?id=646bcc839b0de884c230ff55c522c52f87af828e
Submitter: Jenkins
Branch: stable/8.0

commit 646bcc839b0de884c230ff55c522c52f87af828e
Author: Artem Panchenko <email address hidden>
Date: Fri Jan 29 17:53:43 2016 +0200

Use regex for matching node in rabbit-fence logs

    Currently it's possible to add special prefixes for
    hostnames which are used by RabbitMQ (messaging network).
    Modified tests, so such prefixes are ignored while
    parsing rabbit-fence logs.

    Also change 'get_rabbit_running_nodes' method to
    omit node name prefixes while returning list of running
    nodes.

    Change-Id: I63bac7c4eafa61fc756d033dadd7d2ba662eaf4c
    Closes-bug: #1538597
    Closes-bug: #1539586

Revision history for this message

Tatyanka (tatyana-leontovich) wrote on 2016-02-04:

#13

there is some places in this test where we do not use method get_rabbit_nodes , so we still has issue with parse output, like here
def count_run_rabbit(node, all_up=False):
            with self.fuel_web.get_ssh_for_node(node.name) as remote:
                cmd = 'rabbitmqctl cluster_status'
                with RunLimit(seconds=60, error_message=error.format(cmd)):
                    out = run_on_remote(remote, cmd=cmd, raise_on_assert=False)
            run_nodes = [el for el in out if 'running_nodes' in el]
            run_nodes = run_nodes[0] if run_nodes else ''
            logger.debug('### Status for {} \n {}'.format(str(node.name),
                                                          run_nodes))
            expected_up = len(n_ctrls) if all_up else 1
            return run_nodes.count('rabbit@') == expected_up

Changed in fuel:
status:	Fix Committed → Fix Released
status:	Fix Released → Confirmed
assignee:	Artem Panchenko (apanchenko-8) → Fuel QA Team (fuel-qa)

Tatyanka (tatyana-leontovich) on 2016-02-04

tags:

added: non-release

Nastya Urlapova (aurlapova) on 2016-02-15

Changed in fuel:
assignee:	Fuel QA Team (fuel-qa) → Nastya Urlapova (aurlapova)

OpenStack Infra (hudson-openstack) on 2016-02-16

Changed in fuel:
status:	Confirmed → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-02-16: Fix proposed to fuel-qa (stable/8.0)

#14

Fix proposed to branch: stable/8.0
Review: https://review.openstack.org/280599

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-02-16: Fix merged to fuel-qa (master)

#15

Reviewed: https://review.openstack.org/280192
Committed: https://git.openstack.org/cgit/openstack/fuel-qa/commit/?id=622b7754c0e5d74e06628d6a753d197167eb89e2
Submitter: Jenkins
Branch: master

commit 622b7754c0e5d74e06628d6a753d197167eb89e2
Author: NastyaUrlapova <email address hidden>
Date: Mon Feb 15 15:43:34 2016 +0300

Fix for count_run_rabbit function

Change-Id: I3178172fb3a4e820ef12d5f6e04e571eda3c9afa
Partial-Bug: #1539586

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-02-16: Fix merged to fuel-qa (stable/8.0)

#16

Reviewed: https://review.openstack.org/280599
Committed: https://git.openstack.org/cgit/openstack/fuel-qa/commit/?id=33a7af51c8ac3f1dc64f1614a4dd989780111e81
Submitter: Jenkins
Branch: stable/8.0

commit 33a7af51c8ac3f1dc64f1614a4dd989780111e81
Author: NastyaUrlapova <email address hidden>
Date: Mon Feb 15 15:43:34 2016 +0300

Fix for count_run_rabbit function

Change-Id: I3178172fb3a4e820ef12d5f6e04e571eda3c9afa
Partial-Bug: #1539586

Nastya Urlapova (aurlapova) on 2016-02-16

Changed in fuel:
status:	In Progress → Fix Committed

Revision history for this message

Artem Panchenko (apanchenko-8) wrote on 2016-02-22:

#17

tests passed on the latest swarm for 8.0 (test plan 8.0 iso #586)

Changed in fuel:
status:	Fix Committed → Fix Released

Nastya Urlapova (aurlapova) on 2016-03-11

tags:

removed: non-release

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

fail_error_change_pacemaker_parameter_not_break_rabbitmq-fuel-snapshot-2016-01-29_01-58-46.tar.xz Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.