Changing of pacemaker parameters breaks RabbitMQ test fail on parsing rabbitmqctl cluster_status output

Bug #1539586 reported by Tatyanka
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
High
Nastya Urlapova
8.0.x
Fix Released
High
Nastya Urlapova

Bug Description

Change pacemaker parameters doesn't break RabbitMQ.

Scenario:
1. Deploy environment with at least 3 controllers
2. Change max_rabbitmqctl_timeouts parameter on one of
controllers,after that slaves rabbitmq will be restarted by
Pacemaker.
3. Wait for 1 minute.
4. Check RabbitMQ cluster is assembled until success in 10 min
5. Run OSTF
6. Repeat two more times steps 2-5

After attempt 5 next changes parameters crm_resource --resource p_rabbitmq-server --set-parameter max_rabbitmqctl_timeouts --parameter-value 8

Cluster was not assembled in time timeout=600, interval=20, only master was running
2016-01-29 01:50:34,794 - DEBUG helpers.py:335 -- Executing command: 'rabbitmqctl cluster_status'
2016-01-29 01:50:36,000 - DEBUG __init__.py:59 -- Done: run_on_remote_get_results with result: {'stdout_len': 5, 'stderr_len': 0, 'stdout': ["Cluster status of node 'rabbit@messaging-node-1' ...\n", "[{nodes,[{disc,['rabbit@messaging-node-1']}]},\n", " {running_nodes,['rabbit@messaging-node-1']},\n", ' {cluster_name,<<"<email address hidden>">>},\n', ' {partitions,[]}]\n'], 'exit_code': 0, 'stderr_str': '', 'stderr': [], 'stdout_str': 'Cluster status of node \'rabbit@messaging-node-1\' ...\n[{nodes,[{disc,[\'rabbit@messaging-node-1\']}]},\n {running_nodes,[\'rabbit@messaging-node-1\']},\n {cluster_name,<<"<email address hidden>">>},\n {partitions,[]}]\n'}
2016-01-29 01:50:36,001 - INFO sftp.py:129 -- [chan 0] sftp session closed.
2016-01-29 01:50:36,064 - DEBUG test_failover_base.py:1262 -- ### Status for slave-03
  {running_nodes,['rabbit@messaging-node-1']},

After revert of failed env - looks like c cluster ok, on slaves node there are some errors like:
Error on AMQP connection <0.5460.0> (10.109.32.5:43207 -> 10.109.32.6:5673, vhost: '/', user: 'nova', state: running), channel 0:
{amqp_error,connection_forced,
            "broker forced connection closure with reason 'shutdown'",none}

=ERROR REPORT==== 29-Jan-2016::01:48:34 ===
Error on AMQP connection <0.3069.0> (10.109.32.6:48067 -> 10.109.32.6:5673, vhost: '/', user: 'nova', state: running), channel 0:
{amqp_error,connection_forced,
            "broker forced connection closure with reason 'shutdown'",none}

[root@nailgun ~]# cat /etc/fuel/version.yaml
VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  api: "1.0"
  build_number: "493"
  build_id: "493"
  fuel-nailgun_sha: "b900f9d9de4d2b6ccf27f4addf3f0e38502a0bac"
  python-fuelclient_sha: "4f234669cfe88a9406f4e438b1e1f74f1ef484a5"
  fuel-agent_sha: "e869072139670bb8bbfde00ef04dec3d189f5927"
  fuel-nailgun-agent_sha: "b2bb466fd5bd92da614cdbd819d6999c510ebfb1"
  astute_sha: "b81577a5b7857c4be8748492bae1dec2fa89b446"
  fuel-library_sha: "f7a008e6801ba0072b08302a740174aec506078a"
  fuel-ostf_sha: "ab5fd151fc6c1aa0b35bc2023631b1f4836ecd61"
  fuel-mirror_sha: "351d568fa3b3e4dd062054b91d766aa54d379867"
  fuelmenu_sha: "fac143f4dfa75785758e72afbdc029693e94ff2b"
  shotgun_sha: "63645dea384a37dde5c01d4f8905566978e5d906"
  network-checker_sha: "a43cf96cd9532f10794dce736350bf5bed350e9d"
  fuel-upgrade_sha: "616a7490ec7199f69759e97e42f9b97dfc87e85b"
  fuelmain_sha: "6b993b3004e8d97d840b672d6c1d44c320975cd9"
[root@nailgun ~]#

Tags: area-qa
Revision history for this message
Tatyanka (tatyana-leontovich) wrote :
Changed in fuel:
status: New → Confirmed
summary: - Change pacemaker parameters break RabbitMQ on attemp 5
+ Changing of pacemaker parameters breaks RabbitMQ on attempt 5
tags: added: area-oslo
Revision history for this message
Alexey Lebedeff (alebedev-a) wrote : Re: Changing of pacemaker parameters breaks RabbitMQ on attempt 5

As a side note - pacemaker logs contains a lot of messages from https://bugs.launchpad.net/fuel/+bug/1528686, but I don't think it's related to current bug.

Revision history for this message
Alexey Lebedeff (alebedev-a) wrote :

Another interesting thing - there several 'Segmentation fault (core dumped)' log records, probably from 'rabbitmqctl status'. Can we get hold of the live env, or at least of those core files?

Revision history for this message
Alexey Lebedeff (alebedev-a) wrote :

From ./nailgun.test.domain.local/var/log/docker-logs/remote/node-1.test.domain.local/lrmd.log:

2016-01-29T01:50:17.168543+00:00 info: INFO: p_rabbitmq-server: get_monitor(): rabbit app is running. master is node-2.test.domain.local
2016-01-29T01:50:17.841360+00:00 info: INFO: p_rabbitmq-server: get_monitor(): rabbit app is running. master is node-1.test.domain.local

Looks like pacemaker has elected 2 masters simultaneously.

tags: added: move-to-mu
Revision history for this message
Alexey Lebedeff (alebedev-a) wrote :

Actually, after looking into logs I can't see evidence that "Cluster was not assembled in time timeout=600" is true.

Pacemaker begun stopping slaves at '2016-01-29T01:48:33.101797+00:00' and all nodes were up after '2016-01-29T01:50:52.504074+00:00' - it's definitely less than 600 seconds.

Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

Tatyana, I agree with Alexey, there is clear indication in sys_test.log
https://product-ci.infra.mirantis.net/view/8.0_swarm/job/8.0.system_test.ubuntu.ha_neutron_destructive/123/

that cluster was up again in 01:51:01:

2016-01-29 01:51:01,569 - DEBUG __init__.py:59 -- Done: run_on_remote_get_results with result: {'stdout_len': 7, 'stderr_len': 0, 'stdout': ["Cluster status of node 'rabbit@messaging-node-1' ...\n", "[{nodes,[{disc,['rabbit@messaging-node-1','rabbit@messaging-node-2',\n", " 'rabbit@messaging-node-5']}]},\n", " {running_nodes,['rabbit@messaging-node-2','rabbit@messaging-node-5',\n", " 'rabbit@messaging-node-1']},\n", ' {cluster_name,<<"<email address hidden>">>},\n', ' {partitions,[]}]\n'], 'exit_code': 0, 'stderr_str': '', 'stderr': [], 'stdout_str': 'Cluster status of node \'rabbit@messaging-node-1\' ...\n[{nodes,[{disc,[\'rabbit@messaging-node-1\',\'rabbit@messaging-node-2\',\n \'rabbit@messaging-node-5\']}]},\n {running_nodes,[\'rabbit@messaging-node-2\',\'rabbit@messaging-node-5\',\n \'rabbit@messaging-node-1\']},\n {cluster_name,<<"<email address hidden>">>},\n {partitions,[]}]\n'}

Also, first time cluster_status check was performed just 3 minutes earlier - at 01:48:34. I.e. RabbitMQ reassembled in 3 minutes.

Could you please clarify what exaclty made the test failed?

Changed in fuel:
assignee: MOS Oslo (mos-oslo) → Tatyanka (tatyana-leontovich)
status: Confirmed → Invalid
status: Invalid → Incomplete
Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

Guys note that there is 5 attempts (test is executed in loop) - and fail happens only oh five one (4 was successful) and for now on latest run is against fail https://product-ci.infra.mirantis.net/view/8.0_swarm/job/8.0.system_test.ubuntu.ha_neutron_destructive/127/testReport/junit/%28root%29/change_pacemaker_parameter_not_break_rabbitmq/change_pacemaker_parameter_not_break_rabbitmq/
So please, help us to get a reason, may be we should to wait a little bit more here?

Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

Ok Guys I look deeper and can confirm that it is fuel qa issue according to wrong parse of cmd output

Changed in fuel:
status: Incomplete → Confirmed
assignee: Tatyanka (tatyana-leontovich) → Fuel QA Team (fuel-qa)
tags: added: area-qa
removed: area-oslo move-to-mu
Revision history for this message
Alexey Lebedeff (alebedev-a) wrote :

Is it possible to access this environment? Because I can see that 'Segmentation fault (core dumped)' also reproduces in the latest run. And I need those core dumps for further investigation.

summary: - Changing of pacemaker parameters breaks RabbitMQ on attempt 5
+ Changing of pacemaker parameters breaks RabbitMQ test fail on parsing
+ rabbitmqctl cluster_status output
Changed in fuel:
assignee: Fuel QA Team (fuel-qa) → Artem Panchenko (apanchenko-8)
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-qa (stable/8.0)

Fix proposed to branch: stable/8.0
Review: https://review.openstack.org/275807

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-qa (master)

Reviewed: https://review.openstack.org/274130
Committed: https://git.openstack.org/cgit/openstack/fuel-qa/commit/?id=7bf3271ea16f599d6228862878f1befc61d3d2f2
Submitter: Jenkins
Branch: master

commit 7bf3271ea16f599d6228862878f1befc61d3d2f2
Author: Artem Panchenko <email address hidden>
Date: Fri Jan 29 17:53:43 2016 +0200

    Use regex for matching node in rabbit-fence logs

    Currently it's possible to add special prefixes for
    hostnames which are used by RabbitMQ (messaging network).
    Modified tests, so such prefixes are ignored while
    parsing rabbit-fence logs.

    Also change 'get_rabbit_running_nodes' method to
    omit node name prefixes while returning list of running
    nodes.

    Change-Id: I63bac7c4eafa61fc756d033dadd7d2ba662eaf4c
    Closes-bug: #1538597
    Closes-bug: #1539586

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-qa (stable/8.0)

Reviewed: https://review.openstack.org/275807
Committed: https://git.openstack.org/cgit/openstack/fuel-qa/commit/?id=646bcc839b0de884c230ff55c522c52f87af828e
Submitter: Jenkins
Branch: stable/8.0

commit 646bcc839b0de884c230ff55c522c52f87af828e
Author: Artem Panchenko <email address hidden>
Date: Fri Jan 29 17:53:43 2016 +0200

    Use regex for matching node in rabbit-fence logs

    Currently it's possible to add special prefixes for
    hostnames which are used by RabbitMQ (messaging network).
    Modified tests, so such prefixes are ignored while
    parsing rabbit-fence logs.

    Also change 'get_rabbit_running_nodes' method to
    omit node name prefixes while returning list of running
    nodes.

    Change-Id: I63bac7c4eafa61fc756d033dadd7d2ba662eaf4c
    Closes-bug: #1538597
    Closes-bug: #1539586

Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

there is some places in this test where we do not use method get_rabbit_nodes , so we still has issue with parse output, like here
 def count_run_rabbit(node, all_up=False):
            with self.fuel_web.get_ssh_for_node(node.name) as remote:
                cmd = 'rabbitmqctl cluster_status'
                with RunLimit(seconds=60, error_message=error.format(cmd)):
                    out = run_on_remote(remote, cmd=cmd, raise_on_assert=False)
            run_nodes = [el for el in out if 'running_nodes' in el]
            run_nodes = run_nodes[0] if run_nodes else ''
            logger.debug('### Status for {} \n {}'.format(str(node.name),
                                                          run_nodes))
            expected_up = len(n_ctrls) if all_up else 1
            return run_nodes.count('rabbit@') == expected_up

Changed in fuel:
status: Fix Committed → Fix Released
status: Fix Released → Confirmed
assignee: Artem Panchenko (apanchenko-8) → Fuel QA Team (fuel-qa)
tags: added: non-release
Changed in fuel:
assignee: Fuel QA Team (fuel-qa) → Nastya Urlapova (aurlapova)
Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-qa (stable/8.0)

Fix proposed to branch: stable/8.0
Review: https://review.openstack.org/280599

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-qa (master)

Reviewed: https://review.openstack.org/280192
Committed: https://git.openstack.org/cgit/openstack/fuel-qa/commit/?id=622b7754c0e5d74e06628d6a753d197167eb89e2
Submitter: Jenkins
Branch: master

commit 622b7754c0e5d74e06628d6a753d197167eb89e2
Author: NastyaUrlapova <email address hidden>
Date: Mon Feb 15 15:43:34 2016 +0300

    Fix for count_run_rabbit function

    Change-Id: I3178172fb3a4e820ef12d5f6e04e571eda3c9afa
    Partial-Bug: #1539586

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-qa (stable/8.0)

Reviewed: https://review.openstack.org/280599
Committed: https://git.openstack.org/cgit/openstack/fuel-qa/commit/?id=33a7af51c8ac3f1dc64f1614a4dd989780111e81
Submitter: Jenkins
Branch: stable/8.0

commit 33a7af51c8ac3f1dc64f1614a4dd989780111e81
Author: NastyaUrlapova <email address hidden>
Date: Mon Feb 15 15:43:34 2016 +0300

    Fix for count_run_rabbit function

    Change-Id: I3178172fb3a4e820ef12d5f6e04e571eda3c9afa
    Partial-Bug: #1539586

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
Artem Panchenko (apanchenko-8) wrote :

tests passed on the latest swarm for 8.0 (test plan 8.0 iso #586)

Changed in fuel:
status: Fix Committed → Fix Released
tags: removed: non-release
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.