[ostf] RabbitMQ tests failed (Failed to merge schema: Bad cookie in table definition mirrored_sup_childspec)

Bug #1620649 reported by Dmitry Belyaninov
26
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
High
Dmitry Mescheryakov
Mitaka
Fix Released
High
Dmitry Mescheryakov

Bug Description

Detailed bug description:

There is failed SWARM test:
https://product-ci.infra.mirantis.net/job/9.x.system_test.ubuntu.bonding_ha/50/testReport/(root)/deploy_bonding_neutron_tun/deploy_bonding_neutron_tun/

Failed 2 OSTF tests; should fail 0 tests. Names of failed tests:
  - RabbitMQ availability (failure) Number of RabbitMQ nodes is not equal to number of cluster nodes.
  - RabbitMQ replication (failure) Failed to establish AMQP connection to 5673/tcp port on 192.168.0.4 from controller node! Please refer to OpenStack logs for more details.

  File "/usr/lib/python2.7/site-packages/fuel_health/ha_base.py", line 245, in check_rabbit_connections
    remote.exec_command(cmd)
  File "/usr/lib/python2.7/site-packages/fuel_health/common/ssh.py", line 166, in exec_command
    strerror=''.join(err_data).join(out_data))
SSHExecCommandFailed: Command 'python -c 'import kombu; c = kombu.Connection("amqp://nova:gRY5Jy7VIqDzOxhv8I6A4xvu@192.168.0.4:5673//"); c.connect()'', exit status: 1, Error:
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python2.7/dist-packages/kombu/connection.py", line 237, in connect
    return self.connection
  File "/usr/lib/python2.7/dist-packages/kombu/connection.py", line 741, in connection
    self._connection = self._establish_connection()
  File "/usr/lib/python2.7/dist-packages/kombu/connection.py", line 696, in _establish_connection
    conn = self.transport.establish_connection()
  File "/usr/lib/python2.7/dist-packages/kombu/transport/pyamqp.py", line 116, in establish_connection
    conn = self.Connection(**opts)
  File "/usr/lib/python2.7/dist-packages/amqp/connection.py", line 165, in __init__
    self.transport = self.Transport(host, connect_timeout, ssl)
  File "/usr/lib/python2.7/dist-packages/amqp/connection.py", line 186, in Transport
    return create_transport(host, connect_timeout, ssl)
  File "/usr/lib/python2.7/dist-packages/amqp/transport.py", line 299, in create_transport
    return TCPTransport(host, connect_timeout)
  File "/usr/lib/python2.7/dist-packages/amqp/transport.py", line 95, in __init__
    raise socket.error(last_err)
socket.error: [Errno 111] Connection refused

paramiko.transport: DEBUG: EOF in transport thread
fuel_health.common.test_mixins: DEBUG: Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/fuel_health/common/test_mixins.py", line 177, in verify
    result = func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/fuel_health/ha_base.py", line 249, in check_rabbit_connections
    "port on {0} from controller node!".format(ip, port))
  File "/usr/lib/python2.7/site-packages/unittest2/case.py", line 666, in fail
    raise self.failureException(msg)
AssertionError: Failed to establish AMQP connection to 5673/tcp port on 192.168.0.4 from controller node!

Note:
https://bugs.launchpad.net/fuel/+bug/1604731

Steps to reproduce:
run test
Expected results:
pass
Actual result:
fail
Reproducibility:
yes
Workaround:
 <put your information here>
Impact:
 <put your information here>
Description of the environment:
 Operation system: <put your information here>
 Versions of components: <put your information here>
 Reference architecture: <put your information here>
 Network model: <put your information here>
 Related projects installed: <put your information here>
Additional information:
 <put your information here>

Changed in fuel:
assignee: nobody → Fuel Sustaining (fuel-sustaining-team)
milestone: none → 9.1
tags: added: swarm-fail
Revision history for this message
Dmitry Belyaninov (dbelyaninov) wrote :
Changed in fuel:
status: New → Confirmed
importance: Undecided → High
tags: added: area-library
Changed in fuel:
assignee: Fuel Sustaining (fuel-sustaining-team) → Maksim Malchuk (mmalchuk)
Changed in fuel:
status: Confirmed → Incomplete
Changed in fuel:
status: Incomplete → Confirmed
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

This looks like a new corner case the OCF RA cannot handle:

2016-09-06T06:42:42.844680+00:00 node-2 lrmd.log info: INFO: p_rabbitmq-server[26187]: su_rabbit_cmd(): the invoked command exited 70: /usr/sbin/rabbitmqctl join_cluster rabbit@messaging-node-5
(note: 70 EX_SOFTWARE Any other error discovered when running command against live node)

this repeats on each start attempt in the endless loop

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

This logs snippet http://pastebin.com/1gGAyf65 shows that there is a flaw in the mnesia erase logic. It seems has been introduced by the recent changes to the (post-)demote action.

In the snippet you can see that node-2 was not able to join to the node-5 because of the schema merge issue. The OCF RA normally handles this by mnesia erase. But there is none mnesia erase events for the master running (node-5). So perhaps that's why this situation was never resolved - because of the "bad master's mnesia"

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Note, that it seems that things went wrong after this:
2016-09-06T06:40:48.997924+00:00 node-5 lrmd.log info: INFO: p_rabbitmq-server[21395]: su_rabbit_cmd(): the invoked command exit
ed 70: /usr/sbin/rabbitmqctl start_app >>/var/log/rabbitmq/startup_log 2>&1
2016-09-06T06:41:00.425997+00:00 node-5 lrmd.log info: INFO: p_rabbitmq-server[21395]: su_rabbit_cmd(): the invoked command exited 70: /usr/sbin/rabbitmqctl reset

I believe that node-5's mnesia should've been erased forcibly after the soft reset exited with 70. But there is no message "Mnesia files appear corrupted and have been removed" which indicates the force erase event.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Here is another logs snippet http://pastebin.com/0tGcv14j from the snapshot in https://bugs.launchpad.net/fuel/+bug/1620649/comments/1

It seems that a successful command force_reset doesn't heals the given "merge schema" corner case.
Currently, OCF RA ensures mnesia cleanup as:
- try reset,
- if failed, try force_reset,
- if failed, erase mnesia files.

The step 2 perhaps should be skipped straight to the step 3

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Changed in fuel:
assignee: Maksim Malchuk (mmalchuk) → Bogdan Dobrelya (bogdando)
status: Confirmed → In Progress
milestone: 9.1 → 10.0
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/366711

Changed in fuel:
assignee: Bogdan Dobrelya (bogdando) → Maksim Malchuk (mmalchuk)
status: In Progress → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-library (stable/mitaka)

Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: stable/mitaka
Review: https://review.openstack.org/366711
Reason: This doesn't help to fix the bug

Revision history for this message
Bogdan Dobrelya (bogdando) wrote : Re: [ostf] RabbitMQ tests failed

I tested the patch and it doesn't help to mitigate the given corner case. Nor does a full stop of cluster nodes, manual mnesia files removal for all, then start from a scratch. The issue lives in another place, yet unknown.

Changed in fuel:
assignee: Maksim Malchuk (mmalchuk) → Dmitry Mescheryakov (dmitrymex)
Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

As Bogdan pointed out, the following errors appear in RabbitMQ's log when it tries to join the cluster: http://paste.openstack.org/show/567595/
Deleting mnesia files does not help, the issue persists.

It turns out that starting from some version Mnesia schema lives outside of other Mnesia files in /var/lib/rabbitmq/Mnesia.rabbit@messaging-node-3/

Deleting that directory on the affected node resolves the issue.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/366915

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (stable/mitaka)

Reviewed: https://review.openstack.org/366915
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=44f57fdcd8b4513f0453e8c0d2330667e90a55f7
Submitter: Jenkins
Branch: stable/mitaka

commit 44f57fdcd8b4513f0453e8c0d2330667e90a55f7
Author: Dmitry Mescheryakov <email address hidden>
Date: Wed Sep 7 21:44:20 2016 +0300

    In RabbitMQ OCF script delete Mnesia schema on mnesia reset

    Not doing so leads to RabbitMQ node being half-stuck in cluster. As a
    result, it can't clearly join back and constantly fails.

    Upstream PR: https://github.com/rabbitmq/rabbitmq-server/pull/946

    Change-Id: Ib03f1d5d7e56f25bf122471866e2b8d1736ec88a
    Closes-Bug: #1620649

tags: added: on-verification
Revision history for this message
Dmitriy Kruglov (dkruglov) wrote : Re: [ostf] RabbitMQ tests failed

Verified on MOS 9.x, snapshot #251.
The issue is fixed.

tags: removed: on-verification
summary: - [ostf] RabbitMQ tests failed
+ [ostf] RabbitMQ tests failed (Failed to merge schema: Bad cookie in
+ table definition mirrored_sup_childspec)
Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

The issue https://bugs.launchpad.net/fuel/+bug/1648318 is completely unrelated to that bug. Hence dropping the duplicate link and returning the bug to 'fix released' state. See https://bugs.launchpad.net/fuel/+bug/1648318/comments/4 for RCA for that bug.

Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

Fixed in 10.0 as our current version of OCF script is taken directly from upstream, which already has the fix.

Changed in fuel:
status: Confirmed → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.