[system-tests]Fix fuelweb_tests for RabbitMQ HA full cluster restart

Bug #1383247 reported by Andrey Sledzinskiy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
High
Dennis Dmitriev
5.1.x
Won't Fix
High
Fuel QA Team
6.0.x
Invalid
Undecided
Fuel QA Team
6.1.x
Fix Released
High
Dennis Dmitriev

Bug Description

{

    "build_id": "2014-10-18_00-35-45",
    "ostf_sha": "64cb59c681658a7a55cc2c09d079072a41beb346",
    "build_number": "32",
    "auth_required": true,
    "api": "1.0",
    "nailgun_sha": "b9792cb5bbecddfa9c5c3afb4d0f961a2a2776a7",
    "production": "docker",
    "fuelmain_sha": "7bac3edb9760449ccd2c43d9078a6150c0685590",
    "astute_sha": "f5fbd89d1e0e1f22ef9ab2af26da5ffbfbf24b13",
    "feature_groups": [
        "mirantis"
    ],
    "release": "5.1.1",
    "release_versions": {
        "2014.1.1-5.1.1": {
            "VERSION": {
                "build_id": "2014-10-18_00-35-45",
                "ostf_sha": "64cb59c681658a7a55cc2c09d079072a41beb346",
                "build_number": "32",
                "api": "1.0",
                "nailgun_sha": "b9792cb5bbecddfa9c5c3afb4d0f961a2a2776a7",
                "production": "docker",
                "fuelmain_sha": "7bac3edb9760449ccd2c43d9078a6150c0685590",
                "astute_sha": "f5fbd89d1e0e1f22ef9ab2af26da5ffbfbf24b13",
                "feature_groups": [
                    "mirantis"
                ],
                "release": "5.1.1",
                "fuellib_sha": "4f8414a08316a0c569bf74752b801be77169a9c5"
            }
        }
    },
    "fuellib_sha": "4f8414a08316a0c569bf74752b801be77169a9c5"

}

Steps:
1. Create next cluster - CentOS, HA, Flat Nova-network, Ceph for volumes and images, 3 controller+ceph, 2 compute+ceph, 1 ceph node
2. Deploy cluster
3. Destroy 1 compute+ceph, 1 ceph node
4. Restart 3 controllers
5. Check cinder services

Expected - cinder services are up during 5 minutes
Actual - 5 minutes isn't enough to cinder services up. After revert snapshot services were up after 5-10 minutes

ERROR: Deploy ceph with in HA mode
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/proboscis/case.py", line 296, in testng_method_mistake_capture_func
    compatability.capture_type_error(s_func)
  File "/usr/lib/python2.7/dist-packages/proboscis/compatability/exceptions_2_6.py", line 27, in capture_type_error
    func()
  File "/usr/lib/python2.7/dist-packages/proboscis/case.py", line 350, in func
    func(test_case.state.get_state())
  File "/home/jenkins/workspace/5.1_fuelmain.system_test.centos.thread_3/fuelweb_test/helpers/decorators.py", line 52, in wrapper
    return func(*args, **kwagrs)
  File "/home/jenkins/workspace/5.1_fuelmain.system_test.centos.thread_3/fuelweb_test/tests/tests_strength/test_restart.py", line 149, in ceph_ha_restart
    self.fuel_web.wait_cinder_is_up(['slave-01'])
  File "/home/jenkins/workspace/5.1_fuelmain.system_test.centos.thread_3/fuelweb_test/__init__.py", line 48, in wrapped
    result = func(*args, **kwargs)
  File "/home/jenkins/workspace/5.1_fuelmain.system_test.centos.thread_3/fuelweb_test/models/fuel_web_client.py", line 1006, in wait_cinder_is_up
    raise TimeoutError("Cinder services not ready.")
TimeoutError: Cinder services not ready.

Logs are attached

Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :
Revision history for this message
Dennis Dmitriev (ddmitriev) wrote :

Reproduced on CI test: http://jenkins-product.srt.mirantis.net:8080/view/5.1_swarm/job/5.1_fuelmain.system_test.centos.thread_3/29/console

In fact this issue is connected with rabbitmq long starting:

=============== node-1.test.domain.local/cinder-volume.log =======================
2014-10-22T18:19:02.878091+01:00 err: 2014-10-22 17:19:02.850 3403 ERROR oslo.messaging._drivers.impl_rabbit [req-e2d7f6ca-5b6a-4838-a4c7-43370fa25bde - -- - -] AMQP server on 127.0.0.1:5673 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 30 seconds.
======================================

Unfortunately, there is no pacemaker logs in the diagnostic snapshot so it is hard to investigate what happened.

Revision history for this message
Dennis Dmitriev (ddmitriev) wrote :

Same issue with rabbitmq long starting on CI test http://jenkins-product.srt.mirantis.net:8080/view/5.1_swarm/job/5.1_fuelmain.system_test.centos.thread_5/28/console , test name 'deploy_ha_neutron'.

rabbitmq started on several minutes later than OSTF, so test failed because of failed rabbitmq ostf check.

Revision history for this message
Dennis Dmitriev (ddmitriev) wrote :
Download full text (4.7 KiB)

RabbitMQ is assembled into cluster by pacemaker in several separate stages ('start' to checking Mnesia database consistensy, 'pre-promote', 'promote' and 'post-promote' to choose the Master and join other nodes to it).

Pacemaker runs each stage for 'rabbitmq' resource together with other resources ('heat' and 'mysql'), and goes to the next stage only when all resources are processed in the current stage, one-by-one.

We often facing broken galera cluster that takes a long time when restoring the cluster.

The script /usr/lib/ocf/resource.d/mirantis/mysql-wss on the controller consumes for about 7 minutes for every try to start the galera, not allowing pacemeker to process other resources. This leads to about a seven-minute period between processing the 'rabbitmq' stages.

Taking into account other resources, we have to wait for about 20 minutes before RabbitMQ will be functional ('start' ... 10 minutes wait for others ... 'promote to master' ... 10 minutes wait for others ... 'join to cluster and allow access to rabbitmq').

Unfortunatelly, logging from 'mysql-wss' is broken, so here is output of mysql-wss script started manually:

================================================================================
[root@node-2 mirantis]# date
Thu Oct 23 18:24:23 UTC 2014

[root@node-2 mirantis]# OCF_ROOT=/usr/lib/ocf/ /usr/lib/ocf/resource.d/mirantis/mysql-wss start
INFO: mysql_status: ====================== i = 1 ; sleeptime = 5
INFO: PIDFile /var/run/mysql/mysqld.pid of MySQL server not found. Sleeping for 5 seconds. 0 retries left
INFO: MySQL is not running
Resource 'default' not found: No such device or address
Error performing operation: No such device or address
INFO: Checking if galera primary controller
INFO: GTID OK: 96cc782e-5aa0-11e4-b985-066f5a65a8fa:24134
INFO: GTID OK: 96cc782e-5aa0-11e4-b985-066f5a65a8fa:24114
INFO: GTID OK: 96cc782e-5aa0-11e4-b985-066f5a65a8fa:24209
INFO: Possible masters: node-4.test.domain.local
INFO: Choosed master: node-4.test.domain.local
date
INFO: Waiting for master. 300 seconds to go
Resource 'default' not found: No such device or address
Error performing operation: No such device or address
INFO: Waiting for master. 270 seconds to go
Resource 'default' not found: No such device or address
Error performing operation: No such device or address
INFO: Waiting for master. 240 seconds to go
Resource 'default' not found: No such device or address
Error performing operation: No such device or address
INFO: Waiting for master. 210 seconds to go
Resource 'default' not found: No such device or address
Error performing operation: No such device or address
INFO: Waiting for master. 180 seconds to go
Resource 'default' not found: No such device or address
Error performing operation: No such device or address
INFO: Waiting for master. 150 seconds to go
Resource 'default' not found: No such device or address
Error performing operation: No such device or address
INFO: Waiting for master. 120 seconds to go
Resource 'default' not found: No such device or address
Error performing operation: No such device or address
INFO: Waiting for master. 90 seconds to go
Resource 'default' not found: No such device or address
Err...

Read more...

summary: - Cinder services are down after cold restart all controllers
+ RabbitMQ is started for a very long time in HA
Changed in fuel:
assignee: Fuel QA Team (fuel-qa) → Fuel Library Team (fuel-library)
Revision history for this message
Tatyanka (tatyana-leontovich) wrote : Re: RabbitMQ is started for a very long time in HA
Changed in fuel:
status: New → Confirmed
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

I believe the proper check for
5. Check cinder services
could be
5. Check fuel health --env X --check HA

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

> Unfortunately, there is no pacemaker logs in the diagnostic snapshot so it is hard to investigate what happened.

check /var/log/remote/node*/rabbitmq-server.log for cluster reassembling events from pacemaker.
Other log from corosync and pacemaker are located under /var/log/node*/crmd , lrmd, attrd, cibadmin etc.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

For logs attached in #1, you can inspect ./node-{1,2,4}.test.domain.local/lrmd.log for rabbitmq reassembling events.
Then it is done, there should be a messages in logs like 'INFO: p_rabbitmq-server: get_monitor(): rabbit app is running and is member of healthy cluster'

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Looks like the test case described in the bug was not performed well: http://pastebin.com/jZ5UM6qb (from logs in #1)

As you can see, the full reboot AND cluster reassemble verify period was less than 5 minutes, and logs snapshot was taken too early - before the cluster managed to reassemble.

The correct check should:
1) measure time-to-reassemble from the moment of time then rebooting has been finished, instead of then it was initiated.
2) measure time-to-reassemble for any given node between the moments then corosync started and the time stamp of the nearest 'rabbit app is running and is member of healthy cluster' event.

tags: added: to-be-covered-by-tests
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Please update the verification steps and re-submit the correctly taken logs snapshot

Changed in fuel:
status: Confirmed → Incomplete
importance: Medium → High
Revision history for this message
Dennis Dmitriev (ddmitriev) wrote :

We will perform checks for critical services in the following order:

- Wait until MySQL Galera is up on some controller
- Wait until RabbitMQ cluster is up and accept connections
- Wait until Cinder services is up on some controller
- Check Ceph status

Revision history for this message
Dennis Dmitriev (ddmitriev) wrote :
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Looks good, but please consider to replace
- Wait until MySQL Galera is up on some controller
- Wait until RabbitMQ cluster is up and accept connections
by OSTF ha test group

Revision history for this message
Dennis Dmitriev (ddmitriev) wrote :

As I see RabbitMQ OSTF test performs only "rabbitmqctl cluster_status" check . No functionality is checked by OSTF, so it never covers pacemaker logic concerning assembly rabbitmq cluster.

There are some situations when OSTF doesn't reflect actual rabbitmq status:
- "rabbitmqctl cluster_status" shows that all nodes are running but pacemaker ocf script hasn't opened 5673 port in iptables yet (or there is remained an extra iptables rule that blocks 5673 port);
- "rabbitmqctl cluster_status" shows that all nodes are running but pacemaker is just checking if the rabbitmq starts (start phase) and it is going to shutdown rabbitmq before performing further steps;
- "rabbitmqctl cluster_status" shows that all nodes are running but it is still inaccessible thru haproxy because of haproxy, vip__management, network or any other issue. In this case rabbitmq looks like a nonworking for other services.

We want to make sure that rabbitmq is successfully accembled by pacemaker and ready to serve requests from other services.
The better way would be perform creating some queue and sending some test messages, but it is not realized in OSTF yet.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Submitted related bug https://bugs.launchpad.net/fuel/+bug/1387567

Ok, please stay in touch with OSTF team so they could reuse your code as well.

Changed in fuel:
status: Incomplete → In Progress
assignee: Fuel Library Team (fuel-library) → Dennis Dmitriev (ddmitriev)
summary: - RabbitMQ is started for a very long time in HA
+ Fix fuelweb_tests for RabbitMQ HA full cluster restart
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-main (master)

Change abandoned by Dennis Dmitriev (<email address hidden>) on branch: master
Review: https://review.openstack.org/131742
Reason: RabbitMQ check in OSTF will be more powerful then this one, so it is not necessary to make additional custom checks.

no longer affects: fuel/6.0.x
Revision history for this message
Tatyanka (tatyana-leontovich) wrote : Re: Fix fuelweb_tests for RabbitMQ HA full cluster restart

move to incomplete for 6.0.x according to for now it is not clear how to reproduce it

summary: - Fix fuelweb_tests for RabbitMQ HA full cluster restart
+ [system-tests]Fix fuelweb_tests for RabbitMQ HA full cluster restart
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-ostf (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/178864

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-qa (master)

Fix proposed to branch: master
Review: https://review.openstack.org/178966

tags: added: non-release
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-qa (master)

Reviewed: https://review.openstack.org/178966
Committed: https://git.openstack.org/cgit/stackforge/fuel-qa/commit/?id=aa50833aaebc598de37fcc5d617d77f894b569e7
Submitter: Jenkins
Branch: master

commit aa50833aaebc598de37fcc5d617d77f894b569e7
Author: Dennis Dmitriev <email address hidden>
Date: Thu May 14 17:02:13 2015 +0300

    Add two methods to wait for cluster HA and OS services ready

    assert_ha_services_ready():
     OSTF 'HA' test group should be used to validate if a cluster
     in the operational state.
     There are rabbitmq and mysql checks, and will be added haproxy
     and pacemaker checks.

     Without these services the cluster can fail requests from tests.

    assert_os_services_ready():
     OSTF 'Sanity' test group to wait until OpenStack services are
     ready.

    Change-Id: Ie1bddc965719ca59a143f8f43c53546a4553b1b9
    Closes-Bug: #1383247
    Closes-Bug: #1455910

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
Oleksiy Molchanov (omolchanov) wrote :

Moved to invalid as the issue for that version was not updated for more than 3 weeks.

Revision history for this message
Alexey Stupnikov (astupnikov) wrote :

MOS5.1 is no longer supported, moving to Won't Fix.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.