Rabbit join race with OSTF tests 'RabbitMQ availability' and 'RabbitMQ replication' are failed after reschedule router from primary controller and destroying it

Bug #1491306 reported by Dmitry Tyzhnenko
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
Critical
Bogdan Dobrelya
5.1.x
Won't Fix
Critical
Denis Meltsaykin
6.0.x
Won't Fix
Critical
Denis Meltsaykin
6.1.x
Fix Released
Critical
Denis Meltsaykin

Bug Description

Failed on CI - https://product-ci.infra.mirantis.net/job/7.0.system_test.ubuntu.ha_neutron_destructive_2/23/console

Scenario:
            1. Create cluster. HA, Neutron with VXLAN segmentation
            2. Add 3 nodes with controller roles
            3. Add 2 nodes with compute roles
            4. Add 1 node with cinder role
            5. Deploy the cluster
            6. Create an instance with a key pair
            7. Manually reschedule router from primary controller
               to another one
            8. Destroy controller with l3-agent
            9. Check l3-agent was rescheduled
            10. Check network connectivity from instance via
               dhcp namespace
            11. Run OSTF

Expected result:
  All step pass

Actual result:
  2 OSTF tests on step 11 are failed - 'RabbitMQ availability' and 'RabbitMQ replication'

Fuel version 7.0-262
{
 "build_id": "262",
 "build_number": "262",
 "auth_required": true,
 "fuel-ostf_sha": "582a81ccaa1e439a3aec4b8b8f6994735de840f4",
 "fuel-library_sha": "1556601b9b7503285714d7d1e02cc0807b1c68f0",
 "nailgun_sha": "b564ae20116297750bf6402b3a017e219bf4b468",
 "openstack_version": "2015.1.0-7.0",
 "fuel-nailgun-agent_sha": "d7027952870a35db8dc52f185bb1158cdd3d1ebd",
 "fuel-agent_sha": "082a47bf014002e515001be05f99040437281a2d",
 "api": "1.0",
 "python-fuelclient_sha": "9643fa07f1290071511066804f962f62fe27b512",
 "astute_sha": "e63709d16bd4c1949bef820ac336c9393c040d25",
 "fuelmain_sha": "4dc6799370da4cddf06c04e4ecb7646102298535",
 "feature_groups": [
  "mirantis"
 ],
 "release": "7.0",
 "release_versions": {
  "2015.1.0-7.0": {
   "VERSION": {
    "build_id": "262",
    "build_number": "262",
    "fuel-library_sha": "1556601b9b7503285714d7d1e02cc0807b1c68f0",
    "nailgun_sha": "b564ae20116297750bf6402b3a017e219bf4b468",
    "fuel-ostf_sha": "582a81ccaa1e439a3aec4b8b8f6994735de840f4",
    "fuel-nailgun-agent_sha": "d7027952870a35db8dc52f185bb1158cdd3d1ebd",
    "fuel-agent_sha": "082a47bf014002e515001be05f99040437281a2d",
    "api": "1.0",
    "python-fuelclient_sha": "9643fa07f1290071511066804f962f62fe27b512",
    "astute_sha": "e63709d16bd4c1949bef820ac336c9393c040d25",
    "fuelmain_sha": "4dc6799370da4cddf06c04e4ecb7646102298535",
    "feature_groups": [
     "mirantis"
    ],
    "release": "7.0",
    "openstack_version": "2015.1.0-7.0",
    "production": "docker"
   }
  }
 },
 "production": "docker"
}

Tags: ha rabbitmq
Revision history for this message
Dmitry Tyzhnenko (dtyzhnenko) wrote :
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

How long did you wait for rabbit failover to complete after the step 8?

Changed in fuel:
status: New → Incomplete
Revision history for this message
Nastya Urlapova (aurlapova) wrote :
Changed in fuel:
status: Incomplete → Confirmed
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

the test is incorrect. After you destroyed any controller, you should wait for rabbit failover to finish, which is indicated by passed OSTF HA tests

Changed in fuel:
status: Confirmed → Invalid
Revision history for this message
Dmitry Tyzhnenko (dtyzhnenko) wrote :
Changed in fuel:
status: Invalid → New
Changed in fuel:
status: New → Confirmed
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Bogdan Dobrelya (bogdando)
summary: OSTF tests 'RabbitMQ availability' and 'RabbitMQ replication' are failed
- after reschedule router from primary controller
+ after reschedule router from primary controller and destroying it
tags: added: ha rabbitmq
Revision history for this message
Andrey Maximov (maximov) wrote : Re: OSTF tests 'RabbitMQ availability' and 'RabbitMQ replication' are failed after reschedule router from primary controller and destroying it

can you clarify if this is temporal service interruption or permanent outage?

Revision history for this message
Dmitry Tyzhnenko (dtyzhnenko) wrote :
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

@Andrew, the failover had stuck and downtime was a permanent.
The issue is that a last-man-standing node should not be restarted by the "there are no nodes to join to" logic. Otherwise there is a race like this one might happen. Snippet http://pastebin.com/N36fX8CX

summary: - OSTF tests 'RabbitMQ availability' and 'RabbitMQ replication' are failed
- after reschedule router from primary controller and destroying it
+ Rabbit join race with OSTF tests 'RabbitMQ availability' and 'RabbitMQ
+ replication' are failed after reschedule router from primary controller
+ and destroying it
Changed in fuel:
importance: High → Critical
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Raised to critical as this race condition is a major defect in the OCF agent logic

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/220103

Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/220103
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=403b28c2aff8aa9f37125d3ff1ac09861990da7e
Submitter: Jenkins
Branch: master

commit 403b28c2aff8aa9f37125d3ff1ac09861990da7e
Author: Bogdan Dobrelya <email address hidden>
Date: Thu Sep 3 13:53:13 2015 +0200

    Detect a last man standing for rabbit OCF agent

    W/o this patch, the race condition is possible
    when there is no running rabbit nodes/resource
    master. The rabbit nodes will start/stop in an
    endless loop as a result introducing full downtime
    for AMQP cluster and cloud control plane.

    The solution is:
    * On post-start/post-promote notify, do nothing, if
      either of the following is a true:
      - there is no rabbit resources running or no master
      - the list of rabbit resources being started/promoted
        reported empty
    * For such cases, do not report resource failure and delegate
      recovery, if needed, to the "running out of the cluster"
      monitor's logic.
    * Additionally, report about a last man standing when
      there is no running rabbit resources around.

    Closes-bug: #1491306

    Change-Id: If1c62fac26b63410636413c49fce55c35e53dc5f
    Signed-off-by: Bogdan Dobrelya <email address hidden>

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

verified for 7.0
{"build_id": "288", "build_number": "288", "release_versions": {"2015.1.0-7.0": {"VERSION": {"build_id": "288", "build_number": "288", "api": "1.0", "fuel-library_sha": "121016a09b0e889994118aa3ea42fa67eabb8f25", "nailgun_sha": "93477f9b42c5a5e0506248659f40bebc9ac23943", "feature_groups": ["mirantis"], "fuel-nailgun-agent_sha": "d7027952870a35db8dc52f185bb1158cdd3d1ebd", "openstack_version": "2015.1.0-7.0", "fuel-agent_sha": "082a47bf014002e515001be05f99040437281a2d", "production": "docker", "python-fuelclient_sha": "1ce8ecd8beb640f2f62f73435f4e18d1469979ac", "astute_sha": "a717657232721a7fafc67ff5e1c696c9dbeb0b95", "fuel-ostf_sha": "1f08e6e71021179b9881a824d9c999957fcc7045", "release": "7.0", "fuelmain_sha": "6b83d6a6a75bf7bca3177fcf63b2eebbf1ad0a85"}}}, "auth_required": true, "api": "1.0", "fuel-library_sha": "121016a09b0e889994118aa3ea42fa67eabb8f25", "nailgun_sha": "93477f9b42c5a5e0506248659f40bebc9ac23943", "feature_groups": ["mirantis"], "fuel-nailgun-agent_sha": "d7027952870a35db8dc52f185bb1158cdd3d1ebd", "openstack_version": "2015.1.0-7.0", "fuel-agent_sha": "082a47bf014002e515001be05f99040437281a2d", "production": "docker", "python-fuelclient_sha": "1ce8ecd8beb640f2f62f73435f4e18d1469979ac", "astute_sha": "a717657232721a7fafc67ff5e1c696c9dbeb0b95", "fuel-ostf_sha": "1f08e6e71021179b9881a824d9c999957fcc7045", "release": "7.0", "fuelmain_sha": "6b83d6a6a75bf7bca3177fcf63b2eebbf1ad0a85"}

Changed in fuel:
status: Fix Committed → Fix Released
Revision history for this message
Denis Meltsaykin (dmeltsaykin) wrote :

Setting this as Won't Fix for 5.1.1-updates and 6.0-updates, as such a complex change cannot be delivered in the scope of the Maintenance Update. Also, the possible solution of the backporting of RabbitMQ OCF script is covered in details by the Operations Guide from the official documentation of the Product.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/6.1)

Fix proposed to branch: stable/6.1
Review: https://review.openstack.org/239448

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (stable/6.1)

Reviewed: https://review.openstack.org/239448
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=e052a2d5cf2d313853b472b971033f1c83c4d55c
Submitter: Jenkins
Branch: stable/6.1

commit e052a2d5cf2d313853b472b971033f1c83c4d55c
Author: Bogdan Dobrelya <email address hidden>
Date: Thu Sep 3 13:53:13 2015 +0200

    Detect a last man standing for rabbit OCF agent

    W/o this patch, the race condition is possible
    when there is no running rabbit nodes/resource
    master. The rabbit nodes will start/stop in an
    endless loop as a result introducing full downtime
    for AMQP cluster and cloud control plane.

    The solution is:
    * On post-start/post-promote notify, do nothing, if
      either of the following is a true:
      - there is no rabbit resources running or no master
      - the list of rabbit resources being started/promoted
        reported empty
    * For such cases, do not report resource failure and delegate
      recovery, if needed, to the "running out of the cluster"
      monitor's logic.
    * Additionally, report about a last man standing when
      there is no running rabbit resources around.

    Closes-bug: #1491306

    Conflicts:
     files/fuel-ha-utils/ocf/rabbitmq

    Change-Id: If1c62fac26b63410636413c49fce55c35e53dc5f

tags: added: on-verification
tags: removed: on-verification
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.