OSTF test RabbitMQ availability failure with timeout error

Bug #1486534 reported by Alexander Kurenyshev
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
Critical
Artem Panchenko

Bug Description

BVT failed with OSTF test timeout when checked Rabbit availability

Steps to reproduce:
Scenario:
1. Create cluster with Neutron
2. Add 3 nodes with controller role
3. Add 3 nodes with compute and ceph-osd role
4. Deploy the cluster
5. Check ceph status
6. Run OSTF tests
7. Check the radosqw daemon is started

Expected result:
All ostf tests passed.

Actual result:
Test "RabbitMQ availability" failed with timeout error:
2015-08-19 04:53:08 ERROR (nose_storage_plugin) fuel_health.tests.ha.test_rabbit.RabbitSanityTest.test_002_rabbitmqctl_status_ubuntu
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/unittest2/case.py", line 340, in run
    testMethod()
  File "/usr/lib/python2.6/site-packages/fuel_health/tests/ha/test_rabbit.py", line 89, in test_002_rabbitmqctl_status_ubuntu
    'Cannot retrieve crm status')
  File "/usr/lib/python2.6/site-packages/fuel_health/common/test_mixins.py", line 183, in verify
    " Please refer to OpenStack logs for more details.")
  File "/usr/lib/python2.6/site-packages/unittest2/case.py", line 415, in fail
    raise self.failureException(msg)
AssertionError: Step 3 failed: Time limit exceeded while waiting for to finish. Please refer to OpenStack logs for more details.

At the ostf.log could see which command was timeouted:
fuel_health.ha_base: INFO: Try to execute command <crm resource status master_p_rabbitmq-server>

At the same time at pacemaker log:
Aug 19 12:35:57 [9213] node-1.test.domain.local crmd: notice: throttle_handle_load: High CPU load detected: 3.080000

Fuel:
feature_groups:
    - mirantis
  production: "docker"
  release: "7.0"
  openstack_version: "2015.1.0-7.0"
  api: "1.0"
  build_number: "191"
  build_id: "2015-08-18_21-23-31"
  nailgun_sha: "d0b727cdd0d8e7ce5e17e6ea1306d835bfdfb5e7"
  python-fuelclient_sha: "6ad5e0eb4dbbca6cade3444554606927ecd3f16f"
  fuel-agent_sha: "57145b1d8804389304cd04322ba0fb3dc9d30327"
  fuel-nailgun-agent_sha: "e01693992d7a0304d926b922b43f3b747c35964c"
  astute_sha: "e24ca066bf6160bc1e419aaa5d486cad1aaa937d"
  fuel-library_sha: "7c80eed2119260cc15a700068b9eb20ccc773926"
  fuel-ostf_sha: "235f21b230fea15724d625b2dc44ade0464527e2"
  fuelmain_sha: "c9dad194e82a60bf33060eae635fff867116a9ce"

Revision history for this message
Alexander Kurenyshev (akurenyshev) wrote :
description: updated
Changed in fuel:
importance: Undecided → High
status: New → Confirmed
assignee: Fuel Library Team (fuel-library) → Stanislaw Bogatkin (sbogatkin)
Changed in fuel:
assignee: Stanislaw Bogatkin (sbogatkin) → nobody
assignee: nobody → Fuel Library Team (fuel-library)
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Alex Schultz (alex-schultz)
Revision history for this message
Alex Schultz (alex-schultz) wrote :

Since the rabbitmq availability test is a HA test and you only have a single controller, it's never going to work.

Changed in fuel:
status: Confirmed → Invalid
Revision history for this message
Artem Panchenko (apanchenko-8) wrote :

Actually this issue affects environments with 1+ number of controllers, Alexander provided wrong test scenario in description (I updated it). Btw you can make sure that the environment had 3 controllers by checking astute.yaml from attached diagnostic snapshot.

description: updated
Changed in fuel:
status: Invalid → Confirmed
assignee: Alex Schultz (alex-schultz) → Fuel Library Team (fuel-library)
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Bogdan Dobrelya (bogdando)
importance: High → Critical
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Raised to critical as it had caused several BVT failures in a raw

tags: added: rabbitmq
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

This event http://pastebin.com/gu12QjVG looks like a RC

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Whatever that "epmd-starter-443584618" meaned, it looks not related and not the RC of this issue. I checked my lab build from #ISO 187 and there is also such event logged sometimes w/o any visible harm

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

This should be testing environment related issue:
at the moment of OSTF HA testcase failure, the node-5 looked unresponsive:
2015-08-19T04:53:07.992424+00:00 node-5 crmd notice: notice: throttle_handle_load: High CPU load detected: 3.580000

ostf log's snippet:
paramiko.transport: DEBUG: Dropping user packet because connection is dead.
paramiko.transport: DEBUG: EOF in transport thread
paramiko.transport: DEBUG: Dropping user packet because connection is dead.
paramiko.transport: DEBUG: EOF in transport thread
paramiko.transport: DEBUG: EOF in transport thread
paramiko.transport: DEBUG: Dropping user packet because connection is dead.
paramiko.transport: DEBUG: EOF in transport thread
paramiko.transport: DEBUG: EOF in transport thread
paramiko.transport: DEBUG: EOF in transport thread
paramiko.transport: DEBUG: EOF in transport thread
paramiko.transport: DEBUG: EOF in transport thread
fuel_health.ha_base: INFO: ssh session to node node-6 was open
fuel_health.ha_base: INFO: Try to execute command <crm resource status master_p_rabbitmq-server>
fuel_health.common.test_mixins: INFO: Timeout 10s exceeded for

and the timeout of 10s was not enough to hand over the "crm resource status master_p_rabbitmq-server" request to the next node-6.

Changed in fuel:
status: Confirmed → Invalid
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Perhaps this bug is still valid for the fuel-qa team to adjust the timeout as appropriate so the next node could hand-over.

Changed in fuel:
status: Invalid → Triaged
assignee: Bogdan Dobrelya (bogdando) → Fuel QA Team (fuel-qa)
tags: added: ostf
removed: rabbitmq
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-ostf (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/215131

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

The amount of guest vCPU and RAM should also be increased for CI slaves, I think at least to 3G, 3vCPU

tags: added: system-tests
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

The BVT debug session have shown there is only 1 CPU on CI slaves, which is not good.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

I mean VM guests (controllers, for this specific bug)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-ostf (master)

Reviewed: https://review.openstack.org/215131
Committed: https://git.openstack.org/cgit/stackforge/fuel-ostf/commit/?id=16839cbf471b7142b04c0d2c2d94786bc486fefe
Submitter: Jenkins
Branch: master

commit 16839cbf471b7142b04c0d2c2d94786bc486fefe
Author: Artem Panchenko <email address hidden>
Date: Thu Aug 20 16:22:05 2015 +0300

    Increase timeouts for cmds in RabbitMQ tests

    When controller nodes are overloaded by OpenStack
    services, execution of commands in RabbitMQ availability
    test takes more than 10 seconds, so test fails. But
    actually RabbitMQ works fine.
    Increase timeout values 2 times to workaround the issue.

    Change-Id: I961beed3981ebd50685557b7d66ac24861a3dfa4
    Related-bug: #1486534

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to fuel-infra/jenkins-jobs (master)

Fix proposed to branch: master
Change author: Artem Panchenko <email address hidden>
Review: https://review.fuel-infra.org/10616

Changed in fuel:
status: Triaged → In Progress
Revision history for this message
Artem Panchenko (apanchenko-8) wrote :

patch for increasing slaves RAM to 3GB:

https://review.fuel-infra.org/#/c/10619/

Changed in fuel:
assignee: Fuel QA Team (fuel-qa) → Artem Panchenko (apanchenko-8)
Revision history for this message
Aleksandra Fedorova (bookwar) wrote :

Please be aware, that our generic host has 8 cpu's and 32 gb of ram. Bvt test uses 6 slaves, thus by increasing slave parameters we overload the host with virtual machines.

We are going to try it out, but we need to consider other ways.

And we need much better planning and estimates for Fuel development in terms of impact it causes on our infrastructure.

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to fuel-infra/jenkins-jobs (master)

Reviewed: https://review.fuel-infra.org/10616
Submitter: Aleksandra Fedorova <email address hidden>
Branch: master

Commit: 853469c344de5de960091ac6b005e9562cf2597f
Author: Artem Panchenko <email address hidden>
Date: Thu Aug 20 15:47:28 2015

Allocate 2 CPUs for slaves in BVT/Smoke jobs

Slave nodes (VMs for OS controllers) overload causes
BVT tests failure. So increasing number ov virtual CPUs
for slaves to 2.

Change-Id: I7f65bef4a76befdf7236dc8537d259b8440c090e
Closes-bug: #1486534

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Related fix merged to fuel-infra/jenkins-jobs (master)

Reviewed: https://review.fuel-infra.org/10619
Submitter: Aleksandra Fedorova <email address hidden>
Branch: master

Commit: 11776150dedd0999991b77e0e3791a390f4caf57
Author: Yuriy Shamray <email address hidden>
Date: Fri Aug 21 09:16:49 2015

Setup SLAVE_NODE_MEMORY=3072 for all system tests jobs

Changes setting for 7.0 swarm.
Related-Bug: #1486534
Change-Id: Iaea6d222d187471c7dfd46a3e4bd723b8876c05a

Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

verified 256 iso

Changed in fuel:
status: Fix Committed → Fix Released
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Related fix proposed to fuel-infra/jenkins-jobs (master)

Related fix proposed to branch: master
Change author: Artem Panchenko <email address hidden>
Review: https://review.fuel-infra.org/11044

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Related fix merged to fuel-infra/jenkins-jobs (master)

Reviewed: https://review.fuel-infra.org/11044
Submitter: Aleksandra Fedorova <email address hidden>
Branch: master

Commit: dd02ef0b7904cea1995f691b1644b69e03e153ea
Author: Artem Panchenko <email address hidden>
Date: Tue Sep 1 19:09:32 2015

Increase CPU/RAM resources for custom BVT

Slave nodes (VMs for OS controllers) overload causes
BVT tests failure. So increasing number of virtual CPUs
for slaves to 2 and allocate 3GB for RAM.

Change-Id: I93d1dc71c1d2c33e6031671e6a398ed27028fad9
Related-bug: #1486534

tags: added: area-ostf
removed: ostf
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.