One of Rabbitmq service instances gracefully shutdown

Bug #1594337 reported by Leontii Istomin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
Medium
MOS Oslo
StackLight
Fix Released
High
Swann Croiset

Bug Description

Detailed bug description:
 During the following rally scenario http://paste.openstack.org/show/520583/ when we start 2000 VMs with attached 1Gig volume and security groups with 5 threads one of rabbitmq service instances goes down:
from rabbitmq logs on node-13:
=INFO REPORT==== 20-Jun-2016::09:58:59 ===
Stopped RabbitMQ application
--
=INFO REPORT==== 20-Jun-2016::09:59:01 ===
Stopped RabbitMQ application
--
=INFO REPORT==== 20-Jun-2016::10:00:34 ===
Stopped RabbitMQ application
--
=INFO REPORT==== 20-Jun-2016::10:03:14 ===
Stopped RabbitMQ application

from pacemaker.log: http://paste.openstack.org/show/520592/
Steps to reproduce:
 1. deploy Fuel 9.0-mos-495
 2. perform density rally tests
Expected results:
 tests passed
Actual result:
 tests failed due rabbitmq stopping
Reproducibility:
 each time
Workaround:
 Not yet
Impact:
 ability of actions which depend on rabbitmq
Description of the environment:
 Operation system: Ubuntu
 Versions of components: MOS-9.0
 Reference architecture: 3 controllers, 176 computes, 20 computes+Ceph, Ceph for all
 Network model: vxlan+dvr
 Related projects installed: LMA 10
Additional information:
 Diagnostic snapshot: mos-scale-share.mirantis.com/fuel-snapshot-2016-06-20_10-53-35.tar.gz

Tags: scale
Revision history for this message
Leontii Istomin (listomin) wrote :
Download full text (3.5 KiB)

The issue affects availability of OpenStack services. From rally.log:
09:59:28 2016-06-20 09:59:28.198 175 ERROR rally.task.runner [-] Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
09:59:28 2016-06-20 09:59:28.198 175 ERROR rally.task.runner Traceback (most recent call last):
09:59:28 2016-06-20 09:59:28.198 175 ERROR rally.task.runner File "/data/rally/virtualenv/local/lib/python2.7/site-packages/rally/task/runner.py", line 64, in _run_scenario_once
09:59:28 2016-06-20 09:59:28.198 175 ERROR rally.task.runner getattr(scenario_inst, method_name)(**kwargs)
09:59:28 2016-06-20 09:59:28.198 175 ERROR rally.task.runner File "rally_plugins/nova_density.py", line 54, in boot_attach_and_list_with_secgroups
09:59:28 2016-06-20 09:59:28.198 175 ERROR rally.task.runner rules_per_security_group)
09:59:28 2016-06-20 09:59:28.198 175 ERROR rally.task.runner File "/data/rally/virtualenv/local/lib/python2.7/site-packages/rally/plugins/openstack/scenarios/nova/utils.py", line 811, in _create_rules_for_security_group
09:59:28 2016-06-20 09:59:28.198 175 ERROR rally.task.runner cidr=cidr)
09:59:28 2016-06-20 09:59:28.198 175 ERROR rally.task.runner File "/data/rally/virtualenv/local/lib/python2.7/site-packages/novaclient/v2/security_group_rules.py", line 75, in create
09:59:28 2016-06-20 09:59:28.198 175 ERROR rally.task.runner 'security_group_rule')
09:59:28 2016-06-20 09:59:28.198 175 ERROR rally.task.runner File "/data/rally/virtualenv/local/lib/python2.7/site-packages/novaclient/base.py", line 333, in _create
09:59:28 2016-06-20 09:59:28.198 175 ERROR rally.task.runner resp, body = self.api.client.post(url, body=body)
09:59:28 2016-06-20 09:59:28.198 175 ERROR rally.task.runner File "/data/rally/virtualenv/local/lib/python2.7/site-packages/novaclient/client.py", line 484, in post
09:59:28 2016-06-20 09:59:28.198 175 ERROR rally.task.runner return self._cs_request(url, 'POST', **kwargs)
09:59:28 2016-06-20 09:59:28.198 175 ERROR rally.task.runner File "/data/rally/virtualenv/local/lib/python2.7/site-packages/novaclient/client.py", line 459, in _cs_request
09:59:28 2016-06-20 09:59:28.198 175 ERROR rally.task.runner resp, body = self._time_request(url, method, **kwargs)
09:59:28 2016-06-20 09:59:28.198 175 ERROR rally.task.runner File "/data/rally/virtualenv/local/lib/python2.7/site-packages/novaclient/client.py", line 432, in _time_request
09:59:28 2016-06-20 09:59:28.198 175 ERROR rally.task.runner resp, body = self.request(url, method, **kwargs)
09:59:28 2016-06-20 09:59:28.198 175 ERROR rally.task.runner File "/data/rally/virtualenv/local/lib/python2.7/site-packages/novaclient/client.py", line 426, in request
09:59:28 2016-06-20 09:59:28.198 175 ERROR rally.task.runner raise exceptions.from_response(resp, body, url, method)
09:59:28 2016-06-20 09:59:28.198 175 ERROR rally.task.runner ClientException: Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
09:59:29 2016-06-20 09:59:28.198 175 ERROR rally.task.runner <class 'keystoneauth1.exceptions.connection.ConnectTimeou...

Read more...

description: updated
Revision history for this message
Alexey Lebedeff (alebedev-a) wrote :

`rabbitmqctl status` failed to contact rabbitmq at 2016-06-20T09:57:53 with diagnostics:

rabbit@messaging-node-13:
  * connected to epmd (port 4369) on messaging-node-13
  * epmd reports node 'rabbit' running on port 41055
  * TCP connection succeeded but Erlang distribution failed
  * suggestion: hostname mismatch?
  * suggestion: is the cookie set correctly?
  * suggestion: is the Erlang distribution using TLS?

Interestingly enough the subsequent `rabbitmqctl stop` succeeded and rabbit was gracefully shutdown. It indicates that there was some transient condition which prevented `rabbitmqctl` from contacting broker for at least net_ticktime(10 seconds currently).

So my suspicions are:
1) Erlang async thread pool size is set to 30 by fuel library, and it overrides rabbitmq-calculated value of 768.
2) atop shows that sometimes `rabbitmqctl` is called more than 600 during 20 seconds.

Revision history for this message
Leontii Istomin (listomin) wrote :

The issue has gone after we stopped lma collectors and collectd service on the controller nodes:
crm resource stop clone_metric_collector
crm resource stop clone_log_collector
crm resource restart master_p_rabbitmq-server
for i in 13 87 107; do ssh node-$i "service collectd stop"; done

Changed in fuel:
assignee: nobody → MOS Oslo (mos-oslo)
Changed in lma-toolchain:
status: New → Triaged
assignee: nobody → LMA-Toolchain Fuel Plugins (mos-lma-toolchain)
Dina Belova (dbelova)
Changed in fuel:
importance: Undecided → Medium
status: New → Confirmed
Revision history for this message
Simon Pasquier (simon-pasquier) wrote :

Right now, the LMA collector uses rabbitmqctl to collect RabbitMQ metrics. In practice, it will call rabbitmqctl 5 times at every sampling interval (10 seconds per default but if rabbitmq is slow, the sampling interval will increase):
- status
- cluster_status
- list_connections
- list_exchanges
- list_queues

When RabbitMQ is already under heavy load, the collector adds on top of that and as a consequence Pacemaker might assume that the RabbitMQ service is down.

A better strategy for the collector would be to use the RabbitMQ admin API [1] for collecting data. In particular, the /api/overview and /api/node resources should provide all the information that we need in 2 calls without having to aggregate the numbers. We should also drop support for collecting individual counters (currently notification and ceilometer queues). Finally we could also improve the monitoring scope using /api/aliveness-test/vhost.

[1] https://www.rabbitmq.com/management.html#http-api

Dina Belova (dbelova)
Changed in fuel:
milestone: none → 9.0-updates
Swann Croiset (swann-w)
Changed in lma-toolchain:
milestone: none → 0.10.1
Swann Croiset (swann-w)
Changed in lma-toolchain:
importance: Undecided → High
Swann Croiset (swann-w)
Changed in lma-toolchain:
milestone: 0.10.1 → 0.11.0
Swann Croiset (swann-w)
Changed in lma-toolchain:
milestone: 0.11.0 → 0.10.0
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-plugin-lma-collector (master)

Fix proposed to branch: master
Review: https://review.openstack.org/340426

Changed in lma-toolchain:
assignee: LMA-Toolchain Fuel Plugins (mos-lma-toolchain) → Swann Croiset (swann-w)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-plugin-influxdb-grafana (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/340428

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-plugin-lma-collector (master)

Reviewed: https://review.openstack.org/340426
Committed: https://git.openstack.org/cgit/openstack/fuel-plugin-lma-collector/commit/?id=1ae8829823a4df11248b191d96e130fe8e230504
Submitter: Jenkins
Branch: master

commit 1ae8829823a4df11248b191d96e130fe8e230504
Author: Swann Croiset <email address hidden>
Date: Mon Jul 11 16:25:03 2016 +0200

    Use RabbitMQ management API

    The patch uses the management API to retrieve metrics instead of
    executing rabbitmqctl command.

    A side effect is that all metrics per-queues are not collected anymore.

    Change-Id: I5dab785321e369ec0e1a69a79e0700b276810925
    Closes-bug: #1594337

Changed in lma-toolchain:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-plugin-influxdb-grafana (master)

Reviewed: https://review.openstack.org/340428
Committed: https://git.openstack.org/cgit/openstack/fuel-plugin-influxdb-grafana/commit/?id=3ac0cfc423e8570b3d3d2377b654403d5f22c79a
Submitter: Jenkins
Branch: master

commit 3ac0cfc423e8570b3d3d2377b654403d5f22c79a
Author: Swann Croiset <email address hidden>
Date: Mon Jul 11 17:04:16 2016 +0200

    Rework RabbitMQ dashboard

    * Remove single stat "Total Node", which was unrelevant
    * Add single stat "Channels"
    * Add missing environment filters

    Change-Id: Iab5876b99f1a49037d145f65c0f851417dcccfe6
    Depends-On: I5dab785321e369ec0e1a69a79e0700b276810925
    Related-Bug: #1594337

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-plugin-influxdb-grafana (stable/0.10)

Related fix proposed to branch: stable/0.10
Review: https://review.openstack.org/341610

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-plugin-lma-collector (stable/0.10)

Fix proposed to branch: stable/0.10
Review: https://review.openstack.org/341611

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-plugin-lma-collector (stable/0.10)

Reviewed: https://review.openstack.org/341611
Committed: https://git.openstack.org/cgit/openstack/fuel-plugin-lma-collector/commit/?id=ce524c0e62f2a7a82a5007857e246d42b3ceda5a
Submitter: Jenkins
Branch: stable/0.10

commit ce524c0e62f2a7a82a5007857e246d42b3ceda5a
Author: Swann Croiset <email address hidden>
Date: Mon Jul 11 16:25:03 2016 +0200

    Use RabbitMQ management API

    The patch uses the management API to retrieve metrics instead of
    executing rabbitmqctl command.

    A side effect is that all metrics per-queues are not collected anymore.

    Change-Id: I5dab785321e369ec0e1a69a79e0700b276810925
    Closes-bug: #1594337
    (cherry picked from commit 1ae8829823a4df11248b191d96e130fe8e230504)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-plugin-influxdb-grafana (stable/0.10)

Reviewed: https://review.openstack.org/341610
Committed: https://git.openstack.org/cgit/openstack/fuel-plugin-influxdb-grafana/commit/?id=49196b448feaf408497434255ab2ded4bd651c0a
Submitter: Jenkins
Branch: stable/0.10

commit 49196b448feaf408497434255ab2ded4bd651c0a
Author: Swann Croiset <email address hidden>
Date: Mon Jul 11 17:04:16 2016 +0200

    Rework RabbitMQ dashboard

    * Remove single stat "Total Node", which was unrelevant
    * Add single stat "Channels"
    * Add missing environment filters

    Change-Id: Iab5876b99f1a49037d145f65c0f851417dcccfe6
    Depends-On: I5dab785321e369ec0e1a69a79e0700b276810925
    Related-Bug: #1594337
    (cherry picked from commit 3ac0cfc423e8570b3d3d2377b654403d5f22c79a)

Changed in lma-toolchain:
status: Fix Committed → Fix Released
Changed in fuel:
status: Confirmed → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.