RabbitMQ ctl hangs and got restarted under load of Openstack services

Bug #1460762 reported by Aleksandr Shaposhnikov
40
This bug affects 6 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
High
Bogdan Dobrelya
5.1.x
Won't Fix
High
Denis Meltsaykin
6.0.x
Won't Fix
High
Denis Meltsaykin
6.1.x
Fix Released
High
Denis Meltsaykin
7.0.x
Fix Released
High
Bogdan Dobrelya

Bug Description

Basically our scale rally scenarios failed because messaging cluster in MOS got restarted by unknown reasons.
There is NO any tenant network load generated so there was only regular load over managements network.

api: '1.0'
astute_sha: 5d570ae5e03909182db8e284fbe6e4468c0a4e3e
auth_required: true
build_id: 2015-05-28_20-55-26
build_number: '478'
feature_groups:
- mirantis
fuel-library_sha: 09822a44c2298437882062a43c8ab0bcef952524
fuel-ostf_sha: 6b4ddbfd3126b77f79759721e86d395bf106b177
fuelmain_sha: 6b5712a7197672d588801a1816f56f321cbceebd
nailgun_sha: 4344fe68b5c93d2e7f7254076aab62aa3a612e42
openstack_version: 2014.2.2-6.1
production: docker
python-fuelclient_sha: e19f1b65792f84c4a18b5a9473f85ef3ba172fce
release: '6.1'
release_versions:
  2014.2.2-6.1:
    VERSION:
      api: '1.0'
      astute_sha: 5d570ae5e03909182db8e284fbe6e4468c0a4e3e
      build_id: 2015-05-28_20-55-26
      build_number: '478'
      feature_groups:
      - mirantis
      fuel-library_sha: 09822a44c2298437882062a43c8ab0bcef952524
      fuel-ostf_sha: 6b4ddbfd3126b77f79759721e86d395bf106b177
      fuelmain_sha: 6b5712a7197672d588801a1816f56f321cbceebd
      nailgun_sha: 4344fe68b5c93d2e7f7254076aab62aa3a612e42
      openstack_version: 2014.2.2-6.1
      production: docker
      python-fuelclient_sha: e19f1b65792f84c4a18b5a9473f85ef3ba172fce
      release: '6.1'

Diagnostic snapshot or link to it will be attached later once it will be downloaded.

Tags: scale
Revision history for this message
Aleksandr Shaposhnikov (alashai8) wrote :

Snapshot size ~2.7Gb so it can't be uploaded. I may provide it by request or attach any needed files from it.
(fuel-snapshot-2015-06-01_18-07-47.tar.xz)

Changed in mos:
assignee: nobody → Fuel Library Team (fuel-library)
Changed in mos:
milestone: none → 6.1
status: New → Incomplete
importance: Undecided → High
assignee: Fuel Library Team (fuel-library) → Bogdan Dobrelya (bogdando)
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

I requested the snpashot. Please also attach here all lrmd.log, pengine.log, from /var/log/remote/node-* and rally logs

tags: added: scale
Revision history for this message
Aleksandr Shaposhnikov (alashai8) wrote :
Changed in mos:
status: Incomplete → New
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Thanks! I'm investigating the RC

Changed in mos:
status: New → In Progress
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

As far as I can see from lrmd logs, there was no full AMQP cluster downtime, the node-8 was always available for AMQP connections:
node-8: was running as resource master, no failures, rabbit app was available all of the time
node-13, rabbit app turned unresponsive and got restarted:
2015-05-30T02:56:58.570538+00:00 err: ERROR: p_rabbitmq-server: get_monitor(): rabbitmqctl is not responding. The resource is failed.
node-15, rabbit app turned unresponsive and got restarted:
2015-05-30T02:56:58.403726+00:00 err: ERROR: p_rabbitmq-server: get_monitor(): rabbitmqctl is not responding. The resource is failed.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

My wrong, node-8 was also affected by the hanged rabbitmq app:
2015-05-30T02:56:38.802818+00:00 info: INFO: p_rabbitmq-server: su_rabbit_cmd(): the invoked command exited 137: /usr/sbin/rabbitmqctl list_channels 2>&1 > /dev/null
2015-05-30T02:56:38.853965+00:00 err: ERROR: p_rabbitmq-server: get_monitor(): rabbitmqctl is not responding. The resource is failed.

So, indeed, there was a full AMQP cluster downtime. It looks like it could be caused by high CPU load, continue investigating

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

This issue looks valid from the rabbitmq clustering point as recent fixes force hanged rabbit nodes to restart. Hanged node cannot be "healed" somehow, this is a completely broken state and nothing has to be done but restart. RCA of the rabbit nodes hanged looks not clear, but crmd logs contains many " notice: throttle_handle_load: High CPU load detected " on all controllers:

node-15, crmd.log:
2015-05-30T02:03:29.695240+00:00 notice: notice: throttle_handle_load: High CPU load detected: 21.230000
node-13, crmd.log:
2015-05-30T02:00:22.564736+00:00 notice: notice: throttle_handle_load: High CPU load detected: 19.639999
node-8, crmd.log:
2015-05-30T02:56:43.658271+00:00 notice: notice: throttle_handle_load: High CPU load detected: 35.240002

This could be the root cause of hanged rabbit apps as well

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

I'd mark this issue as invalid. We should investigate the high CPU load issue as a separate bug related to services isolation.

Changed in mos:
status: In Progress → Invalid
Revision history for this message
Alexander Nevenchannyy (anevenchannyy) wrote :

According to the logs, of lrmd from ticket https://bugs.launchpad.net/fuel/+bug/1461586

2015-06-03T04:54:03.191168+00:00 info: INFO: p_rabbitmq-server: su_rabbit_cmd(): the invoked command exited 137: /usr/sbin/rabbitmqctl list_channels 2>&1 > /dev/null
2015-06-03T04:54:03.197232+00:00 err: ERROR: p_rabbitmq-server: get_monitor(): rabbitmqctl is not responding. The resource is failed.
2015-06-03T04:54:03.873629+00:00 info: INFO: p_rabbitmq-server: demote: action begin.
2015-06-03T04:54:03.879774+00:00 info: INFO: p_rabbitmq-server: get_monitor(): CHECK LEVEL IS: 0
2015-06-03T04:54:04.121298+00:00 info: INFO: p_rabbitmq-server: get_monitor(): get_status() returns 0.
2015-06-03T04:54:04.126604+00:00 info: INFO: p_rabbitmq-server: get_monitor(): also checking if we are master.
2015-06-03T04:54:04.378931+00:00 info: INFO: p_rabbitmq-server: get_monitor(): master attribute is 0
2015-06-03T04:54:04.613379+00:00 info: INFO: p_rabbitmq-server: get_monitor(): checking if rabbit app is running
2015-06-03T04:54:04.618044+00:00 info: INFO: p_rabbitmq-server: get_monitor(): rabbit app is running. checking if we are the part of healthy cluster
2015-06-03T04:54:04.636400+00:00 info: INFO: p_rabbitmq-server: get_monitor(): rabbit app is running. looking for master on node-49.domain.tld
2015-06-03T04:54:04.655893+00:00 info: INFO: p_rabbitmq-server: get_monitor(): fetched master attribute for node-49.domain.tld. attr value is 1
2015-06-03T04:54:04.661757+00:00 info: INFO: p_rabbitmq-server: get_monitor(): rabbit app is running. looking for master on node-1.domain.tld
2015-06-03T04:54:04.681031+00:00 info: INFO: p_rabbitmq-server: get_monitor(): fetched master attribute for node-1.domain.tld. attr value is 0
2015-06-03T04:54:04.687028+00:00 info: INFO: p_rabbitmq-server: get_monitor(): rabbit app is running. master is node-1.domain.tld
2015-06-03T04:54:04.927305+00:00 info: INFO: p_rabbitmq-server: get_monitor(): rabbit app is running and is member of healthy cluster
2015-06-03T04:54:04.932689+00:00 info: INFO: p_rabbitmq-server: get_monitor(): preparing to update master score for node
2015-06-03T04:54:04.974043+00:00 info: INFO: p_rabbitmq-server: get_monitor(): comparing our uptime (14477) with node-49.domain.tld (12530)
2015-06-03T04:54:05.001216+00:00 info: INFO: p_rabbitmq-server: get_monitor(): comparing our uptime (14477) with node-44.domain.tld (12549)
2015-06-03T04:54:05.016135+00:00 info: INFO: p_rabbitmq-server: get_monitor(): we are the oldest node

And CPU usage at this oment at node: http://paste.openstack.org/show/259735/
I'm think that root cause of this ticket not a high CPU load.

Revision history for this message
Aleksandr Shaposhnikov (alashai8) wrote :

Guys, don't close or move ticket to invalid just because you think reason is high CPU load.

Changed in mos:
status: Invalid → New
Revision history for this message
Stanislaw Bogatkin (sbogatkin) wrote :

I think that problem in CPU load. Alex snapshot doesn't have atop logs for a long period of time, but here is data from similar environment (summarized):
65% nova-api
47% nova-conductor
36% beam.smp
35% httpd
28% cinder-api
22% neutron-server
13% mysqld
11% neutron-rootwr
5% cinder-volume

avg 5 minutes = 7.45
avg 15 minutes = 8.58

8.58 / 12cpu = 0.715. It is really HUGE load and I think that all the problem is rooted from there.

Dina Belova (dbelova)
affects: mos → fuel
Changed in fuel:
milestone: 6.1 → none
milestone: none → 6.1
Revision history for this message
Leontii Istomin (listomin) wrote :

reproduced again during rally tests:
node-1
=ERROR REPORT==== 4-Jun-2015::04:20:40 ===
AMQP connection <0.19953.13> (running), channel 0 - error:
{amqp_error,connection_forced,
            "broker forced connection closure with reason 'shutdown'",none}

atop -r /var/log/atop/atop_20150604 -b "04:20"
http://paste.openstack.org/show/263139/

At the time two rally scenarios was running:
boot_and_delete_server_with_secgroups:
   iterations: 470
   concurrency: 20
create-and-delete-stack_with_neutron:
   iterations: 47
   concurrency: 5

diagnostic snapshot is here: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-06-04_13-09-28.tar.xz
rally logs: http://mos-scale-share.mirantis.com/rally-2015-06-04_13-09-28.tar.gz

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

Folks, according to the CPU load data this actually means, that this environment cannot handle this type of load. If you think that this is a bug - you will need to make Nova and other OpenStack developers fix their code to not consume so much CPU. I am marking this bug as invalid unless, you provide really good rationale why this is the bug of Fuel deployment framework.

Changed in fuel:
status: New → Invalid
Changed in fuel:
status: Invalid → New
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

0.7 loadavg per CPU core is not "HUGE", it barely approaches 100% CPU utilization, so you can't use it as an excuse to dismiss this bug. Load around 1.0 may affect real-time processes such as video recording, but if a non-realtime process isn't getting scheduled enough CPU cycles at a load below 5.0 or even 10.0 per CPU core, kernel is doing something wrong.

Quick google brought up this discussion of reasonable values of loadavg:
http://blog.scoutapp.com/articles/2009/07/31/understanding-load-averages

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Folks, the high CPU RC was only a suggestion as there is no way to understand why rabbitmqctl may hang w/o erlang dev.
For example, this could be related http://erlang.org/pipermail/erlang-bugs/2014-September/004622.html
Hence, I suggest to try to reproduce this issue on BM controllers instead of qemu-kvm ones.

summary: - RabbitMQ cluster got restarted under load of Openstack services
+ RabbitMQ ctl hangs and got restarted under load of Openstack services
Changed in fuel:
status: New → Incomplete
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

This could be related as well https://groups.google.com/forum/#!topic/rabbitmq-users/Ja0iyfF0Szw
So, as it was recommended by Michael, we should try to reproduce this issue with the lesser value for the net_ticktime 10-15 seconds perhaps

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

And this was also from the latter mail thread: "This looks like bug 26404, fixed in 3.4.0:

26404 prevent queue synchronisation from hanging if there is a very
       short partition just as it starts (since 3.1.0) "

So we should consider upgrading our rabbitmq distribution as well

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

Dmitry, do not forget that half of these CPUs are actually not real CPUs, but HT threads - this means. Moreover, if you look into Bogdan's comment, there were moments when LA was even 35, which is 3 times more than CPU number.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

But as far as I can see from that ML, the issue with hanging rabbitmqctl is yet to be resolved for any version. So, the only solution is to restart affected node with hanged rabbit app, and that is exactly what is being done by OCF monitor logic now

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Another mail thread https://groups.google.com/forum/#!topic/rabbitmq-users/iZWokxvhlaU indicates that short-time CPU load spikes may be the RC of temporary partitions which rabbitmq cannot detect and survive well, thus hangs. There is also a debug script attached we could use in order to troubleshoot this issue and send reports to rabbitmq devs.

Revision history for this message
Mike Scherbakov (mihgen) wrote :

We had a meeting about this issue with:
- Dan Hata
- Leontiy Istomin
- Dina Belova
- Andrew Maksimov
- Tomasz Napierala
and other. The proposal is to try to reproduce this issue again on scale lab, and then get access to the lab so to troubleshoot live environment. Before it happens, we would have to keep this bug as incomplete as there seems to be no engineering agreement on the root cause of the issue in the first place (OpenStack control plane becomes inoperable).

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Good, let's reproduce then. Note, that this bug is not about "OpenStack control plane becomes inoperable" it is about short time downtime in control plane caused by unexpected rabbit nodes hanging. This situation was correctly handled by pacemaker - all hanged nodes had been restarted. So, please let us to beclear here, there was no permanent CP inoperability reported. And when all rabbit nodes hanged and restarted, a short control plane downtime *should* be expected. And I believe that test/benchmark suite should be able to survive it instead of aborting execution.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

According to the http://lists.openstack.org/pipermail/openstack-dev/2015-June/066148.html, we should try to reproduce this issue with the net_ticktime adjusted to the 10 seconds

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/189292

Changed in fuel:
status: Incomplete → In Progress
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

How-to verify if the net_ticktime was adjusted by the patch:
# rabbitmqctl eval "application:get_env(kernel, net_ticktime)."
{ok,10}
...done.

Changed in fuel:
milestone: 6.1 → 7.0
Revision history for this message
Dina Belova (dbelova) wrote :

It looks like bug was reproduced on 50 nodes with 6.1 #521 ISO. Please take a look

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

There was a single node failover from 2015-06-08T23:47:01 to 2015-06-08T23:49:20 because of partitions detected and healed.
Failover events looks correct http://pastebin.com/44CjYmbA, and this seems not the reproduce of the original bug.
In originally reported bug, all nodes got list_channels hanged and restarted, there were no autoheal attempts, which means rabbit failed to detect partitions.
For this new case, I can see it had detected and healed just fine and there was non messages about hanged rabbitmqctl ion lrmd logs.
There was only 1 restart of the single rabbit@node-49 and this node was unavailable from 2015-06-08T23:47:01 to 2015-06-08T23:49:20
and this was caused by autohealing partitions.
I can also see numerous events like Node 'rabbitmqctlxxxx@foo' not responding but only one real partition case was detected and recovered, so this looks OK and these notices may be ignored.
@Dina: now please could you and Leontiy clarify how do you detect and define failure criterion of your test case?

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Related bug https://bugs.launchpad.net/fuel/+bug/1463433 from the scale lab had shown us that the net_ticktime = 10 allowed us to fix this issue and rabbitmqctl list_channels looks now never hangs and partitions detected and recovered when high CPU load spikes hit the rabbit node.

Revision history for this message
Leontii Istomin (listomin) wrote :

Has been reproduced with the following configuration:
Baremetal,Ubuntu,IBP,HA, Neutron-gre,Ceph-all,Nova-debug,Nova-quotas, 6.1-522
Controllers:3 Computes+Ceph_OSD:47

rabbitmq on node-7 was broken during rally tests:
from <email address hidden>:
=ERROR REPORT==== 18-Jun-2015::15:52:10 ===
AMQP connection <0.21623.582> (running), channel 0 - error:
{amqp_error,connection_forced,
           "broker forced connection closure with reason 'shutdown'",none}
from pacemaker.log:
2015-06-18T15:51:44.477300+00:00 err: ERROR: p_rabbitmq-server: get_monitor(): rabbitmqctl is not responding. The resource is failed.
ctl hanged and restarted

atop -r /var/log/atop/atop_20150618-20150619 -b "15:52"
http://paste.openstack.org/show/304479/

The following rally scenarios were running at the moment:
boot_server_with_network_in_single_tenant - a lot of VMs (2000) booting
create-and-delete-volume - cinder test
create_and_list_routers - neutron test
resize_server - nova test

From Bogdan Dobrelya
node-88 was elected as a new master, so it was also down from 18T15:52:15 to 18T15:56:35. Normally at least one node should be alive, but it was restarted as well because of ctl hanged on two nodes including the one supposed to be alive. as a result cluster was recovered, but app failed to survive recovery procedure

Diagnostic Snapshpot is here:http://mos-scale-share.mirantis.com/fuel-snapshot-2015-06-19_10-52-44.tar.xz

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Note, the reproduce was w/o patch on review

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Closing for the 7.0 as the related patch https://review.openstack.org/189292 was merged and it should be enough to address this issue.

Revision history for this message
Eugene Nikanorov (enikanorov) wrote :

This is still seen in 7.0 on scale lab when rabbitmq cpu consumption is over 400-500%

We had to do pcs resource unmanage p_rabbitmq-server to get it to work stably.

Revision history for this message
Leontii Istomin (listomin) wrote :

Increasing net_ticktime parameter helps a bit, but it is not a root cause of the issue.
Root cause should be found here https://bugs.launchpad.net/fuel/+bug/1479815.

Changed in fuel:
status: Fix Committed → Fix Released
Revision history for this message
Denis Meltsaykin (dmeltsaykin) wrote :

Setting this as Won't Fix for 5.1.1-updates and 6.0-updates, as such a complex change cannot be delivered in the scope of the Maintenance Update. Also, the possible solution of the backporting of RabbitMQ OCF script is covered in details by the Operations Guide from the official documentation of the Product.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/6.1)

Fix proposed to branch: stable/6.1
Review: https://review.openstack.org/251760

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (stable/6.1)

Reviewed: https://review.openstack.org/251760
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=ff320db58993770880aee126cdc27dad9e41aa7f
Submitter: Jenkins
Branch: stable/6.1

commit ff320db58993770880aee126cdc27dad9e41aa7f
Author: Bogdan Dobrelya <email address hidden>
Date: Mon Jun 8 15:24:36 2015 +0200

    Adjust the RabbitMQ kernel net_ticktime parameter

    W/o this fix, the situation is possible when some
    arbitrary queue masters got stuck due to high
    short-time CPU load spikes ending up in the
    rabbitmqctl hanged and affected rabbit node erased
    and restarted by OCF monitor logic of the related
    pacemaker RA.

    This is an issue as it seems that Oslo.messaging yet
    to be in a good shape to survive this short-time
    outage of underlying AMQP layer without disrupting
    interrupted RPC calls being executed.

    The workaround is to reduce the net_ticktime kernel
    parameter from default 60 seconds to 10 seconds.
    What would allow the RabbitMQ cluster to better detect
    short-time partitions and autoheal them. And when the
    partition has been detected and healed, the rabbitmqctl
    would not hang hopefully as stucked queue masters will
    be recovered.

    Partial-bug: #1460762

    Change-Id: I3aa47b51ae080bb4a8b298c61a629ac8225d2abd
    Signed-off-by: Bogdan Dobrelya <email address hidden>
    (cherry picked from commit 3ffae36b9de967747a1cc1045eae0923d6bd893c)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.