[shaker] test failing when rabbitmq node rasies memory alert

Bug #1463433 reported by Leontii Istomin
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
High
Bogdan Dobrelya
5.1.x
Won't Fix
High
Denis Meltsaykin
6.0.x
Won't Fix
High
Denis Meltsaykin
6.1.x
Fix Released
High
Michal Rostecki
7.0.x
Fix Released
High
Bogdan Dobrelya

Bug Description

Steps to reproduce:

1. Run Shaker tests. (For explanation what it is, see User Impact section below)

First the Rabbit MQ experienced network partition. After it recovered from partition, on one of the nodes it started to consume RAM. And after it consumed tens of GB of RAM, it stopped accepting more messages from OpenStack services. As a result, OpenStack failed all the incoming requests. The issue did not end after tests were finished.

Conditions for reproduction:
The issue is reproduced only once so far. Another run finished successfully, without Rabbit MQ consuming a lot of memory or being stuck for a long time (see comment #22 below).

User impact:

Shaker creates VMs in batches of 8 at a time and tests network throughput between them. The traffic created is rather close to the limit of network throughput of the lab (a little less then 10G). When traffic is cross-network (i.e. it flows between VMs located in different Neutron networks), it always goes through controllers. Sometimes that hits Rabbit MQ and it switches to inoperable state. The rate of the messages hitting Rabbit MQ are in the 100's per second.

From user's point of view, the cloud is not working until the issue is healed.

Workaround:

Restart all Rabbit MQ nodes. After Rabbit is operable (in several minutes after restart), the cloud should start working properly.

Current plan:

Reproduce the issue once more and more throughly investigate it.

Original description by Leontiy Istomin
=======================================

During Shaker test (http://pyshaker.readthedocs.org/en/latest/index.html) we have found that ksoftirqd keep a lot of CPU:
http://paste.openstack.org/show/277774/
atop SHIFT+P ksoftirqd:
http://paste.openstack.org/show/277891/

At the time rabbitmq on this controller node (node-49) was down:
from node-1
=INFO REPORT==== 8-Jun-2015::23:48:07 ===
rabbit on node 'rabbit@node-49' down

from node-44
=INFO REPORT==== 8-Jun-2015::23:46:55 ===
rabbit on node 'rabbit@node-49' down

Configuration:
Baremetal,Centos,fedora_kernel,IBP,HA, Neutron-vlan,Ceph-all,Nova-debug,Nova-quotas, 6.1-521
Controllers:3 Computes:47

net_ticktime parameter has been added:
http://paste.openstack.org/show/278020/

api: '1.0'
astute_sha: 7766818f079881e2dbeedb34e1f67e517ed7d479
auth_required: true
build_id: 2015-06-08_06-13-27
build_number: '521'
feature_groups:
- mirantis
fuel-library_sha: f43c2ae1af3b493ee0e7810eab7bb7b50c986c7d
fuel-ostf_sha: 7c938648a246e0311d05e2372ff43ef1eb2e2761
fuelmain_sha: bcc909ffc5dd5156ba54cae348b6a07c1b607b24
nailgun_sha: 4340d55c19029394cd5610b0e0f56d6cb8cb661b
openstack_version: 2014.2.2-6.1
production: docker
python-fuelclient_sha: 4fc55db0265bbf39c369df398b9dc7d6469ba13b
release: '6.1'

Diagnostic Snapshot: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-06-09_09-55-58.tar.xz

Tags: scale 6.1
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

As far as I can see the timestamp of the first "timed out waiting for reply" error fits into the described failover scope, see
https://bugs.launchpad.net/fuel/+bug/1460762/comments/27.
So, we should debug this case deeply. Hopefully, some developer from MOS Oslo team could help us here.
As for underlying AMQP layer, as I mentioned earlier, I see no issues but a clean failover
My guess, that we have a classic AP+ C-case here, which is at least two amqp nodes was available (A+), partition was recovered (P+), some reply_queues was lost (C-) but the app layer have failed to survive such a lost.

description: updated
Changed in fuel:
status: New → Confirmed
summary: - rabbitmq was down on one of controllers
+ rabbitmq was down on one of controllers during shaker test
Changed in fuel:
milestone: none → 7.0
importance: Undecided → High
assignee: nobody → MOS Oslo (mos-oslo)
Revision history for this message
Leontii Istomin (listomin) wrote : Re: rabbitmq was down on one of controllers during shaker test

Can be the same issue as here https://bugs.launchpad.net/fuel/+bug/1460762

description: updated
summary: - rabbitmq was down on one of controllers during shaker test
+ rabbitmq was down on one of controllers during shaker test but there are
+ multiple "Timed out waiting for reply to ID" events logged by
+ Oslo.messaging
summary: rabbitmq was down on one of controllers during shaker test but there are
multiple "Timed out waiting for reply to ID" events logged by
- Oslo.messaging
+ Oslo.messaging after rabbitmq recovered from partitioning
description: updated
Revision history for this message
Dina Belova (dbelova) wrote : Re: rabbitmq was down on one of controllers during shaker test but there are multiple "Timed out waiting for reply to ID" events logged by Oslo.messaging after rabbitmq recovered from partitioning

Is that true that we assign this to 7.0?

Revision history for this message
Leontii Istomin (listomin) wrote :

[root@node-1 ~]# rabbitmqctl list_queues
Listing queues ...
Error: unable to connect to node 'rabbit@node-1': nodedown

DIAGNOSTICS
===========

attempted to contact: ['rabbit@node-1']

rabbit@node-1:
  * connected to epmd (port 4369) on node-1
  * node rabbit@node-1 up, 'rabbit' application running

current node details:
- node name: 'rabbitmqctl22420@node-1'
- home dir: /var/lib/rabbitmq
- cookie hash: soeIWU2jk2YNseTyDSlsEA==

You have new mail in /var/spool/mail/root
[root@node-1 ~]# rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-1' ...
[{nodes,[{disc,['rabbit@node-1','rabbit@node-44','rabbit@node-49']}]},
 {running_nodes,['rabbit@node-49','rabbit@node-44','rabbit@node-1']},
 {cluster_name,<<"<email address hidden>">>},
 {partitions,[]}]
...done.

description: updated
Revision history for this message
Leontii Istomin (listomin) wrote :

Roman Podoliaka has found that rabbitmq keep 15G on node-1:
http://xsnippet.org/360727/

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Raising to crit and returning to the 6.1 scope as it is a blocker for the scale lab certification for the 6.1

Changed in fuel:
importance: High → Critical
milestone: 7.0 → 6.1
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

AFAIK, RabbitMQ virtual memory pool may grow uncontrollable, for example if there are many messages being published and consumers cannot handle this. We should investigate the node-1 more precisely

Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote :

"rabbitmqctl report" shows that total memory is higher than high watermark

 {memory,[{total,17770473600},
          {connection_procs,63922904},
          {queue_procs,14427954888},
          {plugins,0},
          {other_proc,0},
          {mnesia,2925304},
          {mgmt_db,0},
          {msg_index,2481160},
          {other_ets,6831936},
          {binary,674800824},
          {code,14650466},
          {atom,2478353},
          {other_system,3655403413}]},
 {alarms,[memory]},
 {listeners,[{clustering,41055,"::"},{amqp,5673,"::"}]},
 {vm_memory_high_watermark,0.4},
 {vm_memory_limit,13535861145},
 {disk_free_limit,50000000},
 {disk_free,52060409856},

Log confirms that's the case
/<email address hidden>:memory resource limit alarm set on node 'rabbit@node-1'.
/<email address hidden>:*** Publishers will be blocked until this alarm clears ***

So connections are blocked per "rabbitmqctl report"

Connections:
pid name port peer_port host peer_host ssl peer_cert_subject peer_cert_issuer peer_cert_validity auth_mechanism ssl_protocol ssl_key_exchange ssl_cipher ssl_hash protocol user vhost timeout frame_max channel_max

client_properties recv_oct recv_cnt send_oct send_cnt send_pend state channels
<email address hidden> 127.0.0.1:35246 -> 127.0.0.1:5673 5673 35246 127.0.0.1 127.0.0.1 false AMQPLAIN {0,9,1} nova / 60 131072 65535 [{"product","py-amqp"},{"product_version","1.4.5"},{"capabilitie
s",[{"connection.blocked",true},{"consumer_cancel_notify",true}]}] 62891 1315 13122 1559 0 blocking 1
<email address hidden> 127.0.0.1:35438 -> 127.0.0.1:5673 5673 35438 127.0.0.1 127.0.0.1 false AMQPLAIN {0,9,1} nova / 60 131072 65535 [{"product","py-amqp"},{"product_version","1.4.5"},{"capabilitie
s",[{"connection.blocked",true},{"consumer_cancel_notify",true}]}] 34560 1835 1809297 2966 0 blocking 1
<email address hidden> 127.0.0.1:35948 -> 127.0.0.1:5673 5673 35948 127.0.0.1 127.0.0.1 false AMQPLAIN {0,9,1} nova / 60 131072 65535 [{"product","py-amqp"},{"product_version","1.4.5"},{"cap
abilities",[{"connection.blocked",true},{"consumer_cancel_notify",true}]}] 69104 3505 3894287 3897 0 blocking 1
<email address hidden> 127.0.0.1:35951 -> 127.0.0.1:5673 5673 35951 127.0.0.1 127.0.0.1 false AMQPLAIN {0,9,1} nova / 60 131072 65535 [{"product","py-amqp"},{"product_version","1.4.5"},{"cap
abilities",[{"connection.blocked",true},{"consumer_cancel_notify",true}]}] 69120 3509 3861730 3901 0 blocking 1

Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote :

looks at a newer release notes i see
25850 prevent excessive binary memory use when accepting or delivering
      large messages at high speed (since 1.0.0)

https://www.rabbitmq.com/release-notes/README-3.4.0.txt - not suggesting right now, but guess we should move up in next release

Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote :

https://www.rabbitmq.com/alarms.html - shows just one node hitting the limit with block everything

When running RabbitMQ in a cluster, the memory and disk alarms are cluster-wide; if one node goes over the limit then all nodes will block connections.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

http://stackoverflow.com/questions/27552462/rabbitmq-consumes-memory-and-shuts
"The issue is most likely that one of the OpenStack services are not consuming messages properly and staking them up endlessly"

This could be the RC, but we cannot confirm it as we don't have publishing / consuming statistics provided by failed tests

summary: - rabbitmq was down on one of controllers during shaker test but there are
- multiple "Timed out waiting for reply to ID" events logged by
- Oslo.messaging after rabbitmq recovered from partitioning
+ [shaker] test failing due to multiple "Timed out waiting for reply to
+ ID" events logged by Oslo.messaging after rabbitmq recovered from
+ partitioning and kept running with some connections got blocked
summary: [shaker] test failing due to multiple "Timed out waiting for reply to
ID" events logged by Oslo.messaging after rabbitmq recovered from
- partitioning and kept running with some connections got blocked
+ partitioning and kept running with some connections got blocked because
+ virt memory got exhausted by publishers
Revision history for this message
Bogdan Dobrelya (bogdando) wrote : Re: [shaker] test failing due to multiple "Timed out waiting for reply to ID" events logged by Oslo.messaging after rabbitmq recovered from partitioning and kept running with some connections got blocked because virt memory got exhausted by publishers

Some questions remain unanswered:
1) Why the swapped virtual memory at node-1 increased from 13.5 to 19G by this night, if many connections was blocked? Were there more and more connections and messages incoming form anaware OpenStack services or we just can see a memory leak issue in beam process? How to debug this?
2) Why blocked connections persist instead of being closed?

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Deeper investigation had shown this is the classic deadlock: the rabbit node have no free
resources left and the cluster blocks *all* publishing, by design. The
app thinks "let's wait for the publish block have lifted" and cannot
recover, hence keeps running continuously reporting "timed out waiting
for reply" things and keeping blocked connections open for ever.

As we discussed with Roman Podolyaka today:
- we *can* apply the w/a for the AMQP cluster monitoring control plane
(OCF). Which is about to "monitor and restart, if something looks really
bad"
- we *cannot* apply w/a or fix for the app layer as it is not clear what
Oslo should do when the rabbit cluster have blocked all publishing
because some node(s) got exhausted memory resources. Roman thinks that
the app should just wait for unblock, as it is now, and I almost agree
- we also *cannot* be sure why exactly the rabbit node's memory high
watermark exceeds: either due to some app side issues or due to the
rabbit side memory leak. Good news, this is *not* important, if we gonna
apply only w/a to the control plane. Which is to monitor additionally if
'rabbitmqctl list_queues' hangs and restart rabbit node the same way as
we do now for the cases when rabbitmqctl list_channels hangs.

Changed in fuel:
assignee: MOS Oslo (mos-oslo) → Bogdan Dobrelya (bogdando)
status: Confirmed → Triaged
summary: [shaker] test failing due to multiple "Timed out waiting for reply to
ID" events logged by Oslo.messaging after rabbitmq recovered from
- partitioning and kept running with some connections got blocked because
- virt memory got exhausted by publishers
+ partitioning and kept running with AMQP publish got blocked because virt
+ memory got exhausted at rabbit node
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/190137

Changed in fuel:
status: Triaged → In Progress
Revision history for this message
Alexander Nevenchannyy (anevenchannyy) wrote : Re: [shaker] test failing due to multiple "Timed out waiting for reply to ID" events logged by Oslo.messaging after rabbitmq recovered from partitioning and kept running with AMQP publish got blocked because virt memory got exhausted at rabbit node

Resource limit may be traced from CRM script with command:
root@node-15:~# rabbitmqctl eval 'rabbit_alarm:get_alarms().'
[{{resource_limit,memory,'rabbit@node-15'},[]}]
...done.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Thank you @Alexander, I added this check to the monitor logic.

So, the current status of the fix is to detect memory alert set on the node and release it with an immediate stop. This gives us the following pros: The memory lock set on some node will not block the cluster and this node will never swap to 20G, hopegully, because will be restarted by OCF.

But the Cons is: if the RC is not a some mysterious RMQ memory leak, but instead the pressure from the OpenStack apps side, bringing one node down by memory alert would likely have brought the rest of the nodes down as well, in a cascade fashion, by the same watermark issue.

So, someone have to ensure that the app side (Oslo.messaging) can react somehow to the blocked publish state declared by temporary reducing the "AMQP pressure" on rabbit nodes. I'm adding MOS Oslo project here as well.

Changed in mos:
milestone: none → 6.1
milestone: 6.1 → 7.0
assignee: nobody → MOS Oslo (mos-oslo)
importance: Undecided → High
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

That was the patch... Here is the bug https://bugs.launchpad.net/oslo.messaging/+bug/1454449

tags: added: non-release
tags: removed: non-release
tags: added: non-release
tags: removed: non-release
tags: added: 6.1-rc2
tags: added: 6.1rc2
removed: 6.1-rc2
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Folks, I tested how exactly blocked publish works when a rabbit node exceeds its high memory watermark.
Here is the scripts: http://pastebin.com/49JisHRP , If you want to give a try, make sure you provided correct IP addresses and nova creds. And the sample generator itself https://github.com/bogdando/ceilometer/raw/rmq_bench/tools/sample-generator.py ,
it requires pika and python-ceilometer installed. The output of the test was http://pastebin.com/EgYtzYuY

 As you can see from test results, consume never got blocked on raised RabbitMQ memory alert, only publish is blocked. So this issue with the swap increasing in uncontrolled way by beam.smp process is likely related to OpenStack apps keep pushing on rabbit cluster when it has declared blocked publish. The pressure on the memory may be done, for example, by new connections opening.

The recent patch should give us memory grow dynamics for queues, connections etc.

Changed in mos:
importance: High → Critical
milestone: 7.0 → 6.1
Changed in mos:
status: New → Confirmed
description: updated
Revision history for this message
Eugene Bogdanov (ebogdanov) wrote :

Talked with Dan Hata. Based on his input, we have no idea how the product will behave once we apply the existing patch

This bug is serious, but if there’s any chance that this patch makes it even worse, it’s not rational to apply it, especially keeping in mind that we’re badly time-driven now.

Based on that outcome and keeping in mind QA, MOS and Fuel teams’ buy-in to build the RC2 ISO without this patch the plan is:

1. Remove this bug from 6.1 RC2 at this point, so I am removing 6.1rc2 tag
2. Continue investigation in parallel. If we have a tangible solution with no risk of regression before Release date, we will consider building new RC (especially keeping in mind that the fix is very specific and does not have dependencies).

tags: added: 6.1
removed: 6.1rc2
tags: removed: 6.1
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

According to the previous comments, this bug should not be addressed in the scope of 6.1 release

Changed in mos:
importance: Critical → High
milestone: 6.1 → 7.0
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

The latest reproduce effort with given patches https://review.openstack.org/190137 and https://review.openstack.org/189292 brought the following results.

Shaker run : from 22:06:00 to 02:23:05
light rally: from 02:23:05 to 02:53:16
full rally: from 02:53:16 till the end of logs ~09:00

- there were no partitions detected due to network or CPU spike loads
- there were no memory alerts set on rabbit nodes
- rabbitmqctl never hanged on nodes, but reported several nodedown errors which OCF considers as a resource failure and initiates restart of rabbit node.
- the complete list of single rabbit node failures during the test runs is http://pastebin.com/y12DDEx6

some logs and additional rabbit stats collected by this script http://pastebin.com/sX3DPyRG is attached

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Revision history for this message
Alexander Nevenchannyy (anevenchannyy) wrote :
Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote : Re: [Bug 1463433] Re: [shaker] test failing due to multiple "Timed out waiting for reply to ID" events logged by Oslo.messaging after rabbitmq recovered from partitioning and kept running with AMQP publish got blocked because virt memory got exhausted at rabbit node

Alexander, Bogdan,

Is this the cause of the issue?
https://github.com/rabbitmq/rabbitmq-server/issues/62

On Tue, Jun 16, 2015 at 9:39 AM, Alexander Nevenchannyy
<email address hidden> wrote:
> This is mail thread looks like our problem
> https://groups.google.com/forum/#!topic/rabbitmq-users/Ja0iyfF0Szw
>
> --
> You received this bug notification because you are a member of MOS Oslo,
> which is a bug assignee.
> https://bugs.launchpad.net/bugs/1463433
>
> Title:
> [shaker] test failing due to multiple "Timed out waiting for reply to
> ID" events logged by Oslo.messaging after rabbitmq recovered from
> partitioning and kept running with AMQP publish got blocked because
> virt memory got exhausted at rabbit node
>
> Status in Fuel: OpenStack installer that works:
> In Progress
> Status in Fuel for OpenStack 6.1.x series:
> In Progress
> Status in Fuel for OpenStack 7.0.x series:
> In Progress
> Status in Mirantis OpenStack:
> Confirmed
>
> Bug description:
> During Shaker test (http://pyshaker.readthedocs.org/en/latest/index.html) we have found that ksoftirqd keep a lot of CPU:
> http://paste.openstack.org/show/277774/
> atop SHIFT+P ksoftirqd:
> http://paste.openstack.org/show/277891/
>
> At the time rabbitmq on this controller node (node-49) was down:
> from node-1
> =INFO REPORT==== 8-Jun-2015::23:48:07 ===
> rabbit on node 'rabbit@node-49' down
>
> from node-44
> =INFO REPORT==== 8-Jun-2015::23:46:55 ===
> rabbit on node 'rabbit@node-49' down
>
> Configuration:
> Baremetal,Centos,fedora_kernel,IBP,HA, Neutron-vlan,Ceph-all,Nova-debug,Nova-quotas, 6.1-521
> Controllers:3 Computes:47
>
> net_ticktime parameter has been added:
> http://paste.openstack.org/show/278020/
>
> api: '1.0'
> astute_sha: 7766818f079881e2dbeedb34e1f67e517ed7d479
> auth_required: true
> build_id: 2015-06-08_06-13-27
> build_number: '521'
> feature_groups:
> - mirantis
> fuel-library_sha: f43c2ae1af3b493ee0e7810eab7bb7b50c986c7d
> fuel-ostf_sha: 7c938648a246e0311d05e2372ff43ef1eb2e2761
> fuelmain_sha: bcc909ffc5dd5156ba54cae348b6a07c1b607b24
> nailgun_sha: 4340d55c19029394cd5610b0e0f56d6cb8cb661b
> openstack_version: 2014.2.2-6.1
> production: docker
> python-fuelclient_sha: 4fc55db0265bbf39c369df398b9dc7d6469ba13b
> release: '6.1'
>
> Diagnostic Snapshot: http://mos-scale-share.mirantis.com/fuel-
> snapshot-2015-06-09_09-55-58.tar.xz
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/fuel/+bug/1463433/+subscriptions

--
Davanum Srinivas :: https://twitter.com/dims

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

The rabbitmqctl never had been seen "hanged" since we added the net_ticktime=10 patch to the lab under test https://review.openstack.org/189292. If you see its related bug comments you will find many topics resembling the one you mentioned

Revision history for this message
Dan Hata (dhata) wrote :

Requested by Eugene Bogdanov

Clear steps to reproduce and expected result vs actual result
During a shaker network test where we test VM to VM network throughput performance on all hosts, RabbitMQ memory process grows and eventually books all incoming requests.

Running Shaker on a 50 node cents environment has reproduced this problem 1 time.

Rough estimate of the probability of user facing the issue
We have seen this problem once. We issued a fix but the root cause did not manifest. So 1 out of 2 times we have seen this.

What is the real user facing impact / severity and is there a workaround available?

IMPACT: Data plane and control plane requests will fail
WORKAROUND: Restart RabbitMQ

Can we deliver the fix later and apply it easy on running env?
yes, the logic to restart RabbitMQ is in place currently.

However we have seen RabbitMQ die for a different reason now.

rabbitmqctl never hanged on nodes, but reported several nodedown errors which OCF considers as a resource failure and initiates restart of rabbit node.
- the complete list of single rabbit node failures during the test runs is http://pastebin.com/y12DDEx6

some logs and additional rabbit stats collected by this script http://pastebin.com/sX3DPyRG is attached

The fix for this is unclear and under investigation

description: updated
Dan Hata (dhata)
description: updated
description: updated
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

And here are some logs additional to the comment #22 from the app side - how it behaves after AMQP reconnected http://paste.openstack.org/show/wKE646lb0CD6TJZXkRm8/.

summary: [shaker] test failing due to multiple "Timed out waiting for reply to
- ID" events logged by Oslo.messaging after rabbitmq recovered from
- partitioning and kept running with AMQP publish got blocked because virt
- memory got exhausted at rabbit node
+ ID"/"Queue not found:Basic.consume" events logged by Oslo.messaging
+ after AMQP reconnect
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

The latest two reproduce tries had shown no more than a dup of https://bugs.launchpad.net/mos/+bug/1463802

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Let me summarize the status.
By test results of patches mentioned in https://bugs.launchpad.net/fuel/+bug/1463433/comments/22, we have this bug fixed by the underlying AMQP side. This bug may be closed once these patches got merged as the rest of the issues left belong to the scope of https://bugs.launchpad.net/mos/+bug/1465757 and https://bugs.launchpad.net/mos/+bug/1463802 and should be tracked there

no longer affects: mos
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Removed MOS project target as there are another bugs, see above, to track this at the MOS side

summary: - [shaker] test failing due to multiple "Timed out waiting for reply to
- ID"/"Queue not found:Basic.consume" events logged by Oslo.messaging
- after AMQP reconnect
+ [shaker] test failing when rabbitmq node rasies memory alert
tags: added: 6.1-mu-1
tags: added: 6.1
removed: 6.1-mu-1
Revision history for this message
Leontii Istomin (listomin) wrote :

has been reproduced during nova/boot_and_delete_server_with_secgroups rally scenario. Test has been failed on 914 iteration:
http://paste.openstack.org/show/324559/
before tests we perform "pcs resource unmanage" command for rabbitmq pacemaker resource.
rabbitmqctl list_queues hangs.

node-288:
from log:
=INFO REPORT==== 29-Jun-2015::00:56:57 ===
rabbit on node 'rabbit@node-404' down
[root@node-288 ~]# rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-288' ...
[{nodes,[{disc,['rabbit@node-288','rabbit@node-403','rabbit@node-404']}]},
{running_nodes,['rabbit@node-403','rabbit@node-288']},
{cluster_name,<<"<email address hidden>">>},
{partitions,[]}]
...done.
node-403:
from log: http://paste.openstack.org/show/324531/
[root@node-403 ~]# rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-403' ...
[{nodes,[{disc,['rabbit@node-288','rabbit@node-403','rabbit@node-404']}]},
{running_nodes,['rabbit@node-404','rabbit@node-288','rabbit@node-403']},
{cluster_name,<<"<email address hidden>">>},
{partitions,[]}]
...done.
node-404:
from log:
=INFO REPORT==== 29-Jun-2015::00:56:57 ===
rabbit on node 'rabbit@node-288' down
[root@node-404 ~]# rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-404' ...
[{nodes,[{disc,['rabbit@node-288','rabbit@node-403','rabbit@node-404']}]},
{running_nodes,['rabbit@node-403','rabbit@node-404']},
{cluster_name,<<"<email address hidden>">>},
{partitions,[]}]
...done.

atops from controller nodes at the time when rabbitmq was broken
atop on node-288: http://paste.openstack.org/show/325652/
atop on node-403: http://paste.openstack.org/show/325653/
atop on node-404: http://paste.openstack.org/show/325654/

** Reason for termination ==
** {function_clause,
       [{rabbit_mirror_queue_slave,forget_sender,[down_from_ch,down_from_ch]},
        {rabbit_mirror_queue_slave,maybe_forget_sender,3},
        {rabbit_mirror_queue_slave,handle_info,2},
        {gen_server2,handle_msg,2},
        {proc_lib,wake_up,3}]}

rabbitmq_statistics is attached
diagnostic snapshot here: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-06-29_15-38-26.tar.xz

Revision history for this message
Leontii Istomin (listomin) wrote :

reproduced again. have found that boot_and_delete_server_with_secgroups rally scenari was failed.

from rally.log: http://paste.openstack.org/show/331694/
root@node-76:~# awk '$11==500' /var/log/messages | grep "30 16:09:47"
http://paste.openstack.org/show/331697/
from nova-api.log at the time: http://paste.openstack.org/show/331686/

from <email address hidden>:
=INFO REPORT==== 30-Jun-2015::16:09:46 ===
Stopping RabbitMQ

=INFO REPORT==== 30-Jun-2015::16:09:46 ===
stopped TCP Listener on [::]:5673

root@node-76:~# less /var/log/pacemaker.log:
Jun 30 16:09:45 [15257] node-76.domain.tld crmd: notice: process_lrm_event: Operation p_rabbitmq-server_monitor_30000: unknown error (node=node-76.domain.tld, call=171, rc=1, cib-update=267, confirmed=false)
Jun 30 16:09:45 [15257] node-76.domain.tld crmd: notice: process_lrm_event: node-76.domain.tld-p_rabbitmq-server_monitor_30000:171 [ Killed\n ]
...
Jun 30 16:10:00 [15254] node-76.domain.tld pacemaker_remoted: info: log_finished: finished - rsc:p_rabbitmq-server action:stop call_id:271 pid:27592 exit-code:0 exec-time:14222ms queue-time:0ms

atop from node-74 at 16:09:42: http://paste.openstack.org/show/331865/
atop from node-76 at 16:09:42: http://paste.openstack.org/show/331864/
atop from node-81 at 16:09:42: http://paste.openstack.org/show/331866/

Diagnostic snapshot is here: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-07-01_09-31-23.tar.xz

Revision history for this message
Leontii Istomin (listomin) wrote :

rabbitmq state at 16:08

Revision history for this message
Leontii Istomin (listomin) wrote :

one more update to previous two comments:
there were two rally tests running at the same time on the env:
boot_and_delete_server_with_secgroups
create-and-upload-volume-to-image

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/190137
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=bf604f80d72f69e771152b153973fa38fa83afd8
Submitter: Jenkins
Branch: master

commit bf604f80d72f69e771152b153973fa38fa83afd8
Author: Bogdan Dobrelya <email address hidden>
Date: Wed Jun 10 13:44:53 2015 +0200

    Restart rabbit if can't list queues or found memory alert

    W/o this fix the dead end situation is possible
    when the rabbit node have no free memory resources left
    and the cluster blocks all publishing, by design.
    But the app thinks "let's wait for the publish block have
    lifted" and cannot recover.

    The workaround is to monitor results
    of crucial rabbitmqctl commands and restart the rabbit node,
    if queues/channels/alarms cannot be listed or if there are
    memory alarms found.
    This is the similar logic as we have for the cases when
    rabbitmqctl list_channels hangs. But the channels check is also
    fixed to verify if the exit code>0 when the rabbit app is
    running.

    Additional checks added to the monitor also require extending
    the timeout window for the monitor action from 60 to 180 seconds.

    Besides that, this patch makes the monitor action to gather the
    rabbit status and runtime stats, like consumed memory by all
    queues of total Mem+Swap, total messages in all queues and
    average queue consumer utilization. This info should help to
    troubleshoot failures better.

    DocImpact: ops guide. If any rabbitmq node exceeded its memory
    threshold the publish became blocked cluster-wide, by design.
    For such cases, this rabbit node would be recovered from the
    raised memory alert and immediately stopped to be restarted
    later by the pacemaker. Otherwise, this blocked publishing state
    might never have been lifted, if the pressure persists from the
    OpenStack apps side.

    Closes-bug: #1463433

    Change-Id: I91dec2d30d77b166ff9fe88109f3acdd19ce9ff9
    Signed-off-by: Bogdan Dobrelya <email address hidden>

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/6.1)

Fix proposed to branch: stable/6.1
Review: https://review.openstack.org/222608

Revision history for this message
Leontii Istomin (listomin) wrote :

Hasn't been reproduced with at least 7.0-301 (RC4)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (stable/6.1)

Reviewed: https://review.openstack.org/222608
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=599961e60858f182811dd5bc166b4d76e3b3de36
Submitter: Jenkins
Branch: stable/6.1

commit 599961e60858f182811dd5bc166b4d76e3b3de36
Author: Bogdan Dobrelya <email address hidden>
Date: Wed Jun 10 13:44:53 2015 +0200

    Restart rabbit if can't list queues or found memory alert

    W/o this fix the dead end situation is possible
    when the rabbit node have no free memory resources left
    and the cluster blocks all publishing, by design.
    But the app thinks "let's wait for the publish block have
    lifted" and cannot recover.

    The workaround is to monitor results
    of crucial rabbitmqctl commands and restart the rabbit node,
    if queues/channels/alarms cannot be listed or if there are
    memory alarms found.
    This is the similar logic as we have for the cases when
    rabbitmqctl list_channels hangs. But the channels check is also
    fixed to verify if the exit code>0 when the rabbit app is
    running.

    Additional checks added to the monitor also require extending
    the timeout window for the monitor action from 60 to 180 seconds.

    Besides that, this patch makes the monitor action to gather the
    rabbit status and runtime stats, like consumed memory by all
    queues of total Mem+Swap, total messages in all queues and
    average queue consumer utilization. This info should help to
    troubleshoot failures better.

    DocImpact: ops guide. If any rabbitmq node exceeded its memory
    threshold the publish became blocked cluster-wide, by design.
    For such cases, this rabbit node would be recovered from the
    raised memory alert and immediately stopped to be restarted
    later by the pacemaker. Otherwise, this blocked publishing state
    might never have been lifted, if the pressure persists from the
    OpenStack apps side.

    Closes-bug: #1463433

    Change-Id: I91dec2d30d77b166ff9fe88109f3acdd19ce9ff9
    Signed-off-by: Bogdan Dobrelya <email address hidden>
    (cherry picked from commit bf604f80d72f69e771152b153973fa38fa83afd8)

Revision history for this message
Oleksandr Martsyniuk (omartsyniuk) wrote :

The reported problem blocks the performance testing of Fuel Contrail plugin for 6.1 on Scale.
Shaker tests on 200 node env were affected, heat returned 504 error while applying HOT templates due to rabbitmq failures.

Revision history for this message
Vitaly Sedelnik (vsedelnik) wrote :

The fix for this issue is included into 6.1-mu-3 which will be published at 10/26 - please try again after 6.1-mu-3 release.

Revision history for this message
Denis Meltsaykin (dmeltsaykin) wrote :

Setting this as Won't Fix for 5.1.1-updates and 6.0-updates, as such a complex change cannot be delivered in the scope of the Maintenance Update. Also, the possible solution of the backporting of RabbitMQ OCF script is covered in details by the Operations Guide from the official documentation of the Product.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.