Bug #1463433 “[shaker] test failing when rabbitmq node rasies me...” : Bugs : Fuel for OpenStack

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-06-09:

#1

As far as I can see the timestamp of the first "timed out waiting for reply" error fits into the described failover scope, see
https://bugs.launchpad.net/fuel/+bug/1460762/comments/27.
So, we should debug this case deeply. Hopefully, some developer from MOS Oslo team could help us here.
As for underlying AMQP layer, as I mentioned earlier, I see no issues but a clean failover
My guess, that we have a classic AP+ C-case here, which is at least two amqp nodes was available (A+), partition was recovered (P+), some reply_queues was lost (C-) but the app layer have failed to survive such a lost.

description:	updated
Changed in fuel:
status:	New → Confirmed
summary:	- rabbitmq was down on one of controllers + rabbitmq was down on one of controllers during shaker test
Changed in fuel:
milestone:	none → 7.0
importance:	Undecided → High
assignee:	nobody → MOS Oslo (mos-oslo)

Revision history for this message

Leontii Istomin (listomin) wrote on 2015-06-09: Re: rabbitmq was down on one of controllers during shaker test

#2

Can be the same issue as here https://bugs.launchpad.net/fuel/+bug/1460762

Leontii Istomin (listomin) on 2015-06-09

description:

updated

Bogdan Dobrelya (bogdando) on 2015-06-09

summary:	- rabbitmq was down on one of controllers during shaker test + rabbitmq was down on one of controllers during shaker test but there are + multiple "Timed out waiting for reply to ID" events logged by + Oslo.messaging
summary:	rabbitmq was down on one of controllers during shaker test but there are multiple "Timed out waiting for reply to ID" events logged by - Oslo.messaging + Oslo.messaging after rabbitmq recovered from partitioning

Leontii Istomin (listomin) on 2015-06-09

description:

updated

Revision history for this message

Dina Belova (dbelova) wrote on 2015-06-09: Re: rabbitmq was down on one of controllers during shaker test but there are multiple "Timed out waiting for reply to ID" events logged by Oslo.messaging after rabbitmq recovered from partitioning

#3

Is that true that we assign this to 7.0?

Revision history for this message

Leontii Istomin (listomin) wrote on 2015-06-09:

#4

[root@node-1 ~]# rabbitmqctl list_queues
Listing queues ...
Error: unable to connect to node 'rabbit@node-1': nodedown

DIAGNOSTICS
===========

attempted to contact: ['rabbit@node-1']

rabbit@node-1:
* connected to epmd (port 4369) on node-1
* node rabbit@node-1 up, 'rabbit' application running

current node details:
- node name: 'rabbitmqctl22420@node-1'
- home dir: /var/lib/rabbitmq
- cookie hash: soeIWU2jk2YNseTyDSlsEA==

You have new mail in /var/spool/mail/root
[root@node-1 ~]# rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-1' ...
[{nodes,[{disc,['rabbit@node-1','rabbit@node-44','rabbit@node-49']}]},
{running_nodes,['rabbit@node-49','rabbit@node-44','rabbit@node-1']},
{cluster_name,<<"<email address hidden>">>},
{partitions,[]}]
...done.

description:

updated

Revision history for this message

Leontii Istomin (listomin) wrote on 2015-06-09:

#5

Roman Podoliaka has found that rabbitmq keep 15G on node-1:
http://xsnippet.org/360727/

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-06-09:

#6

Raising to crit and returning to the 6.1 scope as it is a blocker for the scale lab certification for the 6.1

Changed in fuel:
importance:	High → Critical
milestone:	7.0 → 6.1

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-06-09:

#7

AFAIK, RabbitMQ virtual memory pool may grow uncontrollable, for example if there are many messages being published and consumers cannot handle this. We should investigate the node-1 more precisely

Revision history for this message

Davanum Srinivas (DIMS) (dims-v) wrote on 2015-06-09:

#8

"rabbitmqctl report" shows that total memory is higher than high watermark

{memory,[{total,17770473600},
          {connection_procs,63922904},
          {queue_procs,14427954888},
          {plugins,0},
          {other_proc,0},
          {mnesia,2925304},
          {mgmt_db,0},
          {msg_index,2481160},
          {other_ets,6831936},
          {binary,674800824},
          {code,14650466},
          {atom,2478353},
          {other_system,3655403413}]},
{alarms,[memory]},
{listeners,[{clustering,41055,"::"},{amqp,5673,"::"}]},
{vm_memory_high_watermark,0.4},
{vm_memory_limit,13535861145},
{disk_free_limit,50000000},
{disk_free,52060409856},

Log confirms that's the case
/<email address hidden>:memory resource limit alarm set on node 'rabbit@node-1'.
/<email address hidden>:*** Publishers will be blocked until this alarm clears ***

So connections are blocked per "rabbitmqctl report"

Connections:
pid name port peer_port host peer_host ssl peer_cert_subject peer_cert_issuer peer_cert_validity auth_mechanism ssl_protocol ssl_key_exchange ssl_cipher ssl_hash protocol user vhost timeout frame_max channel_max

client_properties recv_oct recv_cnt send_oct send_cnt send_pend state channels
<email address hidden> 127.0.0.1:35246 -> 127.0.0.1:5673 5673 35246 127.0.0.1 127.0.0.1 false AMQPLAIN {0,9,1} nova / 60 131072 65535 [{"product","py-amqp"},{"product_version","1.4.5"},{"capabilitie
s",[{"connection.blocked",true},{"consumer_cancel_notify",true}]}] 62891 1315 13122 1559 0 blocking 1
<email address hidden> 127.0.0.1:35438 -> 127.0.0.1:5673 5673 35438 127.0.0.1 127.0.0.1 false AMQPLAIN {0,9,1} nova / 60 131072 65535 [{"product","py-amqp"},{"product_version","1.4.5"},{"capabilitie
s",[{"connection.blocked",true},{"consumer_cancel_notify",true}]}] 34560 1835 1809297 2966 0 blocking 1
<email address hidden> 127.0.0.1:35948 -> 127.0.0.1:5673 5673 35948 127.0.0.1 127.0.0.1 false AMQPLAIN {0,9,1} nova / 60 131072 65535 [{"product","py-amqp"},{"product_version","1.4.5"},{"cap
abilities",[{"connection.blocked",true},{"consumer_cancel_notify",true}]}] 69104 3505 3894287 3897 0 blocking 1
<email address hidden> 127.0.0.1:35951 -> 127.0.0.1:5673 5673 35951 127.0.0.1 127.0.0.1 false AMQPLAIN {0,9,1} nova / 60 131072 65535 [{"product","py-amqp"},{"product_version","1.4.5"},{"cap
abilities",[{"connection.blocked",true},{"consumer_cancel_notify",true}]}] 69120 3509 3861730 3901 0 blocking 1

"rabbitmqctl report" shows that total memory is higher than high watermark

{memory,[{total,17770473600},
          {connection_procs,63922904},
          {queue_procs,14427954888},
          {plugins,0},
          {other_proc,0},
          {mnesia,2925304},
          {mgmt_db,0},
          {msg_index,2481160},
          {other_ets,6831936},
          {binary,674800824},
          {code,14650466},
          {atom,2478353},
          {other_system,3655403413}]},
 {alarms,[memory]},
 {listeners,[{clustering,41055,"::"},{amqp,5673,"::"}]},
 {vm_memory_high_watermark,0.4},
 {vm_memory_limit,13535861145},
 {disk_free_limit,50000000},
 {disk_free,52060409856},

Log confirms that's the case
/var/log/rabbitmq/rabbit@node-49.log:memory resource limit alarm set on node 'rabbit@node-1'.
/var/log/rabbitmq/rabbit@node-49.log:*** Publishers will be blocked until this alarm clears ***

So connections are blocked per "rabbitmqctl report"

Connections:
pid    name    port    peer_port    host    peer_host    ssl    peer_cert_subject    peer_cert_issuer    peer_cert_validity    auth_mechanism    ssl_protocol    ssl_key_exchange    ssl_cipher    ssl_hash    protocol    user    vhost    timeout    frame_max    channel_max

client_properties    recv_oct    recv_cnt    send_oct    send_cnt    send_pend    state    channels
<rabbit@node-1.2.923.0>    127.0.0.1:35246 -> 127.0.0.1:5673    5673    35246    127.0.0.1    127.0.0.1    false                AMQPLAIN                    {0,9,1}    nova    /    60    131072    65535    [{"product","py-amqp"},{"product_version","1.4.5"},{"capabilitie
s",[{"connection.blocked",true},{"consumer_cancel_notify",true}]}]    62891    1315    13122    1559    0    blocking    1
<rabbit@node-1.2.962.0>    127.0.0.1:35438 -> 127.0.0.1:5673    5673    35438    127.0.0.1    127.0.0.1    false                AMQPLAIN                    {0,9,1}    nova    /    60    131072    65535    [{"product","py-amqp"},{"product_version","1.4.5"},{"capabilitie
s",[{"connection.blocked",true},{"consumer_cancel_notify",true}]}]    34560    1835    1809297    2966    0    blocking    1
<rabbit@node-1.2.1017.0>    127.0.0.1:35948 -> 127.0.0.1:5673    5673    35948    127.0.0.1    127.0.0.1    false                AMQPLAIN                    {0,9,1}    nova    /    60    131072    65535    [{"product","py-amqp"},{"product_version","1.4.5"},{"cap
abilities",[{"connection.blocked",true},{"consumer_cancel_notify",true}]}]    69104    3505    3894287    3897    0    blocking    1
<rabbit@node-1.2.1037.0>    127.0.0.1:35951 -> 127.0.0.1:5673    5673    35951    127.0.0.1    127.0.0.1    false                AMQPLAIN                    {0,9,1}    nova    /    60    131072    65535    [{"product","py-amqp"},{"product_version","1.4.5"},{"cap
abilities",[{"connection.blocked",true},{"consumer_cancel_notify",true}]}]    69120    3509    3861730    3901    0    blocking    1

Revision history for this message

Davanum Srinivas (DIMS) (dims-v) wrote on 2015-06-09:

#9

looks at a newer release notes i see
25850 prevent excessive binary memory use when accepting or delivering
large messages at high speed (since 1.0.0)

https://www.rabbitmq.com/release-notes/README-3.4.0.txt - not suggesting right now, but guess we should move up in next release

Revision history for this message

Davanum Srinivas (DIMS) (dims-v) wrote on 2015-06-09:

#10

https://www.rabbitmq.com/alarms.html - shows just one node hitting the limit with block everything

When running RabbitMQ in a cluster, the memory and disk alarms are cluster-wide; if one node goes over the limit then all nodes will block connections.

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-06-10:

#11

http://stackoverflow.com/questions/27552462/rabbitmq-consumes-memory-and-shuts
"The issue is most likely that one of the OpenStack services are not consuming messages properly and staking them up endlessly"

This could be the RC, but we cannot confirm it as we don't have publishing / consuming statistics provided by failed tests

summary:

- rabbitmq was down on one of controllers during shaker test but there are
- multiple "Timed out waiting for reply to ID" events logged by
- Oslo.messaging after rabbitmq recovered from partitioning
+ [shaker] test failing due to multiple "Timed out waiting for reply to
+ ID" events logged by Oslo.messaging after rabbitmq recovered from
+ partitioning and kept running with some connections got blocked

summary:

[shaker] test failing due to multiple "Timed out waiting for reply to
ID" events logged by Oslo.messaging after rabbitmq recovered from
- partitioning and kept running with some connections got blocked
+ partitioning and kept running with some connections got blocked because
+ virt memory got exhausted by publishers

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-06-10: Re: [shaker] test failing due to multiple "Timed out waiting for reply to ID" events logged by Oslo.messaging after rabbitmq recovered from partitioning and kept running with some connections got blocked because virt memory got exhausted by publishers

#12

Some questions remain unanswered:
1) Why the swapped virtual memory at node-1 increased from 13.5 to 19G by this night, if many connections was blocked? Were there more and more connections and messages incoming form anaware OpenStack services or we just can see a memory leak issue in beam process? How to debug this?
2) Why blocked connections persist instead of being closed?

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-06-10:

#13

Deeper investigation had shown this is the classic deadlock: the rabbit node have no free
resources left and the cluster blocks *all* publishing, by design. The
app thinks "let's wait for the publish block have lifted" and cannot
recover, hence keeps running continuously reporting "timed out waiting
for reply" things and keeping blocked connections open for ever.

As we discussed with Roman Podolyaka today:
- we *can* apply the w/a for the AMQP cluster monitoring control plane
(OCF). Which is about to "monitor and restart, if something looks really
bad"
- we *cannot* apply w/a or fix for the app layer as it is not clear what
Oslo should do when the rabbit cluster have blocked all publishing
because some node(s) got exhausted memory resources. Roman thinks that
the app should just wait for unblock, as it is now, and I almost agree
- we also *cannot* be sure why exactly the rabbit node's memory high
watermark exceeds: either due to some app side issues or due to the
rabbit side memory leak. Good news, this is *not* important, if we gonna
apply only w/a to the control plane. Which is to monitor additionally if
'rabbitmqctl list_queues' hangs and restart rabbit node the same way as
we do now for the cases when rabbitmqctl list_channels hangs.

Changed in fuel:
assignee:	MOS Oslo (mos-oslo) → Bogdan Dobrelya (bogdando)
status:	Confirmed → Triaged

Bogdan Dobrelya (bogdando) on 2015-06-10

summary:

[shaker] test failing due to multiple "Timed out waiting for reply to
ID" events logged by Oslo.messaging after rabbitmq recovered from
- partitioning and kept running with some connections got blocked because
- virt memory got exhausted by publishers
+ partitioning and kept running with AMQP publish got blocked because virt
+ memory got exhausted at rabbit node

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-06-10: Fix proposed to fuel-library (master)

#14

Fix proposed to branch: master
Review: https://review.openstack.org/190137

Changed in fuel:
status:	Triaged → In Progress

Revision history for this message

Alexander Nevenchannyy (anevenchannyy) wrote on 2015-06-10: Re: [shaker] test failing due to multiple "Timed out waiting for reply to ID" events logged by Oslo.messaging after rabbitmq recovered from partitioning and kept running with AMQP publish got blocked because virt memory got exhausted at rabbit node

#15

Resource limit may be traced from CRM script with command:
root@node-15:~# rabbitmqctl eval 'rabbit_alarm:get_alarms().'
[{{resource_limit,memory,'rabbit@node-15'},[]}]
...done.

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-06-11:

#16

Thank you @Alexander, I added this check to the monitor logic.

So, the current status of the fix is to detect memory alert set on the node and release it with an immediate stop. This gives us the following pros: The memory lock set on some node will not block the cluster and this node will never swap to 20G, hopegully, because will be restarted by OCF.

But the Cons is: if the RC is not a some mysterious RMQ memory leak, but instead the pressure from the OpenStack apps side, bringing one node down by memory alert would likely have brought the rest of the nodes down as well, in a cascade fashion, by the same watermark issue.

So, someone have to ensure that the app side (Oslo.messaging) can react somehow to the blocked publish state declared by temporary reducing the "AMQP pressure" on rabbit nodes. I'm adding MOS Oslo project here as well.

Changed in mos:
milestone:	none → 6.1
milestone:	6.1 → 7.0
assignee:	nobody → MOS Oslo (mos-oslo)
importance:	Undecided → High

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-06-11:

#17

Related Oslo upstream bug https://review.openstack.org/#/c/185851/

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-06-11:

#18

That was the patch... Here is the bug https://bugs.launchpad.net/oslo.messaging/+bug/1454449

Eugene Bogdanov (ebogdanov) on 2015-06-11

tags:	added: non-release
tags:	removed: non-release
tags:	added: non-release
tags:	removed: non-release
tags:	added: 6.1-rc2
tags:	added: 6.1rc2 removed: 6.1-rc2

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-06-12:

#19

Folks, I tested how exactly blocked publish works when a rabbit node exceeds its high memory watermark.
Here is the scripts: http://pastebin.com/49JisHRP , If you want to give a try, make sure you provided correct IP addresses and nova creds. And the sample generator itself https://github.com/bogdando/ceilometer/raw/rmq_bench/tools/sample-generator.py ,
it requires pika and python-ceilometer installed. The output of the test was http://pastebin.com/EgYtzYuY

As you can see from test results, consume never got blocked on raised RabbitMQ memory alert, only publish is blocked. So this issue with the swap increasing in uncontrolled way by beam.smp process is likely related to OpenStack apps keep pushing on rabbit cluster when it has declared blocked publish. The pressure on the memory may be done, for example, by new connections opening.

The recent patch should give us memory grow dynamics for queues, connections etc.

Changed in mos:
importance:	High → Critical
milestone:	7.0 → 6.1

Dmitry Mescheryakov (dmitrymex) on 2015-06-15

Changed in mos:
status:	New → Confirmed

Leontii Istomin (listomin) on 2015-06-15

description:

updated

Revision history for this message

Eugene Bogdanov (ebogdanov) wrote on 2015-06-15:

#20

Talked with Dan Hata. Based on his input, we have no idea how the product will behave once we apply the existing patch

This bug is serious, but if there’s any chance that this patch makes it even worse, it’s not rational to apply it, especially keeping in mind that we’re badly time-driven now.

Based on that outcome and keeping in mind QA, MOS and Fuel teams’ buy-in to build the RC2 ISO without this patch the plan is:

1. Remove this bug from 6.1 RC2 at this point, so I am removing 6.1rc2 tag
2. Continue investigation in parallel. If we have a tangible solution with no risk of regression before Release date, we will consider building new RC (especially keeping in mind that the fix is very specific and does not have dependencies).

tags:	added: 6.1 removed: 6.1rc2
tags:	removed: 6.1

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-06-16:

#21

According to the previous comments, this bug should not be addressed in the scope of 6.1 release

Changed in mos:
importance:	Critical → High
milestone:	6.1 → 7.0

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-06-16:

#22

logs.tgz Edit (2.6 MiB, application/x-tar)

The latest reproduce effort with given patches https://review.openstack.org/190137 and https://review.openstack.org/189292 brought the following results.

Shaker run : from 22:06:00 to 02:23:05
light rally: from 02:23:05 to 02:53:16
full rally: from 02:53:16 till the end of logs ~09:00

- there were no partitions detected due to network or CPU spike loads
- there were no memory alerts set on rabbit nodes
- rabbitmqctl never hanged on nodes, but reported several nodedown errors which OCF considers as a resource failure and initiates restart of rabbit node.
- the complete list of single rabbit node failures during the test runs is http://pastebin.com/y12DDEx6

some logs and additional rabbit stats collected by this script http://pastebin.com/sX3DPyRG is attached

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-06-16:

#23

Related upstream mail thread https://groups.google.com/forum/#!topic/rabbitmq-users/P8CStc7WWCQ

Revision history for this message

Alexander Nevenchannyy (anevenchannyy) wrote on 2015-06-16:

#24

This is mail thread looks like our problem https://groups.google.com/forum/#!topic/rabbitmq-users/Ja0iyfF0Szw

Revision history for this message

Davanum Srinivas (DIMS) (dims-v) wrote on 2015-06-16: Re: [Bug 1463433] Re: [shaker] test failing due to multiple "Timed out waiting for reply to ID" events logged by Oslo.messaging after rabbitmq recovered from partitioning and kept running with AMQP publish got blocked because virt memory got exhausted at rabbit node

#25

Alexander, Bogdan,

Is this the cause of the issue?
https://github.com/rabbitmq/rabbitmq-server/issues/62

On Tue, Jun 16, 2015 at 9:39 AM, Alexander Nevenchannyy
<email address hidden> wrote:
> This is mail thread looks like our problem
> https://groups.google.com/forum/#!topic/rabbitmq-users/Ja0iyfF0Szw
>
> --
> You received this bug notification because you are a member of MOS Oslo,
> which is a bug assignee.
> https://bugs.launchpad.net/bugs/1463433
>
> Title:
> [shaker] test failing due to multiple "Timed out waiting for reply to
> ID" events logged by Oslo.messaging after rabbitmq recovered from
> partitioning and kept running with AMQP publish got blocked because
> virt memory got exhausted at rabbit node
>
> Status in Fuel: OpenStack installer that works:
> In Progress
> Status in Fuel for OpenStack 6.1.x series:
> In Progress
> Status in Fuel for OpenStack 7.0.x series:
> In Progress
> Status in Mirantis OpenStack:
> Confirmed
>
> Bug description:
> During Shaker test (http://pyshaker.readthedocs.org/en/latest/index.html) we have found that ksoftirqd keep a lot of CPU:
> http://paste.openstack.org/show/277774/
> atop SHIFT+P ksoftirqd:
> http://paste.openstack.org/show/277891/
>
> At the time rabbitmq on this controller node (node-49) was down:
> from node-1
> =INFO REPORT==== 8-Jun-2015::23:48:07 ===
> rabbit on node 'rabbit@node-49' down
>
> from node-44
> =INFO REPORT==== 8-Jun-2015::23:46:55 ===
> rabbit on node 'rabbit@node-49' down
>
> Configuration:
> Baremetal,Centos,fedora_kernel,IBP,HA, Neutron-vlan,Ceph-all,Nova-debug,Nova-quotas, 6.1-521
> Controllers:3 Computes:47
>
> net_ticktime parameter has been added:
> http://paste.openstack.org/show/278020/
>
> api: '1.0'
> astute_sha: 7766818f079881e2dbeedb34e1f67e517ed7d479
> auth_required: true
> build_id: 2015-06-08_06-13-27
> build_number: '521'
> feature_groups:
> - mirantis
> fuel-library_sha: f43c2ae1af3b493ee0e7810eab7bb7b50c986c7d
> fuel-ostf_sha: 7c938648a246e0311d05e2372ff43ef1eb2e2761
> fuelmain_sha: bcc909ffc5dd5156ba54cae348b6a07c1b607b24
> nailgun_sha: 4340d55c19029394cd5610b0e0f56d6cb8cb661b
> openstack_version: 2014.2.2-6.1
> production: docker
> python-fuelclient_sha: 4fc55db0265bbf39c369df398b9dc7d6469ba13b
> release: '6.1'
>
> Diagnostic Snapshot: http://mos-scale-share.mirantis.com/fuel-
> snapshot-2015-06-09_09-55-58.tar.xz
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/fuel/+bug/1463433/+subscriptions

--
Davanum Srinivas :: https://twitter.com/dims

Alexander, Bogdan,

Is this the cause of the issue?
https://github.com/rabbitmq/rabbitmq-server/issues/62

On Tue, Jun 16, 2015 at 9:39 AM, Alexander Nevenchannyy
<anevenchannyy@mirantis.com> wrote:
> This is mail thread looks like our problem
> https://groups.google.com/forum/#!topic/rabbitmq-users/Ja0iyfF0Szw
>
> --
> You received this bug notification because you are a member of MOS Oslo,
> which is a bug assignee.
> https://bugs.launchpad.net/bugs/1463433
>
> Title:
>   [shaker] test failing due to multiple "Timed out waiting for reply to
>   ID" events logged by Oslo.messaging after rabbitmq recovered from
>   partitioning and kept running with AMQP publish got blocked because
>   virt memory got exhausted at rabbit node
>
> Status in Fuel: OpenStack installer that works:
>   In Progress
> Status in Fuel for OpenStack 6.1.x series:
>   In Progress
> Status in Fuel for OpenStack 7.0.x series:
>   In Progress
> Status in Mirantis OpenStack:
>   Confirmed
>
> Bug description:
>   During Shaker test (http://pyshaker.readthedocs.org/en/latest/index.html) we have found that ksoftirqd keep a lot of CPU:
>   http://paste.openstack.org/show/277774/
>   atop SHIFT+P ksoftirqd:
>   http://paste.openstack.org/show/277891/
>
>   At the time rabbitmq on this controller node (node-49) was down:
>   from node-1
>   =INFO REPORT==== 8-Jun-2015::23:48:07 ===
>   rabbit on node 'rabbit@node-49' down
>
>   from node-44
>   =INFO REPORT==== 8-Jun-2015::23:46:55 ===
>   rabbit on node 'rabbit@node-49' down
>
>   Configuration:
>   Baremetal,Centos,fedora_kernel,IBP,HA, Neutron-vlan,Ceph-all,Nova-debug,Nova-quotas, 6.1-521
>   Controllers:3 Computes:47
>
>   net_ticktime parameter has been added:
>   http://paste.openstack.org/show/278020/
>
>   api: '1.0'
>   astute_sha: 7766818f079881e2dbeedb34e1f67e517ed7d479
>   auth_required: true
>   build_id: 2015-06-08_06-13-27
>   build_number: '521'
>   feature_groups:
>   - mirantis
>   fuel-library_sha: f43c2ae1af3b493ee0e7810eab7bb7b50c986c7d
>   fuel-ostf_sha: 7c938648a246e0311d05e2372ff43ef1eb2e2761
>   fuelmain_sha: bcc909ffc5dd5156ba54cae348b6a07c1b607b24
>   nailgun_sha: 4340d55c19029394cd5610b0e0f56d6cb8cb661b
>   openstack_version: 2014.2.2-6.1
>   production: docker
>   python-fuelclient_sha: 4fc55db0265bbf39c369df398b9dc7d6469ba13b
>   release: '6.1'
>
>   Diagnostic Snapshot: http://mos-scale-share.mirantis.com/fuel-
>   snapshot-2015-06-09_09-55-58.tar.xz
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/fuel/+bug/1463433/+subscriptions

-- 
Davanum Srinivas :: https://twitter.com/dims

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-06-16:

#27

The rabbitmqctl never had been seen "hanged" since we added the net_ticktime=10 patch to the lab under test https://review.openstack.org/189292. If you see its related bug comments you will find many topics resembling the one you mentioned

Revision history for this message

Dan Hata (dhata) wrote on 2015-06-16:

#28

Requested by Eugene Bogdanov

Clear steps to reproduce and expected result vs actual result
During a shaker network test where we test VM to VM network throughput performance on all hosts, RabbitMQ memory process grows and eventually books all incoming requests.

Running Shaker on a 50 node cents environment has reproduced this problem 1 time.

Rough estimate of the probability of user facing the issue
We have seen this problem once. We issued a fix but the root cause did not manifest. So 1 out of 2 times we have seen this.

What is the real user facing impact / severity and is there a workaround available?

IMPACT: Data plane and control plane requests will fail
WORKAROUND: Restart RabbitMQ

Can we deliver the fix later and apply it easy on running env?
yes, the logic to restart RabbitMQ is in place currently.

However we have seen RabbitMQ die for a different reason now.

rabbitmqctl never hanged on nodes, but reported several nodedown errors which OCF considers as a resource failure and initiates restart of rabbit node.
- the complete list of single rabbit node failures during the test runs is http://pastebin.com/y12DDEx6

some logs and additional rabbit stats collected by this script http://pastebin.com/sX3DPyRG is attached

The fix for this is unclear and under investigation

Dmitry Mescheryakov (dmitrymex) on 2015-06-16

description:

updated

Dan Hata (dhata) on 2015-06-16

description:	updated
description:	updated

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-06-17:

#29

app-side.tgz Edit (390.1 KiB, application/x-tar)

And here are some logs additional to the comment #22 from the app side - how it behaves after AMQP reconnected http://paste.openstack.org/show/wKE646lb0CD6TJZXkRm8/.

summary:

[shaker] test failing due to multiple "Timed out waiting for reply to
- ID" events logged by Oslo.messaging after rabbitmq recovered from
- partitioning and kept running with AMQP publish got blocked because virt
- memory got exhausted at rabbit node
+ ID"/"Queue not found:Basic.consume" events logged by Oslo.messaging
+ after AMQP reconnect

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-06-17:

#31

The latest two reproduce tries had shown no more than a dup of https://bugs.launchpad.net/mos/+bug/1463802

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-06-17:

#32

Let me summarize the status.
By test results of patches mentioned in https://bugs.launchpad.net/fuel/+bug/1463433/comments/22, we have this bug fixed by the underlying AMQP side. This bug may be closed once these patches got merged as the rest of the issues left belong to the scope of https://bugs.launchpad.net/mos/+bug/1465757 and https://bugs.launchpad.net/mos/+bug/1463802 and should be tracked there

no longer affects:

mos

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-06-17:

#33

Removed MOS project target as there are another bugs, see above, to track this at the MOS side

summary:

- [shaker] test failing due to multiple "Timed out waiting for reply to
- ID"/"Queue not found:Basic.consume" events logged by Oslo.messaging
- after AMQP reconnect
+ [shaker] test failing when rabbitmq node rasies memory alert

Vitaly Sedelnik (vsedelnik) on 2015-06-25

tags:

added: 6.1-mu-1

Eugene Bogdanov (ebogdanov) on 2015-06-29

tags:

added: 6.1
removed: 6.1-mu-1

Revision history for this message

Leontii Istomin (listomin) wrote on 2015-06-29:

#34

rabbit_stat.tar.gz Edit (264.6 KiB, application/x-tar)

has been reproduced during nova/boot_and_delete_server_with_secgroups rally scenario. Test has been failed on 914 iteration:
http://paste.openstack.org/show/324559/
before tests we perform "pcs resource unmanage" command for rabbitmq pacemaker resource.
rabbitmqctl list_queues hangs.

node-288:
from log:
=INFO REPORT==== 29-Jun-2015::00:56:57 ===
rabbit on node 'rabbit@node-404' down
[root@node-288 ~]# rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-288' ...
[{nodes,[{disc,['rabbit@node-288','rabbit@node-403','rabbit@node-404']}]},
{running_nodes,['rabbit@node-403','rabbit@node-288']},
{cluster_name,<<"<email address hidden>">>},
{partitions,[]}]
...done.
node-403:
from log: http://paste.openstack.org/show/324531/
[root@node-403 ~]# rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-403' ...
[{nodes,[{disc,['rabbit@node-288','rabbit@node-403','rabbit@node-404']}]},
{running_nodes,['rabbit@node-404','rabbit@node-288','rabbit@node-403']},
{cluster_name,<<"<email address hidden>">>},
{partitions,[]}]
...done.
node-404:
from log:
=INFO REPORT==== 29-Jun-2015::00:56:57 ===
rabbit on node 'rabbit@node-288' down
[root@node-404 ~]# rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-404' ...
[{nodes,[{disc,['rabbit@node-288','rabbit@node-403','rabbit@node-404']}]},
{running_nodes,['rabbit@node-403','rabbit@node-404']},
{cluster_name,<<"<email address hidden>">>},
{partitions,[]}]
...done.

atops from controller nodes at the time when rabbitmq was broken
atop on node-288: http://paste.openstack.org/show/325652/
atop on node-403: http://paste.openstack.org/show/325653/
atop on node-404: http://paste.openstack.org/show/325654/

** Reason for termination ==
** {function_clause,
       [{rabbit_mirror_queue_slave,forget_sender,[down_from_ch,down_from_ch]},
        {rabbit_mirror_queue_slave,maybe_forget_sender,3},
        {rabbit_mirror_queue_slave,handle_info,2},
        {gen_server2,handle_msg,2},
        {proc_lib,wake_up,3}]}

rabbitmq_statistics is attached
diagnostic snapshot here: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-06-29_15-38-26.tar.xz

has been reproduced during nova/boot_and_delete_server_with_secgroups rally scenario. Test has been failed on 914 iteration:
http://paste.openstack.org/show/324559/
before tests we perform "pcs resource unmanage" command for rabbitmq pacemaker resource.
rabbitmqctl list_queues hangs.

node-288:
from log:
=INFO REPORT==== 29-Jun-2015::00:56:57 ===
rabbit on node 'rabbit@node-404' down
[root@node-288 ~]# rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-288' ...
[{nodes,[{disc,['rabbit@node-288','rabbit@node-403','rabbit@node-404']}]},
{running_nodes,['rabbit@node-403','rabbit@node-288']},
{cluster_name,<<"rabbit@node-288.domain.tld">>},
{partitions,[]}]
...done.
node-403:
from log: http://paste.openstack.org/show/324531/
[root@node-403 ~]# rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-403' ...
[{nodes,[{disc,['rabbit@node-288','rabbit@node-403','rabbit@node-404']}]},
{running_nodes,['rabbit@node-404','rabbit@node-288','rabbit@node-403']},
{cluster_name,<<"rabbit@node-288.domain.tld">>},
{partitions,[]}]
...done.
node-404:
from log:
=INFO REPORT==== 29-Jun-2015::00:56:57 ===
rabbit on node 'rabbit@node-288' down
[root@node-404 ~]# rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-404' ...
[{nodes,[{disc,['rabbit@node-288','rabbit@node-403','rabbit@node-404']}]},
{running_nodes,['rabbit@node-403','rabbit@node-404']},
{cluster_name,<<"rabbit@node-288.domain.tld">>},
{partitions,[]}]
...done.

atops from controller nodes at the time when rabbitmq was broken
atop on node-288: http://paste.openstack.org/show/325652/
atop on node-403: http://paste.openstack.org/show/325653/
atop on node-404: http://paste.openstack.org/show/325654/

** Reason for termination == 
** {function_clause,
       [{rabbit_mirror_queue_slave,forget_sender,[down_from_ch,down_from_ch]},
        {rabbit_mirror_queue_slave,maybe_forget_sender,3},
        {rabbit_mirror_queue_slave,handle_info,2},
        {gen_server2,handle_msg,2},
        {proc_lib,wake_up,3}]}

rabbitmq_statistics is attached
diagnostic snapshot here: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-06-29_15-38-26.tar.xz

Revision history for this message

Leontii Istomin (listomin) wrote on 2015-07-01:

#35

reproduced again. have found that boot_and_delete_server_with_secgroups rally scenari was failed.

from rally.log: http://paste.openstack.org/show/331694/
root@node-76:~# awk '$11==500' /var/log/messages | grep "30 16:09:47"
http://paste.openstack.org/show/331697/
from nova-api.log at the time: http://paste.openstack.org/show/331686/

from <email address hidden>:
=INFO REPORT==== 30-Jun-2015::16:09:46 ===
Stopping RabbitMQ

=INFO REPORT==== 30-Jun-2015::16:09:46 ===
stopped TCP Listener on [::]:5673

root@node-76:~# less /var/log/pacemaker.log:
Jun 30 16:09:45 [15257] node-76.domain.tld crmd: notice: process_lrm_event: Operation p_rabbitmq-server_monitor_30000: unknown error (node=node-76.domain.tld, call=171, rc=1, cib-update=267, confirmed=false)
Jun 30 16:09:45 [15257] node-76.domain.tld crmd: notice: process_lrm_event: node-76.domain.tld-p_rabbitmq-server_monitor_30000:171 [ Killed\n ]
...
Jun 30 16:10:00 [15254] node-76.domain.tld pacemaker_remoted: info: log_finished: finished - rsc:p_rabbitmq-server action:stop call_id:271 pid:27592 exit-code:0 exec-time:14222ms queue-time:0ms

atop from node-74 at 16:09:42: http://paste.openstack.org/show/331865/
atop from node-76 at 16:09:42: http://paste.openstack.org/show/331864/
atop from node-81 at 16:09:42: http://paste.openstack.org/show/331866/

Diagnostic snapshot is here: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-07-01_09-31-23.tar.xz

Revision history for this message

Leontii Istomin (listomin) wrote on 2015-07-01:

#36

rabbit_stat_16%3A08.log Edit (79.1 KiB, text/plain)

rabbitmq state at 16:08

Revision history for this message

Leontii Istomin (listomin) wrote on 2015-07-01:

#37

one more update to previous two comments:
there were two rally tests running at the same time on the env:
boot_and_delete_server_with_secgroups
create-and-upload-volume-to-image

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-07-09: Fix merged to fuel-library (master)

#38

Reviewed: https://review.openstack.org/190137
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=bf604f80d72f69e771152b153973fa38fa83afd8
Submitter: Jenkins
Branch: master

commit bf604f80d72f69e771152b153973fa38fa83afd8
Author: Bogdan Dobrelya <email address hidden>
Date: Wed Jun 10 13:44:53 2015 +0200

Restart rabbit if can't list queues or found memory alert

    W/o this fix the dead end situation is possible
    when the rabbit node have no free memory resources left
    and the cluster blocks all publishing, by design.
    But the app thinks "let's wait for the publish block have
    lifted" and cannot recover.

    The workaround is to monitor results
    of crucial rabbitmqctl commands and restart the rabbit node,
    if queues/channels/alarms cannot be listed or if there are
    memory alarms found.
    This is the similar logic as we have for the cases when
    rabbitmqctl list_channels hangs. But the channels check is also
    fixed to verify if the exit code>0 when the rabbit app is
    running.

Additional checks added to the monitor also require extending
the timeout window for the monitor action from 60 to 180 seconds.

    Besides that, this patch makes the monitor action to gather the
    rabbit status and runtime stats, like consumed memory by all
    queues of total Mem+Swap, total messages in all queues and
    average queue consumer utilization. This info should help to
    troubleshoot failures better.

    DocImpact: ops guide. If any rabbitmq node exceeded its memory
    threshold the publish became blocked cluster-wide, by design.
    For such cases, this rabbit node would be recovered from the
    raised memory alert and immediately stopped to be restarted
    later by the pacemaker. Otherwise, this blocked publishing state
    might never have been lifted, if the pressure persists from the
    OpenStack apps side.

Closes-bug: #1463433

Change-Id: I91dec2d30d77b166ff9fe88109f3acdd19ce9ff9
Signed-off-by: Bogdan Dobrelya <email address hidden>

Reviewed:  https://review.openstack.org/190137
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=bf604f80d72f69e771152b153973fa38fa83afd8
Submitter: Jenkins
Branch:    master

commit bf604f80d72f69e771152b153973fa38fa83afd8
Author: Bogdan Dobrelya <bdobrelia@mirantis.com>
Date:   Wed Jun 10 13:44:53 2015 +0200

Restart rabbit if can't list queues or found memory alert
    
    W/o this fix the dead end situation is possible
    when the rabbit node have no free memory resources left
    and the cluster blocks all publishing, by design.
    But the app thinks "let's wait for the publish block have
    lifted" and cannot recover.
    
    The workaround is to monitor results
    of crucial rabbitmqctl commands and restart the rabbit node,
    if queues/channels/alarms cannot be listed or if there are
    memory alarms found.
    This is the similar logic as we have for the cases when
    rabbitmqctl list_channels hangs. But the channels check is also
    fixed to verify if the exit code>0 when the rabbit app is
    running.
    
    Additional checks added to the monitor also require extending
    the timeout window for the monitor action from 60 to 180 seconds.
    
    Besides that, this patch makes the monitor action to gather the
    rabbit status and runtime stats, like consumed memory by all
    queues of total Mem+Swap, total messages in all queues and
    average queue consumer utilization. This info should help to
    troubleshoot failures better.
    
    DocImpact: ops guide. If any rabbitmq node exceeded its memory
    threshold the publish became blocked cluster-wide, by design.
    For such cases, this rabbit node would be recovered from the
    raised memory alert and immediately stopped to be restarted
    later by the pacemaker. Otherwise, this blocked publishing state
    might never have been lifted, if the pressure persists from the
    OpenStack apps side.
    
    Closes-bug: #1463433
    
    Change-Id: I91dec2d30d77b166ff9fe88109f3acdd19ce9ff9
    Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>

Changed in fuel:
status:	In Progress → Fix Committed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-09-11: Fix proposed to fuel-library (stable/6.1)

#39

Fix proposed to branch: stable/6.1
Review: https://review.openstack.org/222608

Revision history for this message

Leontii Istomin (listomin) wrote on 2015-09-28:

#40

Hasn't been reproduced with at least 7.0-301 (RC4)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-10-06: Fix merged to fuel-library (stable/6.1)

#41

Reviewed: https://review.openstack.org/222608
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=599961e60858f182811dd5bc166b4d76e3b3de36
Submitter: Jenkins
Branch: stable/6.1

commit 599961e60858f182811dd5bc166b4d76e3b3de36
Author: Bogdan Dobrelya <email address hidden>
Date: Wed Jun 10 13:44:53 2015 +0200

Restart rabbit if can't list queues or found memory alert

    W/o this fix the dead end situation is possible
    when the rabbit node have no free memory resources left
    and the cluster blocks all publishing, by design.
    But the app thinks "let's wait for the publish block have
    lifted" and cannot recover.

    The workaround is to monitor results
    of crucial rabbitmqctl commands and restart the rabbit node,
    if queues/channels/alarms cannot be listed or if there are
    memory alarms found.
    This is the similar logic as we have for the cases when
    rabbitmqctl list_channels hangs. But the channels check is also
    fixed to verify if the exit code>0 when the rabbit app is
    running.

Additional checks added to the monitor also require extending
the timeout window for the monitor action from 60 to 180 seconds.

    Besides that, this patch makes the monitor action to gather the
    rabbit status and runtime stats, like consumed memory by all
    queues of total Mem+Swap, total messages in all queues and
    average queue consumer utilization. This info should help to
    troubleshoot failures better.

    DocImpact: ops guide. If any rabbitmq node exceeded its memory
    threshold the publish became blocked cluster-wide, by design.
    For such cases, this rabbit node would be recovered from the
    raised memory alert and immediately stopped to be restarted
    later by the pacemaker. Otherwise, this blocked publishing state
    might never have been lifted, if the pressure persists from the
    OpenStack apps side.

Closes-bug: #1463433

    Change-Id: I91dec2d30d77b166ff9fe88109f3acdd19ce9ff9
    Signed-off-by: Bogdan Dobrelya <email address hidden>
    (cherry picked from commit bf604f80d72f69e771152b153973fa38fa83afd8)

Reviewed:  https://review.openstack.org/222608
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=599961e60858f182811dd5bc166b4d76e3b3de36
Submitter: Jenkins
Branch:    stable/6.1

commit 599961e60858f182811dd5bc166b4d76e3b3de36
Author: Bogdan Dobrelya <bdobrelia@mirantis.com>
Date:   Wed Jun 10 13:44:53 2015 +0200

Restart rabbit if can't list queues or found memory alert
    
    W/o this fix the dead end situation is possible
    when the rabbit node have no free memory resources left
    and the cluster blocks all publishing, by design.
    But the app thinks "let's wait for the publish block have
    lifted" and cannot recover.
    
    The workaround is to monitor results
    of crucial rabbitmqctl commands and restart the rabbit node,
    if queues/channels/alarms cannot be listed or if there are
    memory alarms found.
    This is the similar logic as we have for the cases when
    rabbitmqctl list_channels hangs. But the channels check is also
    fixed to verify if the exit code>0 when the rabbit app is
    running.
    
    Additional checks added to the monitor also require extending
    the timeout window for the monitor action from 60 to 180 seconds.
    
    Besides that, this patch makes the monitor action to gather the
    rabbit status and runtime stats, like consumed memory by all
    queues of total Mem+Swap, total messages in all queues and
    average queue consumer utilization. This info should help to
    troubleshoot failures better.
    
    DocImpact: ops guide. If any rabbitmq node exceeded its memory
    threshold the publish became blocked cluster-wide, by design.
    For such cases, this rabbit node would be recovered from the
    raised memory alert and immediately stopped to be restarted
    later by the pacemaker. Otherwise, this blocked publishing state
    might never have been lifted, if the pressure persists from the
    OpenStack apps side.
    
    Closes-bug: #1463433
    
    Change-Id: I91dec2d30d77b166ff9fe88109f3acdd19ce9ff9
    Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>
    (cherry picked from commit bf604f80d72f69e771152b153973fa38fa83afd8)

Revision history for this message

Oleksandr Martsyniuk (omartsyniuk) wrote on 2015-10-21:

#42

The reported problem blocks the performance testing of Fuel Contrail plugin for 6.1 on Scale.
Shaker tests on 200 node env were affected, heat returned 504 error while applying HOT templates due to rabbitmq failures.

Revision history for this message

Vitaly Sedelnik (vsedelnik) wrote on 2015-10-25:

#43

The fix for this issue is included into 6.1-mu-3 which will be published at 10/26 - please try again after 6.1-mu-3 release.

Revision history for this message

Denis Meltsaykin (dmeltsaykin) wrote on 2015-10-26:

#44

Setting this as Won't Fix for 5.1.1-updates and 6.0-updates, as such a complex change cannot be delivered in the scope of the Maintenance Update. Also, the possible solution of the backporting of RabbitMQ OCF script is covered in details by the Operations Guide from the official documentation of the Product.

Fuel for OpenStack

[shaker] test failing when rabbitmq node rasies memory alert

Bug Description

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches

	Status	Importance	Assigned to	Milestone
Fuel for OpenStack	Fix Committed	High	Bogdan Dobrelya	Fuel for OpenStack 6.1
5.1.x	Won't Fix	High	Denis Meltsaykin	Fuel for OpenStack 5.1.1-updates
6.0.x	Won't Fix	High	Denis Meltsaykin	Fuel for OpenStack 6.0-updates
6.1.x	Fix Released	High	Michal Rostecki	Fuel for OpenStack 6.1-mu-3
7.0.x	Fix Released	High	Bogdan Dobrelya	Fuel for OpenStack 7.0