[shaker] test failing when rabbitmq node rasies memory alert
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Fuel for OpenStack |
Fix Committed
|
High
|
Bogdan Dobrelya | ||
5.1.x |
Won't Fix
|
High
|
Denis Meltsaykin | ||
6.0.x |
Won't Fix
|
High
|
Denis Meltsaykin | ||
6.1.x |
Fix Released
|
High
|
Michal Rostecki | ||
7.0.x |
Fix Released
|
High
|
Bogdan Dobrelya |
Bug Description
Steps to reproduce:
1. Run Shaker tests. (For explanation what it is, see User Impact section below)
First the Rabbit MQ experienced network partition. After it recovered from partition, on one of the nodes it started to consume RAM. And after it consumed tens of GB of RAM, it stopped accepting more messages from OpenStack services. As a result, OpenStack failed all the incoming requests. The issue did not end after tests were finished.
Conditions for reproduction:
The issue is reproduced only once so far. Another run finished successfully, without Rabbit MQ consuming a lot of memory or being stuck for a long time (see comment #22 below).
User impact:
Shaker creates VMs in batches of 8 at a time and tests network throughput between them. The traffic created is rather close to the limit of network throughput of the lab (a little less then 10G). When traffic is cross-network (i.e. it flows between VMs located in different Neutron networks), it always goes through controllers. Sometimes that hits Rabbit MQ and it switches to inoperable state. The rate of the messages hitting Rabbit MQ are in the 100's per second.
From user's point of view, the cloud is not working until the issue is healed.
Workaround:
Restart all Rabbit MQ nodes. After Rabbit is operable (in several minutes after restart), the cloud should start working properly.
Current plan:
Reproduce the issue once more and more throughly investigate it.
Original description by Leontiy Istomin
=======
During Shaker test (http://
http://
atop SHIFT+P ksoftirqd:
http://
At the time rabbitmq on this controller node (node-49) was down:
from node-1
=INFO REPORT==== 8-Jun-2015:
rabbit on node 'rabbit@node-49' down
from node-44
=INFO REPORT==== 8-Jun-2015:
rabbit on node 'rabbit@node-49' down
Configuration:
Baremetal,
Controllers:3 Computes:47
net_ticktime parameter has been added:
http://
api: '1.0'
astute_sha: 7766818f079881e
auth_required: true
build_id: 2015-06-08_06-13-27
build_number: '521'
feature_groups:
- mirantis
fuel-library_sha: f43c2ae1af3b493
fuel-ostf_sha: 7c938648a246e03
fuelmain_sha: bcc909ffc5dd515
nailgun_sha: 4340d55c1902939
openstack_version: 2014.2.2-6.1
production: docker
python-
release: '6.1'
Diagnostic Snapshot: http://
description: | updated |
summary: |
- rabbitmq was down on one of controllers during shaker test + rabbitmq was down on one of controllers during shaker test but there are + multiple "Timed out waiting for reply to ID" events logged by + Oslo.messaging |
summary: |
rabbitmq was down on one of controllers during shaker test but there are multiple "Timed out waiting for reply to ID" events logged by - Oslo.messaging + Oslo.messaging after rabbitmq recovered from partitioning |
description: | updated |
summary: |
[shaker] test failing due to multiple "Timed out waiting for reply to ID" events logged by Oslo.messaging after rabbitmq recovered from - partitioning and kept running with some connections got blocked because - virt memory got exhausted by publishers + partitioning and kept running with AMQP publish got blocked because virt + memory got exhausted at rabbit node |
tags: | added: non-release |
tags: | removed: non-release |
tags: | added: non-release |
tags: | removed: non-release |
tags: | added: 6.1-rc2 |
tags: |
added: 6.1rc2 removed: 6.1-rc2 |
Changed in mos: | |
status: | New → Confirmed |
description: | updated |
description: | updated |
description: | updated |
description: | updated |
tags: | added: 6.1-mu-1 |
tags: |
added: 6.1 removed: 6.1-mu-1 |
As far as I can see the timestamp of the first "timed out waiting for reply" error fits into the described failover scope, see /bugs.launchpad .net/fuel/ +bug/1460762/ comments/ 27.
https:/
So, we should debug this case deeply. Hopefully, some developer from MOS Oslo team could help us here.
As for underlying AMQP layer, as I mentioned earlier, I see no issues but a clean failover
My guess, that we have a classic AP+ C-case here, which is at least two amqp nodes was available (A+), partition was recovered (P+), some reply_queues was lost (C-) but the app layer have failed to survive such a lost.