MQ downtime after killing master RabbitMQ
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Fuel for OpenStack |
Invalid
|
High
|
Dmitry Mescheryakov | ||
Mitaka |
Fix Released
|
High
|
Dmitry Mescheryakov | ||
Newton |
Invalid
|
High
|
Dmitry Mescheryakov |
Bug Description
Killing master RabbitMQ server results in ~30 seconds downtime of the whole cluster.
Steps to reproduce:
0) Install the latest oslo.messaging simulator on one of nodes (preferably to compute). Get connection parameters from service config files (username and password).
1) Find which node runs master by doing `pcs resource`
2) Start simulator server, e.g. "python simulator.py --url rabbit:
3) Start simulator client, e.g. "python simulator.py --url rabbit:
4) After several second kill RabbitMQ process (beam) on master node
It's observed that process killing doesn't affect throughput much, but the "after-shock" actions do. In the following experiment the downtime was about 30 seconds.
The experiment
~~~~~~~~~~~~~
Controllers: node-123 (master), node-111 and node-58.
Simulator Server is connected to node-123, the client to node-111.
The moment of killing master RabbitMQ process:
client
2016-03-18 13:20:15,488 INFO root client-0 : seq: 13 count: 588 bytes: 1885040
2016-03-18 13:20:16,488 INFO root client-0 : seq: 14 count: 170 bytes: 530547
2016-03-18 13:20:17,488 INFO root client-0 : seq: 15 count: 471 bytes: 1483060
server
2016-03-18 13:20:15,399 INFO root server : seq: 25 count: 586 bytes: 1888334 latency: 0.002 min: 0.002 max: 0.012
2016-03-18 13:20:15,784 ERROR oslo.messaging.
2016-03-18 13:20:16,399 INFO root server : seq: 26 count: 223 bytes: 702102 latency: 0.002 min: 0.002 max: 0.003
2016-03-18 13:20:16,789 ERROR oslo.messaging.
2016-03-18 13:20:17,399 INFO root server : seq: 27 count: 0 bytes: 0
2016-03-18 13:20:17,802 INFO oslo.messaging.
2016-03-18 13:20:18,402 INFO root server : seq: 28 count: 840 bytes: 2702260 latency: 0.648 min: 0.064 max: 2.061
So the overall downtime from the server p-o-v is 2 seconds.
The issues start 1 minute after:
server
2016-03-18 13:21:19,422 INFO root server : seq: 89 count: 0 bytes: 0
2016-03-18 13:21:20,410 ERROR oslo.messaging.
2016-03-18 13:21:26,381 ERROR oslo.messaging.
2016-03-18 13:21:20,422 INFO root server : seq: 90 count: 10 bytes: 34170 latency: 0.002 min: 0.002 max: 0.002
2016-03-18 13:21:21,425 INFO root server : seq: 91 count: 895 bytes: 2881936 latency: 0.002 min: 0.002 max: 0.010
2016-03-18 13:21:22,426 INFO root server : seq: 92 count: 1015 bytes: 3241000 latency: 0.002 min: 0.001 max: 0.010
2016-03-18 13:21:23,426 INFO root server : seq: 93 count: 1022 bytes: 3269220 latency: 0.002 min: 0.001 max: 0.006
2016-03-18 13:21:24,427 INFO root server : seq: 94 count: 949 bytes: 3050223 latency: 0.002 min: 0.002 max: 0.011
ConnectionForced: (0, 0): (320) CONNECTION_FORCED - broker forced connection closure with reason 'shutdown'
2016-03-18 13:21:25,427 INFO root server : seq: 95 count: 899 bytes: 2865040 latency: 0.002 min: 0.001 max: 0.009
2016-03-18 13:21:26,426 INFO root server : seq: 96 count: 0 bytes: 0
------ // -------
2016-03-18 13:21:54,448 INFO root server : seq: 124 count: 0 bytes: 0
2016-03-18 13:21:54,460 INFO oslo.messaging.
2016-03-18 13:21:55,449 INFO root server : seq: 125 count: 979 bytes: 3117822 latency: 0.002 min: 0.001 max: 0.011
The overall downtime is 30 seconds with attempts to reconnect to different endpoints (see logs attached)
Changed in fuel: | |
status: | New → Confirmed |
importance: | Undecided → Medium |
milestone: | none → 9.0 |
tags: | added: 10.0-reviewed |
tags: | added: rabbitmq |
VERSION: 9e395940c232911 ffb851899c1" fuelclient_ sha: "4f234669cfe88a 9406f4e438b1e1f 74f1ef484a5" e1436b86ac4567a b914bfb451b" nailgun- agent_sha: "b2bb466fd5bd92 da614cdbd819d69 99c510ebfb1" 4be8748492bae1d ec2fa89b446" b994f78d4c78723 d29fa44685a" 95ff34eadc29552 f4235fa6c52" 99d931f926e5c95 12e2b442749" 4b707c081d128cb 7eea611474f" dde5c01d4f89055 66978e5d906" checker_ sha: "a43cf96cd9532f 10794dce736350b f5bed350e9d" f69759e97e42f9b 97dfc87e85b" 82d56d0ce814345 8be67c53434"
feature_groups:
- mirantis
production: "docker"
release: "8.0"
api: "1.0"
build_number: "570"
build_id: "570"
fuel-nailgun_sha: "558ca91a854cf2
python-
fuel-agent_sha: "658be72c4b42d3
fuel-
astute_sha: "b81577a5b7857c
fuel-library_sha: "c2a335b5b725f1
fuel-ostf_sha: "3bc76a63a9e7d1
fuel-mirror_sha: "fb45b80d7bee58
fuelmenu_sha: "78ffc73065a967
shotgun_sha: "63645dea384a37
network-
fuel-upgrade_sha: "616a7490ec7199
fuelmain_sha: "d605bcbabf3153