Can't failover when rabbit_hosts is configured as 3 hosts
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Ubuntu Cloud Archive |
Invalid
|
Undecided
|
Unassigned | ||
Pike |
Fix Released
|
High
|
Unassigned | ||
oslo.messaging |
Fix Released
|
Undecided
|
Vincent Untz | ||
python-oslo.messaging (Ubuntu) |
Invalid
|
Undecided
|
Unassigned | ||
Artful |
Fix Released
|
High
|
Felipe Reyes |
Bug Description
[Impact]
When the heartbeat connection times out it is not treated as a recoverable error nor attempts to reconnect calling ensure_
[Test Case]
* deploy openstack
bzr branch lp:openstack-charm-testing
cd openstack-
juju deployer -c default.yaml -d -v artful-pike
juju add-unit rabbitmq-server
* Force timeout using iptables in a rabbitmq-server node
sudo iptables -I INPUT -p tcp --dport 5672 -j DROP
Expected result:
once the timeout happens, the heartbeat thread reconnects (picking a new rabbit host if needed).
Actual result:
the heartbeat thread is left in a loop (connect, socket closed, retry, connect...)
[Regression Potential]
Without this patch when the heartbeat connection times out, and it does not attempt to connect to the next configured rabbit host. So the risk is that situations where currently the daemons using this library made it to reconnect to the same host (e.g. the disconnection from the host is only for a few seconds) with this change they will reconnect to the next host, so users may see the connections flapping between two (or more) rabbit hosts.
[Other Info]
I have a rabbitmq cluster of 3 nodes
root@47704165d2
Cluster status of node rabbit@47704165d2bb ...
[{nodes,
{running_
{cluster_
{partitions,[]},
{alarms,
root@47704165d2
Listing policies ...
/ ha-all all ^ha\\. {"ha-mode":"all"} 0
My oslo_message client configuration
[oslo_messaging
rabbit_
rabbit_userid=cloud
rabbit_
rabbit_
rabbit_
rabbit_
rabbit_
rabbit_
When I run "service rabbitmq-server stop" on one node to simulating a failure, I got following error logs, and the consumer can't failover from the bad node. It will reconnect the failure node forever instead of other nodes. "kombu_
2009-01-13 18:32:42.785 17 ERROR oslo.messaging.
2009-01-13 18:32:43.819 17 ERROR oslo.messaging.
2009-01-13 18:32:43.819 17 WARNING oslo.messaging.
2009-01-13 18:32:58.874 17 ERROR oslo.messaging.
2009-01-13 18:32:59.907 17 ERROR oslo.messaging.
2009-01-13 18:32:59.907 17 WARNING oslo.messaging.
Who can help me. Thanks!
Changed in cloud-archive: | |
status: | New → Invalid |
Changed in python-oslo.messaging (Ubuntu): | |
status: | New → Invalid |
Changed in python-oslo.messaging (Ubuntu Artful): | |
status: | New → Triaged |
importance: | Undecided → High |
description: | updated |
Fix proposed to branch: master /review. openstack. org/519701
Review: https:/