Cannot reconnect to rabbitmq-server after power-off 1-node on 3-node clustering controllers
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
oslo.messaging |
Fix Released
|
Undecided
|
Herve Beraud |
Bug Description
On 3-node clustering controllers environment, force power off controller-1.
After messaging heartbeat connection times out but it is not able to reconnect to other messaging node.
"Unexpected error during heartbeart thread processing, retrying..." warning messages were repeated long time.
$ grep "Unexpected error during heartbeart thread processing" nova-scheduler.log | grep "13 10:[4-5]"
2018-12-13 10:40:08.587 1 WARNING oslo.messaging.
2018-12-13 10:40:08.588 1 WARNING oslo.messaging.
2018-12-13 10:40:53.682 1 WARNING oslo.messaging.
2018-12-13 10:40:53.682 1 WARNING oslo.messaging.
2018-12-13 10:41:38.773 1 WARNING oslo.messaging.
2018-12-13 10:41:38.774 1 WARNING oslo.messaging.
2018-12-13 10:42:23.870 1 WARNING oslo.messaging.
2018-12-13 10:42:23.870 1 WARNING oslo.messaging.
2018-12-13 10:43:08.961 1 WARNING oslo.messaging.
2018-12-13 10:43:08.962 1 WARNING oslo.messaging.
2018-12-13 10:43:54.055 1 WARNING oslo.messaging.
2018-12-13 10:43:54.056 1 WARNING oslo.messaging.
2018-12-13 10:44:39.149 1 WARNING oslo.messaging.
2018-12-13 10:44:39.150 1 WARNING oslo.messaging.
2018-12-13 10:45:24.243 1 WARNING oslo.messaging.
2018-12-13 10:46:09.340 1 WARNING oslo.messaging.
Versions:
oslo-messaging-
kombu-4.0.2
amqp-2.3.2
How to reproduce:
Step1. Deploy 3 controller node clustering environment.
Step2. Confirm that VM booting is possible.
Step3. Power off controller-1 via IPMI, and check its status has been changed OFFLINE by pcs status command.
i) Actual results:
The following messages are recorded repeatedly on the nova-scheduler.log and didn't change connection destination long time.
2018-12-13 10:40:08.587 1 WARNING oslo.messaging.
j) Expected results:
Execute ensure connection and reconnect to other messaging server quickly.
k) Additional information
It seems that exception is not caught correctly in _heartbeat_
It is expected that caused exception generated in (*1) will be caught in (*2).
In that case, ensure_connection() (*3) is executed, so the connection destination would be switched.
However, in fact the exception was caught in (*4), the destination of connection was not switched and the warning log (*5) continues to be recorded.
oslo_
def _heartbeat_
"""Thread that maintains inactive connections
"""
while not self._heartbeat
with self._connectio
Changed in oslo.messaging: | |
assignee: | nobody → Herve Beraud (herveberaud) |
status: | New → In Progress |
Reviewed: https:/ /review. opendev. org/656902 /git.openstack. org/cgit/ openstack/ oslo.messaging/ commit/ ?id=9d8b1430e5c 081b081c0e3c0b5 f12f744dc7809d
Committed: https:/
Submitter: Zuul
Branch: master
commit 9d8b1430e5c081b 081c0e3c0b5f12f 744dc7809d
Author: Hervé Beraud <email address hidden>
Date: Fri May 3 00:55:56 2019 +0200
Fix switch connection destination when a rabbitmq cluster node disappear
In a clustered rabbitmq when a node disappears, we get a efusedError because the socket get disconnected.
ConnectionR
The socket access yields a OSError because the heartbeat
tries to reach an unreachable host (No route to host).
Catch these exceptions to ensure that we call ensure_connection for switching
the connection destination.
POC is available at github. com:4383/ rabbitmq- oslo_messging- error-poc
Example: :4383/rabbitmq- oslo_messging- error-poc oslo_messging- error-poc ttings. IPAddress' )
$ git clone <email address hidden>
$ cd rabbitmq-
$ python -m virtualenv .
$ source bin/activate
$ pip install -r requirements.txt
$ sudo podman run -d --hostname my-rabbit --name rabbit rabbitmq:3
$ python poc.py $(sudo podman inspect rabbit | niet '.[0].NetworkSe
And in parallele in an another shell|tmux
$ podman stop rabbit
$ # observe the output of the poc.py script we now call ensure_connection
Now you can observe some output relative to the connection who is
modified and not catched before these changes.
Related to: https:/ /bugzilla. redhat. com/show_ bug.cgi? id=1665399
Closes-Bug: #1828841
Change-Id: I9dc1644cac0e39 eb11bf05f57bde7 7dcf6d42ed3