oslo.messaging

Cannot reconnect to rabbitmq-server after power-off 1-node on 3-node clustering controllers

Bug #1828841 reported by Herve Beraud on 2019-05-13

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	oslo.messaging	Fix Released	Undecided	Herve Beraud

Bug Description

On 3-node clustering controllers environment, force power off controller-1.
After messaging heartbeat connection times out but it is not able to reconnect to other messaging node.
"Unexpected error during heartbeart thread processing, retrying..." warning messages were repeated long time.

     $ grep "Unexpected error during heartbeart thread processing" nova-scheduler.log | grep "13 10:[4-5]"
     2018-12-13 10:40:08.587 1 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: error: [Errno 113] EHOSTUNREACH
     2018-12-13 10:40:08.588 1 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: error: [Errno 113] EHOSTUNREACH
     2018-12-13 10:40:53.682 1 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: error: [Errno 113] EHOSTUNREACH
     2018-12-13 10:40:53.682 1 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: error: [Errno 113] EHOSTUNREACH
     2018-12-13 10:41:38.773 1 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: error: [Errno 113] EHOSTUNREACH
     2018-12-13 10:41:38.774 1 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: error: [Errno 113] EHOSTUNREACH
     2018-12-13 10:42:23.870 1 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: error: [Errno 113] EHOSTUNREACH
     2018-12-13 10:42:23.870 1 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: error: [Errno 113] EHOSTUNREACH
     2018-12-13 10:43:08.961 1 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: error: [Errno 113] EHOSTUNREACH
     2018-12-13 10:43:08.962 1 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: error: [Errno 113] EHOSTUNREACH
     2018-12-13 10:43:54.055 1 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: error: [Errno 113] EHOSTUNREACH
     2018-12-13 10:43:54.056 1 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: error: [Errno 113] EHOSTUNREACH
     2018-12-13 10:44:39.149 1 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: error: [Errno 113] EHOSTUNREACH
     2018-12-13 10:44:39.150 1 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: error: [Errno 113] EHOSTUNREACH
     2018-12-13 10:45:24.243 1 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: error: [Errno 113] EHOSTUNREACH
     2018-12-13 10:46:09.340 1 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: error: [Errno 113] EHOSTUNREACH

Versions:
oslo-messaging-5.35.3
kombu-4.0.2
amqp-2.3.2

How to reproduce:

Step1. Deploy 3 controller node clustering environment.

Step2. Confirm that VM booting is possible.

Step3. Power off controller-1 via IPMI, and check its status has been changed OFFLINE by pcs status command.

i) Actual results:

The following messages are recorded repeatedly on the nova-scheduler.log and didn't change connection destination long time.

2018-12-13 10:40:08.587 1 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: error: [Errno 113] EHOSTUNREACH

j) Expected results:

Execute ensure connection and reconnect to other messaging server quickly.

k) Additional information

   It seems that exception is not caught correctly in _heartbeat_thread_job().
   It is expected that caused exception generated in (*1) will be caught in (*2).
   In that case, ensure_connection() (*3) is executed, so the connection destination would be switched.
   However, in fact the exception was caught in (*4), the destination of connection was not switched and the warning log (*5) continues to be recorded.

oslo_messaging/_drivers/impl_rabbit.py:

     def _heartbeat_thread_job(self):
         """Thread that maintains inactive connections
         """
         while not self._heartbeat_exit_event.is_set():
             with self._connection_lock.for_heartbeat():

                 try:
                     try:
                         self._heartbeat_check() (*1)
                         # NOTE(sileht): We need to drain event to receive
                         # heartbeat from the broker but don't hold the
                         # connection too much times. In amqpdriver a connection
                         # is used exclusively for read or for write, so we have
                         # to do this for connection used for write drain_events
                         # already do that for other connection
                         try:
                             self.connection.drain_events(timeout=0.001)
                         except socket.timeout:
                             pass
                     except (socket.timeout, (*2)
                             kombu.exceptions.OperationalError) as exc:
                         LOG.info(_LI("A recoverable connection/channel error "
                                      "occurred, trying to reconnect: %s"), exc)
                         self.ensure_connection() (*3)
                 except Exception: (*4)
                     LOG.warning(_LW("Unexpected error during heartbeart "
                                     "thread processing, retrying...")) (*5)
                     LOG.debug('Exception', exc_info=True)

             self._heartbeat_exit_event.wait(
                 timeout=self._heartbeat_wait_timeout)
         self._heartbeat_exit_event.clear()

Tags:

OpenStack Infra (hudson-openstack) on 2019-05-13

Changed in oslo.messaging:
assignee:	nobody → Herve Beraud (herveberaud)
status:	New → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-05-20: Fix merged to oslo.messaging (master)

Reviewed: https://review.opendev.org/656902
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=9d8b1430e5c081b081c0e3c0b5f12f744dc7809d
Submitter: Zuul
Branch: master

commit 9d8b1430e5c081b081c0e3c0b5f12f744dc7809d
Author: Hervé Beraud <email address hidden>
Date: Fri May 3 00:55:56 2019 +0200

Fix switch connection destination when a rabbitmq cluster node disappear

In a clustered rabbitmq when a node disappears, we get a
ConnectionRefusedError because the socket get disconnected.

The socket access yields a OSError because the heartbeat
tries to reach an unreachable host (No route to host).

Catch these exceptions to ensure that we call ensure_connection for switching
the connection destination.

POC is available at github.com:4383/rabbitmq-oslo_messging-error-poc

    Example:
        $ git clone <email address hidden>:4383/rabbitmq-oslo_messging-error-poc
        $ cd rabbitmq-oslo_messging-error-poc
        $ python -m virtualenv .
        $ source bin/activate
        $ pip install -r requirements.txt
        $ sudo podman run -d --hostname my-rabbit --name rabbit rabbitmq:3
        $ python poc.py $(sudo podman inspect rabbit | niet '.[0].NetworkSettings.IPAddress')

    And in parallele in an another shell|tmux
        $ podman stop rabbit
        $ # observe the output of the poc.py script we now call ensure_connection

Now you can observe some output relative to the connection who is
modified and not catched before these changes.

Related to: https://bugzilla.redhat.com/show_bug.cgi?id=1665399

Closes-Bug: #1828841

Change-Id: I9dc1644cac0e39eb11bf05f57bde77dcf6d42ed3

Changed in oslo.messaging:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-05-21: Fix proposed to oslo.messaging (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/660277

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-05-21: Fix proposed to oslo.messaging (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/660278

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-05-21: Fix proposed to oslo.messaging (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/660280

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-05-21: Fix included in openstack/oslo.messaging 9.7.0

This issue was fixed in the openstack/oslo.messaging 9.7.0 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-06-06: Fix merged to oslo.messaging (stable/stein)

Reviewed: https://review.opendev.org/660277
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=7b8fd6370c8b04c3836ee4f9d06eaa90c7be5197
Submitter: Zuul
Branch: stable/stein

commit 7b8fd6370c8b04c3836ee4f9d06eaa90c7be5197
Author: Hervé Beraud <email address hidden>
Date: Fri May 3 00:55:56 2019 +0200

Fix switch connection destination when a rabbitmq cluster node disappear

In a clustered rabbitmq when a node disappears, we get a
ConnectionRefusedError because the socket get disconnected.

The socket access yields a OSError because the heartbeat
tries to reach an unreachable host (No route to host).

Catch these exceptions to ensure that we call ensure_connection for switching
the connection destination.

POC is available at github.com:4383/rabbitmq-oslo_messging-error-poc

    And in parallele in an another shell|tmux
        $ podman stop rabbit
        $ # observe the output of the poc.py script we now call ensure_connection

Now you can observe some output relative to the connection who is
modified and not catched before these changes.

Related to: https://bugzilla.redhat.com/show_bug.cgi?id=1665399

Closes-Bug: #1828841

Change-Id: I9dc1644cac0e39eb11bf05f57bde77dcf6d42ed3
(cherry picked from commit 9d8b1430e5c081b081c0e3c0b5f12f744dc7809d)

tags:	added: in-stable-stein
tags:	added: in-stable-rocky

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-06-06: Fix merged to oslo.messaging (stable/rocky)

Reviewed: https://review.opendev.org/660278
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=26fccea84371a3690a4d7fe970f8d6c5ba9c154b
Submitter: Zuul
Branch: stable/rocky

commit 26fccea84371a3690a4d7fe970f8d6c5ba9c154b
Author: Hervé Beraud <email address hidden>
Date: Fri May 3 00:55:56 2019 +0200

Fix switch connection destination when a rabbitmq cluster node disappear

In a clustered rabbitmq when a node disappears, we get a
ConnectionRefusedError because the socket get disconnected.

The socket access yields a OSError because the heartbeat
tries to reach an unreachable host (No route to host).

Catch these exceptions to ensure that we call ensure_connection for switching
the connection destination.

POC is available at github.com:4383/rabbitmq-oslo_messging-error-poc

    And in parallele in an another shell|tmux
        $ podman stop rabbit
        $ # observe the output of the poc.py script we now call ensure_connection

Now you can observe some output relative to the connection who is
modified and not catched before these changes.

Related to: https://bugzilla.redhat.com/show_bug.cgi?id=1665399

Closes-Bug: #1828841

Change-Id: I9dc1644cac0e39eb11bf05f57bde77dcf6d42ed3
(cherry picked from commit 9d8b1430e5c081b081c0e3c0b5f12f744dc7809d)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-06-10: Fix merged to oslo.messaging (stable/queens)

Reviewed: https://review.opendev.org/660280
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=a053b1db0faa636cc226fca506a1180694b05e55
Submitter: Zuul
Branch: stable/queens

commit a053b1db0faa636cc226fca506a1180694b05e55
Author: Hervé Beraud <email address hidden>
Date: Fri May 3 00:55:56 2019 +0200

Fix switch connection destination when a rabbitmq cluster node disappear

In a clustered rabbitmq when a node disappears, we get a
ConnectionRefusedError because the socket get disconnected.

The socket access yields a OSError because the heartbeat
tries to reach an unreachable host (No route to host).

Catch these exceptions to ensure that we call ensure_connection for switching
the connection destination.

POC is available at github.com:4383/rabbitmq-oslo_messging-error-poc

    And in parallele in an another shell|tmux
        $ podman stop rabbit
        $ # observe the output of the poc.py script we now call ensure_connection

Now you can observe some output relative to the connection who is
modified and not catched before these changes.

Related to: https://bugzilla.redhat.com/show_bug.cgi?id=1665399

Closes-Bug: #1828841

Change-Id: I9dc1644cac0e39eb11bf05f57bde77dcf6d42ed3
(cherry picked from commit 9d8b1430e5c081b081c0e3c0b5f12f744dc7809d)

tags:

added: in-stable-queens

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-10-04: Fix included in openstack/oslo.messaging 5.35.6

This issue was fixed in the openstack/oslo.messaging 5.35.6 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-10-04: Fix included in openstack/oslo.messaging 8.1.4

#10