Cannot reconnect to rabbitmq-server after power-off 1-node on 3-node clustering controllers

Bug #1828841 reported by Herve Beraud
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
oslo.messaging
Fix Released
Undecided
Herve Beraud

Bug Description

On 3-node clustering controllers environment, force power off controller-1.
   After messaging heartbeat connection times out but it is not able to reconnect to other messaging node.
   "Unexpected error during heartbeart thread processing, retrying..." warning messages were repeated long time.

     $ grep "Unexpected error during heartbeart thread processing" nova-scheduler.log | grep "13 10:[4-5]"
     2018-12-13 10:40:08.587 1 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: error: [Errno 113] EHOSTUNREACH
     2018-12-13 10:40:08.588 1 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: error: [Errno 113] EHOSTUNREACH
     2018-12-13 10:40:53.682 1 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: error: [Errno 113] EHOSTUNREACH
     2018-12-13 10:40:53.682 1 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: error: [Errno 113] EHOSTUNREACH
     2018-12-13 10:41:38.773 1 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: error: [Errno 113] EHOSTUNREACH
     2018-12-13 10:41:38.774 1 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: error: [Errno 113] EHOSTUNREACH
     2018-12-13 10:42:23.870 1 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: error: [Errno 113] EHOSTUNREACH
     2018-12-13 10:42:23.870 1 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: error: [Errno 113] EHOSTUNREACH
     2018-12-13 10:43:08.961 1 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: error: [Errno 113] EHOSTUNREACH
     2018-12-13 10:43:08.962 1 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: error: [Errno 113] EHOSTUNREACH
     2018-12-13 10:43:54.055 1 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: error: [Errno 113] EHOSTUNREACH
     2018-12-13 10:43:54.056 1 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: error: [Errno 113] EHOSTUNREACH
     2018-12-13 10:44:39.149 1 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: error: [Errno 113] EHOSTUNREACH
     2018-12-13 10:44:39.150 1 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: error: [Errno 113] EHOSTUNREACH
     2018-12-13 10:45:24.243 1 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: error: [Errno 113] EHOSTUNREACH
     2018-12-13 10:46:09.340 1 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: error: [Errno 113] EHOSTUNREACH

Versions:
oslo-messaging-5.35.3
kombu-4.0.2
amqp-2.3.2

How to reproduce:

   Step1. Deploy 3 controller node clustering environment.

   Step2. Confirm that VM booting is possible.

   Step3. Power off controller-1 via IPMI, and check its status has been changed OFFLINE by pcs status command.

i) Actual results:

   The following messages are recorded repeatedly on the nova-scheduler.log and didn't change connection destination long time.

     2018-12-13 10:40:08.587 1 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: error: [Errno 113] EHOSTUNREACH

j) Expected results:

   Execute ensure connection and reconnect to other messaging server quickly.

k) Additional information

   It seems that exception is not caught correctly in _heartbeat_thread_job().
   It is expected that caused exception generated in (*1) will be caught in (*2).
   In that case, ensure_connection() (*3) is executed, so the connection destination would be switched.
   However, in fact the exception was caught in (*4), the destination of connection was not switched and the warning log (*5) continues to be recorded.

   oslo_messaging/_drivers/impl_rabbit.py:

     def _heartbeat_thread_job(self):
         """Thread that maintains inactive connections
         """
         while not self._heartbeat_exit_event.is_set():
             with self._connection_lock.for_heartbeat():

                 try:
                     try:
                         self._heartbeat_check() (*1)
                         # NOTE(sileht): We need to drain event to receive
                         # heartbeat from the broker but don't hold the
                         # connection too much times. In amqpdriver a connection
                         # is used exclusively for read or for write, so we have
                         # to do this for connection used for write drain_events
                         # already do that for other connection
                         try:
                             self.connection.drain_events(timeout=0.001)
                         except socket.timeout:
                             pass
                     except (socket.timeout, (*2)
                             kombu.exceptions.OperationalError) as exc:
                         LOG.info(_LI("A recoverable connection/channel error "
                                      "occurred, trying to reconnect: %s"), exc)
                         self.ensure_connection() (*3)
                 except Exception: (*4)
                     LOG.warning(_LW("Unexpected error during heartbeart "
                                     "thread processing, retrying...")) (*5)
                     LOG.debug('Exception', exc_info=True)

             self._heartbeat_exit_event.wait(
                 timeout=self._heartbeat_wait_timeout)
         self._heartbeat_exit_event.clear()

Changed in oslo.messaging:
assignee: nobody → Herve Beraud (herveberaud)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.messaging (master)

Reviewed: https://review.opendev.org/656902
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=9d8b1430e5c081b081c0e3c0b5f12f744dc7809d
Submitter: Zuul
Branch: master

commit 9d8b1430e5c081b081c0e3c0b5f12f744dc7809d
Author: Hervé Beraud <email address hidden>
Date: Fri May 3 00:55:56 2019 +0200

    Fix switch connection destination when a rabbitmq cluster node disappear

    In a clustered rabbitmq when a node disappears, we get a
    ConnectionRefusedError because the socket get disconnected.

    The socket access yields a OSError because the heartbeat
    tries to reach an unreachable host (No route to host).

    Catch these exceptions to ensure that we call ensure_connection for switching
    the connection destination.

    POC is available at github.com:4383/rabbitmq-oslo_messging-error-poc

    Example:
        $ git clone <email address hidden>:4383/rabbitmq-oslo_messging-error-poc
        $ cd rabbitmq-oslo_messging-error-poc
        $ python -m virtualenv .
        $ source bin/activate
        $ pip install -r requirements.txt
        $ sudo podman run -d --hostname my-rabbit --name rabbit rabbitmq:3
        $ python poc.py $(sudo podman inspect rabbit | niet '.[0].NetworkSettings.IPAddress')

    And in parallele in an another shell|tmux
        $ podman stop rabbit
        $ # observe the output of the poc.py script we now call ensure_connection

    Now you can observe some output relative to the connection who is
    modified and not catched before these changes.

    Related to: https://bugzilla.redhat.com/show_bug.cgi?id=1665399

    Closes-Bug: #1828841

    Change-Id: I9dc1644cac0e39eb11bf05f57bde77dcf6d42ed3

Changed in oslo.messaging:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/660277

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/660278

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/660280

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/oslo.messaging 9.7.0

This issue was fixed in the openstack/oslo.messaging 9.7.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.messaging (stable/stein)

Reviewed: https://review.opendev.org/660277
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=7b8fd6370c8b04c3836ee4f9d06eaa90c7be5197
Submitter: Zuul
Branch: stable/stein

commit 7b8fd6370c8b04c3836ee4f9d06eaa90c7be5197
Author: Hervé Beraud <email address hidden>
Date: Fri May 3 00:55:56 2019 +0200

    Fix switch connection destination when a rabbitmq cluster node disappear

    In a clustered rabbitmq when a node disappears, we get a
    ConnectionRefusedError because the socket get disconnected.

    The socket access yields a OSError because the heartbeat
    tries to reach an unreachable host (No route to host).

    Catch these exceptions to ensure that we call ensure_connection for switching
    the connection destination.

    POC is available at github.com:4383/rabbitmq-oslo_messging-error-poc

    Example:
        $ git clone <email address hidden>:4383/rabbitmq-oslo_messging-error-poc
        $ cd rabbitmq-oslo_messging-error-poc
        $ python -m virtualenv .
        $ source bin/activate
        $ pip install -r requirements.txt
        $ sudo podman run -d --hostname my-rabbit --name rabbit rabbitmq:3
        $ python poc.py $(sudo podman inspect rabbit | niet '.[0].NetworkSettings.IPAddress')

    And in parallele in an another shell|tmux
        $ podman stop rabbit
        $ # observe the output of the poc.py script we now call ensure_connection

    Now you can observe some output relative to the connection who is
    modified and not catched before these changes.

    Related to: https://bugzilla.redhat.com/show_bug.cgi?id=1665399

    Closes-Bug: #1828841

    Change-Id: I9dc1644cac0e39eb11bf05f57bde77dcf6d42ed3
    (cherry picked from commit 9d8b1430e5c081b081c0e3c0b5f12f744dc7809d)

tags: added: in-stable-stein
tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.messaging (stable/rocky)

Reviewed: https://review.opendev.org/660278
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=26fccea84371a3690a4d7fe970f8d6c5ba9c154b
Submitter: Zuul
Branch: stable/rocky

commit 26fccea84371a3690a4d7fe970f8d6c5ba9c154b
Author: Hervé Beraud <email address hidden>
Date: Fri May 3 00:55:56 2019 +0200

    Fix switch connection destination when a rabbitmq cluster node disappear

    In a clustered rabbitmq when a node disappears, we get a
    ConnectionRefusedError because the socket get disconnected.

    The socket access yields a OSError because the heartbeat
    tries to reach an unreachable host (No route to host).

    Catch these exceptions to ensure that we call ensure_connection for switching
    the connection destination.

    POC is available at github.com:4383/rabbitmq-oslo_messging-error-poc

    Example:
        $ git clone <email address hidden>:4383/rabbitmq-oslo_messging-error-poc
        $ cd rabbitmq-oslo_messging-error-poc
        $ python -m virtualenv .
        $ source bin/activate
        $ pip install -r requirements.txt
        $ sudo podman run -d --hostname my-rabbit --name rabbit rabbitmq:3
        $ python poc.py $(sudo podman inspect rabbit | niet '.[0].NetworkSettings.IPAddress')

    And in parallele in an another shell|tmux
        $ podman stop rabbit
        $ # observe the output of the poc.py script we now call ensure_connection

    Now you can observe some output relative to the connection who is
    modified and not catched before these changes.

    Related to: https://bugzilla.redhat.com/show_bug.cgi?id=1665399

    Closes-Bug: #1828841

    Change-Id: I9dc1644cac0e39eb11bf05f57bde77dcf6d42ed3
    (cherry picked from commit 9d8b1430e5c081b081c0e3c0b5f12f744dc7809d)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.messaging (stable/queens)

Reviewed: https://review.opendev.org/660280
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=a053b1db0faa636cc226fca506a1180694b05e55
Submitter: Zuul
Branch: stable/queens

commit a053b1db0faa636cc226fca506a1180694b05e55
Author: Hervé Beraud <email address hidden>
Date: Fri May 3 00:55:56 2019 +0200

    Fix switch connection destination when a rabbitmq cluster node disappear

    In a clustered rabbitmq when a node disappears, we get a
    ConnectionRefusedError because the socket get disconnected.

    The socket access yields a OSError because the heartbeat
    tries to reach an unreachable host (No route to host).

    Catch these exceptions to ensure that we call ensure_connection for switching
    the connection destination.

    POC is available at github.com:4383/rabbitmq-oslo_messging-error-poc

    Example:
        $ git clone <email address hidden>:4383/rabbitmq-oslo_messging-error-poc
        $ cd rabbitmq-oslo_messging-error-poc
        $ python -m virtualenv .
        $ source bin/activate
        $ pip install -r requirements.txt
        $ sudo podman run -d --hostname my-rabbit --name rabbit rabbitmq:3
        $ python poc.py $(sudo podman inspect rabbit | niet '.[0].NetworkSettings.IPAddress')

    And in parallele in an another shell|tmux
        $ podman stop rabbit
        $ # observe the output of the poc.py script we now call ensure_connection

    Now you can observe some output relative to the connection who is
    modified and not catched before these changes.

    Related to: https://bugzilla.redhat.com/show_bug.cgi?id=1665399

    Closes-Bug: #1828841

    Change-Id: I9dc1644cac0e39eb11bf05f57bde77dcf6d42ed3
    (cherry picked from commit 9d8b1430e5c081b081c0e3c0b5f12f744dc7809d)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/oslo.messaging 5.35.6

This issue was fixed in the openstack/oslo.messaging 5.35.6 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/oslo.messaging 8.1.4

This issue was fixed in the openstack/oslo.messaging 8.1.4 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/oslo.messaging 9.5.1

This issue was fixed in the openstack/oslo.messaging 9.5.1 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.