# Versions
- oslo.messaging 12.9.1
- rabbitmq 3.9.8
- ubuntu 20.04
Hi,
We are observing issues with services recovering if they encounter connectivity issues to the RabbitMQ cluster. We have seen this across Nova, Neutron and Cinder services in particular, across all of our deployments. When this occurs, the following greenlet related traceback is always seen in the service logs, following a number of reconnection related messages (example for Nova compute):
Feb 18 08:42:33 compute102 nova-compute[1402787]: 2022-02-18 08:42:33.514 1402787 ERROR oslo.messaging._drivers.impl_rabbit [-] [c21a4649-bc17-4648-82b4-e88743d61fc9] AMQP server on 10.99.99.99:5671 is unreachable: . Trying again in 1 seconds.: socket.timeout
Feb 18 08:42:34 compute102 nova-compute[1402787]: 2022-02-18 08:42:34.517 1402787 ERROR oslo.messaging._drivers.impl_rabbit [-] [c21a4649-bc17-4648-82b4-e88743d61fc9] AMQP server on 10.99.99.99:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 1 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:35 compute102 nova-compute[1402787]: 2022-02-18 08:42:35.050 1402787 ERROR oslo.messaging._drivers.impl_rabbit [req-85fc671a-18e5-4b0d-9c4b-27562efafd2a - - - - -] [229e749d-adb2-4375-87ca-2f7235129935] AMQP server on 10.99.99.99:5671 is unreachable: . Trying again in 1 seconds.: socket.timeout
Feb 18 08:42:35 compute102 nova-compute[1402787]: 2022-02-18 08:42:35.520 1402787 ERROR oslo.messaging._drivers.impl_rabbit [-] [c21a4649-bc17-4648-82b4-e88743d61fc9] AMQP server on 10.99.99.98:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 1 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:36 compute102 nova-compute[1402787]: 2022-02-18 08:42:36.052 1402787 ERROR oslo.messaging._drivers.impl_rabbit [req-85fc671a-18e5-4b0d-9c4b-27562efafd2a - - - - -] [229e749d-adb2-4375-87ca-2f7235129935] AMQP server on 10.99.99.99:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 1 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:36 compute102 nova-compute[1402787]: 2022-02-18 08:42:36.521 1402787 ERROR oslo.messaging._drivers.impl_rabbit [-] [c21a4649-bc17-4648-82b4-e88743d61fc9] AMQP server on 10.99.99.97:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 2 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:37 compute102 nova-compute[1402787]: 2022-02-18 08:42:37.053 1402787 ERROR oslo.messaging._drivers.impl_rabbit [req-85fc671a-18e5-4b0d-9c4b-27562efafd2a - - - - -] [229e749d-adb2-4375-87ca-2f7235129935] AMQP server on 10.99.99.98:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 1 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:38 compute102 nova-compute[1402787]: 2022-02-18 08:42:38.055 1402787 ERROR oslo.messaging._drivers.impl_rabbit [req-85fc671a-18e5-4b0d-9c4b-27562efafd2a - - - - -] [229e749d-adb2-4375-87ca-2f7235129935] AMQP server on 10.99.99.97:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 2 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:38 compute102 nova-compute[1402787]: 2022-02-18 08:42:38.524 1402787 ERROR oslo.messaging._drivers.impl_rabbit [-] [c21a4649-bc17-4648-82b4-e88743d61fc9] AMQP server on 10.99.99.99:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 1 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:39 compute102 nova-compute[1402787]: 2022-02-18 08:42:39.526 1402787 ERROR oslo.messaging._drivers.impl_rabbit [-] [c21a4649-bc17-4648-82b4-e88743d61fc9] AMQP server on 10.99.99.98:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 1 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:40 compute102 nova-compute[1402787]: 2022-02-18 08:42:40.058 1402787 ERROR oslo.messaging._drivers.impl_rabbit [req-85fc671a-18e5-4b0d-9c4b-27562efafd2a - - - - -] [229e749d-adb2-4375-87ca-2f7235129935] AMQP server on 10.99.99.99:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 1 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:40 compute102 nova-compute[1402787]: 2022-02-18 08:42:40.527 1402787 ERROR oslo.messaging._drivers.impl_rabbit [-] [c21a4649-bc17-4648-82b4-e88743d61fc9] AMQP server on 10.99.99.97:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 4 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:41 compute102 nova-compute[1402787]: 2022-02-18 08:42:41.060 1402787 ERROR oslo.messaging._drivers.impl_rabbit [req-85fc671a-18e5-4b0d-9c4b-27562efafd2a - - - - -] [229e749d-adb2-4375-87ca-2f7235129935] AMQP server on 10.99.99.98:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 1 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:42 compute102 nova-compute[1402787]: 2022-02-18 08:42:42.062 1402787 ERROR oslo.messaging._drivers.impl_rabbit [req-85fc671a-18e5-4b0d-9c4b-27562efafd2a - - - - -] [229e749d-adb2-4375-87ca-2f7235129935] AMQP server on 10.99.99.97:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 4 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:44 compute102 nova-compute[1402787]: 2022-02-18 08:42:44.532 1402787 ERROR oslo.messaging._drivers.impl_rabbit [-] [c21a4649-bc17-4648-82b4-e88743d61fc9] AMQP server on 10.99.99.99:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 1 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:45 compute102 nova-compute[1402787]: 2022-02-18 08:42:45.534 1402787 ERROR oslo.messaging._drivers.impl_rabbit [-] [c21a4649-bc17-4648-82b4-e88743d61fc9] AMQP server on 10.99.99.98:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 1 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:46 compute102 nova-compute[1402787]: 2022-02-18 08:42:46.067 1402787 ERROR oslo.messaging._drivers.impl_rabbit [req-85fc671a-18e5-4b0d-9c4b-27562efafd2a - - - - -] [229e749d-adb2-4375-87ca-2f7235129935] AMQP server on 10.99.99.99:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 1 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:46 compute102 nova-compute[1402787]: 2022-02-18 08:42:46.536 1402787 ERROR oslo.messaging._drivers.impl_rabbit [-] [c21a4649-bc17-4648-82b4-e88743d61fc9] AMQP server on 10.99.99.97:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 6 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:47 compute102 nova-compute[1402787]: 2022-02-18 08:42:47.068 1402787 ERROR oslo.messaging._drivers.impl_rabbit [req-85fc671a-18e5-4b0d-9c4b-27562efafd2a - - - - -] [229e749d-adb2-4375-87ca-2f7235129935] AMQP server on 10.99.99.98:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 1 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:48 compute102 nova-compute[1402787]: 2022-02-18 08:42:48.070 1402787 ERROR oslo.messaging._drivers.impl_rabbit [req-85fc671a-18e5-4b0d-9c4b-27562efafd2a - - - - -] [229e749d-adb2-4375-87ca-2f7235129935] AMQP server on 10.99.99.97:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 6 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:52 compute102 nova-compute[1402787]: 2022-02-18 08:42:52.543 1402787 ERROR oslo.messaging._drivers.impl_rabbit [-] [c21a4649-bc17-4648-82b4-e88743d61fc9] AMQP server on 10.99.99.99:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 1 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:53 compute102 nova-compute[1402787]: 2022-02-18 08:42:53.545 1402787 ERROR oslo.messaging._drivers.impl_rabbit [-] [c21a4649-bc17-4648-82b4-e88743d61fc9] AMQP server on 10.99.99.98:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 1 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:54 compute102 nova-compute[1402787]: 2022-02-18 08:42:54.077 1402787 ERROR oslo.messaging._drivers.impl_rabbit [req-85fc671a-18e5-4b0d-9c4b-27562efafd2a - - - - -] [229e749d-adb2-4375-87ca-2f7235129935] AMQP server on 10.99.99.99:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 1 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:54 compute102 nova-compute[1402787]: 2022-02-18 08:42:54.546 1402787 ERROR oslo.messaging._drivers.impl_rabbit [-] [c21a4649-bc17-4648-82b4-e88743d61fc9] AMQP server on 10.99.99.97:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 8 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:55 compute102 nova-compute[1402787]: 2022-02-18 08:42:55.079 1402787 ERROR oslo.messaging._drivers.impl_rabbit [req-85fc671a-18e5-4b0d-9c4b-27562efafd2a - - - - -] [229e749d-adb2-4375-87ca-2f7235129935] AMQP server on 10.99.99.98:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 1 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:56 compute102 nova-compute[1402787]: 2022-02-18 08:42:56.080 1402787 ERROR oslo.messaging._drivers.impl_rabbit [req-85fc671a-18e5-4b0d-9c4b-27562efafd2a - - - - -] [229e749d-adb2-4375-87ca-2f7235129935] AMQP server on 10.99.99.97:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 8 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:58 compute102 nova-compute[1402787]: 2022-02-18 08:42:58.700 1402787 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: [Errno 110] Connection timed out
Feb 18 08:42:58 compute102 nova-compute[1402787]: 2022-02-18 08:42:58.701 1402787 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: [Errno 110] Connection timed out
Feb 18 08:42:58 compute102 nova-compute[1402787]: 2022-02-18 08:42:58.702 1402787 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 101] ENETUNREACH (retrying in 0 seconds): OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:58 compute102 nova-compute[1402787]: Traceback (most recent call last):
Feb 18 08:42:58 compute102 nova-compute[1402787]: File "/openstack/venvs/nova-24.0.0.0rc1/lib/python3.8/site-packages/eventlet/hubs/hub.py", line 476, in fire_timers
Feb 18 08:42:58 compute102 nova-compute[1402787]: timer()
Feb 18 08:42:58 compute102 nova-compute[1402787]: File "/openstack/venvs/nova-24.0.0.0rc1/lib/python3.8/site-packages/eventlet/hubs/timer.py", line 59, in __call__
Feb 18 08:42:58 compute102 nova-compute[1402787]: cb(*args, **kw)
Feb 18 08:42:58 compute102 nova-compute[1402787]: File "/openstack/venvs/nova-24.0.0.0rc1/lib/python3.8/site-packages/eventlet/semaphore.py", line 152, in _do_acquire
Feb 18 08:42:58 compute102 nova-compute[1402787]: waiter.switch()
Feb 18 08:42:58 compute102 nova-compute[1402787]: greenlet.error: cannot switch to a different thread
Typically if the RabbitMQ cluster is taken down this will impact ~5% of the services in the deployment, all of which will need to be restarted in order to recover. Similar recovery issues have been seen if the host's network interface is taken down and brought back up (as used to generate the above traceback).
As far as we can tell this started to occur at a similar time to https://bugs.launchpad.net/oslo.messaging/+bug/1949964, so around the time of the Wallaby OpenStack release, and coinciding with a switch from TLSv1.0/v1.1 to TLSv1.2 in our RabbitMQ connections, plus a switch to using a full PKI infrastructure with certificate validation, rather than ignoring certificate errors.
Any suggestions for diagnosing this further would be appreciated.
Thanks
After some further testing I suspect this and the unresolved portion of https:/ /bugs.launchpad .net/oslo. messaging/ +bug/1949964 have the same root cause as https:/ /bugs.launchpad .net/oslo. messaging/ +bug/1934937. I've switched the 'heartbeat_ in_pthread' parameter to False on a sample host and will report back next week if this resolves the issues.