File descriptor leak in RabbitMQ connection pool

Bug #1949964 reported by Andrew Bonney
50
This bug affects 7 people
Affects Status Importance Assigned to Milestone
oslo.messaging
New
Undecided
Unassigned

Bug Description

# Versions
- oslo.messaging 12.7.1
- rabbitmq 3.8.14
- ubuntu 20.04

Hi,
We're observing a file descriptor leak which we've managed to track down as far as the oslo.messaging connection pool. I'm not sure this is quite where the issue lies, but I'd appreciate any suggestions for how to debug it further.

We originally noted nova-compute services running out of file descriptors. Turning up debug messaging and watching for times where the FD count increased or decreased we've now established a link between these increases and decreases and the following log messages:

# Drops are associated with:
Nov 05 08:57:04 **** nova-compute[2708296]: 2021-11-05 08:57:04.574 2708296 DEBUG oslo.messaging._drivers.pool [-] Idle connection has expired and been closed. Pool size: 2 _on_expire /openstack/venvs/nova-23.1.1/lib/python3.8/site-packages/oslo_messaging/_drivers/pool.py:141

# Increases are associated with:
Nov 05 09:17:33 **** nova-compute[2708296]: 2021-11-05 09:17:33.333 2708296 DEBUG oslo.messaging._drivers.pool [-] Pool creating new connection create /openstack/venvs/nova-23.1.1/lib/python3.8/site-packages/oslo_messaging/_drivers/pool.py:145

The services in question appear to create a new connection (presumably taking the pool count up to 3 - we're using the defaults), before later closing that connection and returning the pool size to 2. When the connection count increases, we see file descriptor counts jump by approximately 30, but when connections are closed they only drop by approximately 5. As a result, over time this count continues to increase until we have to restart services.

Whilst we haven't previously monitored the connection pool behaviour, this leak only seems to have started having upgraded from OpenStack Victoria to Wallaby (oslo.messaging 12.5.2->12.7.1), and in doing so additionally changing the ssl_version for RabbitMQ usage to TLS v1.2 rather than v1.0.

If there is anything you could suggest to diagnose this further I'd appreciate it.

Thanks

Revision history for this message
Hang Yang (hangyang) wrote :

Hi Andrew,

We're facing a similar issue in our deployment too. We also recently did the upgrade from Train to Wallaby and started using RabbitMQ TLS v1.2. We noticed Neutron-server(rpc_worker) is leaking file descriptors and have to restart the service occasionally to avoid reaching the FD limit on the controller node. We found most of the leaked FDs are a_inode(eventpoll) and sock(tcp) type file descriptors, but the root cause is still unknown. May I know if you make any progress on the troubleshooting or find any workaround? Thanks

Revision history for this message
Andrew Bonney (andrewbonney) wrote :

I believe this is fixed in amqp v5.0.8 via https://github.com/celery/py-amqp/commit/f4fd4f952dded9ea1006e82642f2c15008bb68d3. I suspect that an exception is thrown during connection cleanup for TLS connections (in sock.unwrap() - hence why we didn't see it before for non-TLS), leading to a dangling socket. We've not seen any logging of this exception, so this must have been caught elsewhere.

Upper constraints suggest this fixed version should be available from Yoga, but for earlier releases the upper pin for amqp means you could hit the issue if using TLS in Xena and earlier.

A workaround should be to increase 'conn_pool_min_size' to be closer to (or equal to) 'rpc_conn_pool_size' so that connections are never dynamically closed in normal operation, although this has the downside of requiring a lot more connections to RabbitMQ by default.

Revision history for this message
Andrew Bonney (andrewbonney) wrote :

Resolved via requirements updates in https://review.opendev.org/c/openstack/requirements/+/823104 (Xena) and https://review.opendev.org/c/openstack/requirements/+/823350 (Wallaby). This may also impact users of TLS in Victoria and earlier, but I don't have a deployment to verify this.

Revision history for this message
Hang Yang (hangyang) wrote :

Thanks a lot, Andrew. I have tried amqp v5.0.8 in my deployment (Wallaby) and it did fix the socket leak issue. However, running lsof monitoring for the neutron-server rpc-worker processes, I see the leaking of the a_inode (eventpoll) type file descriptors continues. I'm wondering if you are facing a similar problem? Not sure if there is something else missed.

Revision history for this message
Andrew Bonney (andrewbonney) wrote :

Yes, I can still see those increasing. I haven't yet been able to determine whether this is a leak or a genuine requirement given the rate at which they're appearing.

Revision history for this message
Andrew Bonney (andrewbonney) wrote :

The remaining 'eventpoll' FD issue appears to be resolved by adding the following configuration to affected services. This is the same cause as https://bugs.launchpad.net/oslo.messaging/+bug/1934937.

[oslo_messaging_rabbit]
heartbeat_in_pthread = False

Revision history for this message
Tobias Urdin (tobias-urdin) wrote :

Thanks for this troubleshooting! Saved me some hours of headache :)

Revision history for this message
Tobias Urdin (tobias-urdin) wrote :

Just for posterity, we changed heartbeat_in_pthread=false on Xena with the newer oslo.messaging version where it defaults to true and we don't have the issue anymore.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.