oslo.messaging

File descriptor leak in RabbitMQ connection pool

Bug #1949964 reported by Andrew Bonney on 2021-11-05

50

This bug affects 7 people

Affects		Status	Importance	Assigned to	Milestone
	oslo.messaging	New	Undecided	Unassigned

Bug Description

# Versions
- oslo.messaging 12.7.1
- rabbitmq 3.8.14
- ubuntu 20.04

Hi,
We're observing a file descriptor leak which we've managed to track down as far as the oslo.messaging connection pool. I'm not sure this is quite where the issue lies, but I'd appreciate any suggestions for how to debug it further.

We originally noted nova-compute services running out of file descriptors. Turning up debug messaging and watching for times where the FD count increased or decreased we've now established a link between these increases and decreases and the following log messages:

# Drops are associated with:
Nov 05 08:57:04 **** nova-compute[2708296]: 2021-11-05 08:57:04.574 2708296 DEBUG oslo.messaging._drivers.pool [-] Idle connection has expired and been closed. Pool size: 2 _on_expire /openstack/venvs/nova-23.1.1/lib/python3.8/site-packages/oslo_messaging/_drivers/pool.py:141

# Increases are associated with:
Nov 05 09:17:33 **** nova-compute[2708296]: 2021-11-05 09:17:33.333 2708296 DEBUG oslo.messaging._drivers.pool [-] Pool creating new connection create /openstack/venvs/nova-23.1.1/lib/python3.8/site-packages/oslo_messaging/_drivers/pool.py:145

The services in question appear to create a new connection (presumably taking the pool count up to 3 - we're using the defaults), before later closing that connection and returning the pool size to 2. When the connection count increases, we see file descriptor counts jump by approximately 30, but when connections are closed they only drop by approximately 5. As a result, over time this count continues to increase until we have to restart services.

Whilst we haven't previously monitored the connection pool behaviour, this leak only seems to have started having upgraded from OpenStack Victoria to Wallaby (oslo.messaging 12.5.2->12.7.1), and in doing so additionally changing the ssl_version for RabbitMQ usage to TLS v1.2 rather than v1.0.

If there is anything you could suggest to diagnose this further I'd appreciate it.

Thanks

Revision history for this message

Hang Yang (hangyang) wrote on 2021-12-23:

#1

Hi Andrew,

We're facing a similar issue in our deployment too. We also recently did the upgrade from Train to Wallaby and started using RabbitMQ TLS v1.2. We noticed Neutron-server(rpc_worker) is leaking file descriptors and have to restart the service occasionally to avoid reaching the FD limit on the controller node. We found most of the leaked FDs are a_inode(eventpoll) and sock(tcp) type file descriptors, but the root cause is still unknown. May I know if you make any progress on the troubleshooting or find any workaround? Thanks

Revision history for this message

Andrew Bonney (andrewbonney) wrote on 2021-12-24:

#2

I believe this is fixed in amqp v5.0.8 via https://github.com/celery/py-amqp/commit/f4fd4f952dded9ea1006e82642f2c15008bb68d3. I suspect that an exception is thrown during connection cleanup for TLS connections (in sock.unwrap() - hence why we didn't see it before for non-TLS), leading to a dangling socket. We've not seen any logging of this exception, so this must have been caught elsewhere.

Upper constraints suggest this fixed version should be available from Yoga, but for earlier releases the upper pin for amqp means you could hit the issue if using TLS in Xena and earlier.

A workaround should be to increase 'conn_pool_min_size' to be closer to (or equal to) 'rpc_conn_pool_size' so that connections are never dynamically closed in normal operation, although this has the downside of requiring a lot more connections to RabbitMQ by default.

Revision history for this message

Andrew Bonney (andrewbonney) wrote on 2022-01-05:

#3

Resolved via requirements updates in https://review.opendev.org/c/openstack/requirements/+/823104 (Xena) and https://review.opendev.org/c/openstack/requirements/+/823350 (Wallaby). This may also impact users of TLS in Victoria and earlier, but I don't have a deployment to verify this.

Revision history for this message

Hang Yang (hangyang) wrote on 2022-01-05:

#4

Thanks a lot, Andrew. I have tried amqp v5.0.8 in my deployment (Wallaby) and it did fix the socket leak issue. However, running lsof monitoring for the neutron-server rpc-worker processes, I see the leaking of the a_inode (eventpoll) type file descriptors continues. I'm wondering if you are facing a similar problem? Not sure if there is something else missed.

Revision history for this message

Andrew Bonney (andrewbonney) wrote on 2022-01-05:

#5

Yes, I can still see those increasing. I haven't yet been able to determine whether this is a leak or a genuine requirement given the rate at which they're appearing.

Revision history for this message

Andrew Bonney (andrewbonney) wrote on 2022-02-21:

#6

The remaining 'eventpoll' FD issue appears to be resolved by adding the following configuration to affected services. This is the same cause as https://bugs.launchpad.net/oslo.messaging/+bug/1934937.

[oslo_messaging_rabbit]
heartbeat_in_pthread = False

Revision history for this message

Tobias Urdin (tobias-urdin) wrote on 2022-05-17:

#7

Thanks for this troubleshooting! Saved me some hours of headache :)

Revision history for this message

Tobias Urdin (tobias-urdin) wrote on 2022-05-18:

#8

Just for posterity, we changed heartbeat_in_pthread=false on Xena with the newer oslo.messaging version where it defaults to true and we don't have the issue anymore.

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

Bug #1973887

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.