File descriptor leak in RabbitMQ connection pool
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
oslo.messaging |
New
|
Undecided
|
Unassigned |
Bug Description
# Versions
- oslo.messaging 12.7.1
- rabbitmq 3.8.14
- ubuntu 20.04
Hi,
We're observing a file descriptor leak which we've managed to track down as far as the oslo.messaging connection pool. I'm not sure this is quite where the issue lies, but I'd appreciate any suggestions for how to debug it further.
We originally noted nova-compute services running out of file descriptors. Turning up debug messaging and watching for times where the FD count increased or decreased we've now established a link between these increases and decreases and the following log messages:
# Drops are associated with:
Nov 05 08:57:04 **** nova-compute[
# Increases are associated with:
Nov 05 09:17:33 **** nova-compute[
The services in question appear to create a new connection (presumably taking the pool count up to 3 - we're using the defaults), before later closing that connection and returning the pool size to 2. When the connection count increases, we see file descriptor counts jump by approximately 30, but when connections are closed they only drop by approximately 5. As a result, over time this count continues to increase until we have to restart services.
Whilst we haven't previously monitored the connection pool behaviour, this leak only seems to have started having upgraded from OpenStack Victoria to Wallaby (oslo.messaging 12.5.2->12.7.1), and in doing so additionally changing the ssl_version for RabbitMQ usage to TLS v1.2 rather than v1.0.
If there is anything you could suggest to diagnose this further I'd appreciate it.
Thanks
Hi Andrew,
We're facing a similar issue in our deployment too. We also recently did the upgrade from Train to Wallaby and started using RabbitMQ TLS v1.2. We noticed Neutron- server( rpc_worker) is leaking file descriptors and have to restart the service occasionally to avoid reaching the FD limit on the controller node. We found most of the leaked FDs are a_inode(eventpoll) and sock(tcp) type file descriptors, but the root cause is still unknown. May I know if you make any progress on the troubleshooting or find any workaround? Thanks