octavia/ovn: filedescriptor out of range in select()

Bug #2063043 reported by Kurt Garloff
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
New
High
Fernando Royo

Bug Description

Running octavia-api in a container (kolla-ansible), where a health monitoring process creates OVN provider loadbalancers with listeners, pools, members, health-mons and tests them and cleans them up again.
Upon creation of resources, the octavia-api process creates `[eventpoll]` file descriptors (with `epoll_create1()` system call which it never closes. Once it hits 1024, it errors out with
```
 2024-04-09 03:00:12.130 732 ERROR ovn_octavia_provider.ovsdb.impl_idl_ovn [None req-eefd1e9d-8bfe-473d-9ff1-5a8b0d4ab5d2 - 2767ef0256804b92ae0e51e3a99f809a - - 93db18cb391748009639058c52577527 93db18cb391748009639058c52577527] OVS database connection to OVN_Northbound failed with error: 'filedescriptor out of range in select()'. Verify that the OVS and OVN services are available and that the 'ovn_nb_connection' and 'ovn_sb_connection' configuration options are correct.: ValueError: filedescriptor out of range in select()
2024-04-09 03:00:12.130 732 ERROR ovn_octavia_provider.ovsdb.impl_idl_ovn Traceback (most recent call last):
2024-04-09 03:00:12.130 732 ERROR ovn_octavia_provider.ovsdb.impl_idl_ovn File "/var/lib/kolla/venv/lib/python3.10/site-packages/ovn_octavia_provider/ovsdb/impl_idl_ovn.py", line 65, in start_connection
2024-04-09 03:00:12.130 732 ERROR ovn_octavia_provider.ovsdb.impl_idl_ovn self.ovsdb_connection.start()
2024-04-09 03:00:12.130 732 ERROR ovn_octavia_provider.ovsdb.impl_idl_ovn File "/var/lib/kolla/venv/lib/python3.10/site-packages/ovsdbapp/backend/ovs_idl/connection.py", line 83, in start
2024-04-09 03:00:12.130 732 ERROR ovn_octavia_provider.ovsdb.impl_idl_ovn idlutils.wait_for_change(self.idl, self.timeout)
2024-04-09 03:00:12.130 732 ERROR ovn_octavia_provider.ovsdb.impl_idl_ovn File "/var/lib/kolla/venv/lib/python3.10/site-packages/ovsdbapp/backend/ovs_idl/idlutils.py", line 252, in wait_for_change
2024-04-09 03:00:12.130 732 ERROR ovn_octavia_provider.ovsdb.impl_idl_ovn ovs_poller.block()
2024-04-09 03:00:12.130 732 ERROR ovn_octavia_provider.ovsdb.impl_idl_ovn File "/usr/lib/python3/dist-packages/ovs/poller.py", line 231, in block
2024-04-09 03:00:12.130 732 ERROR ovn_octavia_provider.ovsdb.impl_idl_ovn events = self.poll.poll(self.timeout)
2024-04-09 03:00:12.130 732 ERROR ovn_octavia_provider.ovsdb.impl_idl_ovn File "/usr/lib/python3/dist-packages/ovs/poller.py", line 137, in poll
2024-04-09 03:00:12.130 732 ERROR ovn_octavia_provider.ovsdb.impl_idl_ovn rlist, wlist, xlist = select.select(self.rlist,
2024-04-09 03:00:12.130 732 ERROR ovn_octavia_provider.ovsdb.impl_idl_ovn ValueError: filedescriptor out of range in select()
2024-04-09 03:00:12.130 732 ERROR ovn_octavia_provider.ovsdb.impl_idl_ovn
2024-04-09 03:00:12.132 732 ERROR octavia.api.drivers.driver_factory [None req-eefd1e9d-8bfe-473d-9ff1-5a8b0d4ab5d2 - 2767ef0256804b92ae0e51e3a99f809a - - 93db18cb391748009639058c52577527 93db18cb391748009639058c52577527] Unable to load provider driver ovn due to: OVS database connection to OVN_Northbound failed with error: 'filedescriptor out of range in select()'. Verify that the OVS and OVN services are available and that the 'ovn_nb_connection' and 'ovn_sb_connection' configuration options are correct.: ovn_octavia_provider.ovsdb.impl_idl_ovn.OvsdbConnectionUnavailable: OVS database connection to OVN_Northbound failed with error: 'filedescriptor out of range in select()'. Verify that the OVS and OVN services are available and that the 'ovn_nb_connection' and 'ovn_sb_connection' configuration options are correct.
2024-04-09 03:00:12.134 732 ERROR wsme.api [None req-eefd1e9d-8bfe-473d-9ff1-5a8b0d4ab5d2 - 2767ef0256804b92ae0e51e3a99f809a - - 93db18cb391748009639058c52577527 93db18cb391748009639058c52577527] Server-side error: "Provider 'ovn' was not found.". Detail:
Traceback (most recent call last):
[...]
```
Subsequently, the ovn provider no longer is registered from an octavia-api perspective and the container goes into an unhealthy state and needs to be restarted.

This was observed on octavia from OpenStack 2023.2 (Bobcat) installed via kolla-ansible (OSISM).

Original bug report at
https://github.com/osism/issues/issues/959

Revision history for this message
Gregory Thiemonge (gthiemonge) wrote :

moving to neutron, the ovn-octavia-provider is a neutron project

affects: octavia → neutron
tags: added: ovn-octavia-provider
Revision history for this message
Kurt Garloff (kgarloff) wrote :
Download full text (3.6 KiB)

Upon a new LB member, I get two new [eventpoll] FDs, which never get cleaned up. (Removing the members does not help.)
Not sure whether those should be explicitly be closed in the code before new ones are opened or whether some thread should just exit to get rid of them.
Anyhow, here is how lsof looks at it after some resources have been created and deleted
[...]
apache2 2625091 octavia 21u IPv4 880892680 0t0 TCP api.in-a-box.cloud:37214->api.in-a-box.cloud:5000 (CLOSE_WAIT)
apache2 2625091 octavia 22u IPv4 880928052 0t0 TCP api.in-a-box.cloud:53454->api.in-a-box.cloud:mysql (ESTABLISHED)
apache2 2625091 octavia 23r FIFO 0,14 0t0 880607178 pipe
apache2 2625091 octavia 24w FIFO 0,14 0t0 880607178 pipe
apache2 2625091 octavia 25u sock 0,8 0t0 880607182 protocol: TCP
apache2 2625091 octavia 26u IPv4 880893583 0t0 TCP api.in-a-box.cloud:39068->api.in-a-box.cloud:9696 (CLOSE_WAIT)
apache2 2625091 octavia 27u sock 0,8 0t0 880607189 protocol: TCP
apache2 2625091 octavia 28u sock 0,8 0t0 880732423 protocol: TCP
apache2 2625091 octavia 29u sock 0,8 0t0 880724669 protocol: TCP
apache2 2625091 octavia 30u sock 0,8 0t0 880716764 protocol: TCP
apache2 2625091 octavia 31u sock 0,8 0t0 880644259 protocol: TCP
apache2 2625091 octavia 32u IPv4 880939067 0t0 TCP api.in-a-box.cloud:50304->api.in-a-box.cloud:9696 (ESTABLISHED)
apache2 2625091 octavia 33u a_inode 0,15 0 68 [eventpoll]
apache2 2625091 octavia 34u sock 0,8 0t0 880823397 protocol: TCP
apache2 2625091 octavia 35u a_inode 0,15 0 68 [eventpoll]
apache2 2625091 octavia 36u IPv4 880893589 0t0 TCP api.in-a-box.cloud:39094->api.in-a-box.cloud:9696 (CLOSE_WAIT)
apache2 2625091 octavia 37u a_inode 0,15 0 68 [eventpoll]
apache2 2625091 octavia 38u IPv4 880891827 0t0 TCP api.in-a-box.cloud:39140->api.in-a-box.cloud:9696 (CLOSE_WAIT)
apache2 2625091 octavia 39u a_inode 0,15 0 68 [eventpoll]
apache2 2625091 octavia 40u a_inode 0,15 0 68 [eventpoll]
apache2 2625091 octavia 41u a_inode 0,15 0 68 [eventpoll]
apache2 2625091 octavia 42u a_inode 0,15 0 68 [eventpoll]
apache2 2625091 octavia 43u IPv4 880886692 0t0 TCP api.in-a-box.cloud:54838->api.in-a-box.cloud:9696 (CLOSE_WAIT)
apache2 2625091 octavia 44u a_inode 0,15 0 68 [eventpoll]
apache2 2625091 octavia 45u IPv4 880895843 0t0 TCP api.in-a-box.cloud:54860->api.in-a-box.cloud:9696 (CLOSE_WAIT)
apache2 2625091 octavia 46u a_inode 0,15 0 68 [eventpoll]
apache2 2625091 octavia 48u a_inode 0,15 0 68 [eventpoll]
apache2 2625091 octa...

Read more...

Changed in neutron:
assignee: nobody → Fernando Royo (froyoredhat)
Changed in neutron:
importance: Undecided → High
Revision history for this message
Fernando Royo (froyoredhat) wrote :
Revision history for this message
Kurt Garloff (kgarloff) wrote :

I have tried to find where things go wrong ... but thus far no success.
We don't seem to leak threads at least. Just the epoll file descriptors that we then wait on with select. But I could not find the place where we miss the cleanup yet. I have been looking through ovn-octavia-provider, ovsdbapp, and ovs python code (in Bobcat) without luck thus far. I guess, I'll have to add some tracing logging statements ...
Help is definitely welcome.

Revision history for this message
Kurt Garloff (kgarloff) wrote (last edit ):
Download full text (3.5 KiB)

Whenever the octavia ovn helper talks to neutron, it seems to create a new TCP connection to the neutron service endpoint along with an eventpoll watcher, without the file descriptor nor the watcher of the old connection being closed. The Singleton constructions in common/clients.py does not seem to prevent this.
(Is this an SDK bug that may not have anything to do with OVN octavia in the end?)
Correction: *Some* of the TCP sockets seem to get reused:
```
apache2 3420779 octavia 20u IPv4 515704480 0t0 TCP manager.systems.in-a-box.cloud:40470->manager.systems.in-a-box.cloud:11211 (ESTABLISHED)
apache2 3420779 octavia 21u IPv4 516310909 0t0 TCP api.in-a-box.cloud:50690->api.in-a-box.cloud:5000 (ESTABLISHED)
apache2 3420779 octavia 22u IPv4 516273896 0t0 TCP api.in-a-box.cloud:56094->api.in-a-box.cloud:mysql (ESTABLISHED)
apache2 3420779 octavia 23r FIFO 0,14 0t0 515818098 pipe
apache2 3420779 octavia 24w FIFO 0,14 0t0 515818098 pipe
apache2 3420779 octavia 25u sock 0,8 0t0 515818102 protocol: TCP
apache2 3420779 octavia 26u IPv4 516182592 0t0 TCP api.in-a-box.cloud:47694->api.in-a-box.cloud:9696 (CLOSE_WAIT)
apache2 3420779 octavia 27u sock 0,8 0t0 515818129 protocol: TCP
apache2 3420779 octavia 28u sock 0,8 0t0 515820532 protocol: TCP
apache2 3420779 octavia 29u sock 0,8 0t0 515836380 protocol: TCP
apache2 3420779 octavia 30u sock 0,8 0t0 515895964 protocol: TCP
apache2 3420779 octavia 31u sock 0,8 0t0 515833072 protocol: TCP
apache2 3420779 octavia 32u a_inode 0,15 0 64 [eventpoll]
apache2 3420779 octavia 33u IPv4 516168640 0t0 TCP api.in-a-box.cloud:47720->api.in-a-box.cloud:9696 (CLOSE_WAIT)
apache2 3420779 octavia 34u sock 0,8 0t0 516016283 protocol: TCP
apache2 3420779 octavia 35u a_inode 0,15 0 64 [eventpoll]
apache2 3420779 octavia 36u sock 0,8 0t0 516011446 protocol: TCP
apache2 3420779 octavia 37u a_inode 0,15 0 64 [eventpoll]
apache2 3420779 octavia 38u a_inode 0,15 0 64 [eventpoll]
apache2 3420779 octavia 39u a_inode 0,15 0 64 [eventpoll]
apache2 3420779 octavia 40u a_inode 0,15 0 64 [eventpoll]
apache2 3420779 octavia 41u IPv4 516172616 0t0 TCP api.in-a-box.cloud:47714->api.in-a-box.cloud:9696 (CLOSE_WAIT)
apache2 3420779 octavia 42u a_inode 0,15 0 64 [eventpoll]
apache2 3420779 octavia 43u IPv4 516188638 0t0 TCP api.in-a-box.cloud:47768->api.in-a-box.cloud:9696 (CLOSE_WAIT)
apache2 3420779 octavia 44u a_inode 0,15 0 64 [eventpoll]
apache2 3420779 octavia 46u a_inode 0,15 0 64 [eventpoll]
apache2 3420779 octavia 47u a_inode...

Read more...

Revision history for this message
Kurt Garloff (kgarloff) wrote :

This test script causes my octavia-api service (Bobcat via kolla-ansible) to leak file descriptors at a rate of up to ~30/min. Interestingly the behavior is bursty. I see lots of pipe file descriptors in the lsof listings. Many (all?) of the pipes go away after a while, so cleanup happens for them after a few minutes. Once we started creating many pipes again, reaching near the bottom of the fd list, the increase of highest FD starts happening again.

Revision history for this message
Kurt Garloff (kgarloff) wrote :

A graph showing how the script kills octavia/ovn with two octavia-api workers.
One can see that the pipe FDs are regularly cleaned, which the FDs for long gone connections (socket) and the FDs for eventpoll keep increasing linearly. (We see the sum from the two octavia-api processes.)
After a pipe cleanup, the kernel first reuses the freed FDs before the highest FD needs to grow further. Misc are unused FDs and the handful of FDs (e.g. regular files) that don't fit into the classification we have taken here:
* PIPE (aka FIFO)
* Epoll (event_poll)
* Sock (FDs for sockets that have been closed)
* Conn (FDs for sockets that are active or in close_wait)

Revision history for this message
Fernando Royo (froyoredhat) wrote :

Hi,

Looks like last patches on master branch and backport to stable branches had reduced the impact drastically about this issue.

I had your script running for several hours and I was working fine, I also has two workers and I still see some FDs achiving to 1024 getting a 500 ERROR reply on the Octavia request of creating a member. I will try to detect where is this FD leftover, and maybe Kurt you could also retrigger your script and get the same graph to confirm my results meanwhile, I will keep this update as soon as I could identify why that Polling is not reusing the FD and creating a new one on every worker from time to time.

Revision history for this message
Kurt Garloff (kgarloff) wrote :

If you have a pointer to the patches (on master or the backports) that would be really appreciated.
This way, I can ensure that I can do tests with and without them and see whether that improves or porssibly even fixes things for my setup.

Revision history for this message
Fernando Royo (froyoredhat) wrote :

Hi Kurt,

Apologies, I managed to collect some strace logs but haven't had time to review them yet. The patch I mentioned is mainly this one: https://review.opendev.org/c/openstack/ovn-octavia-provider/+/919229.

I've also attached the logs from the strace output and octavia-api for the same short period, where we can see the file descriptors (FDs) increasing, which reproduces the behavior I described in comment #8.

I’ll update you once I have some time to analyze them.

Revision history for this message
Kurt Garloff (kgarloff) wrote :

Hi Fernando,

thanks for following up!

Here's what I observe with Caracal from the stable branch (or to be more precise: quay.io/osism/octavia-api:14.0.1.20240909 kolla ansible container):
It know takes 3hrs20' to exhaust the file descriptors and the only FDs that leak are sockets.

And it's interesting what sockets are leaked by using netstat:
[...]
tcp 0 0 192.168.16.10:953 0.0.0.0:* LISTEN 0 17308399 1657097/named
tcp 0 0 192.168.16.10:953 0.0.0.0:* LISTEN 0 17308398 1657097/named
tcp 0 0 192.168.16.10:953 0.0.0.0:* LISTEN 0 17308397 1657097/named
[...]
tcp 0 0 192.168.16.10:53 0.0.0.0:* LISTEN 0 17293599 1657097/named
tcp 0 0 192.168.16.10:53 0.0.0.0:* LISTEN 0 17293601 1657097/named
tcp 0 0 192.168.16.10:53 0.0.0.0:* LISTEN 0 17293600 1657097/named
[...]
tcp6 0 0 fe80::f4f5:3aff:fe85:53 :::* LISTEN 0 17309322 1657097/named
tcp6 0 0 fe80::90ef:d6ff:fe61:53 :::* LISTEN 0 17295100 1657097/named
[...]
tcp6 0 0 fe80::ccea:9ff:fe88::53 :::* LISTEN 0 17307884 1657097/named
tcp6 0 0 fe80::3422:a3ff:fe03:53 :::* LISTEN 0 17285977 1657097/named
[...]
tcp6 0 0 fe80::5800:24ff:fe45:53 :::* LISTEN 0 17982335 1657097/named
tcp6 0 0 fe80::42:f4ff:fe4f:7:53 :::* LISTEN 0 17307789 1657097/named
[...]
udp 0 0 192.168.16.10:53 0.0.0.0:* 0 17293555 1657097/named
[...]
udp6 0 0 ::1:53 :::* 0 17294629 1657097/named
[...]
udp6 0 0 2a0a:a546:1a48:ff::6:53 :::* 0 17294706 1657097/named
udp6 0 0 2a0a:a546:1a48:ff:3e:53 :::* 0 17279750 1657097/named
[...]

dragon@manager:~$ sudo netstat -tulpen | grep 'named' | wc -l
4320

Revision history for this message
Kurt Garloff (kgarloff) wrote (last edit ):

And the named comes from octavia_backend_bind9 container.
So we have octavia-designate integration creates lots of FDs in designate?
(But read on: This is not the leak we are looking for!)

Revision history for this message
Kurt Garloff (kgarloff) wrote :

The vast amount of named sockets is staggering.
However, the apache2-wsgi/octavia-api process really creates TCP connections to neutron all the time.
A snapshot looks like this:
apache2 3738266 octavia 38u sock 0,8 0t0 52635849 protocol: TCP
apache2 3738266 octavia 39u sock 0,8 0t0 52626327 protocol: TCP
apache2 3738266 octavia 40u IPv4 52669464 0t0 TCP api.in-a-box.cloud:38020->api.in-a-box.cloud:9696 (CLOSE_WAIT)
apache2 3738266 octavia 41u IPv4 52649651 0t0 TCP api.in-a-box.cloud:38036->api.in-a-box.cloud:9696 (CLOSE_WAIT)
apache2 3738266 octavia 42u IPv4 52682826 0t0 TCP api.in-a-box.cloud:4>api.in-a-box.cloud:9696 (CLOSE_WAIT)
apache2 3738266 octavia 43u IPv4 52681321 0t0 TCP api.in-a-box.cloud:50974->api.in-a-box.cloud:9696 (CLOSE_WAIT)
apache2 3738266 octavia 44u IPv4 52694380 0t0 TCP api.in-a-box.cloud:40690->api.in-a-box.cloud:9696 (CLOSE_WAIT)
apache2 3738266 octavia 45u IPv4 52678498 0t0 TCP api.in-a-box.cloud:57752->api.in-a-box.cloud:9696 (CLOSE_WAIT)
apache2 3738266 octavia 46u IPv4 52698415 0t0 TCP api.in-a-box.cloud:57782->api.in-a-box.cloud:9696 (ESTABLISHED)

One active connection (the last one), a few in CLOSE_WAIT (which is OK), but then a number of sockets that have gone, but somehow the FDs don't get closed ... Some connection pooling optimization to reuse these which forgets to be garbage collected?

Revision history for this message
Kurt Garloff (kgarloff) wrote :
Download full text (6.8 KiB)

OK, taking an strace on octavia-api (and children, option -f) and filtering for socket(), connect() and close() calls, here's a pattern:

[pid 2482980] socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_TCP) = 24
[pid 2482980] connect(24, {sa_family=AF_INET, sin_port=htons(9696), sin_addr=inet_addr("192.168.16.254")}, 16) = 0
[pid 2482980] close(104) = 0
[pid 2482980] close(104) = 0
[pid 2482980] socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_TCP) = 104
[pid 2482980] connect(104, {sa_family=AF_INET, sin_port=htons(5000), sin_addr=inet_addr("192.168.16.254")}, 16) = 0
[pid 2482980] close(105) = 0
[pid 2482980] close(105) = 0
[pid 2482980] socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_TCP) = 105
[pid 2482980] connect(105, {sa_family=AF_INET, sin_port=htons(9696), sin_addr=inet_addr("192.168.16.254")}, 16) = 0
[pid 2482980] close(106) = 0
[pid 2482980] socket(AF_NETLINK, SOCK_RAW, NETLINK_ROUTE) = 106
[pid 2482980] socket(AF_UNIX, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 107
[pid 2482980] close(107) = 0
[pid 2482980] socket(AF_UNIX, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 107
[pid 2482980] close(107) = 0
[pid 2482980] close(106) = 0
[pid 2482980] close(105) = 0
[pid 2482980] close(104) = 0
[pid 2482980] close(104) = 0

THIS ONE NEVER GETS CLOSED:
[pid 2482980] socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_TCP) = 104
[pid 2482980] connect(104, {sa_family=AF_INET, sin_port=htons(9696), sin_addr=inet_addr("192.168.16.254")}, 16) = 0

[pid 2482980] close(105) = 0
[pid 2482980] socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_IP) = 105
[pid 2482980] connect(105, {sa_family=AF_INET, sin_port=htons(6641), sin_addr=inet_addr("192.168.16.10")}, 16) = -1 EINPROGRESS (Operation now in progress)
[pid 2482980] close(105) = 0
[pid 2482980] socket(AF_UNIX, SOCK_STREAM, 0) = 105
[pid 2482980] connect(105, {sa_family=AF_UNIX, sun_path="/run/uuidd/request"}, 110) = -1 ENOENT (No such file or directory)
[pid 2482980] close(105) = 0
[pid 2482980] socket(AF_UNIX, SOCK_STREAM, 0) = 105
[pid 2482980] connect(105, {sa_family=AF_UNIX, sun_path="/run/uuidd/request"}, 110) = -1 ENOENT (No such file or directory)
[pid 2482980] close(105) = 0
[pid 2482980] close(105) = 0
[pid 2482980] close(106) = 0
[pid 1399057] socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0) = 105
[pid 1399057] connect(105, {sa_family=AF_UNIX, sun_path="/var/run/octavia/get.sock"}, 28) = 0
[pid 1399057] close(105) = 0
[pid 1399057] socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0) = 105
[pid 1399057] connect(105, {sa_family=AF_UNIX, sun_path="/var/run/octavia/get.sock"}, 28) = 0
[pid 1399057] close(105) = 0
[pid 1399057] socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0) = 105
[pid 1399057] connect(105, {sa_family=AF_UNIX, sun_path="/var/run/octavia/get.sock"}, 28) = 0
[pid 1399057] close(105) = 0
[pid 1399057] socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0) = 105
[pid 1399057] socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0) = 105
[pid 1399...

Read more...

Revision history for this message
Kurt Garloff (kgarloff) wrote :

See also https://bugs.launchpad.net/ovsdbapp/+bug/2093247 which seems to be a duplicate of this one.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.