octavia/ovn: filedescriptor out of range in select()

Bug #2063043 reported by Kurt Garloff
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
New
High
Fernando Royo

Bug Description

Running octavia-api in a container (kolla-ansible), where a health monitoring process creates OVN provider loadbalancers with listeners, pools, members, health-mons and tests them and cleans them up again.
Upon creation of resources, the octavia-api process creates `[eventpoll]` file descriptors (with `epoll_create1()` system call which it never closes. Once it hits 1024, it errors out with
```
 2024-04-09 03:00:12.130 732 ERROR ovn_octavia_provider.ovsdb.impl_idl_ovn [None req-eefd1e9d-8bfe-473d-9ff1-5a8b0d4ab5d2 - 2767ef0256804b92ae0e51e3a99f809a - - 93db18cb391748009639058c52577527 93db18cb391748009639058c52577527] OVS database connection to OVN_Northbound failed with error: 'filedescriptor out of range in select()'. Verify that the OVS and OVN services are available and that the 'ovn_nb_connection' and 'ovn_sb_connection' configuration options are correct.: ValueError: filedescriptor out of range in select()
2024-04-09 03:00:12.130 732 ERROR ovn_octavia_provider.ovsdb.impl_idl_ovn Traceback (most recent call last):
2024-04-09 03:00:12.130 732 ERROR ovn_octavia_provider.ovsdb.impl_idl_ovn File "/var/lib/kolla/venv/lib/python3.10/site-packages/ovn_octavia_provider/ovsdb/impl_idl_ovn.py", line 65, in start_connection
2024-04-09 03:00:12.130 732 ERROR ovn_octavia_provider.ovsdb.impl_idl_ovn self.ovsdb_connection.start()
2024-04-09 03:00:12.130 732 ERROR ovn_octavia_provider.ovsdb.impl_idl_ovn File "/var/lib/kolla/venv/lib/python3.10/site-packages/ovsdbapp/backend/ovs_idl/connection.py", line 83, in start
2024-04-09 03:00:12.130 732 ERROR ovn_octavia_provider.ovsdb.impl_idl_ovn idlutils.wait_for_change(self.idl, self.timeout)
2024-04-09 03:00:12.130 732 ERROR ovn_octavia_provider.ovsdb.impl_idl_ovn File "/var/lib/kolla/venv/lib/python3.10/site-packages/ovsdbapp/backend/ovs_idl/idlutils.py", line 252, in wait_for_change
2024-04-09 03:00:12.130 732 ERROR ovn_octavia_provider.ovsdb.impl_idl_ovn ovs_poller.block()
2024-04-09 03:00:12.130 732 ERROR ovn_octavia_provider.ovsdb.impl_idl_ovn File "/usr/lib/python3/dist-packages/ovs/poller.py", line 231, in block
2024-04-09 03:00:12.130 732 ERROR ovn_octavia_provider.ovsdb.impl_idl_ovn events = self.poll.poll(self.timeout)
2024-04-09 03:00:12.130 732 ERROR ovn_octavia_provider.ovsdb.impl_idl_ovn File "/usr/lib/python3/dist-packages/ovs/poller.py", line 137, in poll
2024-04-09 03:00:12.130 732 ERROR ovn_octavia_provider.ovsdb.impl_idl_ovn rlist, wlist, xlist = select.select(self.rlist,
2024-04-09 03:00:12.130 732 ERROR ovn_octavia_provider.ovsdb.impl_idl_ovn ValueError: filedescriptor out of range in select()
2024-04-09 03:00:12.130 732 ERROR ovn_octavia_provider.ovsdb.impl_idl_ovn
2024-04-09 03:00:12.132 732 ERROR octavia.api.drivers.driver_factory [None req-eefd1e9d-8bfe-473d-9ff1-5a8b0d4ab5d2 - 2767ef0256804b92ae0e51e3a99f809a - - 93db18cb391748009639058c52577527 93db18cb391748009639058c52577527] Unable to load provider driver ovn due to: OVS database connection to OVN_Northbound failed with error: 'filedescriptor out of range in select()'. Verify that the OVS and OVN services are available and that the 'ovn_nb_connection' and 'ovn_sb_connection' configuration options are correct.: ovn_octavia_provider.ovsdb.impl_idl_ovn.OvsdbConnectionUnavailable: OVS database connection to OVN_Northbound failed with error: 'filedescriptor out of range in select()'. Verify that the OVS and OVN services are available and that the 'ovn_nb_connection' and 'ovn_sb_connection' configuration options are correct.
2024-04-09 03:00:12.134 732 ERROR wsme.api [None req-eefd1e9d-8bfe-473d-9ff1-5a8b0d4ab5d2 - 2767ef0256804b92ae0e51e3a99f809a - - 93db18cb391748009639058c52577527 93db18cb391748009639058c52577527] Server-side error: "Provider 'ovn' was not found.". Detail:
Traceback (most recent call last):
[...]
```
Subsequently, the ovn provider no longer is registered from an octavia-api perspective and the container goes into an unhealthy state and needs to be restarted.

This was observed on octavia from OpenStack 2023.2 (Bobcat) installed via kolla-ansible (OSISM).

Original bug report at
https://github.com/osism/issues/issues/959

Revision history for this message
Gregory Thiemonge (gthiemonge) wrote :

moving to neutron, the ovn-octavia-provider is a neutron project

affects: octavia → neutron
tags: added: ovn-octavia-provider
Revision history for this message
Kurt Garloff (kgarloff) wrote :
Download full text (3.6 KiB)

Upon a new LB member, I get two new [eventpoll] FDs, which never get cleaned up. (Removing the members does not help.)
Not sure whether those should be explicitly be closed in the code before new ones are opened or whether some thread should just exit to get rid of them.
Anyhow, here is how lsof looks at it after some resources have been created and deleted
[...]
apache2 2625091 octavia 21u IPv4 880892680 0t0 TCP api.in-a-box.cloud:37214->api.in-a-box.cloud:5000 (CLOSE_WAIT)
apache2 2625091 octavia 22u IPv4 880928052 0t0 TCP api.in-a-box.cloud:53454->api.in-a-box.cloud:mysql (ESTABLISHED)
apache2 2625091 octavia 23r FIFO 0,14 0t0 880607178 pipe
apache2 2625091 octavia 24w FIFO 0,14 0t0 880607178 pipe
apache2 2625091 octavia 25u sock 0,8 0t0 880607182 protocol: TCP
apache2 2625091 octavia 26u IPv4 880893583 0t0 TCP api.in-a-box.cloud:39068->api.in-a-box.cloud:9696 (CLOSE_WAIT)
apache2 2625091 octavia 27u sock 0,8 0t0 880607189 protocol: TCP
apache2 2625091 octavia 28u sock 0,8 0t0 880732423 protocol: TCP
apache2 2625091 octavia 29u sock 0,8 0t0 880724669 protocol: TCP
apache2 2625091 octavia 30u sock 0,8 0t0 880716764 protocol: TCP
apache2 2625091 octavia 31u sock 0,8 0t0 880644259 protocol: TCP
apache2 2625091 octavia 32u IPv4 880939067 0t0 TCP api.in-a-box.cloud:50304->api.in-a-box.cloud:9696 (ESTABLISHED)
apache2 2625091 octavia 33u a_inode 0,15 0 68 [eventpoll]
apache2 2625091 octavia 34u sock 0,8 0t0 880823397 protocol: TCP
apache2 2625091 octavia 35u a_inode 0,15 0 68 [eventpoll]
apache2 2625091 octavia 36u IPv4 880893589 0t0 TCP api.in-a-box.cloud:39094->api.in-a-box.cloud:9696 (CLOSE_WAIT)
apache2 2625091 octavia 37u a_inode 0,15 0 68 [eventpoll]
apache2 2625091 octavia 38u IPv4 880891827 0t0 TCP api.in-a-box.cloud:39140->api.in-a-box.cloud:9696 (CLOSE_WAIT)
apache2 2625091 octavia 39u a_inode 0,15 0 68 [eventpoll]
apache2 2625091 octavia 40u a_inode 0,15 0 68 [eventpoll]
apache2 2625091 octavia 41u a_inode 0,15 0 68 [eventpoll]
apache2 2625091 octavia 42u a_inode 0,15 0 68 [eventpoll]
apache2 2625091 octavia 43u IPv4 880886692 0t0 TCP api.in-a-box.cloud:54838->api.in-a-box.cloud:9696 (CLOSE_WAIT)
apache2 2625091 octavia 44u a_inode 0,15 0 68 [eventpoll]
apache2 2625091 octavia 45u IPv4 880895843 0t0 TCP api.in-a-box.cloud:54860->api.in-a-box.cloud:9696 (CLOSE_WAIT)
apache2 2625091 octavia 46u a_inode 0,15 0 68 [eventpoll]
apache2 2625091 octavia 48u a_inode 0,15 0 68 [eventpoll]
apache2 2625091 octa...

Read more...

Changed in neutron:
assignee: nobody → Fernando Royo (froyoredhat)
Changed in neutron:
importance: Undecided → High
Revision history for this message
Fernando Royo (froyoredhat) wrote :
Revision history for this message
Kurt Garloff (kgarloff) wrote :

I have tried to find where things go wrong ... but thus far no success.
We don't seem to leak threads at least. Just the epoll file descriptors that we then wait on with select. But I could not find the place where we miss the cleanup yet. I have been looking through ovn-octavia-provider, ovsdbapp, and ovs python code (in Bobcat) without luck thus far. I guess, I'll have to add some tracing logging statements ...
Help is definitely welcome.

Revision history for this message
Kurt Garloff (kgarloff) wrote (last edit ):
Download full text (3.5 KiB)

Whenever the octavia ovn helper talks to neutron, it seems to create a new TCP connection to the neutron service endpoint along with an eventpoll watcher, without the file descriptor nor the watcher of the old connection being closed. The Singleton constructions in common/clients.py does not seem to prevent this.
(Is this an SDK bug that may not have anything to do with OVN octavia in the end?)
Correction: *Some* of the TCP sockets seem to get reused:
```
apache2 3420779 octavia 20u IPv4 515704480 0t0 TCP manager.systems.in-a-box.cloud:40470->manager.systems.in-a-box.cloud:11211 (ESTABLISHED)
apache2 3420779 octavia 21u IPv4 516310909 0t0 TCP api.in-a-box.cloud:50690->api.in-a-box.cloud:5000 (ESTABLISHED)
apache2 3420779 octavia 22u IPv4 516273896 0t0 TCP api.in-a-box.cloud:56094->api.in-a-box.cloud:mysql (ESTABLISHED)
apache2 3420779 octavia 23r FIFO 0,14 0t0 515818098 pipe
apache2 3420779 octavia 24w FIFO 0,14 0t0 515818098 pipe
apache2 3420779 octavia 25u sock 0,8 0t0 515818102 protocol: TCP
apache2 3420779 octavia 26u IPv4 516182592 0t0 TCP api.in-a-box.cloud:47694->api.in-a-box.cloud:9696 (CLOSE_WAIT)
apache2 3420779 octavia 27u sock 0,8 0t0 515818129 protocol: TCP
apache2 3420779 octavia 28u sock 0,8 0t0 515820532 protocol: TCP
apache2 3420779 octavia 29u sock 0,8 0t0 515836380 protocol: TCP
apache2 3420779 octavia 30u sock 0,8 0t0 515895964 protocol: TCP
apache2 3420779 octavia 31u sock 0,8 0t0 515833072 protocol: TCP
apache2 3420779 octavia 32u a_inode 0,15 0 64 [eventpoll]
apache2 3420779 octavia 33u IPv4 516168640 0t0 TCP api.in-a-box.cloud:47720->api.in-a-box.cloud:9696 (CLOSE_WAIT)
apache2 3420779 octavia 34u sock 0,8 0t0 516016283 protocol: TCP
apache2 3420779 octavia 35u a_inode 0,15 0 64 [eventpoll]
apache2 3420779 octavia 36u sock 0,8 0t0 516011446 protocol: TCP
apache2 3420779 octavia 37u a_inode 0,15 0 64 [eventpoll]
apache2 3420779 octavia 38u a_inode 0,15 0 64 [eventpoll]
apache2 3420779 octavia 39u a_inode 0,15 0 64 [eventpoll]
apache2 3420779 octavia 40u a_inode 0,15 0 64 [eventpoll]
apache2 3420779 octavia 41u IPv4 516172616 0t0 TCP api.in-a-box.cloud:47714->api.in-a-box.cloud:9696 (CLOSE_WAIT)
apache2 3420779 octavia 42u a_inode 0,15 0 64 [eventpoll]
apache2 3420779 octavia 43u IPv4 516188638 0t0 TCP api.in-a-box.cloud:47768->api.in-a-box.cloud:9696 (CLOSE_WAIT)
apache2 3420779 octavia 44u a_inode 0,15 0 64 [eventpoll]
apache2 3420779 octavia 46u a_inode 0,15 0 64 [eventpoll]
apache2 3420779 octavia 47u a_inode...

Read more...

Revision history for this message
Kurt Garloff (kgarloff) wrote :

This test script causes my octavia-api service (Bobcat via kolla-ansible) to leak file descriptors at a rate of up to ~30/min. Interestingly the behavior is bursty. I see lots of pipe file descriptors in the lsof listings. Many (all?) of the pipes go away after a while, so cleanup happens for them after a few minutes. Once we started creating many pipes again, reaching near the bottom of the fd list, the increase of highest FD starts happening again.

Revision history for this message
Kurt Garloff (kgarloff) wrote :

A graph showing how the script kills octavia/ovn with two octavia-api workers.
One can see that the pipe FDs are regularly cleaned, which the FDs for long gone connections (socket) and the FDs for eventpoll keep increasing linearly. (We see the sum from the two octavia-api processes.)
After a pipe cleanup, the kernel first reuses the freed FDs before the highest FD needs to grow further. Misc are unused FDs and the handful of FDs (e.g. regular files) that don't fit into the classification we have taken here:
* PIPE (aka FIFO)
* Epoll (event_poll)
* Sock (FDs for sockets that have been closed)
* Conn (FDs for sockets that are active or in close_wait)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.