Losing access to instances via floating IPs

Bug #1744412 reported by Kieran Forde
34
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Won't Fix
Medium
Ihar Hrachyshka

Bug Description

Description of problem:
Neutron Floating IPs stop working, instances become unreachable.

Version-Release number of selected component (if applicable):
Ocata.

Neutron-related RPMs:
puppet-neutron-10.3.2-0.20180103174737.2e7d298.el7.centos.noarch
openstack-neutron-common-10.0.5-0.20180105192920.295c700.el7.centos.noarch
openvswitch-ovn-common-2.6.1-10.1.git20161206.el7.x86_64
openstack-neutron-sriov-nic-agent-10.0.5-0.20180105192920.295c700.el7.centos.noarch
python-neutron-lib-1.1.0-1.el7.noarch
openvswitch-2.6.1-10.1.git20161206.el7.x86_64
python2-neutronclient-6.1.1-1.el7.noarch
python-openvswitch-2.6.1-10.1.git20161206.el7.noarch
openstack-neutron-lbaas-10.0.2-0.20180104200311.10771af.el7.centos.noarch
openvswitch-ovn-host-2.6.1-10.1.git20161206.el7.x86_64
python-neutron-10.0.5-0.20180105192920.295c700.el7.centos.noarch
openvswitch-ovn-central-2.6.1-10.1.git20161206.el7.x86_64
openstack-neutron-metering-agent-10.0.5-0.20180105192920.295c700.el7.centos.noarch
python-neutron-lbaas-10.0.2-0.20180104200311.10771af.el7.centos.noarch
openstack-neutron-openvswitch-10.0.5-0.20180105192920.295c700.el7.centos.noarch
openstack-neutron-ml2-10.0.5-0.20180105192920.295c700.el7.centos.noarch
openstack-neutron-10.0.5-0.20180105192920.295c700.el7.centos.noarch

How reproducible:
Not sure. We have noticed over the past day or so several users complaining about unreachable instances. Not all VMs have this issue and it is not clear how connectivity was lost in the first place.

Actual results:
In some cases, router is active on more than one controller, or router looks in the correct configuration but the qg-xxxx interface isn't NAT-ing the traffic to the qr-xxx interface. Iptables look correct.

Expected results:
VMs reachable via FIP.

Additional info:
Some ports appear to be stuck in 'BUILD' status, but not sure what is causing it.

See this error in the logs:

2018-01-19 18:36:03.930 74015 ERROR neutron.agent.l3.router_info Traceback (most recent call last):
2018-01-19 18:36:03.930 74015 ERROR neutron.agent.l3.router_info File "/usr/lib/python2.7/site-packages/neutron/common/utils.py", line 256, in call
2018-01-19 18:36:03.930 74015 ERROR neutron.agent.l3.router_info return func(*args, **kwargs)
2018-01-19 18:36:03.930 74015 ERROR neutron.agent.l3.router_info File "/usr/lib/python2.7/site-packages/neutron/agent/l3/router_info.py", line 1116, in process
2018-01-19 18:36:03.930 74015 ERROR neutron.agent.l3.router_info self.process_external()
2018-01-19 18:36:03.930 74015 ERROR neutron.agent.l3.router_info File "/usr/lib/python2.7/site-packages/neutron/agent/l3/router_info.py", line 910, in process_external
2018-01-19 18:36:03.930 74015 ERROR neutron.agent.l3.router_info self.update_fip_statuses(fip_statuses)
2018-01-19 18:36:03.930 74015 ERROR neutron.agent.l3.router_info File "/usr/lib/python2.7/site-packages/neutron/agent/l3/router_info.py", line 926, in update_fip_statuses
2018-01-19 18:36:03.930 74015 ERROR neutron.agent.l3.router_info self.agent.context, self.router_id, fip_statuses)
2018-01-19 18:36:03.930 74015 ERROR neutron.agent.l3.router_info File "/usr/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 125, in update_floatingip_statuses
2018-01-19 18:36:03.930 74015 ERROR neutron.agent.l3.router_info router_id=router_id, fip_statuses=fip_statuses)
2018-01-19 18:36:03.930 74015 ERROR neutron.agent.l3.router_info File "/usr/lib/python2.7/site-packages/neutron/common/rpc.py", line 151, in call
2018-01-19 18:36:03.930 74015 ERROR neutron.agent.l3.router_info return self._original_context.call(ctxt, method, **kwargs)
2018-01-19 18:36:03.930 74015 ERROR neutron.agent.l3.router_info File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 169, in call
2018-01-19 18:36:03.930 74015 ERROR neutron.agent.l3.router_info retry=self.retry)
2018-01-19 18:36:03.930 74015 ERROR neutron.agent.l3.router_info File "/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 97, in _send
2018-01-19 18:36:03.930 74015 ERROR neutron.agent.l3.router_info timeout=timeout, retry=retry)
2018-01-19 18:36:03.930 74015 ERROR neutron.agent.l3.router_info File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 566, in send
2018-01-19 18:36:03.930 74015 ERROR neutron.agent.l3.router_info retry=retry)
2018-01-19 18:36:03.930 74015 ERROR neutron.agent.l3.router_info File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 557, in _send
2018-01-19 18:36:03.930 74015 ERROR neutron.agent.l3.router_info raise result
2018-01-19 18:36:03.930 74015 ERROR neutron.agent.l3.router_info RemoteError: Remote error: TimeoutError QueuePool limit of size 10 overflow 20 reached, connection timed out, timeout 10

Tags: db
Revision history for this message
Paul Belanger (pabelanger) wrote :

have you seen https://bugs.launchpad.net/ubuntu/+source/neutron/+bug/1384108 it seems to indicate setting the max connections to the number of API workers. EG: bumped the max connections to 4 x the worker configuration just to be on the safe side.

However, it is possible connections are being leaked, some place. Of the open connections maybe check to see if they are still valid.

https://stackoverflow.com/questions/3360951/sql-alchemy-connection-time-out applies to SQL, but backtrace seems to be AMQP

Revision history for this message
Brian Haley (brian-haley) wrote :

The error is indicating the l3-agent wasn't able to send floating IP info to the neutron-server when adding/updating a router. I would think that increasing the relevant items in the [database] section in neutron.conf (max_pool_size, max_overflow, for example) could alleviate this.

Did the l3-agent retry things at the next loop or just continue on? It should have logged something like "Failed to process compatible router" and attempted a resync in a few seconds unless it completely went out to lunch.

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

Please share your [database] settings from neutron.conf. Also the number of workers you use (both rpc and api). How is the cloud configured, tripleo?

In Pike+, we changed default values used for database interaction to whatever is default in oslo.db: https://bugs.launchpad.net/neutron/+bug/1682307 (which is e.g. 50 for overflow setting). We couldn't backport it to Ocata since it's a backwards incompatible change. I am not sure if we followed up with any fixes for tripleo.

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

I think what makes sense is to try default values from oslo.db and see if they fix the issue. Specifically, set in neutron.conf:

[database]
pool_timeout=30
max_overflow=50
max_pool_size=5

If it doesn't help, we may need to look into better tweaks. Otherwise, we can assume that the setup is satisfied with neutron defaults that should kick in in Pike+ and hence no need for a change.

I also checked whether we had any related patches in puppet / tripleo and it doesn't seem like it.

Changed in neutron:
status: New → Incomplete
importance: Undecided → Medium
tags: added: db
Revision history for this message
Mike Bayer (zzzeek) wrote :

The connection pool tuning values apply to the configuration *per* worker, that is, assuming a "worker" here is a Python process (e.g. fork, has it's own PID). So the number of connections here is not related to the number of "workers" (again, assuming "worker" is a python process, and not a thread or greenlet) - it has to do with the concurrency inside of a single worker.

What the number *is* related to is how many concurrent requests may be sent to a single worker at a time. I wrote a long email about this in [1] which is referring to how openstack services at that time were all based on eventlet and all were set up to allow 1000 greenlets per process, meaning as many as 1000 requests could be sent to a single process at a time which will definitely bring forth this error.

I'm not up to speed on which services are using eventlet vs. mod_wsgi w threads vs. other, however, what's important here is the number of concurrent requests a single process can handle. If neutron is still on eventlet and using greenlets, the number of greenlets should be tuned way down and the number of possible connections tuned up to match the number of greenlets.

Background on what the connection pool error means is at [2], I recommend reading it completely to gain further understanding of this issue.

[1] http://markmail.org/message/r45i4ifnpnvgprpu

[2] http://docs.sqlalchemy.org/en/latest/errors.html#queuepool-limit-of-size-x-overflow-y-reached-connection-timed-out-timeout-z

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

Neutron still uses eventlet, and it relies on wsgi_default_pool_size option defined in oslo.service (default value = 100 since Mitaka). As for the number of database connections allowed, in Ocata, it uses custom defaults:

max_pool_size=10
max_overflow=20
pool_timeout=10

For what I understand, Mike suggests that we raise pool_size / overflow to something similar to the number of greenthreads per worker, which is 100, since neutron-server is db-intensive, and most if not all API requests trigger at least one database transaction.

While it makes sense to poke current defaults (overflow=50) more if we experience issues, the easiest thing to do is to try current Pike+ defaults for oslo.db that I mentioned in comment 4, and see if that's enough to alleviate the issue. If not, we can tweak the number of allowed database connections even higher / number of greenthreads even lower.

Which makes me wonder: should we raise the default number of db connections allowed to at least wsgi_default_pool_size? Are there any drawbacks in doing so?

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

Moving back to Cofirmed as I think the discussion in comments about defaults is fair regardless of whether configuration tweak from comment 4 will help the reporter.

Changed in neutron:
status: Incomplete → Confirmed
assignee: nobody → Ihar Hrachyshka (ihar-hrachyshka)
Revision history for this message
David Manchado (dmanchad) wrote :

Just to update that we had the same issue yesterday.

We changed from the default into the values suggested in comment 4 and after restaring neutron-l3-agent the error was gone.

Revision history for this message
Mike Bayer (zzzeek) wrote :

there is a theoretical reason that maybe wsgi_default_pool_size is bigger (a little bit) than DB connections, which is that those extra greenlets can be doing IO with the client or with some other service besides the database while other greenlets that have a DB connection might be waiting on database IO. I have a feeling this is exceedingly unlikely in neutron. It's important to remember that all of these greenlets pile up in exactly one thread (and even if it were multiple threads, the GIL serializes all CPU work anyway) so the greenlets have to really be doing IO-bound things and waiting for it to be at all worth it to have even 50 greenlets. all you get by having more greenlets is so that a process can even accept a connection up front so that it can wait to use the database, but as I pointed out in my email it's better that they pile up on the "accept" call so that they can go to a worker that can definitely accommodate them, rather than going into one worker randomly and waiting for that worker to attend to all its other greenlets.

It is also the case that if there are too many greenlets or the greenlets are starting to pile up with CPU bound work, you can actually have dropped SQL connections, as MySQL will drop a new client connection that hasn't finished authenticating after just ten seconds. I recently found Cinder doing this: https://bugs.launchpad.net/cinder/+bug/1719580

tl;dr mike hates greenlets for database-bound business methods

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

I don't believe we are going to change any defaults for oslo.db / oslo.service options in upstream, so closing it as Won't Fix. Also, it's not clear if there is any issue in fresh versions of neutron.

Changed in neutron:
status: Confirmed → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.