neutron api serverhigh load after upgrade to ocata

Bug #1734623 reported by Mario Sommer
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Expired
High
Unassigned

Bug Description

After upgrade to ocata the n-api process is having a very high load on an more or less "idle" cloud. No matter how many CPU cores or api workers, there is always one neutron server process at 100% CPU usage. This process seems to do constant mysql queries.
In debug log there are many "Checking segment" messages.
http://paste.openstack.org/show/627415/

The agents time out with
"Failed reporting state!" "MessagingTimeout: Timed out waiting for a reply to message"
http://paste.openstack.org/show/627414/

Revision history for this message
zhaobo (zhaobo6) wrote :

How long has the cloud been idle? I suspect that there are tons of message in the MQ services, both of server and agent need to process them 1 by 1. So it is possible that consume many resources for that, such as cpu, memory, etc.

Could u give some information mentioned above? Including rabbitMQ setting and DB log?

tags: added: db needs-attention
Revision history for this message
Mario Sommer (mario-sommer) wrote :

Attached n-api server.log with connection_debug=50.
Wehen I stopped all neutron agents everything was quiet on the n-api server.
As soon as I started the l3 or openvswitch or dhcp agent the massive queries started.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

This may be hard to reproduce without knowing the exact state of the neutron database prior the upgrade. It's possible that the queries are triggered depending on how networks you have in your environment. Can you help us isolate the environment conditions under which you see this load spike?

Changed in neutron:
importance: Undecided → High
status: New → Incomplete
Revision history for this message
Mario Sommer (mario-sommer) wrote :

Its' not a spike, but permanent high load.
Here is an overview of the current neutron database: http://paste.openstack.org/show/627538/
Only thing that stands out is the allowedaddresspairs table with nearly as much entries as the ports table. We do heavily use allowedaddresspairs, could this be the problem?

Revision history for this message
Mario Sommer (mario-sommer) wrote :

On Nova Compute Nodes with many allowed-addres-pairs the openvswitch agent's "Agent rpc_loop" takes forever. And I think it never finishes, as the "added" count never decreases.

2017-11-28 09:02:15.725 53388 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-c9e21302-5c5c-4f2c-b2ba-b8041d47c1e9 - - - - -] Agent rpc_loop - iteration:91 completed. Processed ports statistics: {'regular': {'updated': 0, 'added': 30, 'removed': 0}}. Elapsed:1205.793 loop_count_and_wait /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:1771
2017-11-28 09:13:07.161 53388 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-c9e21302-5c5c-4f2c-b2ba-b8041d47c1e9 - - - - -] Agent rpc_loop - iteration:92 completed. Processed ports statistics: {'regular': {'updated': 0, 'added': 30, 'removed': 0}}. Elapsed:651.435 loop_count_and_wait /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:1771

Compared to another compute node, without allowed-addres-pairs
2017-11-28 11:01:24.938 32558 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-8075ac82-7601-4070-9bf7-a5de18982114 - - - - -] Agent rpc_loop - iteration:638 completed. Processed ports statistics: {'regular': {'updated': 0, 'added': 0, 'removed': 0}}. Elapsed:0.010 loop_count_and_wait /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:1771

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Mario: that's a great piece of insight. Perhaps there's a query optimization that we could perform to speed things up. We should investigate how allowed address pairs are joined to the port object.

Revision history for this message
Mario Sommer (mario-sommer) wrote :

After we applied the patch from here https://review.openstack.org/#/c/454203/ and restarted the n-api server the load instantly dropped to zero.
if I understand this patch correctly, it just disables DVR by not loading it.
Luckily we don't use DVR (yet), but the combination of DVR & lots of allowed-addres-pairs seems to be a huge problem.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

You wouldn't need the patch but just set distributed=False in the neutron config file. Actually I would be interested if that had the same outcome.

That said, I am starting to wonder if we'd need something along the lines of [1], but targeting allowed address pairs. Enabling debug logging for sqlalchemy may show some query explosion that would pinpoint us in the right direction.

[1] https://review.openstack.org/#/c/479091/

Revision history for this message
Mario Sommer (mario-sommer) wrote :

Tried with distributed=False in neutron.conf. This does not work, n-api server load is still high.
I could reproduce the problem in a small test enviroment by creating a single instance with two ports. Then set max_allowed_address_pair=100 in neutron.conf. Updating the the first port with 100 allowed addresses (neutron port-update <port> --allowed-address-pairs type=dict list=true ip_address=10.0.0.100 ip_address=10.0.0.101 [...] ip_address=10.0.0.200) worked. Updating the second port with the same command failed because of the high neutron server load.

Also attached the SQL debug log.

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for neutron because there has been no activity for 60 days.]

Changed in neutron:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.