neutron

neutron api serverhigh load after upgrade to ocata

Bug #1734623 reported by Mario Sommer on 2017-11-27

8

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	neutron	Expired	High	Unassigned

Bug Description

After upgrade to ocata the n-api process is having a very high load on an more or less "idle" cloud. No matter how many CPU cores or api workers, there is always one neutron server process at 100% CPU usage. This process seems to do constant mysql queries.
In debug log there are many "Checking segment" messages.
http://paste.openstack.org/show/627415/

The agents time out with
"Failed reporting state!" "MessagingTimeout: Timed out waiting for a reply to message"
http://paste.openstack.org/show/627414/

Tags:

Revision history for this message

zhaobo (zhaobo6) wrote on 2017-11-27:

#1

How long has the cloud been idle? I suspect that there are tons of message in the MQ services, both of server and agent need to process them 1 by 1. So it is possible that consume many resources for that, such as cpu, memory, etc.

Could u give some information mentioned above? Including rabbitMQ setting and DB log?

tags:

added: db needs-attention

Revision history for this message

Mario Sommer (mario-sommer) wrote on 2017-11-27:

#2

neutron api server.log Edit (50.0 MiB, text/plain)

Attached n-api server.log with connection_debug=50.
Wehen I stopped all neutron agents everything was quiet on the n-api server.
As soon as I started the l3 or openvswitch or dhcp agent the massive queries started.

Revision history for this message

Armando Migliaccio (armando-migliaccio) wrote on 2017-11-27:

#3

This may be hard to reproduce without knowing the exact state of the neutron database prior the upgrade. It's possible that the queries are triggered depending on how networks you have in your environment. Can you help us isolate the environment conditions under which you see this load spike?

Changed in neutron:
importance:	Undecided → High
status:	New → Incomplete

Revision history for this message

Mario Sommer (mario-sommer) wrote on 2017-11-28:

#4

Its' not a spike, but permanent high load.
Here is an overview of the current neutron database: http://paste.openstack.org/show/627538/
Only thing that stands out is the allowedaddresspairs table with nearly as much entries as the ports table. We do heavily use allowedaddresspairs, could this be the problem?

Revision history for this message

Mario Sommer (mario-sommer) wrote on 2017-11-28:

#5

On Nova Compute Nodes with many allowed-addres-pairs the openvswitch agent's "Agent rpc_loop" takes forever. And I think it never finishes, as the "added" count never decreases.

2017-11-28 09:02:15.725 53388 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-c9e21302-5c5c-4f2c-b2ba-b8041d47c1e9 - - - - -] Agent rpc_loop - iteration:91 completed. Processed ports statistics: {'regular': {'updated': 0, 'added': 30, 'removed': 0}}. Elapsed:1205.793 loop_count_and_wait /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:1771
2017-11-28 09:13:07.161 53388 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-c9e21302-5c5c-4f2c-b2ba-b8041d47c1e9 - - - - -] Agent rpc_loop - iteration:92 completed. Processed ports statistics: {'regular': {'updated': 0, 'added': 30, 'removed': 0}}. Elapsed:651.435 loop_count_and_wait /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:1771

Compared to another compute node, without allowed-addres-pairs
2017-11-28 11:01:24.938 32558 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-8075ac82-7601-4070-9bf7-a5de18982114 - - - - -] Agent rpc_loop - iteration:638 completed. Processed ports statistics: {'regular': {'updated': 0, 'added': 0, 'removed': 0}}. Elapsed:0.010 loop_count_and_wait /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:1771

Revision history for this message

Armando Migliaccio (armando-migliaccio) wrote on 2017-11-28:

#6

Mario: that's a great piece of insight. Perhaps there's a query optimization that we could perform to speed things up. We should investigate how allowed address pairs are joined to the port object.

Revision history for this message

Mario Sommer (mario-sommer) wrote on 2017-11-28:

#7

After we applied the patch from here https://review.openstack.org/#/c/454203/ and restarted the n-api server the load instantly dropped to zero.
if I understand this patch correctly, it just disables DVR by not loading it.
Luckily we don't use DVR (yet), but the combination of DVR & lots of allowed-addres-pairs seems to be a huge problem.

Revision history for this message

Armando Migliaccio (armando-migliaccio) wrote on 2017-11-28:

#8

You wouldn't need the patch but just set distributed=False in the neutron config file. Actually I would be interested if that had the same outcome.

That said, I am starting to wonder if we'd need something along the lines of [1], but targeting allowed address pairs. Enabling debug logging for sqlalchemy may show some query explosion that would pinpoint us in the right direction.

[1] https://review.openstack.org/#/c/479091/

Revision history for this message

Mario Sommer (mario-sommer) wrote on 2017-12-19:

#9

n-api server sql debug log Edit (38.3 MiB, text/plain)

Tried with distributed=False in neutron.conf. This does not work, n-api server load is still high.
I could reproduce the problem in a small test enviroment by creating a single instance with two ports. Then set max_allowed_address_pair=100 in neutron.conf. Updating the the first port with 100 allowed addresses (neutron port-update <port> --allowed-address-pairs type=dict list=true ip_address=10.0.0.100 ip_address=10.0.0.101 [...] ip_address=10.0.0.200) worked. Updating the second port with the same command failed because of the high neutron server load.

Also attached the SQL debug log.

Revision history for this message

Launchpad Janitor (janitor) wrote on 2018-02-18:

#10

[Expired for neutron because there has been no activity for 60 days.]

Changed in neutron:
status:	Incomplete → Expired

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.