Cannot adjust number of resources in one agent step during device handling.

Bug #1938202 reported by Mitya Eremeev
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Medium
Mitya Eremeev

Bug Description

Sometimes certain RPC calls time out during port update on compute nodes.
It may happen when OVS finds a new port on br-int. E.g. a very regular operation.
Timeout example from logs:

File "/usr/lib/python2.7/dist-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py", line 2384,

in rpc_loop\n port_info, provisioning_needed)\n', ' File "/usr/lib/python2.7/dist-packages/osprofiler/profiler.py", line 158,

in wrapper\n result = f(*args, **kwargs)\n', ' File "/usr/lib/python2.7/dist-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py", line 1831,

in process_network_ports\n port_info.get(\'updated\', set()))\n', ' File "/usr/lib/python2.7/dist-packages/neutron/agent/securitygroups_rpc.py", line 262,

in setup_port_filters\n self.prepare_devices_filter(new_devices)\n', ' File "/usr/lib/python2.7/dist-packages/neutron/agent/securitygroups_rpc.py", line 125,

in decorated_function\n *args, **kwargs)\n', ' File "/usr/lib/python2.7/dist-packages/neutron/agent/securitygroups_rpc.py", line 137,

in prepare_devices_filter\n self._apply_port_filter(device_ids)\n', ' File "/usr/lib/python2.7/dist-packages/neutron/agent/securitygroups_rpc.py", line 147,

in _apply_port_filter\n self.context, list(device_ids)[i:i + step])\n', ' File "/usr/lib/python2.7/dist-packages/neutron/api/rpc/handlers/securitygroups_rpc.py", line 233,

in security_group_info_for_devices\n result = self.security_group_info_for_ports(context, ports)\n', ' File "/usr/lib/python2.7/dist-packages/neutron/db/securitygroups_rpc_base.py", line 220,

in security_group_info_for_ports\n return self._get_security_group_member_ips(context, sg_info)\n', ' File "/usr/lib/python2.7/dist-packages/neutron/db/securitygroups_rpc_base.py", line 224,

in _get_security_group_member_ips\n context, sg_info[\'sg_member_ips\'].keys())\n', ' File "/usr/lib/python2.7/dist-packages/neutron/api/rpc/handlers/securitygroups_rpc.py", line 328,

in _select_ips_for_remote_group\n for p in self.rcache.get_resources(\'Port\', filters):\n', ' File "/usr/lib/python2.7/dist-packages/neutron/agent/resource_cache.py", line 120,

in get_resources\n self._flood_cache_for_query(rtype, **filters)\n', ' File "/usr/lib/python2.7/dist-packages/neutron/agent/resource_cache.py", line 76,

in _flood_cache_for_query\n filter_kwargs=filter_kwargs)\n', ' File "/usr/lib/python2.7/dist-packages/oslo_log/helpers.py", line 67,

in wrapper\n return method(*args, **kwargs)\n', ' File "/usr/lib/python2.7/dist-packages/neutron/api/rpc/handlers/resources_rpc.py", line 112,

in bulk_pull\n version=resource_type_cls.VERSION, filter_kwargs=filter_kwargs)\n', ' File "/usr/lib/python2.7/dist-packages/neutron/common/rpc.py", line 189,

in call\n time.sleep(wait)\n', ' File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220,

in __exit__\n self.force_reraise()\n', ' File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196,

in force_reraise\n six.reraise(self.type_, self.value, self.tb)\n', ' File "/usr/lib/python2.7/dist-packages/neutron/common/rpc.py", line 165,

in call\n return self._original_context.call(ctxt, method, **kwargs)\n', ' File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/client.py", line 174,

in call\n retry=self.retry)\n', ' File "/usr/lib/python2.7/dist-packages/oslo_messaging/transport.py", line 131,

in _send\n timeout=timeout, retry=retry)\n', ' File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 625,

in send\n retry=retry)\n', ' File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 614,

in _send\n result = self._waiter.wait(msg_id, timeout)\n', ' File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 506,

in wait\n message = self.waiters.get(msg_id, timeout=timeout)\n', ' File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 394,

in get\n \'to message ID %s\' % msg_id)\n']: MessagingTimeout: Timed out waiting for a reply to message ID 1b9cbbfe84f84d7581d53640a9d75222

The reason for RPC timeout is the scale.
We have a hundred of compute nodes with hundreds of networks and creates large stacks, that cause huge amounts of data to be sent between neutron-server and OVS agents.
Often that leads to RPC timeouts.

Increased load for this scenario is caused by L2Pop feature, which tells ovs-agents how to properly set up flooding flows. Each new spawned VM of the stack causes all computes that have ports from the same network to receive notifications. Upon notification OVS agent requests l2pop data, which is a lot. It all goes through RPC and is heavy on the side of neutron-server.

The workaround to increase the RPC timeout actually doesn’t work, because it is there already - agents have exponential back off on RPC calls, so they increase timeout 2x after each failed attempt. It still happens because it depends on external factors such as how big are the stacks that are being created.

The call that times out is “bulk_pull”, which sometimes returns about 2000 objects (ports) and takes > 120 seconds to complete.

What I noticed that it’s not quite linear. E.g. up to 100 objects are returned within 0.1 seconds, and then there’s this big unproportional jump in response time. Perhaps it depends on what extra info those ports have (I guess it’s security group related info)

Looking at the agent code, there is only a single place where the bulk_pull is used, which is RemoteResourceCache

Inside of it, one method fetches resources by ID so it looks it can’t produce that many resources in the query anyway.

The other one that gets resources by an “arbitrary” query can, and is used only in security groups handling…

Number of resources for neutron agent side functions
to deal with large sets always equals 100.
Number of resources for neutron to divide
the large RPC call data sets always equals 20.
In "big" networks even these numbers can be
still huge and causes service timeouts.

When ovs-agent configures ovs firewall, it shoots rpc query asking for ports with security group ids as filters.

The query is "give me all ports related to given security group ids".

In our environment a few security group ids corresponding to big networks result in fetching about 3000 ports for this query.
When being load, each port is decorated with a few extra objects by sqlalchemy, that issues subqueries for each port.
So as a result we're having thousands of sql queries for a single rpc request.

My suggestion is where this rpc request is defined and instead of sending multiple security group ids, send them one by one.
While this will not improve performance, it will allow to reduce time taken for each rpc request, distribute the load between neutron-server instances better and avoid rpc timeouts.
Also, this is an easy fix and does not require changes to internal RPC APIs.

https://opendev.org/openstack/neutron/src/branch/master/neutron/common/_constants.py#L58

Tags: loadimpact
Mitya Eremeev (mitos)
Changed in neutron:
assignee: nobody → Mitya Eremeev (mitos)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/802596

Mitya Eremeev (mitos)
description: updated
tags: added: loadimpact
Changed in neutron:
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/806329

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by "mitya-eremeev-2 <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/806203
Reason: merged with 802596: Agent RPC step size in config. | https://review.opendev.org/c/openstack/neutron/+/802596

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/802596
Committed: https://opendev.org/openstack/neutron/commit/1221cf1b24484d61622f16937f1d3a08a5c2c27f
Submitter: "Zuul (22348)"
Branch: master

commit 1221cf1b24484d61622f16937f1d3a08a5c2c27f
Author: Mitya_Eremeev <email address hidden>
Date: Tue Jul 27 21:14:02 2021 +0300

    Agent RPC step size in config.

    Number of resources for neutron to divide
    the large RPC call data sets always equals 100.
    In "big" networks even these numbers can be
    still huge and cause service timeouts.
    Now we can decrease these numbers in config.
    Default value equals 20 now.

    Closes-Bug: 1938202
    Change-Id: Idf545ad31398ded460b6c2ae1675dd5e9ae71440

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 19.0.0.0rc1

This issue was fixed in the openstack/neutron 19.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by "liuyulong <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/806329
Reason: reopen if we want this

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.