Cannot adjust number of resources in one agent step during device handling.
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
neutron |
Fix Released
|
Medium
|
Mitya Eremeev |
Bug Description
Sometimes certain RPC calls time out during port update on compute nodes.
It may happen when OVS finds a new port on br-int. E.g. a very regular operation.
Timeout example from logs:
File "/usr/lib/
in rpc_loop\n port_info, provisioning_
in wrapper\n result = f(*args, **kwargs)\n', ' File "/usr/lib/
in process_
in setup_port_
in decorated_
in prepare_
in _apply_
in security_
in security_
in _get_security_
in _select_
in get_resources\n self._flood_
in _flood_
in wrapper\n return method(*args, **kwargs)\n', ' File "/usr/lib/
in bulk_pull\n version=
in call\n time.sleep(
in __exit__\n self.force_
in force_reraise\n six.reraise(
in call\n return self._original_
in call\n retry=self.
in _send\n timeout=timeout, retry=retry)\n', ' File "/usr/lib/
in send\n retry=retry)\n', ' File "/usr/lib/
in _send\n result = self._waiter.
in wait\n message = self.waiters.
in get\n \'to message ID %s\' % msg_id)\n']: MessagingTimeout: Timed out waiting for a reply to message ID 1b9cbbfe84f84d7
The reason for RPC timeout is the scale.
We have a hundred of compute nodes with hundreds of networks and creates large stacks, that cause huge amounts of data to be sent between neutron-server and OVS agents.
Often that leads to RPC timeouts.
Increased load for this scenario is caused by L2Pop feature, which tells ovs-agents how to properly set up flooding flows. Each new spawned VM of the stack causes all computes that have ports from the same network to receive notifications. Upon notification OVS agent requests l2pop data, which is a lot. It all goes through RPC and is heavy on the side of neutron-server.
The workaround to increase the RPC timeout actually doesn’t work, because it is there already - agents have exponential back off on RPC calls, so they increase timeout 2x after each failed attempt. It still happens because it depends on external factors such as how big are the stacks that are being created.
The call that times out is “bulk_pull”, which sometimes returns about 2000 objects (ports) and takes > 120 seconds to complete.
What I noticed that it’s not quite linear. E.g. up to 100 objects are returned within 0.1 seconds, and then there’s this big unproportional jump in response time. Perhaps it depends on what extra info those ports have (I guess it’s security group related info)
Looking at the agent code, there is only a single place where the bulk_pull is used, which is RemoteResourceCache
Inside of it, one method fetches resources by ID so it looks it can’t produce that many resources in the query anyway.
The other one that gets resources by an “arbitrary” query can, and is used only in security groups handling…
Number of resources for neutron agent side functions
to deal with large sets always equals 100.
Number of resources for neutron to divide
the large RPC call data sets always equals 20.
In "big" networks even these numbers can be
still huge and causes service timeouts.
When ovs-agent configures ovs firewall, it shoots rpc query asking for ports with security group ids as filters.
The query is "give me all ports related to given security group ids".
In our environment a few security group ids corresponding to big networks result in fetching about 3000 ports for this query.
When being load, each port is decorated with a few extra objects by sqlalchemy, that issues subqueries for each port.
So as a result we're having thousands of sql queries for a single rpc request.
My suggestion is where this rpc request is defined and instead of sending multiple security group ids, send them one by one.
While this will not improve performance, it will allow to reduce time taken for each rpc request, distribute the load between neutron-server instances better and avoid rpc timeouts.
Also, this is an easy fix and does not require changes to internal RPC APIs.
https:/
Changed in neutron: | |
assignee: | nobody → Mitya Eremeev (mitos) |
status: | New → In Progress |
description: | updated |
tags: | added: loadimpact |
Changed in neutron: | |
importance: | Undecided → Medium |
Fix proposed to branch: master /review. opendev. org/c/openstack /neutron/ +/802596
Review: https:/