Timeout in RPC method get_devices_details_list_and_failed_devices cannot be fixed by increasing the timeout to infinity

Bug #1735427 reported by Saverio Proto
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
neutron
Incomplete
High
Saverio Proto

Bug Description

We are running a big production public cloud with Openstrack Newton. However looking at the code what I show here should affect also master.

In neutron.conf we have:

rpc_response_timeout=240

we often hit Timeout in RPC method get_devices_details_list_and_failed_devices no matter how big we set the RPC timeout.

The problem is that when calling this function:
https://github.com/openstack/neutron/blob/5fc8e47786c91f76d253010b194bd5637de895b8/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L1514

the 'devices' argument can be very big without any limit, generating huge RPC calls and huge database queries.

As an operator I would propose to paginate the list to break the thing into smaller RPC calls.

Please note that `update_device_list` is also unpaginated.

Revision history for this message
Brian Haley (brian-haley) wrote :

This looks similar to fetch_and_sync_all_routers() in the l3-agent, which we "chunkified" to do smaller RPC queries. That code is a little different but can probably be used as a reference since it definitely does help.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/524222

Changed in neutron:
assignee: nobody → Saverio Proto (zioproto)
status: New → In Progress
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Before making recommendations, I'd like to understand a bit more the scale involved here. I recall that Rossella did a bunch of work to address scalability issues on the compute nodes. How many devices are there on the node that it pushes the system over the edge?

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Looks similar to bug 1430999

Changed in neutron:
importance: Undecided → High
status: In Progress → Incomplete
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

I'd like to understand better the nature of the failure. The failure may be caused by different fault/errors, so lack of paging is only one possible error.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

when I said 'compute' what I meant was the OVS agent. That can run on more than just the compute node alone.

Revision history for this message
Saverio Proto (zioproto) wrote :

Hello Armando,

the problem happens on the network node where the number of interfaces is very high: about 1000

here the detail about the kind of interfaces that I see in my deployment:

proto@network-1:~$ sudo ovs-vsctl list-ifaces br-int | grep -c qr-
321
proto@network-1:~$ sudo ovs-vsctl list-ifaces br-int | grep -c qg-
245
proto@network-1:~$ sudo ovs-vsctl list-ifaces br-int | grep -c tap
540
proto@network-1:~$ sudo ovs-vsctl list-ifaces br-int | wc
   1109 1109 16621
proto@network-1:~$

We have more network nodes, and they all show similar numbers.

thank you

Revision history for this message
Saverio Proto (zioproto) wrote :

Hello Armando,

on a network node with 170 interfaces in my staging Openstack I have done the following test.

I added LOG.warnings in the following points:

https://github.com/openstack/neutron/blob/newton-eol/neutron/plugins/ml2/rpc.py#L149
https://github.com/openstack/neutron/blob/newton-eol/neutron/plugins/ml2/rpc.py#L153
https://github.com/openstack/neutron/blob/newton-eol/neutron/plugins/ml2/rpc.py#L163

It takes 20 seconds to process the function get_devices_details_list_and_failed_devices

With 1000 interfaces I would expect then at least 100 seconds.

Consider that the staging system has no users and the load is practically non existent. In production the neutron server is much more busy.

I identified the slow function to be: get_device_details
https://github.com/openstack/neutron/blob/newton-eol/neutron/plugins/ml2/rpc.py#L61

When the function runs to the "return" at the very end, in my staging system this single function needs always at least 300 ms and up to 600 ms in the worst case.

Could this one be the bottleneck ?

Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

What type of routers are used in here? There was similar scalability issue with rpctimeout was fixed in DVR routers in the last cycle.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Saverio and I looked into this and pinpointed to a potential fix that could address his issue:

https://review.openstack.org/#/c/434682/

@Swami, the one you're referring to (https://review.openstack.org/#/c/464904/) is another one, but also still one that Saverio could benefit from :)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Saverio Proto (<email address hidden>) on branch: master
Review: https://review.openstack.org/524222
Reason: Already fixed in Ocata, look at bug https://bugs.launchpad.net/neutron/+bug/1665215

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.