neutron

Timeout in RPC method get_devices_details_list_and_failed_devices cannot be fixed by increasing the timeout to infinity

Bug #1735427 reported by Saverio Proto on 2017-11-30

This bug report is a duplicate of: Bug #1665215: performance degradation in agent<->server port wiring process. Edit Remove

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	neutron	Incomplete	High	Saverio Proto

Bug Description

We are running a big production public cloud with Openstrack Newton. However looking at the code what I show here should affect also master.

In neutron.conf we have:

rpc_response_timeout=240

we often hit Timeout in RPC method get_devices_details_list_and_failed_devices no matter how big we set the RPC timeout.

The problem is that when calling this function:
https://github.com/openstack/neutron/blob/5fc8e47786c91f76d253010b194bd5637de895b8/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L1514

the 'devices' argument can be very big without any limit, generating huge RPC calls and huge database queries.

As an operator I would propose to paginate the list to break the thing into smaller RPC calls.

Please note that `update_device_list` is also unpaginated.

Revision history for this message

Brian Haley (brian-haley) wrote on 2017-11-30:

This looks similar to fetch_and_sync_all_routers() in the l3-agent, which we "chunkified" to do smaller RPC queries. That code is a little different but can probably be used as a reference since it definitely does help.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-11-30: Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/524222

Changed in neutron:
assignee:	nobody → Saverio Proto (zioproto)
status:	New → In Progress

Revision history for this message

Armando Migliaccio (armando-migliaccio) wrote on 2017-11-30:

Before making recommendations, I'd like to understand a bit more the scale involved here. I recall that Rossella did a bunch of work to address scalability issues on the compute nodes. How many devices are there on the node that it pushes the system over the edge?

Revision history for this message

Armando Migliaccio (armando-migliaccio) wrote on 2017-11-30:

Looks similar to bug 1430999

Changed in neutron:
importance:	Undecided → High
status:	In Progress → Incomplete

Revision history for this message

Armando Migliaccio (armando-migliaccio) wrote on 2017-11-30:

I'd like to understand better the nature of the failure. The failure may be caused by different fault/errors, so lack of paging is only one possible error.

Revision history for this message

Armando Migliaccio (armando-migliaccio) wrote on 2017-11-30:

when I said 'compute' what I meant was the OVS agent. That can run on more than just the compute node alone.

Revision history for this message

Saverio Proto (zioproto) wrote on 2017-11-30:

Hello Armando,

the problem happens on the network node where the number of interfaces is very high: about 1000

here the detail about the kind of interfaces that I see in my deployment:

proto@network-1:~$ sudo ovs-vsctl list-ifaces br-int | grep -c qr-
321
proto@network-1:~$ sudo ovs-vsctl list-ifaces br-int | grep -c qg-
245
proto@network-1:~$ sudo ovs-vsctl list-ifaces br-int | grep -c tap
540
proto@network-1:~$ sudo ovs-vsctl list-ifaces br-int | wc
1109 1109 16621
proto@network-1:~$

We have more network nodes, and they all show similar numbers.

thank you

Revision history for this message

Saverio Proto (zioproto) wrote on 2017-12-02:

Hello Armando,

on a network node with 170 interfaces in my staging Openstack I have done the following test.

I added LOG.warnings in the following points:

https://github.com/openstack/neutron/blob/newton-eol/neutron/plugins/ml2/rpc.py#L149
https://github.com/openstack/neutron/blob/newton-eol/neutron/plugins/ml2/rpc.py#L153
https://github.com/openstack/neutron/blob/newton-eol/neutron/plugins/ml2/rpc.py#L163

It takes 20 seconds to process the function get_devices_details_list_and_failed_devices

With 1000 interfaces I would expect then at least 100 seconds.

Consider that the staging system has no users and the load is practically non existent. In production the neutron server is much more busy.

I identified the slow function to be: get_device_details
https://github.com/openstack/neutron/blob/newton-eol/neutron/plugins/ml2/rpc.py#L61

When the function runs to the "return" at the very end, in my staging system this single function needs always at least 300 ms and up to 600 ms in the worst case.

Could this one be the bottleneck ?

Revision history for this message

Swaminathan Vasudevan (swaminathan-vasudevan) wrote on 2017-12-02:

What type of routers are used in here? There was similar scalability issue with rpctimeout was fixed in DVR routers in the last cycle.

Revision history for this message

Armando Migliaccio (armando-migliaccio) wrote on 2017-12-02:

#10

Saverio and I looked into this and pinpointed to a potential fix that could address his issue:

https://review.openstack.org/#/c/434682/

@Swami, the one you're referring to (https://review.openstack.org/#/c/464904/) is another one, but also still one that Saverio could benefit from :)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-12-02: Change abandoned on neutron (master)

#11

Change abandoned by Saverio Proto (<email address hidden>) on branch: master
Review: https://review.openstack.org/524222
Reason: Already fixed in Ocata, look at bug https://bugs.launchpad.net/neutron/+bug/1665215

Report a bug

This report contains Public information

Everyone can see this information.

Duplicate of bug #1665215 Remove

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.