Timeouts in update_device_list (too slow with large # of VIFs)

Bug #1528895 reported by Mohammed Naser
30
This bug affects 5 people
Affects Status Importance Assigned to Milestone
neutron
Expired
High
Unassigned

Bug Description

In our environment, we have some large compute nodes with a large number of VIFs. When the update_device_list call happens on the agent start up:

https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L842

This takes a very long time as it seems to loop on each port at the server side, contact Nova and much more. The default rpc timeout of 60 seconds is not enough and it ends up failing on a server with around 120 VIFs. When raising the timeout to 120, it seems to work with no problems.

2015-12-23 15:27:27.373 38588 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-1e6cc46d-eb52-4d99-bd77-bf2e8424a1ea - - - - -] Error while processing VIF ports
2015-12-23 15:27:27.373 38588 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent Traceback (most recent call last):
2015-12-23 15:27:27.373 38588 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/dist-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py", line 1752, in rpc_loop
2015-12-23 15:27:27.373 38588 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent ovs_restarted)
2015-12-23 15:27:27.373 38588 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/dist-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py", line 1507, in process_network_ports
2015-12-23 15:27:27.373 38588 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent self._bind_devices(need_binding_devices)
2015-12-23 15:27:27.373 38588 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/dist-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py", line 847, in _bind_devices
2015-12-23 15:27:27.373 38588 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent self.conf.host)
2015-12-23 15:27:27.373 38588 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/dist-packages/neutron/agent/rpc.py", line 179, in update_device_list
2015-12-23 15:27:27.373 38588 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent agent_id=agent_id, host=host)
2015-12-23 15:27:27.373 38588 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/client.py", line 158, in call
2015-12-23 15:27:27.373 38588 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent retry=self.retry)
2015-12-23 15:27:27.373 38588 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/dist-packages/oslo_messaging/transport.py", line 90, in _send
2015-12-23 15:27:27.373 38588 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent timeout=timeout, retry=retry)
2015-12-23 15:27:27.373 38588 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 431, in send
2015-12-23 15:27:27.373 38588 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent retry=retry)
2015-12-23 15:27:27.373 38588 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 420, in _send
2015-12-23 15:27:27.373 38588 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent result = self._waiter.wait(msg_id, timeout)
2015-12-23 15:27:27.373 38588 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 318, in wait
2015-12-23 15:27:27.373 38588 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent message = self.waiters.get(msg_id, timeout=timeout)
2015-12-23 15:27:27.373 38588 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 223, in get
2015-12-23 15:27:27.373 38588 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent 'to message ID %s' % msg_id)
2015-12-23 15:27:27.373 38588 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent MessagingTimeout: Timed out waiting for a reply to message ID c42c1ffc801b41ca89aa4472696bbf1a

I don't think that an RPC call should ever take that long, the neutron-server is not loaded or anything and adding new ones doesn't seem to resolve it, due to the fact a single RPC responder answers this.

Revision history for this message
Eugene Nikanorov (enikanorov) wrote :

What is on the server side?
Do you have logs from the server?

Revision history for this message
Akihiro Motoki (amotoki) wrote :

In addition to Eugene's question, how many number of ports exist in total in your neutron deployment?
As a first impression, I think 60 VIFs on one compute node is not so big and the load of neutron-server side affects this more.

tags: added: loadimpact
Revision history for this message
Mohammed Naser (mnaser) wrote :

Hi everyone,

Sorry, there was actually 121 VIFs on this server, not 60. In addition, there is a large amount of data going on in the neutron-server logs but it's along the lines of the following that repeat many times (I think for every port that goes up). I think it also notifies Nova so there has to be 121 notifications sent out to Nova:

2015-12-24 15:47:46.239 33607 DEBUG neutron.plugins.ml2.db [req-03503e77-55fd-4e20-b84d-c7c0c6e18a2d - - - - -] For port 66c98583-3c64-4029-b4bc-163f24cdf364, host compute-72-ca-ymq-1, got binding levels [<neutron.plugins.ml2.models.PortBindingLevel[object at 7f9f5f4c1e10] {port_id=u'66c98583-3c64-4029-b4bc-163f24cdf364', host=u'compute-72-ca-ymq-1', level=0, driver=u'openvswitch', segment_id=u'514310a8-1526-46f6-b1a1-25e4427abeba'}>] get_binding_levels /usr/lib/python2.7/dist-packages/neutron/plugins/ml2/db.py:177
2015-12-24 15:47:46.257 33607 DEBUG neutron.callbacks.manager [req-03503e77-55fd-4e20-b84d-c7c0c6e18a2d - - - - -] Notify callbacks for port, after_update _notify_loop /usr/lib/python2.7/dist-packages/neutron/callbacks/manager.py:133
2015-12-24 15:47:46.258 33607 DEBUG neutron.callbacks.manager [req-03503e77-55fd-4e20-b84d-c7c0c6e18a2d - - - - -] Calling callback neutron.db.l3_dvrscheduler_db._notify_l3_agent_port_update _notify_loop /usr/lib/python2.7/dist-packages/neutron/callbacks/manager.py:140
2015-12-24 15:47:46.258 33607 DEBUG neutron.db.l3_dvrscheduler_db [req-03503e77-55fd-4e20-b84d-c7c0c6e18a2d - - - - -] Received port after_update _notify_l3_agent_new_port /usr/lib/python2.7/dist-packages/neutron/db/l3_dvrscheduler_db.py:486

Revision history for this message
Eugene (baribalbear) wrote :
Download full text (7.7 KiB)

Having the same on latest Liberty RDO release and Legacy to DVR migrated setup. On compute node having 99 VIFs seeing often:

2016-01-01 14:16:00.016 9236 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-07cf91da-9989-4c68-8ab0-9739b0e10ca3 - - - - -] Error while processing VIF ports
2016-01-01 14:16:00.016 9236 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent Traceback (most recent call last):
2016-01-01 14:16:00.016 9236 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py", line 1751, in rpc_loop
2016-01-01 14:16:00.016 9236 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent ovs_restarted)
2016-01-01 14:16:00.016 9236 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py", line 1506, in process_network_ports
2016-01-01 14:16:00.016 9236 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent self._bind_devices(need_binding_devices)
2016-01-01 14:16:00.016 9236 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py", line 847, in _bind_devices
2016-01-01 14:16:00.016 9236 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent self.conf.host)
2016-01-01 14:16:00.016 9236 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/site-packages/neutron/agent/rpc.py", line 179, in update_device_list
2016-01-01 14:16:00.016 9236 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent agent_id=agent_id, host=host)
2016-01-01 14:16:00.016 9236 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 158, in call
2016-01-01 14:16:00.016 9236 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent retry=self.retry)
2016-01-01 14:16:00.016 9236 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 90, in _send
2016-01-01 14:16:00.016 9236 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent timeout=timeout, retry=retry)
2016-01-01 14:16:00.016 9236 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 431, in send
2016-01-01 14:16:00.016 9236 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent retry=retry)
2016-01-01 14:16:00.016 9236 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 420, in _send
2016-01-01 14:16:00.016 9236 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent result = self._waiter.wait(msg_id, timeout)
2016-01-01 14:16:00.016 9236 ERROR neutron...

Read more...

Revision history for this message
Eugene (baribalbear) wrote :
Revision history for this message
IWAMOTO Toshihiro (iwamoto) wrote :

In my test environment, an ovs agent restart with 250 VMs on a node can always reproduce this bug.

This bug should have been there since the introduction of update_device_list, i.e. commit ced2e63cb8.
Prior to that, neutron used to process update_device_up one by one in an RPC call.

I think this bug makes ovs agent unusable in large deployments. We should either,

1. revert the patch
2. increase the RPC timeout a lot
3. issue update_device_list in smaller chunks, say 10 ports at a time

tags: added: ovs
Revision history for this message
Assaf Muller (amuller) wrote :

This is a show stopper regression for existing setups.

Changed in neutron:
status: New → Confirmed
importance: Undecided → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/277279

Changed in neutron:
assignee: nobody → IWAMOTO Toshihiro (iwamoto)
status: Confirmed → In Progress
Revision history for this message
IWAMOTO Toshihiro (iwamoto) wrote :

Some updates to my comment #6:

I should have said "high density deployments" rather than "large deployments".

Today, for some reason I don't want to know details, I needed 300 VMs (ports) to cause the timeout bug.

There's another bulk RPC call introduced in commit 88638e6. get_devices_details_list_and_failed_devices seems to take about a half of time than update_device_list, which might cause a trouble some day.

Revision history for this message
Yuli Deng (yuli-deng) wrote :

Hi,
I'm facing the same bug with Liberty and Ubuntu 14.04 when restarting ovs agent on standalone neutron node with about 70 subnets.
As a result all qdhcp device stuck in 'Build' status, and no vm can get ip.

I currently increased the RPC timeout in neutron conf file as a walk around.

Changed in neutron:
milestone: none → mitaka-rc1
Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

How is it of High impact if it can be worked around with rpc timeout tuning?

Revision history for this message
Mohammed Naser (mnaser) wrote :

Because the workaround can cause many other problems, including the fact that any failures in the system won't be detected until 10 minutes or so.

An RPC call should be quick and responsive, with a (somewhat) consistent response time. This gets much slower and slower as you get more ports on the server.

Changed in neutron:
milestone: mitaka-rc1 → newton-1
tags: added: mitaka-rc-potential
tags: added: mitaka-backport-potential
removed: mitaka-rc-potential
Revision history for this message
IWAMOTO Toshihiro (iwamoto) wrote :

I've commented in the patchset above a while ago. 4% slow down is better than unintended consequences from having very long timeout, though I understand sentiment against reverting the batched RPC optimization.

Patch Set 3:

I've checked how much this patch slows down agent restarts using https://review.openstack.org/267905 .

An agent restart with 80 VMs takes about 25 seconds in my environment, and this patch seems to slow a restart by 0.5 to 1 seconds. To me, such a slowdown is a fair price to pay as this patch fixes a bug.

Revision history for this message
Eugene (baribalbear) wrote :

Any news regarding this ticket? Are there any plans to backport it to LIberty? Thanks in advance.

tags: added: scale
Revision history for this message
Eugene (baribalbear) wrote :

Seems for me, using DVR set up with ML2 and openvswitch-agent, this issue is solved by Neutron 7.0.3 or 7.0.4 updates. Right now the time between "Preparing filters for devices set" and "Configuration for devices up and devices down [] completed" in openvswitch-agent.log is less than 60sec.

Changed in neutron:
milestone: newton-1 → newton-2
Changed in neutron:
milestone: newton-2 → newton-3
Changed in neutron:
milestone: newton-3 → newton-rc1
Revision history for this message
IWAMOTO Toshihiro (iwamoto) wrote :

As push-notifications didn't arrive to save the world,
do you want https://review.openstack.org/#/c/277279/ as an interim fix?

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Probably not, we'll have to limp along another release.

Changed in neutron:
milestone: newton-rc1 → ocata-1
Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

We won't be able to backport push-notifications bits back to Mitaka, hence removed the mitaka-backport-potential tag.

tags: removed: mitaka-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Armando Migliaccio (<email address hidden>) on branch: master
Review: https://review.openstack.org/319989
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Similar to bug 1430999

Changed in neutron:
status: In Progress → Incomplete
assignee: IWAMOTO Toshihiro (iwamoto) → nobody
milestone: ocata-1 → none
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for neutron because there has been no activity for 60 days.]

Changed in neutron:
status: Incomplete → Expired
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Armando Migliaccio (<email address hidden>) on branch: master
Review: https://review.openstack.org/277279
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

For the record, this bug was fixed by adopting RPC callbacks / OVO / push-notifications in OVS agent to fetch ports, starting from Pike. This bug is still present in Ocata, and we can't backport push-notifications there. It's a legit bug though I am not sure we will be able to fix it in Ocata at this point. A workaround is probably to bump rpc timeout and hope it helps.

Revision history for this message
Feodor Tersin (ftersin) wrote :

It takes about 5 min with Pike. It is faster than it is with Ocata (7 min), but still not enough for production.
3 DVR HA agents
140 routers, 8 instances per router, 1500 ports in DB

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.