Fatal memory consumption by neutron-server with DVR at scale

Bug #1505575 reported by Oleg Bondarev
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Oleg Bondarev

Bug Description

Steps to reproduce:
0. The issue is noticeable at scale (100+ nodes), DVR should be turned on
1. Run rally scenario NeutronNetworks.create_and_list_routers

Initially neutron-server processes consume 100-150M, but at some point the size rapidly increases in several times. (At 200 nodes the raise was from 150M to 2G, and upto 14G in the end).

The issue may lead to OOM situation causing kernel to kill the process with highest consumption. Usually candidates are rabbit or mysql. This makes cluster completely inoperable.

Revision history for this message
Oleg Bondarev (obondarev) wrote :

The issue happens when one (or all) of l3 agents from controllers starts to resync. That may happen due to agent restart or some exception (like messaging exception). On resync agent requests full info for all routers scheduled to this agent - and this might be a really big amount of data if there are a lot of routers (like in create_and_list_routers rally scenario). This leads to 'Serious memory consumption by neutron-server'.

Usually server fails to complete request within 60 seconds which leads to timeout on agent side and agent sends yet another sync_routers() request. This leads to a loop until server consumes all available memory and cluster fails.

The idea of the fix is to request routers info by chunks of configured size.

summary: - Serious memory consumption by neutron-server with DVR at scale
+ Fatal memory consumption by neutron-server with DVR at scale
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/234067

Changed in neutron:
status: New → In Progress
Changed in neutron:
importance: Undecided → High
tags: added: kilo-backport-potential l3-dvr-backlog liberty-rc-potential loadimpact
removed: scale
Akihiro Motoki (amotoki)
tags: added: liberty-backport-potential
removed: liberty-rc-potential
Revision history for this message
Miguel Angel Ajo (mangelajo) wrote :

Similar to this one, but I think they aren't duplicates: https://bugs.launchpad.net/neutron/+bug/1516260

Revision history for this message
Oleg Bondarev (obondarev) wrote :

Memory consumption was fixed by https://review.openstack.org/#/c/214974/ hence closing the bug

Changed in neutron:
status: In Progress → Fix Committed
Changed in neutron:
status: Fix Committed → Fix Released
tags: removed: kilo-backport-potential liberty-backport-potential
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.