Serious memory consumption by neutron-server with DVR at scale

Bug #1497219 reported by Ilya Shakhat
24
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Fix Released
High
Oleg Bondarev
7.0.x
Fix Released
Critical
Oleg Bondarev
8.0.x
Fix Released
High
Oleg Bondarev

Bug Description

Upstream bug: https://bugs.launchpad.net/neutron/+bug/1505575
Related upstream bug: https://bugs.launchpad.net/neutron/+bug/1489671

Steps to reproduce:
0. The issue is noticeable at scale (100+ nodes), DVR should be turned on
1. Run rally scenario NeutronNetworks.create_and_list_routers

Initially neutron-server processes consume 100-150M, but at some point the size rapidly increases in several times. (At 200 nodes the raise was from 150M to 2G, and upto 14G in the end).

The issue may lead to OOM situation causing kernel to kill the process with highest consumption. Usually candidates are rabbit or mysql.

Ilya Shakhat (shakhat)
tags: added: neutron scale
Changed in mos:
assignee: nobody → Oleg Bondarev (obondarev)
milestone: none → 7.0-updates
Revision history for this message
Ilya Shakhat (shakhat) wrote :

strace analysis shows that the issue is caused by a single SQL query:

SELECT routerports.router_id AS routerports_router_id, routerports.port_id AS routerports_port_id, routerports.port_type AS routerports_port_type, ipallocations_1.port_id AS ipallo
cations_1_port_id, ipallocations_1.ip_address AS ipallocations_1_ip_address, ipallocations_1.subnet_id AS ipallocations_1_subnet_id, ipallocations_1.network_id AS ipallocations_1_network_id, ports_1.tenant
_id AS por....

which has list of parameters with ~1000 items

the response size is about 2G, after reading the response process starts to call mmap() retrieving more memory.
overall the processing takes several minutes.

Neutron function that matches query pattern is l3_db.get_sync_interfaces()

tags: added: dvr
Revision history for this message
Oleg Bondarev (obondarev) wrote :

Most probably the issue happens when one (or all) of l3 agents from controllers starts to resync. That may happen due to agent restart or some exception (like messaging exception). On resync agent requests full info for all routers scheduled to this agent - and this might be a really big amount of data if there are a lot of routers (like in create_and_list_routers rally scenario). This leads to 'Serious memory consumption by neutron-server'.

The idea of the fix is to request routers info by chunks (for example not more than 200 routers at a time).

Revision history for this message
Eugene Nikanorov (enikanorov) wrote :

I would then question 'critical' status of this "bug".

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to openstack/neutron (openstack-ci/fuel-7.0/2015.1.0)

Fix proposed to branch: openstack-ci/fuel-7.0/2015.1.0
Change author: Oleg Bondarev <email address hidden>
Review: https://review.fuel-infra.org/12608

description: updated
description: updated
Revision history for this message
Alexander Ignatov (aignatov) wrote :

From one Kevin's mail:

This is a simple fix that significantly improves the performance of the l3 get sync interfaces RPC call.

https://github.com/openstack/neutron/commit/7dbaa12bd8948653ac0ed90c0132e931aac8b42b

The previous join criteria made it fetch every router interface in the DB.

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Related fix proposed to openstack/neutron (openstack-ci/fuel-7.0/2015.1.0)

Related fix proposed to branch: openstack-ci/fuel-7.0/2015.1.0
Change author: Oleg Bondarev <email address hidden>
Review: https://review.fuel-infra.org/12772

tags: added: 70mu1-confirmed
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Related fix merged to openstack/neutron (openstack-ci/fuel-7.0/2015.1.0)

Reviewed: https://review.fuel-infra.org/12772
Submitter: mos-infra-ci <>
Branch: openstack-ci/fuel-7.0/2015.1.0

Commit: 217562433172ade7e98ca2acd41449af7a546520
Author: Oleg Bondarev <email address hidden>
Date: Tue Oct 13 10:10:48 2015

Changed filter field to router_id

The get_sync_interfaces query will always return all router ports
from database even it is supposed to query specific ones that
belong to a certain router. In large L3 scale environment with
number of route ports in place, this would lag the response time
for adding router interface and router L3 agent binding.

Upstream review: https://review.openstack.org/220787

Related-Bug: #1497219
Change-Id: Ib78ca766f91783ad2ecca5b728c31602b4ed15d8

Revision history for this message
Alexander Ignatov (aignatov) wrote :

Related fix merged and with high probability fixes this issue, keep this bug open to wait for decision of bug which closes this issue.

Revision history for this message
Vitaly Sedelnik (vsedelnik) wrote :

Setting to Fix Committed per feedback from Oleg Bondarev

tags: removed: 70mu1-confirmed
Revision history for this message
Alexander Ignatov (aignatov) wrote :

Fixed by https://review.openstack.org/#/c/214974/ in stable/liberty

Revision history for this message
Michael Semenov (msemenov) wrote :

Verified on 7.0-MU1.
For 8.0 this bug needs to be verified with Neutron team.

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Change abandoned on openstack/neutron (openstack-ci/fuel-7.0/2015.1.0)

Change abandoned by Oleg Bondarev <email address hidden> on branch: openstack-ci/fuel-7.0/2015.1.0
Review: https://review.fuel-infra.org/12608

tags: added: area-neutron
removed: neutron
Revision history for this message
Mikhail Chernik (mchernik) wrote :

Verified on MOS8.0, ISO 482, fixed. During NeutronNetworks.create_and_list_routers scenario run RSIZE never exceeded 160M and VSIZE 450M

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.