Ubuntu Cloud Archive

Idle rpc traffic with a large number of instances causes failures

Bug #1384660 reported by James Page on 2014-10-23

10

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Ubuntu Cloud Archive	Fix Released	Undecided	Unassigned
	neutron	Fix Released	Undecided	Unassigned

Bug Description

OpenStack Juno (Neutron ML2+OVS/l2pop/neutron security groups), Ubuntu 14.04

500 compute node cloud, running 4.5k active instances (can't get it any further right now).

As the number of instances in the cloud increases, the idle loading on the neutron-server servers (4 of them all with 4 cores/8 threads and a suitable *_worker configuration) increases from nothing to 30; The db call get_port_and_sgs is being serviced around 10 times per second on each server at this point. Other things are also happening - I've attached the last 1000 lines of the server log with debug enabled.

The result is that its no longer possible to create new instances, as the rpc calls and api thread just don't get onto CPU, resulting in VIF plugging timeouts on compute nodes, and ERROR'ed instances.

ProblemType: Bug
DistroRelease: Ubuntu 14.04
Package: neutron-common 1:2014.2-0ubuntu1~cloud0 [origin: Canonical]
ProcVersionSignature: User Name 3.13.0-35.62-generic 3.13.11.6
Uname: Linux 3.13.0-35-generic x86_64
ApportVersion: 2.14.1-0ubuntu3.5
Architecture: amd64
CrashDB:
{
                "impl": "launchpad",
                "project": "cloud-archive",
                "bug_pattern_url": "http://people.canonical.com/~ubuntu-archive/bugpatterns/bugpatterns.xml",
             }
Date: Thu Oct 23 10:22:14 2014
PackageArchitecture: all
SourcePackage: neutron
UpgradeStatus: No upgrade log present (probably fresh install)
modified.conffile..etc.neutron.api.paste.ini: [deleted]
modified.conffile..etc.neutron.fwaas.driver.ini: [deleted]
modified.conffile..etc.neutron.l3.agent.ini: [deleted]
modified.conffile..etc.neutron.neutron.conf: [deleted]
modified.conffile..etc.neutron.policy.json: [deleted]
modified.conffile..etc.neutron.rootwrap.conf: [deleted]
modified.conffile..etc.neutron.rootwrap.d.debug.filters: [deleted]
modified.conffile..etc.neutron.rootwrap.d.ipset.firewall.filters: [deleted]
modified.conffile..etc.neutron.rootwrap.d.iptables.firewall.filters: [deleted]
modified.conffile..etc.neutron.rootwrap.d.l3.filters: [deleted]
modified.conffile..etc.neutron.rootwrap.d.vpnaas.filters: [deleted]
modified.conffile..etc.neutron.vpn.agent.ini: [deleted]
modified.conffile..etc.sudoers.d.neutron.sudoers: [deleted]

See original description

Tags:

Revision history for this message

James Page (james-page) wrote on 2014-10-23:

#1

Dependencies.txt Edit (4.6 KiB, text/plain; charset="utf-8")
ProcEnviron.txt Edit (315 bytes, text/plain; charset="utf-8")

Revision history for this message

James Page (james-page) wrote on 2014-10-23:

#2

Some other details: l2population driver and neutron security groups are enabled.

description:	updated
tags:	added: scale-testing removed: scale-test
summary:	- Idle rpc traffic with a large number of instances causes failures + Idle rpc traffic (get_port_and_sgs) with a large number of instances + causes failures

Revision history for this message

James Page (james-page) wrote on 2014-10-23:

#3

neutron-server.log Edit (533.1 KiB, text/plain)

summary:	- Idle rpc traffic (get_port_and_sgs) with a large number of instances - causes failures + Idle rpc traffic with a large number of instances causes failures
description:	updated

Revision history for this message

James Page (james-page) wrote on 2014-10-23:

#4

I see:

2014-10-23 10:51:28.325 15781 TRACE neutron.notifiers.nova ConnectionError: HTTPConnectionPool(host='athh4.maas', port=8774): Max retries exceeded with url: /v2/3ff63e9bdfcf4c93a4c2804feb4406e3/os-server-external-events (Caused by <class 'socket.gaierror'>: (-3, 'Lookup timed out'))

as symptomatic of this problem

Revision history for this message

James Page (james-page) wrote on 2014-10-23:

#5

I also see a fairly continuous stream of add_fdb_entries messages on the hypervisor agents as well.

Revision history for this message

James Page (james-page) wrote on 2014-10-23:

#6

I'm not convinced that this problem is not caused by bug 1384109; the postcommit failing would potentially have negative effects in this area.

Revision history for this message

James Page (james-page) wrote on 2014-10-23:

#7

Disabling the l2population driver improves this situation; I'm not convinced that this is 100% due to normal l2pop operation as I detailed in #6.

Revision history for this message

Kevin Benton (kevinbenton) wrote on 2014-10-23:

#8

Some significant optimizations have been made to the security groups RPC retrieval code that should reduce the load. They are just now being added to master and will be back-ported to Juno and hopefully Icehouse.[1][2]

1. https://review.openstack.org/#/c/123997/
2. https://review.openstack.org/#/c/124478/

Rossella Sblendido (rossella-o) on 2014-11-21

Changed in neutron:
status:	New → Confirmed

Revision history for this message

James Page (james-page) wrote on 2015-03-24:

#9

I believe the optimizations detail in this change have now landed in >= Juno; marking fix released.

Changed in cloud-archive:
status:	New → Fix Released

Revision history for this message

Armando Migliaccio (armando-migliaccio) wrote on 2016-07-08:

#10

This bug is > 365 days without activity. We are unsetting assignee and milestone and setting status to Incomplete in order to allow its expiry in 60 days.

If the bug is still valid, then update the bug status.

Changed in neutron:
status:	Confirmed → Incomplete

Rodolfo Alonso (rodolfo-alonso-hernandez) on 2022-11-30

Changed in neutron:
status:	Incomplete → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.