Idle rpc traffic with a large number of instances causes failures

Bug #1384660 reported by James Page
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
Fix Released
Undecided
Unassigned
neutron
Fix Released
Undecided
Unassigned

Bug Description

OpenStack Juno (Neutron ML2+OVS/l2pop/neutron security groups), Ubuntu 14.04

500 compute node cloud, running 4.5k active instances (can't get it any further right now).

As the number of instances in the cloud increases, the idle loading on the neutron-server servers (4 of them all with 4 cores/8 threads and a suitable *_worker configuration) increases from nothing to 30; The db call get_port_and_sgs is being serviced around 10 times per second on each server at this point. Other things are also happening - I've attached the last 1000 lines of the server log with debug enabled.

The result is that its no longer possible to create new instances, as the rpc calls and api thread just don't get onto CPU, resulting in VIF plugging timeouts on compute nodes, and ERROR'ed instances.

ProblemType: Bug
DistroRelease: Ubuntu 14.04
Package: neutron-common 1:2014.2-0ubuntu1~cloud0 [origin: Canonical]
ProcVersionSignature: User Name 3.13.0-35.62-generic 3.13.11.6
Uname: Linux 3.13.0-35-generic x86_64
ApportVersion: 2.14.1-0ubuntu3.5
Architecture: amd64
CrashDB:
 {
                "impl": "launchpad",
                "project": "cloud-archive",
                "bug_pattern_url": "http://people.canonical.com/~ubuntu-archive/bugpatterns/bugpatterns.xml",
             }
Date: Thu Oct 23 10:22:14 2014
PackageArchitecture: all
SourcePackage: neutron
UpgradeStatus: No upgrade log present (probably fresh install)
modified.conffile..etc.neutron.api.paste.ini: [deleted]
modified.conffile..etc.neutron.fwaas.driver.ini: [deleted]
modified.conffile..etc.neutron.l3.agent.ini: [deleted]
modified.conffile..etc.neutron.neutron.conf: [deleted]
modified.conffile..etc.neutron.policy.json: [deleted]
modified.conffile..etc.neutron.rootwrap.conf: [deleted]
modified.conffile..etc.neutron.rootwrap.d.debug.filters: [deleted]
modified.conffile..etc.neutron.rootwrap.d.ipset.firewall.filters: [deleted]
modified.conffile..etc.neutron.rootwrap.d.iptables.firewall.filters: [deleted]
modified.conffile..etc.neutron.rootwrap.d.l3.filters: [deleted]
modified.conffile..etc.neutron.rootwrap.d.vpnaas.filters: [deleted]
modified.conffile..etc.neutron.vpn.agent.ini: [deleted]
modified.conffile..etc.sudoers.d.neutron.sudoers: [deleted]

Revision history for this message
James Page (james-page) wrote :
Revision history for this message
James Page (james-page) wrote :

Some other details: l2population driver and neutron security groups are enabled.

description: updated
tags: added: scale-testing
removed: scale-test
summary: - Idle rpc traffic with a large number of instances causes failures
+ Idle rpc traffic (get_port_and_sgs) with a large number of instances
+ causes failures
Revision history for this message
James Page (james-page) wrote :
summary: - Idle rpc traffic (get_port_and_sgs) with a large number of instances
- causes failures
+ Idle rpc traffic with a large number of instances causes failures
description: updated
Revision history for this message
James Page (james-page) wrote :

I see:

2014-10-23 10:51:28.325 15781 TRACE neutron.notifiers.nova ConnectionError: HTTPConnectionPool(host='athh4.maas', port=8774): Max retries exceeded with url: /v2/3ff63e9bdfcf4c93a4c2804feb4406e3/os-server-external-events (Caused by <class 'socket.gaierror'>: (-3, 'Lookup timed out'))

as symptomatic of this problem

Revision history for this message
James Page (james-page) wrote :

I also see a fairly continuous stream of add_fdb_entries messages on the hypervisor agents as well.

Revision history for this message
James Page (james-page) wrote :

I'm not convinced that this problem is not caused by bug 1384109; the postcommit failing would potentially have negative effects in this area.

Revision history for this message
James Page (james-page) wrote :

Disabling the l2population driver improves this situation; I'm not convinced that this is 100% due to normal l2pop operation as I detailed in #6.

Revision history for this message
Kevin Benton (kevinbenton) wrote :

Some significant optimizations have been made to the security groups RPC retrieval code that should reduce the load. They are just now being added to master and will be back-ported to Juno and hopefully Icehouse.[1][2]

1. https://review.openstack.org/#/c/123997/
2. https://review.openstack.org/#/c/124478/

Changed in neutron:
status: New → Confirmed
Revision history for this message
James Page (james-page) wrote :

I believe the optimizations detail in this change have now landed in >= Juno; marking fix released.

Changed in cloud-archive:
status: New → Fix Released
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

This bug is > 365 days without activity. We are unsetting assignee and milestone and setting status to Incomplete in order to allow its expiry in 60 days.

If the bug is still valid, then update the bug status.

Changed in neutron:
status: Confirmed → Incomplete
Changed in neutron:
status: Incomplete → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.