metadata service performance regression ~8x

Bug #1361357 reported by Scott Moser
40
This bug affects 6 people
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
Medium
Unassigned
Icehouse
Medium
Unassigned
Juno
Medium
Unassigned
neutron (Ubuntu)
Medium
Unassigned
Trusty
Medium
Unassigned
Utopic
Medium
Unassigned

Bug Description

A change was made to the neutron portion of the nova metadata service to disable caching. This ends up causing a all hits to the nova metadata service to generate messages to retrieve data from neutron. Doing so makes a "crawl" of the metadata service take at least 8x longer than it did. cloud-init crawls the metadata service during boot. The end result is that instances boot significantly slower than they did previously.

The commits are marked as having fixed bug 1276440 [1].
The commits are:
 icehouse: d568fee34be36ca17a9124fe6539f62d702d6359 [2]
 trunk: 3faea81c6029033c85cefd6e98d7a3e719e858f5 [3]

[1] http://pad.lv/1276440
[2] https://github.com/openstack/neutron/commit/d568fee34be36ca17a9124fe6539f62d702d6359
[3] https://github.com/openstack/neutron/commit/423ca756af10e10398636d6d34a7594a4fd4bc87

Revision history for this message
Scott Moser (smoser) wrote :

Attaching a python program showing the performance regression. Example output:
$ python bug-1361357.py
old comparison code: 0.0159089565277
new comparision code: 1.65406799316

Scott Moser (smoser)
Changed in nova (Ubuntu Trusty):
status: New → Confirmed
importance: Undecided → Medium
Changed in nova (Ubuntu Utopic):
importance: Undecided → Medium
status: New → Confirmed
Scott Moser (smoser)
description: updated
Revision history for this message
Seth Arnold (seth-arnold) wrote :

I can't comment on the need for a timing-independent fix in any given location -- my instinct suggests HMAC with "strong enough" hashes doesn't need timing-resistant equality checks because brute forcing HMAC is bloody difficult -- but with the assumption that the analysis was done correctly, there's no easy way to speed up equality tests that don't leak timing information.

There are two possible solutions to this:
- Fix the comparison to run more quickly
- Change algorithms in whatever calls comparisons so often that comparisons are a major bottleneck

Both options together may in fact lead to better performance than before, so it's worth investigating both.

My suggestions for things to try to speed up the comparison:

- Implement a similar comparison algorithm in C and call it via FFI methods
- Implement a sip-hash for the strings and bail early if the hashes don't match.

The choice of hashing function is important -- it can't be something that allows easily constructing strings that collide. One might almost suggest hmac-sha256, except that's apparently the thing that is already too slow. The siphash value could be computed ahead of time, perhaps for one of the inputs anyway, which may improve the user-visible latency.

On the other hand, with the timing that you've reported, this looks like it'd be a real problem only if there's supposed to be ~10K comparisons per second; how likely is that? If it happens often, it sounds a lot like an O(N) algorithm is being used when an O(Lg(N)) algorithm should be used instead. Consider, if the code is searching linearly through a list to find a match it could probably be converted to a sorted array or a tree or a hash table to allow bisection searches or searching significantly fewer entries in a hash table. Maintaining a binary tree or sorted array could be cheaper than repeated linear searches, especially if enough lookups are performed between insertions / deletions.

Algorithmic improvements are harder to discuss in the abstract -- and may require changing multiple interfaces -- but stand a better chance of addressing this regression head-on than trying to tweak the comparison code.

Revision history for this message
Scott Moser (smoser) wrote :

As to whether or not this is a problem, it is a real problem seen on our internal openstack cloud. Instance boot times added 20+ seconds (and they are typically ~10 seconds).

Revision history for this message
Scott Moser (smoser) wrote :

Hm.. It seems like this should not have caused the performance regression I'm seeing, unless something else is wrong and that code is being executed way more than it needs to be.

Revision history for this message
Scott Moser (smoser) wrote :

We'll have to dig more as to what is causing this. I was not able to reproduce this performance issue on devstack.

Revision history for this message
Scott Moser (smoser) wrote :
Tracy Jones (tjones-i)
tags: added: network
Revision history for this message
Corey Bryant (corey.bryant) wrote :

This appears to be a regression between 2014.1.1 and 2014.1.2. I'm attaching output from timed ec2metadata runs on both.

Revision history for this message
Corey Bryant (corey.bryant) wrote :
Revision history for this message
Corey Bryant (corey.bryant) wrote :

It looks like the performance hit is caused by this commit: https://github.com/openstack/neutron/commit/d568fee34be36ca17a9124fe6539f62d702d6359

An ec2metadata run on 2014.1.2 minus this commit got rid of the performance hit.

Scott Moser (smoser)
no longer affects: nova (Ubuntu)
no longer affects: nova
no longer affects: nova (Ubuntu Trusty)
no longer affects: nova (Ubuntu Utopic)
Changed in neutron:
status: New → Confirmed
Changed in neutron (Ubuntu Trusty):
status: New → Confirmed
Changed in neutron (Ubuntu Utopic):
status: New → Confirmed
Scott Moser (smoser)
summary: - metadata service performance regression ~100x
+ metadata service performance regression ~8x
Revision history for this message
Scott Moser (smoser) wrote :

I updated the description with a better diagnosis of the problem, as the first was way off, and only would cause confusion.

description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/121649

Changed in neutron:
assignee: nobody → Corey Bryant (corey.bryant)
status: Confirmed → In Progress
Revision history for this message
Corey Bryant (corey.bryant) wrote :

I just posted a patch for this. I'm attaching results for non-cached metadata performance before and after the fix, and I'm also attaching results for cached metadata performance before and after the fix (using cache_url = memory://?default_ttl=5).

Revision history for this message
Corey Bryant (corey.bryant) wrote :
Revision history for this message
Corey Bryant (corey.bryant) wrote :
Revision history for this message
Corey Bryant (corey.bryant) wrote :
Revision history for this message
Corey Bryant (corey.bryant) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/121782

James Page (james-page)
Changed in neutron (Ubuntu Trusty):
importance: Undecided → Medium
Changed in neutron (Ubuntu Utopic):
importance: Undecided → Medium
Maru Newby (maru)
Changed in neutron:
importance: Undecided → High
assignee: Corey Bryant (corey.bryant) → Oleg Bondarev (obondarev)
Revision history for this message
Maru Newby (maru) wrote :

I'm changing assignment to Oleg since he has submitted a patch to switch to eliminate the need for client initialization by using RPC.

Changed in neutron:
importance: High → Medium
milestone: none → kilo-1
tags: added: juno-backport-potential
Maru Newby (maru)
Changed in neutron:
milestone: kilo-1 → none
Maru Newby (maru)
tags: added: icehouse-backport-potential
Maru Newby (maru)
Changed in neutron:
milestone: none → kilo-1
Maru Newby (maru)
tags: removed: icehouse-backport-potential juno-backport-potential
Maru Newby (maru)
Changed in neutron:
assignee: Oleg Bondarev (obondarev) → nobody
milestone: kilo-1 → none
Revision history for this message
Maru Newby (maru) wrote :

As per a thread on the mailing list [1], this issue was already fixed [2] in Neutron in Juno and backported to Icehouse, so I'm going to remove Neutron as an affected project.

1: http://lists.openstack.org/pipermail/openstack-dev/2014-October/048916.html
2: https://bugs.launchpad.net/neutron/+bug/1276440

no longer affects: neutron
Revision history for this message
Corey Bryant (corey.bryant) wrote :

I've confirmed that non-caching performance on icehouse 2014.1.3 has been restored back to the 2014.1.1 range.

James Page (james-page)
Changed in neutron (Ubuntu Utopic):
status: Confirmed → Fix Released
Changed in neutron (Ubuntu):
status: Confirmed → Fix Released
Changed in neutron (Ubuntu Trusty):
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers