scheduler hang (DOS) possible with DifferentHostFilter/SameHostFilter

Bug #1017795 reported by Dan Prince
264
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Dan Prince
Essex
Fix Released
High
Dan Prince
nova (Ubuntu)
Fix Released
Undecided
Unassigned
Precise
Fix Released
Undecided
Unassigned

Bug Description

When using Nova scheduler with either the DifferentHostFilter and/or SameHostFilter we expose a way to make repeated DB calls via compute_api.get(). A user could submit requests to create servers with scheduler_hints such that thousands of DB calls to lookup an instance ID are made... which can cause the Nova scheduler to hang until the all the lookups are finished.

To reproduce:

1) Create a valid instance.

2) Obtain the 'ID' of the instance. You can do this by looking at the instance_name (an extended server attribute)

3) Create specially formated server create request which repeates the valid instance ID in the os:scheduler_hints part of the request. See the following example request:

{
"server" : {
"name" : "sched_tester",
"imageRef" : "7d3b6df9-0cb0-45ce-b9d8-621829d85554",
"flavorRef" : "1"
},
"os:scheduler_hints" : {
  "same_host": ["1234", "1234", "1234", "1234", "1234", "1234", "1234", "1234", "1234"...]
}
}

If you script it you can repeat the ID 10,000 times or so before you hit the request size limit for nova... plenty enough to cause Nova scheduler hangs for 5-10 minutes per server request (actual times would depend on DB performance, size, etc.)

Making 5-10 of these requests could take out all the Nova schedulers for a given cloud thus causing a server outage.

-----

As a temporary workaround users can simply disable the DifferentHostFilter and/or SameHostFilter options in nova.conf. (these are not on by default so users would have had to enable them to begin with)

-----

A good solution to this might be to avoid repeated calls to compute_api.get and replace them with a single call to get all the instance host IDs for a particular user/tenant. This would however limit the selection capabilities to instances within the current user/tenants account. I think this is acceptable. We could also put a flat limit on the number of allowed instance_id/uuids used for the scheduler_hint requests.

-----

It looks like this affects both Folsom and Essex.

Related branches

CVE References

Dan Prince (dan-prince)
Changed in nova:
assignee: nobody → Dan Prince (dan-prince)
status: New → In Progress
Revision history for this message
Thierry Carrez (ttx) wrote :

@Dan: good find -- if you work on a patch, please post it here (private) rather than on Gerrit (public).

Revision history for this message
Dan Prince (dan-prince) wrote :

Attaching patch/fix for Folsom.

Revision history for this message
Dan Prince (dan-prince) wrote :

Attaching patch fix for Essex.

Revision history for this message
Dan Prince (dan-prince) wrote :

ttx: Both of the attached patches replace the compute_api.get calls with a single compute_api.get_all call which should be limited to the context and only contain servers for each user/tenant.

Like I mentioned above an alternate solution might be leave the compute_api.get call as is and simply limit the number of UUIDs we allow to be specified in the scheduler_hints.

We'll see what other think....

Revision history for this message
Thierry Carrez (ttx) wrote :

Addig nova-core to review the patches and give opinion on last comment.
Please do not share this publicly yet, and +1 or -1 in the comments here.

Revision history for this message
Thierry Carrez (ttx) wrote :

Proposed impact description, please confirm:

Title: Scheduler denial of service through scheduler_hints
Impact: High
Reporter: Dan Prince (Red Hat)
Products: Nova
Affects: Essex, Folsom series

Description:
Dan Prince from Red Hat reported a vulnerability in Nova scheduler nodes. By creating servers with malicious scheduler_hints, an authenticated user may generate a huge amount of database calls, potentially resulting in a Denial of Service attack against Nova scheduler nodes. Only setups exposing the OpenStack API and enabling DifferentHostFilter and/or SameHostFilter are affected.

Revision history for this message
Dan Prince (dan-prince) wrote :

ttx: I think that impact description looks fine.

Revision history for this message
Sandy Walsh (sandy-walsh) wrote :

Seems reasonable to me ... are there related reviews for these patches?

Revision history for this message
Sandy Walsh (sandy-walsh) wrote :

The two patches have slightly different diffs, but it seems they could be the same. Not sure if this was just the way patch worked but for later comparisons it would be nice to have them the same.

Revision history for this message
Dan Prince (dan-prince) wrote :

Hi Sandy:

There aren't currently reviews for these patches since this ticket is private (security related). Once we get approval I think ttx sets the date for the public disclosure and we go from there. The idea is to pre-approve these patches here privately and then have it go right in once the public disclosure is made.

The reason the patches have slightly different diffs is because the Essex version of the affinity scheduler is slightly different than Folsmom's.

Revision history for this message
Vish Ishaya (vishvananda) wrote :

looks reasonable. +1

Revision history for this message
Thierry Carrez (ttx) wrote :

Pushed downstream, proposed CRD is Wednesday, July 4th 1500 UTC

Revision history for this message
Thierry Carrez (ttx) wrote :

Hmm, due to independence day, I proposed Tuesday, July 3rd 1500 UTC instead (which is very short) with a fallback to Wednesday July 11th, 1500 UTC.

Revision history for this message
Mark McLoughlin (markmc) wrote :

Patches look good to me too.

Revision history for this message
Dan Prince (dan-prince) wrote :

July 3rd sounds fine to me. Let me know if we need to use the fallback instead.

Revision history for this message
Thierry Carrez (ttx) wrote :

Date was pushed back to Wednesday July 11th, 1500 UTC, by popular demand.

Revision history for this message
Thierry Carrez (ttx) wrote :

Adding Dave Walker for stable/*

Revision history for this message
Dan Prince (dan-prince) wrote :

Attaching a rebase of the Folsom diff. No major changes....

[dprince@dovetail ~]$ diff folsom.diff folsom2.diff
2c2
< index b4e1a30..b0eb3eb 100644
---
> index e2a38b1..db8da1a 100644
5c5
< @@ -26,8 +26,11 @@ class AffinityFilter(filters.BaseHostFilter):
---
> @@ -25,8 +25,11 @@ class AffinityFilter(filters.BaseHostFilter):
19c19
< @@ -38,10 +41,11 @@ class DifferentHostFilter(AffinityFilter):
---
> @@ -37,12 +40,13 @@ class DifferentHostFilter(AffinityFilter):
24a25,26
> if isinstance(affinity_uuids, basestring):
> affinity_uuids = [affinity_uuids]
32c34
< @@ -56,11 +60,12 @@ class SameHostFilter(AffinityFilter):
---
> @@ -57,13 +61,14 @@ class SameHostFilter(AffinityFilter):
37a40,41
> if isinstance(affinity_uuids, basestring):
> affinity_uuids = [affinity_uuids]

Revision history for this message
Thierry Carrez (ttx) wrote :

Adding Canonical security so that they can coordinate on the same Launchpad bug

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/9637

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/essex)

Fix proposed to branch: stable/essex
Review: https://review.openstack.org/9638

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: stable/essex
Review: https://review.openstack.org/9639

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/9637
Committed: http://github.com/openstack/nova/commit/034762e8060dcf0a11cb039b9d426b0d0bb1801d
Submitter: Jenkins
Branch: master

commit 034762e8060dcf0a11cb039b9d426b0d0bb1801d
Author: Dan Prince <email address hidden>
Date: Tue Jun 26 12:44:35 2012 -0400

    Use compute_api.get_all in affinity filters.

    Updates the affinity filters so they make a single compute API
    call to lookup instance host information rather than single
    lookups for each UUID.

    This resolves a potential performance issue which can cause a
    scheduler to hang while processing requests which contain large numbers
    of UUID's in the scheduler_hints.

    Fixes LP Bug #1017795.

    Change-Id: I30f434faf109058573ee41c4a6abce2e48939e8d

Changed in nova:
status: In Progress → Fix Committed
Revision history for this message
Thierry Carrez (ttx) wrote :

Patches out, bug opened

visibility: private → public
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/essex)

Reviewed: https://review.openstack.org/9639
Committed: http://github.com/openstack/nova/commit/25f5bd31805bd21d7b7e3583c775252aa8f737e9
Submitter: Jenkins
Branch: stable/essex

commit 25f5bd31805bd21d7b7e3583c775252aa8f737e9
Author: Dan Prince <email address hidden>
Date: Wed Jul 11 10:19:05 2012 -0400

    Use compute_api.get_all in affinity filters.

    Updates the affinity filters so they make a single compute API
    call to lookup instance host information rather than single
    lookups for each UUID.

    This resolves a potential performance issue which can cause a
    scheduler to hang while processing requests which contain large numbers
    of UUID's in the scheduler_hints.

    Fixes LP Bug #1017795.

    Change-Id: Ie33378a992e53002c16b2d99135699b576f3d7e3

Revision history for this message
Thierry Carrez (ttx) wrote :

OSSA 2012-009

Revision history for this message
Steve Beattie (sbeattie) wrote :

In Ubuntu, this was address for Ubuntu 12.04 LTS (aka precise) in nova 2012.1+stable~20120612-3ee026e-0ubuntu1.2 as described in USN 1501-1: http://www.ubuntu.com/usn/usn-1501-1 . Thanks!

Thierry Carrez (ttx)
Changed in nova:
milestone: none → folsom-3
status: Fix Committed → Fix Released
Dave Walker (davewalker)
Changed in nova (Ubuntu):
status: New → Fix Released
Changed in nova (Ubuntu Precise):
status: New → Confirmed
Revision history for this message
Jamie Strandboge (jdstrand) wrote :

Ubuntu 12.04 was fixed in http://www.ubuntu.com/usn/usn-1501-1.

Changed in nova (Ubuntu Precise):
status: Confirmed → Fix Released
Revision history for this message
Adam Gandelman (gandelman-a) wrote : Verification report.

Please find the attached test log from the Ubuntu Server Team's CI infrastructure. As part of the verification process for this bug, Nova has been deployed and configured across multiple nodes using precise-proposed as an installation source. After successful bring-up and configuration of the cluster, a number of exercises and smoke tests have be invoked to ensure the updated package did not introduce any regressions. A number of test iterations were carried out to catch any possible transient errors.

Please Note the list of installed packages at the top and bottom of the report.

For records of upstream test coverage of this update, please see the Jenkins links in the comments of the relevant upstream code-review(s):

Trunk review: https://review.openstack.org/9637
Stable review: https://review.openstack.org/9639

As per the provisional Micro Release Exception granted to this package by the Technical Board, we hope this contributes toward verification of this update.

Revision history for this message
Adam Gandelman (gandelman-a) wrote :

Test coverage log.

tags: added: verification-done
Revision history for this message
Clint Byrum (clint-fewbar) wrote : Update Released

The verification of this Stable Release Update has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regresssions.

Thierry Carrez (ttx)
Changed in nova:
milestone: folsom-3 → 2012.2
To post a comment you must log in.
This report contains Public Security information  
Everyone can see this security related information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.