[OSSA-2017-006] Potential DoS by rebuilding the same instance with a new image multiple times (CVE-2017-17051)

Bug #1732976 reported by Matt Riedemann on 2017-11-17
278
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
High
Dan Smith
Pike
High
Matt Riedemann
OpenStack Security Advisory
High
Jeremy Stanley

Bug Description

As of the fix for bug 1664931 (OSSA-2017-005, CVE-2017-16239), a regression was introduced which allows a potential denial of service.

Once all computes are upgraded to >=Pike and using the (default) FilterScheduler, a rebuild with a new image will go through the scheduler. The FilterScheduler doesn't know that this is a rebuild on the same host and creates VCPU/MEMORY_MB/DISK_GB allocations in Placement against the compute node that the instance is running on. The ResourceTracker in the nova-compute service will not adjust the allocations after the rebuild, so what can happen is over multiple rebuilds of the same instance with a new image, the Placement service will report the compute node as not having any capacity left and will take it out of scheduling consideration.

Eventually the rebuild would fail once the compute node is at capacity, but an attacker could then simply create a new instance (on a new host) and start the process all over again.

I have a recreate of the bug here: https://review.openstack.org/#/c/521153/

This would not be a problem for anyone using another scheduler driver since only FilterScheduler uses Placement, and it wouldn't be a problem for any deployment that still has at least one compute service running Ocata code, because the ResourceTracker in the nova-compute service will adjust the allocations every 60 seconds.

Beyond this issue, however, there are other problems with the fix for bug 1664931:

1. Even if you're not using the FilterScheduler, e.g. using CachingScheduler, with the RamFilter or DiskFilter or CoreFilter enabled, if the compute node that the instance is running on is at capacity, a rebuild with a new image may still fail whereas before it wouldn't. This is a regression in behavior and the user would have to delete and recreate the instance with the new image.

2. Before the fix for bug 1664931, one could rebuild an instance on a disabled compute service, but now they cannot if the ComputeFilter is enabled (which it is by default and presumably enabled in all deployments).

3. Because of the way instance.image_ref is used with volume-backed instances, we are now *always* going through the scheduler during rebuild of a volume-backed instance, regardless of whether or not the image ref provided to the rebuild API is the same as the original in the root disk. I've already reported bug 1732947 for this.

--

The nova team has looked at some potential solutions, but at this point none of them are straightforward, and some involve using scheduler hints which are tied to filters that are not enabled by default (e.g. using the same_host scheduler hint which requires that the SameHostFilter is enabled). Hacking a fix in would likely result in more bugs in subtle or unforeseen ways not caught during testing.

Long-term we think a better way to fix the rebuild + new image validation is to categorize each scheduler filter as being a 'resource' or 'policy' filter, and with a rebuild + new image, we only run filters that are for policy constraints (like ImagePropertiesFilter) and not run RamFilter/DiskFilter/CoreFilter (or Placement for that matter). This would likely require an internal RPC API version change on the nova-scheduler interface, which is something we wouldn't want to backport to stable branches because of upgrade implications with the RPC API version bump.

At this point it might be best to just revert the fix for bug 1664931. We can still revert that through all of the upstream branches that the fix was applied to (newton is not EOL yet). This is obviously a pain for downstream consumers that have picked up and put out fixes for the CVE already. It would also mean publishing an errata for CVE-2017-16239 (we have to do that anyway probably) and saying it's now no longer fixed but is a publicly known issue.

Another possible alternative is shipping a new policy rule in nova that allows operators to disable rebuilding an instance with a new image, so they could decide based on the types of images and scheduler configuration they have if rebuilding with a new image is safe. Public and private cloud providers might see that rule useful in different ways, e.g. disable rebuild with a new image if you allow tenants to upload their own images to your cloud.

CVE References

Jeremy Stanley (fungi) wrote :

Since this report concerns a possible security risk, an incomplete security advisory task has been added while the core security reviewers for the affected project or projects confirm the bug and discuss the scope of any vulnerability along with potential solutions.

description: updated
Changed in ossa:
status: New → Incomplete
Matt Riedemann (mriedem) wrote :

Dan Smith is working on a possible alternative fix for bug 1664931 here:

https://review.openstack.org/#/c/521186/

Matt Riedemann (mriedem) wrote :

Note that at this point, https://review.openstack.org/#/c/521186/1 does not resolve DoS issue described in this separate bug.

Matt Riedemann (mriedem) wrote :

I've started an etherpad to try and clarify the various issues, status on potential fixes, and options:

https://etherpad.openstack.org/p/nova-rebuild-issues

no longer affects: nova/newton
no longer affects: nova/ocata

This bug description sounds a lot like a class B2 according to VMT taxonomy: https://security.openstack.org/vmt-process.html#incident-report-taxonomy .

Since we are considering reverting the CVE-2017-16239 fix before newton-eol, should we request an embargo exception (e.g. make this report public) to facilitate coordination? ( https://security.openstack.org/vmt-process.html#embargo-exceptions )

Jeremy Stanley (fungi) wrote :

I think this bug probably becomes class A if we fix it by reverting the patches for CVE-2017-16239 (and then that one in turn becomes class B2).

Jeremy Stanley (fungi) wrote :

I've subscribed the direct subscribers of the earlier bug 1664931, as well as the security notes reviewers since we're discussing the possible need to document an unfixable vulnerability in stable branches.

Matt Riedemann (mriedem) wrote :

> since we're discussing the possible need to document an unfixable vulnerability

I wouldn't say this is unfixable. As laid out in https://etherpad.openstack.org/p/nova-rebuild-issues, Dan Smith has an alternative fix which maintains a fix for CVE-2017-16239 while solving part of the regression introduced by the original change (we need to bypass some filters). I think we can build on that change to fix the potential DoS described in *this* bug, which is the issue where the FilterScheduler will double allocations in Placement (and that fix only needs to go back to stable/pike, it's not an issue in newton or ocata).

Jeremy Stanley (fungi) wrote :

Agreed, at least one of the possibilities outlined would keep bug 1664931 fixed as well as mostly addressing the regression covered by this new bug report, which is why I noted it as only a _possible_ need to document an unfixable vulnerability (i.e., if the decision ends up being to revert the CVE-2017-16239 patches instead).

Matt Riedemann (mriedem) wrote :

Between Dan and myself we have fixes for the issues pointed out in this bug and the etherpad:

https://etherpad.openstack.org/p/nova-rebuild-issues

1. https://review.openstack.org/#/c/521186/ - maintains the fix for the original CVE-2017-16239 and also fixes a regression introduced in the original fix where rebuilds can fail based on the scheduler filters that are run, e.g. the ComputeFilter will fail a rebuild if the instance is running on a disabled compute, or the CoreFilter can fail if the rebuild is on a host that is at capacity for vcpu usage. This fix will need to be backported through to stable/newton upstream and it supersedes the original fix for CVE-2017-16239.

2. https://review.openstack.org/#/c/521662/ - fixes the doubling allocations issue in Placement which is the potential DoS pointed out in *this* bug. I haven't linked the bug or added a release note to it, but this is potentially a new CVE, or an errata on the original (I'm not sure about the process here). This fix gets backported through to stable/pike upstream.

3. https://review.openstack.org/#/c/521391/ - fixes a regression introduced with the original fix for CVE-2017-16239 where all volume-backed instances are run through the scheduler during a rebuild, regardless of the image changing. This will need to be backported through to stable/newton upstream. This is more or less a companion to the fix in #1.

--

At this point, what do we do to move forward? Do we need to create a new CVE for #2? Or do these all just get lumped in as errata on the original CVE?

Well if this bug is affecting something that has been release, (iiuc it will be ==14.0.10, ==15.0.8 and ==16.0.3), then we need another CVE.
If this new bug is caused by the previous OSSA-2017-005, then I think we should proceed with an errata and amend the new CVE to the previous OSSA.

Matt Riedemann (mriedem) wrote :

From the issues outlined in comment 10:

1. Fixes a regression introduced by the fix for CVE-2017-16239 which was released. So I guess that's an errata on CVE-2017-16239.

2. Fixes *this* bug 1732976 which is a new CVE introduced by the fix for CVE-2017-16239 which was released, so yes I guess it's a new CVE.

3. Is the same as #1 (regression introduced by the fix for CVE-2017-16239). That's tracked under bug 1732947.

--

Given this, can the nova team move forward with reviewing and backporting the fixes for #1 and #3 while a new OSSA/CVE is created for #2?

Matt,

What would you advise to do *now* for downstream distros carying Newton? For the moment, I have opened the Debian bug for the 1st CVE, but didn't do anything else (well I prepared the fixed package, but didn't upload it yet). In other words: what's worth? The original CVE, or this bug?

Thanks for working on this,
Cheers,

Thomas Goirand (zigo)

Matt Riedemann (mriedem) wrote :

@zigo: I'd say *this* bug is worse than the original CVE.

Having said that, the fix for this bug builds on top of the fix for the original CVE, so you'd still need to backport the fix for the original CVE to backport the fix for this bug. But if you haven't yet shipped the fix for the original CVE in your distro, I think you'd want to hold off until we clear up this one and get the backports rolling upstream.

Dan Smith (danms) wrote :

Yep, agreed, I'd say that holding off on applying the original "fix" until you can include all of the pending ones as well is the best course of action. The original issue only applies to a small number of scenarios and is somewhat theoretical in nature. The ancillary issue it brought with it is very concrete and applies to almost everyone.

Jeremy Stanley (fungi) wrote :

I'll start piecing together the impact description for this DoS, assuming you feel fix 521662 is safely backportable to all branches where the problem was introduced (I assume all the way back to stable/newton?).

Changed in ossa:
status: Incomplete → Confirmed
Matt Riedemann (mriedem) wrote :

@fungi, https://review.openstack.org/#/c/521662/ only goes back to stable/pike.

This other change, https://review.openstack.org/#/c/521186/, is what's going to go as far back as the fix for bug 1664931 (the original CVE that introduced the regressions laid out in this bug).

I think we're going to go forward with https://review.openstack.org/#/c/521186/ so we can get the backports started since we have to get those to stable/newton while it's still around upstream. What needs to happen for the errata on CVE-2017-16239?

So to recap:

1. https://review.openstack.org/#/c/521186/ and https://review.openstack.org/#/c/521391/ are fixes for regressions introduced by the fix for CVE-2017-16239 and are errata for that CVE, and need to get backported to stable/newton upstream.

2. https://review.openstack.org/#/c/521662/ is the fix for this new CVE and only goes back to stable/pike.

Jeremy Stanley (fungi) wrote :

Matt: Please proceed with 521186 and its associated backports; we'll send announcements with the OSSA-2017-005/CVE-2017-16239 errata once those merge. As for 521662 I'd like to include (a link to) the stable/pike backport in a pre-OSSA once it's ready. Assuming we have a viable backport for this bug within the next couple days, I'd like to propose 15:00 UTC on Tuesday, December 5 as the disclosure date/time.

Here's my proposed impact description for this bug (which I'll use to request a new CVE for the denial of service vulnerability if accurate):

Title: Nova ResourceTracker misses rebuilt resources with new images
Reporter: Matt Riedemann (Huawei)
Products: Nova
Affects: 16.0.3

Description:
Matt Riedemann from Huawei reported a vulnerability in OpenStack Nova's default FilterScheduler. By repeatedly rebuilding an instance with new images, an authenticated user may consume untracked resources on a hypervisor host leading to a denial of service. This regression was introduced with the fix for OSSA-2017-005 (CVE-2017-16239), so only Nova stable/pike or later deployments with that fix applied and relying on the default FilterScheduler are affected.

Matt Riedemann (mriedem) wrote :

Here is the stable/pike fix for the new DoS issue:

https://review.openstack.org/#/c/523214/

As for the description, I'd make the following changes:

1. Change the title to: "Nova FilterScheduler doubles resource allocations during rebuild with new image"

2. The last sentence is a bit confusing:

> This regression was introduced with the fix for OSSA-2017-005 (CVE-2017-16239), so only Nova stable/pike or later deployments with that fix applied and relying on the default FilterScheduler are affected.

The fix for CVE-2017-16239 went further than stable/pike. I think the thing to point out is that this new CVE only affects deployments running stable/pike or later, including on all of their nova-compute services and the scheduler. Before Pike the FilterScheduler in the nova-scheduler service won't create allocations in Placement, and before Pike the ResourceTracker in the nova-compute service will automatically adjust allocations in Placement in a periodic task.

So maybe we should say, "This regression was introduced with the fix for OSSA-2017-005 (CVE-2017-16239), however, only Nova stable/pike or later deployments with that fix applied and relying on the default FilterScheduler are affected."

Jeremy Stanley (fungi) wrote :

Thanks for the corrections! Updated impact description follows...

Title: Nova FilterScheduler doubles resource allocations during rebuild with new image
Reporter: Matt Riedemann (Huawei)
Products: Nova
Affects: 16.0.3

Description:
Matt Riedemann from Huawei reported a vulnerability in OpenStack Nova's default FilterScheduler. By repeatedly rebuilding an instance with new images, an authenticated user may consume untracked resources on a hypervisor host leading to a denial of service. This regression was introduced with the fix for OSSA-2017-005 (CVE-2017-16239), however, only Nova stable/pike or later deployments with that fix applied and relying on the default FilterScheduler are affected.

Matt Riedemann (mriedem) wrote :

The description in comment 20 looks good to me. Thanks Jeremy!

The above impact description and the proposed disclosure date 2017-12-05 15:00 UTC looks good to me too.

And regarding the ERRATA, here is a proposed updated to OSSA-2017-005:

errata: >
  The former fix introduced regressions in the rebuild functionality.
  Rebuild may failed when the compute host is running at capacity or when
  the host is disabled.
  This update provides an additional set of fix for these regressions.

Also, it seems like we are missing a few backports, here is the current list:
  queens:
    - https://review.openstack.org/521186 (errata)
    - https://review.openstack.org/521391 (errata)
  pike:
    - https://review.openstack.org/523212 (errata)

Matt Riedemann (mriedem) wrote :

@Tristan,

One typo in your errata description: s/may failed/may fail/

There are other ways the rebuild could incorrectly fail based on other scheduler filters, not just a disabled compute or hosts being at capacity. I'm not sure how to make that sentence more generic. Maybe something like, "Rebuild may fail depending on configured scheduler filters and environment, for example, when the compute host is running at capacity or when the host is disabled."

I can start on the backports for https://review.openstack.org/#/c/521391/, I was hoping to get some core review before I started doing backports. I also need to work on ocata and newton backports for those patches as well.

Thanks for review Matt. We used to describe a regression/bug in errata update in the past tense so that it implies the bug is now fixed by the errata patch.

Anyway, your propose version looks good to me. We could send it as soon as all the reviews are created and when they receive +2(s).

Jeremy Stanley (fungi) wrote :

Further deliberation on the OSSA-2017-005/CVE-2017-16239 errata should probably move to bug 1664931 since it's not security-sensitive and also not dependent on the present embargo for this report, so can go forward independently.

fwiw, the bug 1664931 ERRATA is proposed here: https://review.openstack.org/523649

Jeremy Stanley (fungi) on 2017-11-29
summary: Potential DoS by rebuilding the same instance with a new image multiple
- times
+ times (CVE-2017-17051)

A pre-OSSA with copies of the current patches from (and links to them in) code review has been sent to our downstream stakeholders now. Disclosure is still on track for Tuesday at 15:00 UTC.

Changed in ossa:
status: Confirmed → Fix Committed
importance: Undecided → High
assignee: nobody → Jeremy Stanley (fungi)
Jeremy Stanley (fungi) on 2017-12-05
description: updated
information type: Private Security → Public Security
Matt Riedemann (mriedem) on 2017-12-05
Changed in nova:
assignee: nobody → Dan Smith (danms)
status: Triaged → In Progress

Reviewed: https://review.openstack.org/525643
Committed: https://git.openstack.org/cgit/openstack/ossa/commit/?id=e2283a6b9e16cf055d73115f8a8349168d8cb732
Submitter: Zuul
Branch: master

commit e2283a6b9e16cf055d73115f8a8349168d8cb732
Author: Jeremy Stanley <email address hidden>
Date: Tue Dec 5 14:55:50 2017 +0000

    Adds OSSA-2017-006 (CVE-2017-17051)

    Change-Id: I6110a60e10afb6cad11ec19156a27362c0c1ec2f
    Related-Bug: #1732976

Changed in nova:
assignee: Dan Smith (danms) → Matt Riedemann (mriedem)
Matt Riedemann (mriedem) on 2017-12-05
Changed in nova:
assignee: Matt Riedemann (mriedem) → Dan Smith (danms)
Jeremy Stanley (fungi) on 2017-12-05
summary: - Potential DoS by rebuilding the same instance with a new image multiple
- times (CVE-2017-17051)
+ [OSSA-2017-006] Potential DoS by rebuilding the same instance with a new
+ image multiple times (CVE-2017-17051)
Changed in ossa:
status: Fix Committed → Fix Released

Reviewed: https://review.openstack.org/521662
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=25a1d78e83065c5bea5d8e0a017fd9d0914d41d9
Submitter: Zuul
Branch: master

commit 25a1d78e83065c5bea5d8e0a017fd9d0914d41d9
Author: Dan Smith <email address hidden>
Date: Mon Nov 20 13:24:24 2017 -0800

    Fix doubling allocations on rebuild

    Commit 984dd8ad6add4523d93c7ce5a666a32233e02e34 makes a rebuild
    with a new image go through the scheduler again to validate the
    image against the instance.host (we rebuild to the same host that
    the instance already lives on). This fixes the subsequent doubling
    of allocations that will occur by skipping the claim process if
    a policy-only scheduler check is being performed.

    Closes-Bug: #1732976

    Related-CVE: CVE-2017-17051
    Related-OSSA: OSSA-2017-006

    Change-Id: I8a9157bc76ba1068ab966c4abdbb147c500604a8

Changed in nova:
status: In Progress → Fix Released

This issue was fixed in the openstack/nova 17.0.0.0b2 development milestone.

Reviewed: https://review.openstack.org/523214
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=fed660c1189fdf4159d97badfdc8c5b35ad14f23
Submitter: Zuul
Branch: stable/pike

commit fed660c1189fdf4159d97badfdc8c5b35ad14f23
Author: Dan Smith <email address hidden>
Date: Mon Nov 20 13:24:24 2017 -0800

    Fix doubling allocations on rebuild

    Commit 984dd8ad6add4523d93c7ce5a666a32233e02e34 makes a rebuild
    with a new image go through the scheduler again to validate the
    image against the instance.host (we rebuild to the same host that
    the instance already lives on). This fixes the subsequent doubling
    of allocations that will occur by skipping the claim process if
    a policy-only scheduler check is being performed.

    Closes-Bug: #1732976

    Related-CVE: CVE-2017-17051
    Related-OSSA: OSSA-2017-006

    NOTE(mriedem): This change removes the Pike-only workaround
    added in 234ade29a39cf2d51e08157e149e0cbd0c5047be.

    Change-Id: I8a9157bc76ba1068ab966c4abdbb147c500604a8
    (cherry picked from commit 25a1d78e83065c5bea5d8e0a017fd9d0914d41d9)

This issue was fixed in the openstack/nova 16.0.4 release.

To post a comment you must log in.
This report contains Public Security information  Edit
Everyone can see this security related information.

Other bug subscribers