scheduler: build failure high negative weighting

Bug #1818239 reported by James Page on 2019-03-01
270
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Undecided
Unassigned
OpenStack nova-cloud-controller charm
High
James Page
nova (Ubuntu)
High
Unassigned

Bug Description

Whilst debugging a Queens cloud which seems to be landing all new instances on 3 out of 9 hypervisors (which resulted in three very heavily overloaded servers) I noticed that the weighting of the build failure weighter is -1000000.0 * number of failures:

  https://github.com/openstack/nova/blob/master/nova/conf/scheduler.py#L495

This means that a server which has any sort of build failure instantly drops to the bottom of the weighed list of hypervisors for scheduling of instances.

Why might a instance fail to build? Could be a timeout due to load, might also be due to a bad image (one that won't actually boot under qemu). This second cause could be triggered by an end user of the cloud inadvertently causing all instances to be pushed to a small subset of hypervisors (which is what I think happened in our case).

This feels like quite a dangerous default to have given the potential to DOS hypervisors intentionally or otherwise.

ProblemType: Bug
DistroRelease: Ubuntu 18.04
Package: nova-scheduler 2:17.0.7-0ubuntu1
ProcVersionSignature: Ubuntu 4.15.0-43.46-generic 4.15.18
Uname: Linux 4.15.0-43-generic x86_64
ApportVersion: 2.20.9-0ubuntu7.5
Architecture: amd64
Date: Fri Mar 1 13:57:39 2019
NovaConf: Error: [Errno 13] Permission denied: '/etc/nova/nova.conf'
PackageArchitecture: all
ProcEnviron:
 TERM=screen-256color
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=C.UTF-8
 SHELL=/bin/bash
SourcePackage: nova
UpgradeStatus: No upgrade log present (probably fresh install)

James Page (james-page) wrote :
James Page (james-page) wrote :

as a side note its really hard to see the calculated weights for each host in the scheduler as the weighting is stripped before the debug log message is made here:

  https://github.com/openstack/nova/blob/master/nova/scheduler/filter_scheduler.py#L460

I figured what was happening out by logging the list of WeighedHosts rather than the encapsulated obj's

James Page (james-page) wrote :

for further context the only way I could find to reset the build_failures value for a hypervisor was to restart its nova-compute daemon.

James Page (james-page) wrote :

related bug 1742102

Corey Bryant (corey.bryant) wrote :

Based on the fix for bug 1742102 a successful build on the host should reset the failure count to zero, but it seems impractical to get a successful build if it has a low weighting. So the code does still seem to leave the ability for a user to intentionally increase the failure count and lower the weight for a compute node or nodes, limiting the number of nodes an instance can be scheduled to.

James Page (james-page) on 2019-03-04
Changed in nova (Ubuntu):
status: New → Won't Fix
information type: Private Security → Public Security
Changed in charm-nova-cloud-controller:
status: New → In Progress
importance: Undecided → High
assignee: nobody → James Page (james-page)
tags: added: sts

Reviewed: https://review.openstack.org/640698
Committed: https://git.openstack.org/cgit/openstack/charm-nova-cloud-controller/commit/?id=c5029e9831ab5063485877213987d6827c4d86f1
Submitter: Zuul
Branch: master

commit c5029e9831ab5063485877213987d6827c4d86f1
Author: James Page <email address hidden>
Date: Mon Mar 4 09:25:46 2019 +0000

    Disable BuildFailureWeigher

    Disable the BuildFailureWeigher used when weighting hosts during
    instance scheduling. A single build failure will result in a
    -1000000 weighting which effectively excludes the hypervisor
    from the scheduling decision.

    A bad image can result in build failures resulting in a heavy
    load on hypervisors which have not had a build failure with
    those that have effectively being ignored; the build failure
    count will be reset on a successful build but due to the high
    weighting this won't happen until all resources on known good
    hypervisors have been completely consumed.

    Change-Id: I4d4367ef20e2a20aee1e26d4a0ec69cad2ac69d6
    Closes-Bug: 1818239

Changed in charm-nova-cloud-controller:
status: In Progress → Fix Committed
Corey Bryant (corey.bryant) wrote :

Opening this back up against the package and adding upstream as well. I may be missing something, but I think this is still an issue upstream.

Changed in nova (Ubuntu):
status: Won't Fix → Triaged
importance: Undecided → High

Reviewed: https://review.openstack.org/640961
Committed: https://git.openstack.org/cgit/openstack/charm-nova-cloud-controller/commit/?id=56de84a81be00bf9ddffca4426d28c17d5d9798e
Submitter: Zuul
Branch: stable/18.11

commit 56de84a81be00bf9ddffca4426d28c17d5d9798e
Author: James Page <email address hidden>
Date: Mon Mar 4 09:25:46 2019 +0000

    Disable BuildFailureWeigher

    Disable the BuildFailureWeigher used when weighting hosts during
    instance scheduling. A single build failure will result in a
    -1000000 weighting which effectively excludes the hypervisor
    from the scheduling decision.

    A bad image can result in build failures resulting in a heavy
    load on hypervisors which have not had a build failure with
    those that have effectively being ignored; the build failure
    count will be reset on a successful build but due to the high
    weighting this won't happen until all resources on known good
    hypervisors have been completely consumed.

    Change-Id: I4d4367ef20e2a20aee1e26d4a0ec69cad2ac69d6
    Closes-Bug: 1818239
    (cherry picked from commit c5029e9831ab5063485877213987d6827c4d86f1)

Jeremy Stanley (fungi) wrote :

Is the denial of service concern that an authenticated user could engineer a build failure (perhaps by attempting to boot an intentionally corrupt image they uploaded) and perform that action repeatedly to cause the environment to no longer to be able to schedule instances to any of the hypervisor hosts through which their build failures rotated?

Corey Bryant (corey.bryant) wrote :

@Jeremy, I think it's more of limited denial of service (if we can call it that) where a certain amount of computes could get negative weight and not considered for scheduling. I don't think it's a complete denial of service. For example, in the case you've mentioned the failure weight would become somewhat equivalent for all nodes if they all have a negative weight.

Jeremy Stanley (fungi) wrote :

Thanks! I'm mostly looking for an exploit scenario whereby a malicious actor can intentionally cause harm/deny access to the operating environment for other users. Absent this, we'd probably not bother to issue a security advisory about it.

On Tue, 2019-03-05 at 18:30 +0000, Corey Bryant wrote:
> @Jeremy, I think it's more of limited denial of service (if we can
> call
> it that) where a certain amount of computes could get negative weight
> and not considered for scheduling. I don't think it's a complete
> denial
> of service.

I believe the term you are looking for is "degradation of service" -
https://en.wikipedia.org/wiki/Denial-of-service_attack#Degradation-of-service_attacks

Matt Riedemann (mriedem) wrote :

@James: per comment 2, see bug 1816360 :) Easy fix for that.

Matt Riedemann (mriedem) wrote :

I've marked this as incomplete for nova since I'm not aware of any changes being asked to make here. The build failure weigher was added because of bug 1742102 and in response to operator feedback from the Boston summit to auto-disable computes if they experienced a build failure. So the auto-disable thing went into I think Pike, and then bug 1742102 talked about how that was too heavy weight because there were easy ways to auto-disable a lot of computes in a deployment (e.g. volume over-quota from multiple concurrent boot from volume server create requests). So Dan added the weigher stuff which can be configured to weigh hosts with build failures lower so they don't become a black hole, and once the host has a successful build the stats tracking for that compute node is reset.

Anyway, it's up to charms if it wants to disable it by default, but note it was added for a reason and defaults to being "on" for a reason as well (per operator feedback).

Changed in nova:
status: New → Incomplete

@fungi - we accidentally took down 9/12 of the hypervisors in our QA cloud with this; 75% isn't quite a complete denial of service but definitely degraded the capacity significantly

James Page (james-page) wrote :

@mriedem - yeah that was my hack but I see you beat me to raising a review...

Corey Bryant (corey.bryant) wrote :

Matt, What is your opinion on nova disabling the build failure weigher by default. It would then be secure by default, without any exposure to degradation of service attacks, and folks can opt in to it if they want. Btw, did you mean to triage as won't fix or incomplete? I think we have enough details to describe the issue but if anything else is needed please let us know.

Jeremy Stanley (fungi) wrote :

Chris: I don't doubt that this could be a crippling incident, but you say you took down your own cloud and did so accidentally... can you provide a similar scenario where a non-admin user is able to intentionally bring about the same result? That's mostly what I'm looking for to be able to formulate an appropriate impact description for a possible security vulnerability advisory publication.

Dan Smith (danms) wrote :

With the weigher, you shouldn't be able to "take down" anything. You may stack a lot more instances on the non-error-reporting hosts, but once those are full, the scheduler will try one fo the hosts reporting errors, and as soon as one succeeds there, the score resets to zero. So can you clarify "took down" in this context?

Also, the weight given to this weigher, like all others, is configurable. If you have no desire to deprioritize failing hosts, you can set it to zero, and if you want this to have a smaller impact then you can change the weight to something smaller. The default weight was carefully chosen to cause a failing host to have a lower weight than others, all things equivalent. Since the disk weigher scales by free bytes (or whatever), if you're a new compute node that has no instances (and thus a lot of free space) and a bad config that will cause you to fail every boot, the fail weigher has to have an impactful score, else it really will have no effect.

I've nearly lost the will to even argue about this issue, so I'm not sure what my opinion is on setting the default to zero, other than to say that the converse argument is also true... If you have one compute node with a broken config (or even just something preventing it from talking to neutron), it will attract all builds in the scheduler, fail them, and the cloud is effectively down until a human is paged to remedy the situation. That was the case this was originally trying to mitigate in its original and evolved forms.

Changed in charm-nova-cloud-controller:
milestone: none → 19.04
status: Fix Committed → Confirmed
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public Security information  Edit
Everyone can see this security related information.

Other bug subscribers