Too many errors can trigger compute failed_builds to get incremented

Bug #1774527 reported by Joshua Harlow
This bug report is a duplicate of:  Bug #1742102: Simple user can disable compute. Edit Remove
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
New
Undecided
Unassigned
OpenStack Security Advisory
New
Undecided
Unassigned

Bug Description

So let's analyze what can cause a compute managers failed_builds to get incremented and point out that some of them should not be causing failed_builds to get incremented (which then can have the 'nice' effect of auto-disabling a nova-compute service).

So the return code of self._do_build_and_run_instance returns a result code; the catch of all exceptions also triggers the setting of a result code to failed; when this is failed it will cause the failed_build counter to get incremented.

Some unrelated to nova-compute exceptions that from reading the code can trigger this to happen:

- Unable to base64 decode injected files.
- Failure of notify_about_instance_create to actually send (some inner exception perhaps?)
- exception.NoMoreNetworks, exception.NoMoreFixedIps
- exception.FlavorDiskTooSmall, exception.FlavorMemoryTooSmall,
  exception.ImageNotActive, exception.ImageUnacceptable,
  exception.InvalidDiskInfo, exception.InvalidDiskFormat,
  cursive_exception.SignatureVerificationError,
  exception.VolumeEncryptionNotSupported, exception.InvalidInput,
  exception.RequestedVRamTooHigh --- these bubble up as BuildAbortException
- exception.InstanceNotFound, exception.UnexpectedDeletingTaskStateError
- Anything that pops out of _build_resources
   - Failed to allocate network

And many more?

summary: - To many errors can trigger compute failed_builds to get incremented
+ Too many errors can trigger compute failed_builds to get incremented
Joshua Harlow (harlowja)
description: updated
description: updated
Revision history for this message
Clint Byrum (clint-fewbar) wrote :

I just want to add my concern over the impact of this bug. I work with Josh and observed how this was discovered. We had a misconfigured image service for a little while, which left behind an image that could not be booted. As a result, all of our hypervisors in our staging environment except one were disabled. That's not what was intended by this setting. The only failures that should increment this counter are those that are *specifically* faults caused by the compute node itself and if that can't be determined 100% of the time, the counter must not be incremented, as it would allow users to DoS the entire cloud simply by uploading an image that has a root disk that is too big for a flavor.

Joshua Harlow (harlowja)
description: updated
description: updated
information type: Public → Private Security
Revision history for this message
Clint Byrum (clint-fewbar) wrote :

I've marked this as a private security vulnerability, unfortunately the notifications already went out explaining it so that may be moot.

This seems like a DoS bug that needs to be addressed.

Revision history for this message
Joshua Harlow (harlowja) wrote :

Thanks clint; I am not to sure that I am reading all this correct but please correct me if I am wrong that basically even things like https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L1557 throwing over quota can cause the counter to be eventually incremented; for example trace what calls that, that is called from _build_resources which will reraise the error as exception.BuildAbortException which will then get caught in _build_and_run_instance which will then reraise it which will then cause result = build_results.FAILED and then in a finally block cause the counter to get incremented... and so on and so forth.

information type: Private Security → Public
Revision history for this message
Joshua Harlow (harlowja) wrote :

I think just like the following can show this to folks:

http://paste.openstack.org/show/722481/

Let that run for a while, and after enough time nova-compute services will start dropping offline.

Revision history for this message
Joshua Harlow (harlowja) wrote :

So I do not think this is a duplicate; the scope of 1742102 is smaller than this one.

Revision history for this message
Matt Riedemann (mriedem) wrote :

This is definitely a duplicate of bug 1742102. That was reported for a specific scenario that can cause this, which has been useful for framing discussions around the types of failures that should not count against the auto-disable counter, but agree that it's much more than just the volume over quota error.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.