OpenStack Compute (nova)

Too many errors can trigger compute failed_builds to get incremented

Bug #1774527 reported by Joshua Harlow on 2018-05-31

This bug report is a duplicate of: Bug #1742102: Simple user can disable compute. Edit Remove

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	New	Undecided	Unassigned
	OpenStack Security Advisory	New	Undecided	Unassigned

Bug Description

So let's analyze what can cause a compute managers failed_builds to get incremented and point out that some of them should not be causing failed_builds to get incremented (which then can have the 'nice' effect of auto-disabling a nova-compute service).

So the return code of self._do_build_and_run_instance returns a result code; the catch of all exceptions also triggers the setting of a result code to failed; when this is failed it will cause the failed_build counter to get incremented.

Some unrelated to nova-compute exceptions that from reading the code can trigger this to happen:

- Unable to base64 decode injected files.
- Failure of notify_about_instance_create to actually send (some inner exception perhaps?)
- exception.NoMoreNetworks, exception.NoMoreFixedIps
- exception.FlavorDiskTooSmall, exception.FlavorMemoryTooSmall,
  exception.ImageNotActive, exception.ImageUnacceptable,
  exception.InvalidDiskInfo, exception.InvalidDiskFormat,
  cursive_exception.SignatureVerificationError,
  exception.VolumeEncryptionNotSupported, exception.InvalidInput,
  exception.RequestedVRamTooHigh --- these bubble up as BuildAbortException
- exception.InstanceNotFound, exception.UnexpectedDeletingTaskStateError
- Anything that pops out of _build_resources
   - Failed to allocate network

And many more?

See original description

Clint Byrum (clint-fewbar) on 2018-05-31

summary:

- To many errors can trigger compute failed_builds to get incremented
+ Too many errors can trigger compute failed_builds to get incremented

Joshua Harlow (harlowja) on 2018-05-31

description:	updated
description:	updated

Revision history for this message

Clint Byrum (clint-fewbar) wrote on 2018-05-31:

I just want to add my concern over the impact of this bug. I work with Josh and observed how this was discovered. We had a misconfigured image service for a little while, which left behind an image that could not be booted. As a result, all of our hypervisors in our staging environment except one were disabled. That's not what was intended by this setting. The only failures that should increment this counter are those that are *specifically* faults caused by the compute node itself and if that can't be determined 100% of the time, the counter must not be incremented, as it would allow users to DoS the entire cloud simply by uploading an image that has a root disk that is too big for a flavor.

Joshua Harlow (harlowja) on 2018-05-31

description:	updated
description:	updated

Clint Byrum (clint-fewbar) on 2018-05-31

information type:

Public → Private Security

Revision history for this message

Clint Byrum (clint-fewbar) wrote on 2018-05-31:

I've marked this as a private security vulnerability, unfortunately the notifications already went out explaining it so that may be moot.

This seems like a DoS bug that needs to be addressed.

Revision history for this message

Joshua Harlow (harlowja) wrote on 2018-05-31:

Thanks clint; I am not to sure that I am reading all this correct but please correct me if I am wrong that basically even things like https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L1557 throwing over quota can cause the counter to be eventually incremented; for example trace what calls that, that is called from _build_resources which will reraise the error as exception.BuildAbortException which will then get caught in _build_and_run_instance which will then reraise it which will then cause result = build_results.FAILED and then in a finally block cause the counter to get incremented... and so on and so forth.

Clint Byrum (clint-fewbar) on 2018-05-31

information type:

Private Security → Public

Revision history for this message

Joshua Harlow (harlowja) wrote on 2018-05-31:

I think just like the following can show this to folks:

http://paste.openstack.org/show/722481/

Let that run for a while, and after enough time nova-compute services will start dropping offline.

Revision history for this message

Joshua Harlow (harlowja) wrote on 2018-06-01:

So I do not think this is a duplicate; the scope of 1742102 is smaller than this one.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2018-06-01:

This is definitely a duplicate of bug 1742102. That was reported for a specific scenario that can cause this, which has been useful for framing discussions around the types of failures that should not count against the auto-disable counter, but agree that it's much more than just the volume over quota error.

Report a bug

This report contains Public information

Everyone can see this information.

Duplicate of bug #1742102 Remove

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.