Bug #1581977 “Invalid input for dns_name when spawning instance ...” : Bugs : OpenStack Compute (nova)

Andrea Rosa (andrea-rosa-m) on 2016-05-16

tags:

added: compute

Revision history for this message

Andrea Rosa (andrea-rosa-m) wrote on 2016-05-16:

#1

I am not sure if this is a bug, we could replace the dot(s) in the host name with a different char, for example a "_" but then the user will lose the option to define a hostname as a FQDN.
And according to the rfc952:
' A "name" (Net, Host, Gateway, or Domain name) is a text string up
   to 24 characters drawn from the alphabet (A-Z), digits (0-9), minus
   sign (-), and period (.). Note that periods are only allowed when
   they serve to delimit components of "domain style names"'
So if the user uses a period it should know that it is allowed only to delimit a domain name and it has to be a valid one.
I am going to mark it as invalid, please let me know if you are not happy about this decision.

Changed in nova:
status:	New → Invalid

Revision history for this message

Dinesh Bhor (dinesh-bhor) wrote on 2016-05-16:

#2

The dns_name should match to RFC 1123 (section 2.1) and RFC 952, validation for the same is done [1] in _validate_dns_format() method, so IMO this is not a bug.

[1] https://github.com/openstack/neutron/blob/master/neutron/extensions/dns.py#L107

Revision history for this message

Igor D.C. (igordcard) wrote on 2016-05-16:

#3

I see, although I get a bit uncomfortable with the coupling between the nova instance name and the hostname to be set. Perhaps something interesting in the future would be to show the description of the instance, instead of the name, when listing instances in Horizon, thus having the ability to use special characters to give more "user-friendly" names. In terms of this specific bug report, a new one should be submitted to Horizon, and possibly the client (haven't tested), to validate the names before attempting to create the instance, since it deterministically fails after that.

Revision history for this message

Maurice Escher (maurice-escher) wrote on 2017-05-18:

#4

What about only replacing the dots in the name if the result would violate the RFC?

To get the best of both scenarios:
- allow users that want to define a hostname as FQDN to do so
- allow users that don't know/care to specify a hostname ending in dot+number without getting random errors of a dns feature they don't use (remember the feature is toggled on a by neutron installation basis, not by user)

Revision history for this message

Christian Berendt (berendt) wrote on 2019-09-13:

#5

Through the use of invalid hostnames it is currently possible to deactivate individual nova-compute services.

We use the release Rocky on an environment. If we start an instance with an invalid hostname, the nova-compute service detects this and throws an exception.

The builder failure weigher (activated by default) blocks the nova-compute service from further instances. You have to restart the nova-compute service or explicitly start an instance on this node for the service to work again.

In other words, it is possible that an unprivileged user blocks an internal component due to an incorrect input. In principle you can deactivate whole environments with it.

We have now temporarily solved this problem by setting build_failure_weight_multiplier to 0.

However, we think that invalid names should already be identified by the API when creating the instance and should not lead to unwanted behavior within the environment. Therefore we open this report again.

Changed in nova:
status:	Invalid → New
information type:	Public → Private Security

Revision history for this message

Christian Berendt (berendt) wrote on 2019-09-13:

#6

I put this on private security because I think it is security relevant if you can disturb parts of an environment with an unprivileged user input.

Revision history for this message

melanie witt (melwitt) wrote on 2019-09-13:

#7

Hi Christian, thanks for reporting the security concern related to the BuildFailureWeigher. It is actually a known issue described in the following bug:

https://bugs.launchpad.net/nova/+bug/1818239

and if you check out the discussion there, the conclusion (thus far) [1] has been that while it is possible to de-prioritize a compute host by providing certain invalid inputs, it will not result in deactivating environments because the build failure weigher is only manipulating the scheduling weights of the compute hosts (decrease their ranking) but will not disable or remove them from scheduling. That is, they are still available for scheduling, just in a de-prioritized state.

Now, this can still be undesirable in a deployment because it will affect how instances are spread amongst compute hosts.

Copied from a RHBZ where I have explained this before [2]:

"... This is why the BuildFailureWeigher can be problematic, because it does not differentiate between user-caused build failures vs compute node-related build failures. Any situation where a request goes to a compute node and fails to build the instance (even a reschedule) will cause a failed_build to be tracked by the BuildFailureWeigher. The failed_build counter is reset (cleared out) for a compute node when any successful build occurs on that compute node. So, it does do some self-healing, but will still result in inconsistent instance placement if any build failures occur. If the customer environment requires a consistent placement of instances on compute nodes, it is best to disable the BuildFailureWeigher by setting [filter_scheduler]build_failure_weight_multiplier = 0."

For background, the build failure behavior was introduced to address an operator pain point where if a compute host experienced a hardware failure, for example, and was consistently selected as the first host for scheduling, the cloud could effectively become non-operational with no user able to boot an instance because no instance could get past the compute host with failed hardware and manual intervention from an admin was needed to take the broken compute host out of rotation.

So, initially a mechanism was added to completely disable compute services if they experienced a certain number of failed builds in a row without any successful builds, but this became an actual denial-of-service vector [3] and was changed into the BuildFailureWeigher as a result.

Finally, there was an attempt to "whitelist" certain types of failures to pick and choose which events result in an increment of the failed_build counter [4], but it stalled out and was abandoned because of the complexity and maintainability concerns around having a whitelist. Instead, it is recommended to set [filter_scheduler]build_failure_weight_multiplier = 0 if the BuildFailureWeigher is causing more problems than it is helping in a particular deployment.

[1] https://bugs.launchpad.net/nova/+bug/1818239/comments/21
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1701334#c17
[3] https://bugs.launchpad.net/nova/+bug/1742102
[4] https://review.opendev.org/568953

Hi Christian, thanks for reporting the security concern related to the BuildFailureWeigher. It is actually a known issue described in the following bug:

https://bugs.launchpad.net/nova/+bug/1818239

and if you check out the discussion there, the conclusion (thus far) [1] has been that while it is possible to de-prioritize a compute host by providing certain invalid inputs, it will not result in deactivating environments because the build failure weigher is only manipulating the scheduling weights of the compute hosts (decrease their ranking) but will not disable or remove them from scheduling. That is, they are still available for scheduling, just in a de-prioritized state.

Now, this can still be undesirable in a deployment because it will affect how instances are spread amongst compute hosts.

Copied from a RHBZ where I have explained this before [2]:

"... This is why the BuildFailureWeigher can be problematic, because it does not differentiate between user-caused build failures vs compute node-related build failures. Any situation where a request goes to a compute node and fails to build the instance (even a reschedule) will cause a failed_build to be tracked by the BuildFailureWeigher. The failed_build counter is reset (cleared out) for a compute node when any successful build occurs on that compute node. So, it does do some self-healing, but will still result in inconsistent instance placement if any build failures occur. If the customer environment requires a consistent placement of instances on compute nodes, it is best to disable the BuildFailureWeigher by setting [filter_scheduler]build_failure_weight_multiplier = 0."

For background, the build failure behavior was introduced to address an operator pain point where if a compute host experienced a hardware failure, for example, and was consistently selected as the first host for scheduling, the cloud could effectively become non-operational with no user able to boot an instance because no instance could get past the compute host with failed hardware and manual intervention from an admin was needed to take the broken compute host out of rotation.

So, initially a mechanism was added to completely disable compute services if they experienced a certain number of failed builds in a row without any successful builds, but this became an actual denial-of-service vector [3] and was changed into the BuildFailureWeigher as a result.

Finally, there was an attempt to "whitelist" certain types of failures to pick and choose which events result in an increment of the failed_build counter [4], but it stalled out and was abandoned because of the complexity and maintainability concerns around having a whitelist. Instead, it is recommended to set [filter_scheduler]build_failure_weight_multiplier = 0 if the BuildFailureWeigher is causing more problems than it is helping in a particular deployment.

[1] https://bugs.launchpad.net/nova/+bug/1818239/comments/21
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1701334#c17
[3] https://bugs.launchpad.net/nova/+bug/1742102
[4] https://review.opendev.org/568953

Revision history for this message

Jeremy Stanley (fungi) wrote on 2019-09-13:

#8

Based on Melanie's feedback, I'm switching this bug back to public and marking it as a duplicate.

Revision history for this message

Jeremy Stanley (fungi) wrote on 2019-09-13:

#9

Er, correction, not marking as a duplicate since the original report isn't about the weigher aspect, just switching back to public since the related security concern already has a public bug report.

information type:

Private Security → Public

Revision history for this message

Balazs Gibizer (balazs-gibizer) wrote on 2019-09-26:

#10

I can reproduce the problem in devstack with nova from the master

http://paste.openstack.org/show/779434/

Revision history for this message

Balazs Gibizer (balazs-gibizer) wrote on 2019-09-26:

#11

Based on the comments above neutron do a proper validation of the DNS names. Nova uses the instance.hostname [1] as dns_name. And instance.hostname is set based on the instance.display_name [2]. Nova has already sanitized the hostname [3] but only considers the host limitation. So we could enhance sanitize_hostname() [3] to replace a '\.([\d]+)$' postfix with '_$group1'

[1] https://github.com/openstack/nova/blob/207d2c22538ddec4d82fafbc01e756c9d25f6e36/nova/network/neutronv2/api.py#L1497
[2] https://github.com/openstack/nova/blob/207d2c22538ddec4d82fafbc01e756c9d25f6e36/nova/compute/api.py#L1663
[3] https://github.com/openstack/nova/blob/207d2c22538ddec4d82fafbc01e756c9d25f6e36/nova/utils.py#L363

Changed in nova:
status:	New → Triaged
importance:	Undecided → Low

Revision history for this message

Joshua Huber (uberjay) wrote on 2019-10-09:

#12

I ran into a funny variation on this -- because the dns_domain defaults to a *truncated* instance name, the following instance fails to build:

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.1yyyyyyyyyyyy

But one additional or fewer "x" will succeed:

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.1yyyyyyyyyyyy
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.1yyyyyyyyyyyy

Revision history for this message

sean mooney (sean-k-mooney) wrote on 2020-11-27:

#13

personally i thihnk we whoudl clouse this as invlid.

this is either a feature request to allow setting different hostnames form displayname as part of nova booth or a request to expand the allowed set of vm names to allow '.' which currently not allowed and transfrom it to some other value to generate a vlaid hostname.

this hasnever been supported and is a well know requirement of the nova api that the vm name has to be a vlaid hostname meaning it may not contian a .

so i dont think this is a vaild bug.

we coudl impove documentaion around this or make the api stricter to reject the request eairler but anything beyond that would require a spec and an api microverion bump as it would be a new feature.

given the agent of this bug im going to update the tragie status

Changed in nova:
importance:	Low → Wishlist
status:	Triaged → Opinion

Revision history for this message

Stephen Finucane (stephenfinucane) wrote on 2020-11-27:

#14

I disagree. We already do sanitization of the hostname and fallback to a hostname 'Server-{instance.uuid}' if that returns an empty string. I think we should also do this fallback if the hostname is not a valid FQDN. Personally, I'd rather we provided a mechanism to set hostnames that was entirely decoupled from the instance name, like below, but that's a lot of work and I don't want to do it :)

openstack server create --hostname foo.bar ...

Until someone puts in the effort to do that, extending what we have will do just fine.

Changed in nova:
status:	Opinion → Triaged
importance:	Wishlist → Low

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-03-26: Fix included in openstack/nova 23.0.0.0rc1

#15

This issue was fixed in the openstack/nova 23.0.0.0rc1 release candidate.

Revision history for this message

David Hill (david-hill-ubisoft) wrote on 2023-08-29:

#16

This breaks existing VMs that have "." in them as they now are replaced by "-". I'm not sure if it's a new VM and cloud-init changed the hostname or if it's a new deployment but users setting explicitely (like OCP on OSP) hostnames will break.

OpenStack Compute (nova)

Invalid input for dns_name when spawning instance with .number at the end

Bug Description

Other bug subscribers

Remote bug watches