Invalid input for dns_name when spawning instance with .number at the end

Bug #1581977 reported by Igor D.C.
24
This bug affects 5 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Triaged
Low
Unassigned

Bug Description

When attempting to deploy an instance with a name which ends in dot <number> (e.g. .123, as in an all-numeric TLD) or simply a name that, after conversion to dns_name, ends as .<number>, nova conductor fails with the following error:

2016-05-15 13:15:04.824 ERROR nova.scheduler.utils [req-4ce865cd-e75b-4de8-889a-ed7fc7fece18 admin demo] [instance: c4333432-f0f8-4413-82e8-7f12cdf3b5c8] Error from last host: silpixa00394065 (node silpixa00394065): [u'Traceback (most recent call last):\n', u' File "/opt/stack/nova/nova/compute/manager.py", line 1926, in _do_build_and_run_instance\n filter_properties)\n', u' File "/opt/stack/nova/nova/compute/manager.py", line 2116, in _build_and_run_instance\n instance_uuid=instance.uuid, reason=six.text_type(e))\n', u"RescheduledException: Build of instance c4333432-f0f8-4413-82e8-7f12cdf3b5c8 was re-scheduled: Invalid input for dns_name. Reason: 'networking-ovn-ubuntu-16.04' not a valid PQDN or FQDN. Reason: TLD '04' must not be all numeric.\nNeutron server returns request_ids: ['req-7317c3e3-2875-4073-8076-40e944845b69']\n"]

This throws one instance of the infamous Horizon message: Error: No valid host was found. There are not enough hosts available.

This issue was observed using stable/mitaka via DevStack (nova commit fb3f1706c68ea5b58f05ea810c6339f2449959de).

In the above example, the instance name is "networking-ovn (Ubuntu 16.04)", which resulted in an attempted dns_name="networking-ovn-ubuntu-16.04", where the 04 was interpreted as a TLD and, consequently, an invalid TLD.

Tags: compute
tags: added: compute
Revision history for this message
Andrea Rosa (andrea-rosa-m) wrote :

I am not sure if this is a bug, we could replace the dot(s) in the host name with a different char, for example a "_" but then the user will lose the option to define a hostname as a FQDN.
And according to the rfc952:
' A "name" (Net, Host, Gateway, or Domain name) is a text string up
   to 24 characters drawn from the alphabet (A-Z), digits (0-9), minus
   sign (-), and period (.). Note that periods are only allowed when
   they serve to delimit components of "domain style names"'
So if the user uses a period it should know that it is allowed only to delimit a domain name and it has to be a valid one.
I am going to mark it as invalid, please let me know if you are not happy about this decision.

Changed in nova:
status: New → Invalid
Revision history for this message
Dinesh Bhor (dinesh-bhor) wrote :

The dns_name should match to RFC 1123 (section 2.1) and RFC 952, validation for the same is done [1] in _validate_dns_format() method, so IMO this is not a bug.

[1] https://github.com/openstack/neutron/blob/master/neutron/extensions/dns.py#L107

Revision history for this message
Igor D.C. (igordcard) wrote :

I see, although I get a bit uncomfortable with the coupling between the nova instance name and the hostname to be set. Perhaps something interesting in the future would be to show the description of the instance, instead of the name, when listing instances in Horizon, thus having the ability to use special characters to give more "user-friendly" names. In terms of this specific bug report, a new one should be submitted to Horizon, and possibly the client (haven't tested), to validate the names before attempting to create the instance, since it deterministically fails after that.

Revision history for this message
Maurice Escher (maurice-escher) wrote :

What about only replacing the dots in the name if the result would violate the RFC?

To get the best of both scenarios:
- allow users that want to define a hostname as FQDN to do so
- allow users that don't know/care to specify a hostname ending in dot+number without getting random errors of a dns feature they don't use (remember the feature is toggled on a by neutron installation basis, not by user)

Revision history for this message
Christian Berendt (berendt) wrote :

Through the use of invalid hostnames it is currently possible to deactivate individual nova-compute services.

We use the release Rocky on an environment. If we start an instance with an invalid hostname, the nova-compute service detects this and throws an exception.

The builder failure weigher (activated by default) blocks the nova-compute service from further instances. You have to restart the nova-compute service or explicitly start an instance on this node for the service to work again.

In other words, it is possible that an unprivileged user blocks an internal component due to an incorrect input. In principle you can deactivate whole environments with it.

We have now temporarily solved this problem by setting build_failure_weight_multiplier to 0.

However, we think that invalid names should already be identified by the API when creating the instance and should not lead to unwanted behavior within the environment. Therefore we open this report again.

Changed in nova:
status: Invalid → New
information type: Public → Private Security
Revision history for this message
Christian Berendt (berendt) wrote :

I put this on private security because I think it is security relevant if you can disturb parts of an environment with an unprivileged user input.

Revision history for this message
melanie witt (melwitt) wrote :

Hi Christian, thanks for reporting the security concern related to the BuildFailureWeigher. It is actually a known issue described in the following bug:

https://bugs.launchpad.net/nova/+bug/1818239

and if you check out the discussion there, the conclusion (thus far) [1] has been that while it is possible to de-prioritize a compute host by providing certain invalid inputs, it will not result in deactivating environments because the build failure weigher is only manipulating the scheduling weights of the compute hosts (decrease their ranking) but will not disable or remove them from scheduling. That is, they are still available for scheduling, just in a de-prioritized state.

Now, this can still be undesirable in a deployment because it will affect how instances are spread amongst compute hosts.

Copied from a RHBZ where I have explained this before [2]:

"... This is why the BuildFailureWeigher can be problematic, because it does not differentiate between user-caused build failures vs compute node-related build failures. Any situation where a request goes to a compute node and fails to build the instance (even a reschedule) will cause a failed_build to be tracked by the BuildFailureWeigher. The failed_build counter is reset (cleared out) for a compute node when any successful build occurs on that compute node. So, it does do some self-healing, but will still result in inconsistent instance placement if any build failures occur. If the customer environment requires a consistent placement of instances on compute nodes, it is best to disable the BuildFailureWeigher by setting [filter_scheduler]build_failure_weight_multiplier = 0."

For background, the build failure behavior was introduced to address an operator pain point where if a compute host experienced a hardware failure, for example, and was consistently selected as the first host for scheduling, the cloud could effectively become non-operational with no user able to boot an instance because no instance could get past the compute host with failed hardware and manual intervention from an admin was needed to take the broken compute host out of rotation.

So, initially a mechanism was added to completely disable compute services if they experienced a certain number of failed builds in a row without any successful builds, but this became an actual denial-of-service vector [3] and was changed into the BuildFailureWeigher as a result.

Finally, there was an attempt to "whitelist" certain types of failures to pick and choose which events result in an increment of the failed_build counter [4], but it stalled out and was abandoned because of the complexity and maintainability concerns around having a whitelist. Instead, it is recommended to set [filter_scheduler]build_failure_weight_multiplier = 0 if the BuildFailureWeigher is causing more problems than it is helping in a particular deployment.

[1] https://bugs.launchpad.net/nova/+bug/1818239/comments/21
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1701334#c17
[3] https://bugs.launchpad.net/nova/+bug/1742102
[4] https://review.opendev.org/568953

Revision history for this message
Jeremy Stanley (fungi) wrote :

Based on Melanie's feedback, I'm switching this bug back to public and marking it as a duplicate.

Revision history for this message
Jeremy Stanley (fungi) wrote :

Er, correction, not marking as a duplicate since the original report isn't about the weigher aspect, just switching back to public since the related security concern already has a public bug report.

information type: Private Security → Public
Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

I can reproduce the problem in devstack with nova from the master

http://paste.openstack.org/show/779434/

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

Based on the comments above neutron do a proper validation of the DNS names. Nova uses the instance.hostname [1] as dns_name. And instance.hostname is set based on the instance.display_name [2]. Nova has already sanitized the hostname [3] but only considers the host limitation. So we could enhance sanitize_hostname() [3] to replace a '\.([\d]+)$' postfix with '_$group1'

[1] https://github.com/openstack/nova/blob/207d2c22538ddec4d82fafbc01e756c9d25f6e36/nova/network/neutronv2/api.py#L1497
[2] https://github.com/openstack/nova/blob/207d2c22538ddec4d82fafbc01e756c9d25f6e36/nova/compute/api.py#L1663
[3] https://github.com/openstack/nova/blob/207d2c22538ddec4d82fafbc01e756c9d25f6e36/nova/utils.py#L363

Changed in nova:
status: New → Triaged
importance: Undecided → Low
Revision history for this message
Joshua Huber (uberjay) wrote :

I ran into a funny variation on this -- because the dns_domain defaults to a *truncated* instance name, the following instance fails to build:

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.1yyyyyyyyyyyy

But one additional or fewer "x" will succeed:

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.1yyyyyyyyyyyy
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.1yyyyyyyyyyyy

Revision history for this message
sean mooney (sean-k-mooney) wrote :

personally i thihnk we whoudl clouse this as invlid.

this is either a feature request to allow setting different hostnames form displayname as part of nova booth or a request to expand the allowed set of vm names to allow '.' which currently not allowed and transfrom it to some other value to generate a vlaid hostname.

this hasnever been supported and is a well know requirement of the nova api that the vm name has to be a vlaid hostname meaning it may not contian a .

so i dont think this is a vaild bug.

we coudl impove documentaion around this or make the api stricter to reject the request eairler but anything beyond that would require a spec and an api microverion bump as it would be a new feature.

given the agent of this bug im going to update the tragie status

Changed in nova:
importance: Low → Wishlist
status: Triaged → Opinion
Revision history for this message
Stephen Finucane (stephenfinucane) wrote :

I disagree. We already do sanitization of the hostname and fallback to a hostname 'Server-{instance.uuid}' if that returns an empty string. I think we should also do this fallback if the hostname is not a valid FQDN. Personally, I'd rather we provided a mechanism to set hostnames that was entirely decoupled from the instance name, like below, but that's a lot of work and I don't want to do it :)

  openstack server create --hostname foo.bar ...

Until someone puts in the effort to do that, extending what we have will do just fine.

Changed in nova:
status: Opinion → Triaged
importance: Wishlist → Low
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 23.0.0.0rc1

This issue was fixed in the openstack/nova 23.0.0.0rc1 release candidate.

Revision history for this message
David Hill (david-hill-ubisoft) wrote :

This breaks existing VMs that have "." in them as they now are replaced by "-". I'm not sure if it's a new VM and cloud-init changed the hostname or if it's a new deployment but users setting explicitely (like OCP on OSP) hostnames will break.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.