CloudStack datasource: querying data-server does not work on Fedora 34

Bug #1942232 reported by Olivier Lemasle
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
cloud-init
Fix Released
Undecided
Unassigned

Bug Description

Cloud provider: CloudStack

CloudStack datasource uses multiple ways to determine the address of the metadata server. One of them (and sometimes the only available) is using a DNS query with hostname "data-server".

However, DNS resolution of "data-server" works because of the DNS search domain, which works well with Fedora 33 but not with Fedora 34.

In Python:

- With Fedora 33:

from socket import getaddrinfo
getaddrinfo("data-server.", 80) => works
getaddrinfo("data-server", 80) => works

- With Fedora 34:

from socket import getaddrinfo
getaddrinfo("data-server.", 80) => fails with "[Errno -2] Name or service not known"
getaddrinfo("data-server", 80) => works

This is not caused by a change in Python (Python 3.9.6 is used for both) but probably in the underlying getaddrinfo implementation.

As the final dot is normally used to prevent using the DNS search domain, the error is actually normal, and the dot should be removed for cloud-init to work.

Revision history for this message
James Falcon (falcojr) wrote :

Hi Olivier. I think I need a little more context here. I'm not familiar with CloudStack.

Whats the FQDN of the metadata service? If it is anything beyond 'data-server.', then why aren't we specifying the FQDN rather than specifying a relative domain? If the FQDN is simply 'data-server.', I would need to understand what changed underneath us to justify switching to a relative domain.

"This is not caused by a change in Python (Python 3.9.6 is used for both) but probably in the underlying getaddrinfo implementation."
That isn't really enough of a justification. Can you point to something specific that changed? I can currently launch a fedora 34 instance, start a service on port 80, and getaddrinfo('localhost.', 80) works fine. Similarly, I can hardcode a known server into /etc/hosts as 'test-server' and getaddrinfo('test-server.', 80) works.

Is there something else that needs to be done to reproduce this behavior?

Changed in cloud-init:
status: New → Incomplete
Revision history for this message
Olivier Lemasle (o-lemasle) wrote :
Download full text (3.8 KiB)

Hi James,

The metadata service is hosted by the "Virtual Router" of a domain, which is also a router and a DHCP server.

The FQDN of the metadata service is "data-server" + a customizable network domain. This domain name is provided by DHCP.

For example, in my CloudStack lab environment, it is "vm.apalia.lan". I've copy-pasted below some command outputs from this specific lab environment.

With a Fedora 33 VM in my CloudStack lab environment, you can see that:
- dig needs the FQDN to find the data-server IP address,
- "host" succeeds with "data-server" but fails with "data-server."
- However, both curl and python's getaddrinfo returns the correct IP address
  of the metadata service when requesting data-server or data-server.

[fedora@fed33 ~]$ grep search /etc/resolv.conf
# configured search domains.
search vm.apalia.lan

[fedora@fed33 ~]$ dig +short data-server
[fedora@fed33 ~]$ dig +short data-server.vm.apalia.lan
10.0.26.1

[fedora@fed33 ~]$ host data-server
data-server.vm.apalia.lan has address 10.0.26.1

[fedora@fed33 ~]$ host data-server.
Host data-server not found: 2(SERVFAIL)

[fedora@fed33 ~]$ curl http://data-server/latest/local-hostname
fed33

[fedora@fed33 ~]$ curl http://data-server./latest/local-hostname
fed33

[fedora@fed33 ~]$ python
Python 3.9.6 (default, Jul 16 2021, 00:00:00)
[GCC 10.3.1 20210422 (Red Hat 10.3.1-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from socket import getaddrinfo
>>> getaddrinfo("data-server", 80)
[(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('10.0.26.1', 80)), (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_DGRAM: 2>, 17, '', ('10.0.26.1', 80)), (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_RAW: 3>, 0, '', ('10.0.26.1', 80))]
>>> getaddrinfo("data-server.", 80)
[(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('10.0.26.1', 80)), (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_DGRAM: 2>, 17, '', ('10.0.26.1', 80)), (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_RAW: 3>, 0, '', ('10.0.26.1', 80))]

With a Fedora 34 VM, you can see that both curl and python's getaddrinfo now
returns the correct IP address when being requested with "data-server" or the
FQDN (here "data-server.vm.apalia.lan") but fail with "data-server."

[fedora@fed34 ~]$ grep search /etc/resolv.conf
# configured search domains.
search vm.apalia.lan

[fedora@fed34 ~]$ dig +short data-server
[fedora@fed34 ~]$ dig +short data-server.vm.apalia.lan
10.0.26.1

[fedora@fed34 ~]$ host data-server
data-server.vm.apalia.lan has address 10.0.26.1

[fedora@fed34 ~]$ host data-server.
Host data-server not found: 2(SERVFAIL)

[fedora@fed34 ~]$ curl http://data-server/latest/local-hostname
fed34

[fedora@fed34 ~]$ curl http://data-server./latest/local-hostname
curl: (6) Could not resolve host: data-server.

[fedora@fed34 ~]$ curl http://data-server.vm.apalia.lan/latest/local-hostname
fed34

[fedora@fed34 ~]$ python
Python 3.9.6 (default, Jul 16 2021, 00:00:00)
[GCC 11.1.1 20210531 (Red Hat 11.1.1-3)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from socket import getaddrinfo
>>> getaddrinfo("data-server", 80)
[(<AddressFamily.AF_I...

Read more...

Revision history for this message
Gregor Riepl (onitake) wrote :

The trailing dot was added on request by @smoser.

See the discussion on the original merge request here: https://code.launchpad.net/~joschi36/cloud-init/+git/cloud-init/+merge/371807

(Click "Show diff comments" on the first comment by @smoser).

Quote:
> Can you put a trailing '.' on this ?
>
> So that it does not propogate across 'search' entries? We had a request to do that other places. It seems generally a good idea if possible.

Revision history for this message
Gregor Riepl (onitake) wrote :

If the trailing dot really was incorrect, you can drop it by reverting commit cfb75bbc71a5c620d2a65a749e1ba440c1c43837 .

Revision history for this message
Chad Smith (chad.smith) wrote :

I'm not certain why the trailing dot is incorrect. It's an absolute and unambiguous hostname that avoids potential aliasing and trying to access a host given multiple search domains.

It makes me think that either:
 1. The DNS server is not configured properly to represent the top-level unambiguous domain
 2. Something in Fedora 34 DNS lookup config that has changed
 3. Something in libc /lib/x86_64-linux-gnu/libnss_dns.so.2 /lib/x86_64-linux-gnu/libresolv.so.2 or /lib/x86_64-linux-gnu/libnss_mdns4_minimal.so.2 has changed how it handles unambigious DNS hostnames.

Some context I found on this was http://www.dns-sd.org/trailingdotsindomainnames.html to give background on the intent of the trailing dot.

The only diffs I see from python when calling getaddrinfo with an unambiguous hostname is that python3 accessesthe following C libs to ultimately try to resolve this unambiguous hostname

+openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
+openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libnss_mdns4_minimal.so.2", O_RDONLY|O_CLOEXEC) = 3
+openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libresolv.so.2", O_RDONLY|O_CLOEXEC) = 3
+openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
+openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libnss_dns.so.2", O_RDONLY|O_CLOEXEC) = 3

I'm not certain we want this change given the cloudstack docs recommend this approach and the use of the unambiguous hostname

Revision history for this message
James Falcon (falcojr) wrote :

Chad, did you see the notes from Olivier above yours? There's some kind of "virtual router" involved, and it seems it is intended to use the search domain of the environment.

"I'm not certain we want this change given the cloudstack docs recommend this approach and the use of the unambiguous hostname."
The documentation was submitted after the PR to cloud-init, so it seems it was based on we do and not the other way around.

Is that true Gregor?

James Falcon (falcojr)
Changed in cloud-init:
status: Incomplete → Fix Committed
Revision history for this message
James Falcon (falcojr) wrote : Fixed in cloud-init version 21.4.

This bug is believed to be fixed in cloud-init in version 21.4. If this is still a problem for you, please make a comment and set the state back to New

Thank you.

Changed in cloud-init:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers