No support for classless-static-routes on centos 7

Bug #1850642 reported by Harry Kominos on 2019-10-30
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
cloud-init
Medium
Unassigned

Bug Description

In a tripleo Rocky deployment I am seeing the behaviour bellow.

In the bootup logs I see that the metadata service could not be reached.

  File "/usr/lib/python2.7/site-packages/cloudinit/sources/DataSourceOpenStack.py", line 177, in _crawl_metadata
    'No active metadata service found')

However the service is curlable from inside the VM and I am also seeing some requests in the metadata-service in Openstack. Furthemore I am getting SSH keys inserted so I believe this might be a false warning.

Harry Kominos (hkominos) wrote :
Scott Moser (smoser) wrote :

I think this is a duplicate of bug 1801364.

Scott Moser (smoser) wrote :

un-dupe this if you think my diagnosis is wrong.

Scott Moser (smoser) wrote :

un-marked as a duplicate.
cloud-init WARNed twice in this log.
a.) openstack local datasource failed to read metadata service (timed out)
b.) openstack network datasource worked, but failed to persisting instance-data.json. This is bug 1801364 I'm pretty sure.

I think the reason for 'a' is the fact that the dhcp lease gave an explicit
route to 169.254.169.254, but cloud-init doesn't know how to read that in its
EphemeralDhcp network. See:

+++++++++++++++++++++++++++++++++Route IPv4 info++++++++++++++++++++++++++++++++++
+-------+-----------------+----------------+-----------------+-----------+-------+
| Route | Destination | Gateway | Genmask | Interface | Flags |
+-------+-----------------+----------------+-----------------+-----------+-------+
| 0 | 0.0.0.0 | 136.156.91.254 | 0.0.0.0 | eth0 | UG |
| 1 | 136.156.90.0 | 0.0.0.0 | 255.255.254.0 | eth0 | U |
| 2 | 169.254.169.254 | 136.156.90.11 | 255.255.255.255 | eth0 | UGH |
+-------+-----------------+----------------+-----------------+-----------+-------+

Ryan Harper (raharper) wrote :
Download full text (8.2 KiB)

The logs do contain that bug, but I'm not sure that's the failure here.

Looking at the cloud-init logs, we can see the Ephemeral DHCP start and obtain an lease, but the end point does not respond:

2019-10-30 13:28:41,516 - dhcp.py[DEBUG]: Received dhcp lease on eth0 for 136.156.90.74/255.255.254.0
2019-10-30 13:28:41,516 - __init__.py[DEBUG]: Attempting setup of ephemeral network on eth0 with 136.156.90.74/23 brd 136.156.91.255
2019-10-30 13:28:41,516 - util.py[DEBUG]: Running command ['ip', '-family', 'inet', 'addr', 'add', '136.156.90.74/23', 'broadcast', '136.156.91.255', 'dev', 'eth0'] with allowed return codes [0] (shell=False, capture=True)
2019-10-30 13:28:41,519 - util.py[DEBUG]: Running command ['ip', '-family', 'inet', 'link', 'set', 'dev', 'eth0', 'up'] with allowed return codes [0] (shell=False, capture=True)
2019-10-30 13:28:41,522 - util.py[DEBUG]: Running command ['ip', 'route', 'show', '0.0.0.0/0'] with allowed return codes [0] (shell=False, capture=True)
2019-10-30 13:28:41,525 - util.py[DEBUG]: Running command ['ip', '-4', 'route', 'add', '136.156.91.254', 'dev', 'eth0', 'src', '136.156.90.74'] with allowed return codes [0] (shell=False, capture=True)
2019-10-30 13:28:41,527 - util.py[DEBUG]: Running command ['ip', '-4', 'route', 'add', 'default', 'via', '136.156.91.254', 'dev', 'eth0'] with allowed return codes [0] (shell=False, capture=True)
2019-10-30 13:29:21,573 - util.py[DEBUG]: Resolving URL: http://169.254.169.254 took 40.044 seconds
2019-10-30 13:29:21,574 - url_helper.py[DEBUG]: [0/1] open 'http://169.254.169.254/openstack' with {'url': 'http://169.254.169.254/openstack', 'headers': {'User-Agent': 'Cloud-Init/18.5'}, 'allow_redirects': True, 'method': 'GET', 'timeout': 10.0} configuration
2019-10-30 13:29:31,608 - url_helper.py[DEBUG]: Calling 'http://169.254.169.254/openstack' failed [10/-1s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /openstack (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.HTTPConnection object at 0x7fdc5e5dc210>, 'Connection to 169.254.169.254 timed out. (connect timeout=10.0)'))]
2019-10-30 13:29:31,608 - DataSourceOpenStack.py[DEBUG]: Giving up on OpenStack md from ['http://169.254.169.254/openstack'] after 10 seconds
2019-10-30 13:29:31,609 - util.py[DEBUG]: Crawl of metadata service took 50.079 seconds

Cloud-init did not find the OpenStack datasource at local time:

2019-10-30 13:29:31,689 - main.py[DEBUG]: No local datasource found

So we write out a fallback network config (dhcp on eth0).

2019-10-30 13:29:31,722 - stages.py[INFO]: Applying network configuration from fallback bringup=False: {'version': 1, 'config': [{'subnets': [{'type': 'dhcp'}], 'type': 'physical', 'name': 'eth0', 'mac_address': 'fa:16:3e:f4:b5:1d'}]}
2019-10-30 13:29:31,723 - __init__.py[DEBUG]: Selected renderer 'sysconfig' from priority list: None
2019-10-30 13:29:31,725 - util.py[DEBUG]: Writing to /etc/sysconfig/network-scripts/ifcfg-eth0 - wb: [644] 159 bytes
2019-10-30 13:29:31,726 - util.py[DEBUG]: Restoring selinux mode for /etc/sysconfig/network-scripts/ifcfg-eth0 (recursive=False)
2019-10-30 13:29:31,726 - util.py[DEBUG]: Restor...

Read more...

Harry Kominos (hkominos) wrote :

Attaching the lease file

Scott Moser (smoser) wrote :

Un-duped again.

Ryan was on the right path... this is very close to bug 1821102.
But the lease there has a "rfc3442-classless-static-routes" entry
while Harry's has a 'classless-static-routes' entry.
And they're unfortunately different formats. Fun.

Examples collected in bugs:
  rfc3442-classless-static-routes 32,169,254,169,254,130,56,248,255,0,130,56,240,1;
  option classless-static-routes 32.169.254.169.254 136.156.90.10,0 136.156.91.254;

I guess possibly the centos 7 dhclient is just formatting the response
differently.

The lease that Harry posted looks like below:

lease {
  interface "eth0";
  fixed-address 136.156.90.65;
  option subnet-mask 255.255.254.0;
  option routers 136.156.91.254;
  option dhcp-lease-time 86400;
  option dhcp-message-type 5;
  option domain-name-servers 136.156.81.192;
  option dhcp-server-identifier 136.156.90.10;
  option interface-mtu 1500;
  option dhcp-renewal-time 43200;
  option classless-static-routes 32.169.254.169.254 136.156.90.10,0 136.156.91.254;
  option broadcast-address 136.156.91.255;
  option dhcp-rebinding-time 75600;
  option host-name "host-136-156-90-65";
  option domain-name "openstacklocal";
  renew 4 2019/10/31 02:14:16;
  rebind 4 2019/10/31 12:15:38;
  expire 4 2019/10/31 15:15:38;
}

Scott Moser (smoser) wrote :

cloud-init version used was cloudcloud-init-18.5-3.el7.centos.x86_64

Ryan Harper (raharper) wrote :

The value:

option classless-static-routes 32.169.254.169.254 136.156.90.10,0 136.156.91.254;

Doesn't look valid to me, the first period seems to be misplaced; it's either all comma separated, or space-separated list of IP/mask values.

Can we get the dhcp client's hook scripts? that's usually where the classless-static-route parsing is handled.

Changed in cloud-init:
status: New → Incomplete
Ryan Harper (raharper) wrote :

https://bugzilla.redhat.com/show_bug.cgi?id=1109949

This suggests that this is another format, internal to the dhclient.

Once the system is up, can you include the output from:

% ip -d route show

And any dhclient related config or log files?

Ryan Harper (raharper) wrote :

I *think* the format is:

NETMASK.DESTINATION GATEWAY

Comma is the record separator

0 is alias for default gateway.

--
With that, we'd have two routes:

ip route add 169.254.169.254/32 via 136.156.90.10
ip route add default via 136.156.91.254

But I'd really like to see some source code of dhcp client to confirm
that this is the format being written.

Scott Moser (smoser) wrote :

@Harry, can you please post:
 $ rpm -q dhclient
 $ cat /usr/sbin/dhclient-script

then set the bug back to new.

summary: - Cloud init unable to find the metadata service but can CURL it
+ No support for classless-static-routes on centos 7
Harry Kominos (hkominos) wrote :

ip -d route show
unicast default via 136.156.91.254 dev ens3 proto boot scope global
unicast 136.156.90.0/23 dev ens3 proto kernel scope link src 136.156.90.31
unicast 169.254.169.254 via 136.156.90.11 dev ens3 proto static scope global

As for dhclient hooks, I see nothing in /etc/dhcp/. Is there some other location ?

Harry Kominos (hkominos) wrote :

rpm -qa |grep dhclient
dhclient-4.2.5-77.el7.centos.x86_64

I have also attached the dhclient hook.

Changed in cloud-init:
status: Incomplete → New
Ryan Harper (raharper) wrote :

Thanks Harry,

With the dhclient-script, I've found the parsing logic. Cloud-init will need to handle this format as well.

Changed in cloud-init:
importance: Undecided → Medium
status: New → Triaged
Ryan Harper (raharper) wrote :

Looking at the hook script, I see the record separator is ', |'

 if [ -n "${new_classless_static_routes}" ]; then
                IFS=', |'

And then, each entry is either 0 (default route)
or, it's split on '.'

if [ ${target} = "0" ]; then
   new_routers="${static_routes[$i+1]}"
   continue
else
   prefix=${target%%.*}
   target=${target#*.}
   IFS="." target_arr=(${target})
   unset IFS
   ((pads=4-${#target_arr[@]}))
   for j in $(seq $pads); do
       target="${target}.0"
   done
fi

Hi all, first time commenting, so please do critic :).

My opinion on the matter is that is we use dhclient to do the leasing, maybe we should use it's interface to get the DNS option instead. That way we would avoid having to maintain a mapping of different naming convention.

Now, I understand that dhclient doesn't expose such mapping and that it's decided at compile time;
elafontaine@centos7# strings /sbin/dhclient | grep -i classless
classless static routes option has wrong size or there's some garbage in format
classless-static-routes

Maybe we could try to find a way to interface by having our own set of options asked to avoid that trouble?

so a dhclient.conf that would have an include statement on the standard dhclient.conf ?

Thoughts?

BTW, I fixed this issue on my side by having dhclient.conf just rename the classless-static-route option to rfc3442-classless-static-routes.

my bad, the rfc3442-class-static-routes was working due to another hook. just setting the option doesn't solve the issue.

This bug is believed to be fixed in cloud-init in version 19.4. If this is still a problem for you, please make a comment and set the state back to New

Thank you.

Changed in cloud-init:
status: Triaged → Fix Released
Harry Kominos (hkominos) wrote :

Hi again.
I have rebuild a centos image and integrated cloud-init 19.4+514.g4a4e26a4-1.el7 from https://copr.fedorainfracloud.org/coprs/g/cloud-init/el-testing/packages/

Again I believe the same issue is present. No metadata source is found in the network.

Logs are provided from both these bootups

Harry Kominos (hkominos) wrote :

As a workaround I force the metadata source to come as a disk and that works as expected.
I have attached a log to show how I would expect cloud init to work when network is the default metadata source

Harry Kominos (hkominos) wrote :

Bug re-opened

Changed in cloud-init:
status: Fix Released → New
Harry Kominos (hkominos) wrote :

Attaching the leases file that I get when I force the metadata source to be the network.

Harry Kominos (hkominos) wrote :

I am going to close this. The issue is appears to be resolved. The logs above do not represent accurate findings since the metadata service was not replying (for some reason).

Changed in cloud-init:
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.