cloud-init

120 seconds timeout on OpenNebula running subp.py

Bug #2007149 reported by Ron on 2023-02-13

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	cloud-init	Invalid	Undecided	Unassigned

Bug Description

Hi,

I'm using the latest AlmaLinux 9 GenericCloud image. This has cloud-init 22.1-5.el9.alma and works with OpenNebula. Booting is very slow on OpenNebula because of a 120 seconds timeout.

systemd-analyze blame give me:
2min 3.083s cloud-init-local.service

In de cloud-init.log I see:

<snip>
2023-02-09 11:30:10,244 - util.py[DEBUG]: Reading from /run/cloud-init/tmp/tmpac937qy0/context.sh (quiet=False)
2023-02-09 11:30:10,245 - util.py[DEBUG]: Read 3674 bytes from /run/cloud-init/tmp/tmpac937qy0/context.sh
2023-02-09 11:30:10,245 - subp.py[DEBUG]: Running command ['sudo', '-u', 'nobody', 'bash', '-e'] with allowed return codes [0] (shell=False, capture=True)
2023-02-09 11:32:10,667 - util.py[DEBUG]: Reading from /sys/class/net/eth0/addr_assign_type (quiet=False)
2023-02-09 11:32:10,668 - util.py[DEBUG]: Read 2 bytes from /sys/class/net/eth0/addr_assign_type
2023-02-09 11:32:10,669 - util.py[DEBUG]: Reading from /sys/class/net/eth0/uevent (quiet=False)
<snip>

See the two minute timeout after subp.py command.

When running the cloud-init-local.service manually when the machine is started I get the same logging but without the 120 seconds timeout.

See original description

Ron (rblom) on 2023-02-13

description:

updated

Revision history for this message

Ron (rblom) wrote on 2023-02-13:

cloud-init.tar.gz Edit (96.6 KiB, application/x-tar)

description:

updated

Revision history for this message

Scott Moser (smoser) wrote on 2023-02-13:

OpenNebula passes network information ("Context" [1]) in shell syntax.
Cloud-init drops permissions to the 'nobody' user when parsing that content.

It drops the permissions using 'sudo'. I suspect that sudo is timing out on a dns lookup as networking is not configured at this point in boot. I've never felt it terribly useful for sudo to rely on dns, but it can depending on configuration [3].

The easy solutions for you are:
* figure out why sudo is doing that and configure it to *not* do that.
* hack cloud-init to not use 'sudo' there

The right solution is probably to implement a parser in cloud-init that does not involve executing 'sh' at all. (Sorry, it was me that wrote that the first time).

--
[1] https://archives.opennebula.org/documentation:archives:rel4.2:cong
[2] https://canonical-cloud-init.readthedocs-hosted.com/en/latest/reference/datasources/opennebula.html#opennebula
[3] https://serverfault.com/questions/38114/why-does-sudo-command-take-long-to-execute

Revision history for this message

Chad Smith (chad.smith) wrote on 2023-02-15:

Looking over journalctl logs as well as your network config in the attached cloud-init.log it looks like one of the following could lead to that sudo timeout on DNS resolution Scott mentioned.

1. you have 3 DNS server IPs referenced in the context.sh content provided to this instance. It's possible that one of those IPs is down or hitting a timeout due to either routing problems on the configured network or one of those IPs not responding.
2. it's possible that your sudo config doesn't have hostname-specific sections defined. If so, you may be able to get away with USERDATA providing the "manage_etc_hosts: true" which should add 127.0.0.1 <fqdn> <hostname> thereby short-circuiting some of the DNS lookups for your defined <...cloud> hostname

Two other things I note in your attached logs.
1. cloud-init complains about cc_ntp failing (check cloud-init status --long).
With error
Stderr: Failed to reload-or-restart chrony.service: Unit chrony.service not found.

This is due to misconfiguration I believe in user-data provided in context.sh.
From the instance user-data provided to this VM the ntp-related schema is incorrect:

ntp:
  enabled: true
  ntp_client: chrony
  service_name: chronyd
  servers:
    - ntp1.rinis.cloud
    - ntp2.rinis.cloud
    - ntp3.rinis.cloud

The service_name and services keys should live under a "config:" key like this:

ntp:
  enabled: true
  ntp_client: chrony
  config:
    service_name: chronyd
    servers:
      - ntp1.blah.cloud
      - ntp2.blah.cloud
      - ntp3.blah.cloud

2. runcmd is restarting chronyd directly (probably as a work around for the failed cc_ntp module run in #1)?

sudo cloud-init schema --system --annotate # on this system should help highlight some of those schema problems in the future. Though I don't think it handles non-ASCII character sets currently.

Revision history for this message

Chad Smith (chad.smith) wrote on 2023-02-15:

Agreed that there could be a better parsing of context.sh instead of shelling out. That said, this has been in place since 2013, I'd rather avoid the risk if this is a network config problem.

Revision history for this message

Ron (rblom) wrote on 2023-02-17:

Hi,

I hacked the cloud-init not to use sudo and then there's no timeout. So your first suggestion seems the right one. Will try to figure out why sudo is waiting for 120 seconds.

Tnx

Revision history for this message

Alberto Contreras (aciba) wrote on 2023-02-23:

Per the discussion, it looks like it is a cloud misconfiguration and I am going to mark it as invalid for the time being.