120 seconds timeout on OpenNebula running subp.py

Bug #2007149 reported by Ron
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
cloud-init
Invalid
Undecided
Unassigned

Bug Description

Hi,

I'm using the latest AlmaLinux 9 GenericCloud image. This has cloud-init 22.1-5.el9.alma and works with OpenNebula. Booting is very slow on OpenNebula because of a 120 seconds timeout.

systemd-analyze blame give me:
2min 3.083s cloud-init-local.service

In de cloud-init.log I see:

<snip>
2023-02-09 11:30:10,244 - util.py[DEBUG]: Reading from /run/cloud-init/tmp/tmpac937qy0/context.sh (quiet=False)
2023-02-09 11:30:10,245 - util.py[DEBUG]: Read 3674 bytes from /run/cloud-init/tmp/tmpac937qy0/context.sh
2023-02-09 11:30:10,245 - subp.py[DEBUG]: Running command ['sudo', '-u', 'nobody', 'bash', '-e'] with allowed return codes [0] (shell=False, capture=True)
2023-02-09 11:32:10,667 - util.py[DEBUG]: Reading from /sys/class/net/eth0/addr_assign_type (quiet=False)
2023-02-09 11:32:10,668 - util.py[DEBUG]: Read 2 bytes from /sys/class/net/eth0/addr_assign_type
2023-02-09 11:32:10,669 - util.py[DEBUG]: Reading from /sys/class/net/eth0/uevent (quiet=False)
<snip>

See the two minute timeout after subp.py command.

When running the cloud-init-local.service manually when the machine is started I get the same logging but without the 120 seconds timeout.

Ron (rblom)
description: updated
Revision history for this message
Ron (rblom) wrote :
description: updated
Revision history for this message
Scott Moser (smoser) wrote :

OpenNebula passes network information ("Context" [1]) in shell syntax.
Cloud-init drops permissions to the 'nobody' user when parsing that content.

It drops the permissions using 'sudo'. I suspect that sudo is timing out on a dns lookup as networking is not configured at this point in boot. I've never felt it terribly useful for sudo to rely on dns, but it can depending on configuration [3].

The easy solutions for you are:
 * figure out why sudo is doing that and configure it to *not* do that.
 * hack cloud-init to not use 'sudo' there

The right solution is probably to implement a parser in cloud-init that does not involve executing 'sh' at all. (Sorry, it was me that wrote that the first time).

--
[1] https://archives.opennebula.org/documentation:archives:rel4.2:cong
[2] https://canonical-cloud-init.readthedocs-hosted.com/en/latest/reference/datasources/opennebula.html#opennebula
[3] https://serverfault.com/questions/38114/why-does-sudo-command-take-long-to-execute

Revision history for this message
Chad Smith (chad.smith) wrote :

Looking over journalctl logs as well as your network config in the attached cloud-init.log it looks like one of the following could lead to that sudo timeout on DNS resolution Scott mentioned.

 1. you have 3 DNS server IPs referenced in the context.sh content provided to this instance. It's possible that one of those IPs is down or hitting a timeout due to either routing problems on the configured network or one of those IPs not responding.
 2. it's possible that your sudo config doesn't have hostname-specific sections defined. If so, you may be able to get away with USERDATA providing the "manage_etc_hosts: true" which should add 127.0.0.1 <fqdn> <hostname> thereby short-circuiting some of the DNS lookups for your defined <...cloud> hostname

Two other things I note in your attached logs.
 1. cloud-init complains about cc_ntp failing (check cloud-init status --long).
With error
Stderr: Failed to reload-or-restart chrony.service: Unit chrony.service not found.

  This is due to misconfiguration I believe in user-data provided in context.sh.
From the instance user-data provided to this VM the ntp-related schema is incorrect:

ntp:
  enabled: true
  ntp_client: chrony
  service_name: chronyd
  servers:
    - ntp1.rinis.cloud
    - ntp2.rinis.cloud
    - ntp3.rinis.cloud

The service_name and services keys should live under a "config:" key like this:

ntp:
  enabled: true
  ntp_client: chrony
  config:
    service_name: chronyd
    servers:
      - ntp1.blah.cloud
      - ntp2.blah.cloud
      - ntp3.blah.cloud

2. runcmd is restarting chronyd directly (probably as a work around for the failed cc_ntp module run in #1)?

sudo cloud-init schema --system --annotate # on this system should help highlight some of those schema problems in the future. Though I don't think it handles non-ASCII character sets currently.

Revision history for this message
Chad Smith (chad.smith) wrote :

Agreed that there could be a better parsing of context.sh instead of shelling out. That said, this has been in place since 2013, I'd rather avoid the risk if this is a network config problem.

Revision history for this message
Ron (rblom) wrote :

Hi,

I hacked the cloud-init not to use sudo and then there's no timeout. So your first suggestion seems the right one. Will try to figure out why sudo is waiting for 120 seconds.

Tnx

Revision history for this message
Alberto Contreras (aciba) wrote :

Per the discussion, it looks like it is a cloud misconfiguration and I am going to mark it as invalid for the time being.

Changed in cloud-init:
status: New → Invalid
Revision history for this message
James Falcon (falcojr) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.