hotplug causing cloud-init to spike CPU usage

Bug #1946003 reported by James Falcon
42
This bug affects 8 people
Affects Status Importance Assigned to Milestone
cloud-init
Fix Released
Critical
Unassigned
cloud-init (Ubuntu)
Fix Released
Undecided
Unassigned

Bug Description

In 21.3, we added udev rules to enable the cloud-init hotplug functionality. If a new device is detected, we call into cloud-init to see if hotplug is supported/enabled, then proceed accordingly based on the results. There are cloud users that are creating and disposing docker containers at a very high rate. This causes many virtual ethernet adapters to be created and disposed. This is triggering cloud-init events at a high volume, consuming significant CPU. Even with the hotplug functionality being disabled, the act of checking if hotplug is enabled is causing the spikes in CPU.

The path taken is:
https://github.com/canonical/cloud-init/blob/main/udev/10-cloud-init-hook-hotplug.rules
to
https://github.com/canonical/cloud-init/blob/main/tools/hook-hotplug
to
https://github.com/canonical/cloud-init/blob/main/cloudinit/cmd/devel/hotplug_hook.py#L158

For more context, see IRC conversations from 10/1/2021 and 10/4/2021:
https://irclogs.ubuntu.com/2021/10/01/%23cloud-init.html
https://irclogs.ubuntu.com/2021/10/04/%23cloud-init.html

Revision history for this message
James Falcon (falcojr) wrote :

As far as a fix goes, I'm leaning towards not including the udev rule during install, then installing it during our normal boot process if we detect that hotplug has been enabled.

Another possible solution is to modify the script called by the udev event to only trigger if we detect a PCI device, but IIRC that won't work on all clouds as some clouds expose their new devices as virtual devices.

Changed in cloud-init:
status: New → Triaged
importance: Undecided → Critical
Revision history for this message
Ryan Harper (raharper) wrote :

> As far as a fix goes, I'm leaning towards not including the udev rule during
> install, then installing it during our normal boot process if we detect that
> hotplug has been enabled.

I think this makes a lot of sense. IIRC, we don't attempt to handle hotplug
events on firstboot, so it's reasonable to write the new udev rule if enabled
and reload rules (udevadm control --reload)

Another optimization for the rule would be to have it not invoke cloud-init
directly to determine if hotplug is enabled (python3 is a heavy exec).

When cloud-init checks for hotplug config, it can serialize into
/run/cloud-init the current status of hotplug) and I think like
cloud-init.disabled, we could also have a marker file that the hook can check
in the shell script to avoid any exec of python at all).

Revision history for this message
Geoffrey Goodman (ggoodman) wrote :

As one of the users affected by this performance regression, I like the proposed solutions.

Certainly avoiding the udev rule entirely when `cloud-init` is otherwise configured to disregard hotplug events seems like the best long-term solution. However, I can also appreciate a less invasive short-term fix that might be scoped to avoiding the heavy python exec via shell scripting.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in cloud-init (Ubuntu):
status: New → Confirmed
Revision history for this message
Chad Smith (chad.smith) wrote :
James Falcon (falcojr)
Changed in cloud-init:
status: Triaged → Fix Committed
Revision history for this message
Chad Smith (chad.smith) wrote :

Upstream commit has landed addressing this issue:
https://github.com/canonical/cloud-init/commit/1d01da5d9916d97ef463ba61a36b3f98f8911419

Expect this available in cloud-init version 21.4.

Revision history for this message
James Falcon (falcojr) wrote : Fixed in cloud-init version 21.4.

This bug is believed to be fixed in cloud-init in version 21.4. If this is still a problem for you, please make a comment and set the state back to New

Thank you.

Changed in cloud-init:
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (6.0 KiB)

This bug was fixed in the package cloud-init - 21.4-0ubuntu1~22.04.1

---------------
cloud-init (21.4-0ubuntu1~22.04.1) jammy; urgency=medium

  * d/upstream/metadata: Change contact to James Falcon
  * d/cloud-init.templates: Add LXD to default datasource_list with
    translations
  * drop the following cherry-picks now included:
    + cpick-28e56d99-Azure-Retry-dhcp-on-timeouts-when-polling
    + cpick-e69a8874-Set-Azure-to-only-update-metadata-on-BOOT_NEW_INSTANCE
    + cpick-612e3908-Add-connectivity_url-to-Oracle-s-EphemeralDHCPv4-988
    + cpick-dc227869-Set-Azure-to-apply-networking-config-every-BOOT-1023
    + cpick-9c147e83-Allow-disabling-of-network-activation-SC-307-1048
  * New upstream release.
    - Release 21.4 (#1091) (LP: #1949405)
    - Azure: fallback nic needs to be reevaluated during reprovisioning
      (#1094) [Anh Vo]
    - azure: pps imds (#1093) [Anh Vo]
    - testing: Remove calls to 'install_new_cloud_init' (#1092)
    - Add LXD datasource (#1040)
    - Fix unhandled apt_configure case. (#1065) [Brett Holman]
    - Allow libexec for hotplug (#1088)
    - Add necessary mocks to test_ovf unit tests (#1087)
    - Remove (deprecated) apt-key (#1068) [Brett Holman] (LP: #1836336)
    - distros: Remove a completed "TODO" comment (#1086)
    - cc_ssh.py: Add configuration for controlling ssh-keygen output (#1083)
      [dermotbradley]
    - Add "install hotplug" module (SC-476) (#1069) (LP: #1946003)
    - hosts.alpine.tmpl: rearrange the order of short and long hostnames
      (#1084) [dermotbradley]
    - Add max version to docutils
    - cloudinit/dmi.py: Change warning to debug to prevent console display
      (#1082) [dermotbradley]
    - remove unnecessary EOF string in
      disable-sshd-keygen-if-cloud-init-active.conf (#1075) [Emanuele
      Giuseppe Esposito]
    - Add module 'write-files-deferred' executed in stage 'final' (#916)
      [Lucendio]
    - Bump pycloudlib to fix CI (#1080)
    - Remove pin in dependencies for jsonschema (#1078)
    - Add "Google" as possible system-product-name (#1077) [vteratipally]
    - Update Debian security suite for bullseye (#1076) [Johann Queuniet]
    - Leave the details of service management to the distro (#1074)
      [Andy Fiddaman]
    - Fix typos in setup.py (#1059) [Christian Clauss]
    - Update Azure _unpickle (SC-500) (#1067) (LP: #1946644)
    - cc_ssh.py: fix private key group owner and permissions (#1070)
      [Emanuele Giuseppe Esposito]
    - VMware: read network-config from ISO (#1066) [Thomas Weißschuh]
    - testing: mock sleep in gce unit tests (#1072)
    - CloudStack: fix data-server DNS resolution (#1004)
      [Olivier Lemasle] (LP: #1942232)
    - Fix unit test broken by pyyaml upgrade (#1071)
    - testing: add get_cloud function (SC-461) (#1038)
    - Inhibit sshd-keygen@.service if cloud-init is active (#1028)
      [Ryan Harper]
    - VMWARE: search the deployPkg plugin in multiarch dir (#1061)
      [xiaofengw-vmware] (LP: #1944946)
    - Fix set-name/interface DNS bug (#1058) [Andrew Kutz] (LP: #1946493)
    - Use specified tmp location for growpart (#1046) [jshen28]
    - .gitignore: ignore tags file for ctags users (#1057) [Brett Holman]
 ...

Read more...

Changed in cloud-init (Ubuntu):
status: Confirmed → Fix Released
Revision history for this message
Paolo Pettinato (p.pettinato) wrote :

Thank you @falcojr et al.
Any chances this fix / version will be backported in the "updates" stream of older LTS releases?

Revision history for this message
James Falcon (falcojr) wrote :

Yes, the fix will be backported to -updates in Bionic, Focal, Hirsute, and Impish. That could happen as soon as today or early next week. The tracking bug for that is https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/1949521

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers