cloud-init is slow to complete init on minimized images

Bug #1727358 reported by Mathieu Trudel-Lapierre
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
cloud-init
Won't Fix
Undecided
Unassigned
cloud-init (Ubuntu)
Won't Fix
Wishlist
Unassigned
linux-kvm (Ubuntu)
Confirmed
Undecided
Unassigned
python3.6 (Ubuntu)
Triaged
Wishlist
Unassigned

Bug Description

http://paste.ubuntu.com/25816789/ for the full logs.

cloud-init is very slow to complete its initialization steps. Specifically, the 'init' takes over 150 seconds.

Cloud-init v. 17.1 running 'init-local' at Wed, 25 Oct 2017 13:22:07 +0000. Up 2.39 seconds.
2017-10-25 13:22:07,157 - util.py[WARNING]: did not find either path /sys/class/dmi/id or dmidecode command
Cloud-init v. 17.1 running 'init' at Wed, 25 Oct 2017 13:22:16 +0000. Up 11.37 seconds.
ci-info: ++++++++++++++++++++++++++++++++Net device info+++++++++++++++++++++++++++++++++
ci-info: +--------+-------+-----------------+---------------+-------+-------------------+
ci-info: | Device | Up | Address | Mask | Scope | Hw-Address |
ci-info: +--------+-------+-----------------+---------------+-------+-------------------+
ci-info: | ens3: | True | 192.168.100.161 | 255.255.255.0 | . | 52:54:00:bb:ad:fb |
ci-info: | ens3: | True | . | . | d | 52:54:00:bb:ad:fb |
ci-info: | lo: | True | 127.0.0.1 | 255.0.0.0 | . | . |
ci-info: | lo: | True | . | . | d | . |
ci-info: | sit0: | False | . | . | . | . |
ci-info: +--------+-------+-----------------+---------------+-------+-------------------+
ci-info: ++++++++++++++++++++++++++++++++Route IPv4 info++++++++++++++++++++++++++++++++
ci-info: +-------+---------------+---------------+-----------------+-----------+-------+
ci-info: | Route | Destination | Gateway | Genmask | Interface | Flags |
ci-info: +-------+---------------+---------------+-----------------+-----------+-------+
ci-info: | 0 | 0.0.0.0 | 192.168.100.1 | 0.0.0.0 | ens3 | UG |
ci-info: | 1 | 192.168.100.0 | 0.0.0.0 | 255.255.255.0 | ens3 | U |
ci-info: | 2 | 192.168.100.1 | 0.0.0.0 | 255.255.255.255 | ens3 | UH |
ci-info: +-------+---------------+---------------+-----------------+-----------+-------+
2017-10-25 13:24:38,187 - util.py[WARNING]: Failed to resize filesystem (cmd=('resize2fs', '/dev/root'))
2017-10-25 13:24:38,193 - util.py[WARNING]: Running module resizefs (<module 'cloudinit.config.cc_resizefs' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_resizefs.py'>) failed
Generating public/private rsa key pair.
Your identification has been saved in /etc/ssh/ssh_host_rsa_key.
Your public key has been saved in /etc/ssh/ssh_host_rsa_key.pub.
The key fingerprint is:
SHA256:LKNlCqqOgPB8KBKGfPhFO5Rs6fDMnAvVet/W9i4vLxY root@cloudimg
The key's randomart image is:
+---[RSA 2048]----+
| |
| . + |
| . O . |
|o . % +. |
|++.o %=.S |
|+=ooo=+o. . .E |
|* +.+. . o o. |
|=. . . .=. |
|+. . B= |
+----[SHA256]-----+
Generating public/private dsa key pair.
Your identification has been saved in /etc/ssh/ssh_host_dsa_key.
Your public key has been saved in /etc/ssh/ssh_host_dsa_key.pub.
The key fingerprint is:
SHA256:dNWNyBHqTUCl820/vL0dEhOVDFYJzqr1WeuqV1PAmjk root@cloudimg
The key's randomart image is:
+---[DSA 1024]----+
| .oo=X==o|
| =* *+.|
| . = .B . |
| . o =E.. .|
| S .oo+o..|
| o ..*+.|
| . +.=o|
| .o *|
| .o..++|
+----[SHA256]-----+
Generating public/private ecdsa key pair.
Your identification has been saved in /etc/ssh/ssh_host_ecdsa_key.
Your public key has been saved in /etc/ssh/ssh_host_ecdsa_key.pub.
The key fingerprint is:
SHA256:N3RTlPa7KU5ryq6kJAO8Tiq90ub4P1DGSofn6jFkM3k root@cloudimg
The key's randomart image is:
+---[ECDSA 256]---+
| .o. |
| .o |
| o . o. . |
| +.* . . . .|
| .*XE S o .|
| oo++ . . . |
| oo= o . . . o|
|o.Oo. + o . .o.o |
|oB=+.. . .o++o. |
+----[SHA256]-----+
Generating public/private ed25519 key pair.
Your identification has been saved in /etc/ssh/ssh_host_ed25519_key.
Your public key has been saved in /etc/ssh/ssh_host_ed25519_key.pub.
The key fingerprint is:
SHA256:B6B/1SyUB/TFPfmU1ADIotnD+wjISO5L2JZkWe2lDUA root@cloudimg
The key's randomart image is:
+--[ED25519 256]--+
| .E . o++.o+++|
| + ...++...++|
| o o=o.o.+ .o|
| .o oo=+o . .|
| o+o .+ So. |
| =o.o .... |
|..= . o |
| o. . . |
| .. |
+----[SHA256]-----+
sudo: unable to resolve host cloudimg: Resource temporarily unavailable
Cloud-init v. 17.1 running 'modules:config' at Wed, 25 Oct 2017 13:24:39 +0000. Up 155.13 seconds.
2017-10-25 13:24:40,289 - util.py[WARNING]: Failed to run command to import ubuntu ssh ids
2017-10-25 13:24:40,291 - util.py[WARNING]: ssh-import-id failed for: ubuntu ['cyphermox']
2017-10-25 13:24:40,292 - util.py[WARNING]: Running module ssh-import-id (<module 'cloudinit.config.cc_ssh_import_id' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_ssh_import_id.py'>) failed
Cloud-init v. 17.1 running 'modules:final' at Wed, 25 Oct 2017 13:24:41 +0000. Up 156.51 seconds.
ci-info: no authorized ssh keys fingerprints found for user ubuntu.
Cloud-init v. 17.1 finished at Wed, 25 Oct 2017 13:24:41 +0000. Datasource DataSourceNoCloud [seed=/dev/vdb][dsmode=net]. Up 156.83 seconds

This does not seem to be caused by ssh key generation (I tried to disable it, with no keytypes enabled there is still > 150 s delay), not does it seem to be related to the resizing operation.

I suspect some other step requiring randomness to generate some kind of ID, but I don't know.

Revision history for this message
Mathieu Trudel-Lapierre (cyphermox) wrote :
Revision history for this message
Scott Moser (smoser) wrote :

Could you run
  cloud-init collect-logs
And then attach the cloud-init.tar.gz

Also, Can you provide some information on what you were running?
 " cloud-init is slow to complete init on minimized images "
How can I recreate this?

One curious thing there is:
 2017-10-25 13:22:07,157 - util.py[WARNING]: did not find either path /sys/class/dmi/id or dmidecode command

I suspect you have a kernel without CONFIG_DMI, which seems unfortunate or
possibly you're not on intel or arm64.

For large jumps in your log, it took ~ 9 seconds (13:22:07,337 -> 13:22:16,112) from the exit of cloud-init-local.service to get to 'cloud-init.service' printing its hello message. That is basically the time it took the network to come up.

Then we have a big jump (122 seconds):
 13:22:16,264 main.py[DEBUG]: no di_report found in config.
 13:24:38,088 stages.py[DEBUG]: Using distro class <class 'cloudinit.distros.ubuntu.Distro'>

Those two lines in a vm I have on a openstack look like this:
2017-10-11 15:08:26,685 - main.py[DEBUG]: no di_report found in config.
2017-10-11 15:08:27,031 - stages.py[DEBUG]: Using distro class <class 'cloudinit.distros.ubuntu.Distro'>
2017-10-11 15:08:27,032 - stages.py[DEBUG]: Running module migrator (<module 'cloudinit.config.cc_migrator' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_migrator.py'>) with frequency always

That is ~ .5 seconds, which is not fast, but not 120 seconds either.

Changed in cloud-init (Ubuntu):
status: New → Incomplete
Changed in cloud-init:
status: New → Incomplete
Revision history for this message
Mathieu Trudel-Lapierre (cyphermox) wrote :

This is running on *minimized* images, with linux-kvm kernel; so there might well not be CONFIG_DMI. This is however not the issue (or most likely not). Tests are done in qemu, to simulate running the images in openstack for now (but they are meant to start in openstack without issues). It's simply more convient to test quickly.

An important point is that if I add a qemu RNG to the VM, the cloud-init 'init' steo takes only 2.071 seconds. It appears as though some part of the initialization might depend on sufficient entropy being available.

I will attach the log tarball shortly.

Revision history for this message
Mathieu Trudel-Lapierre (cyphermox) wrote :
Revision history for this message
Scott Moser (smoser) wrote :

Please explain how you did this.
What did you download, what qemu commands did you run.

Thanks.

Revision history for this message
Scott Moser (smoser) wrote :

possibly related, but was fixed:
 https://bugs.python.org/issue25420

Revision history for this message
Steve Langasek (vorlon) wrote : Re: [Bug 1727358] Re: cloud-init is slow to complete init on minimized images

On Thu, Oct 26, 2017 at 01:43:12AM -0000, Scott Moser wrote:
> possibly related, but was fixed:
> https://bugs.python.org/issue25420

Does cloud-init itself consume entropy when it runs?

Revision history for this message
Scott Moser (smoser) wrote :
Download full text (6.9 KiB)

Only via 'import random' and light usage.

On Oct 25, 2017 11:21 PM, "Steve Langasek" <email address hidden>
wrote:

> On Thu, Oct 26, 2017 at 01:43:12AM -0000, Scott Moser wrote:
> > possibly related, but was fixed:
> > https://bugs.python.org/issue25420
>
> Does cloud-init itself consume entropy when it runs?
>
> --
> You received this bug notification because you are a member of Canonical
> Cloudware, which is subscribed to cloud-init.
> Matching subscriptions: cloud-init bugs
> https://bugs.launchpad.net/bugs/1727358
>
> Title:
> cloud-init is slow to complete init on minimized images
>
> Status in cloud-init:
> Incomplete
> Status in cloud-init package in Ubuntu:
> Incomplete
>
> Bug description:
> http://paste.ubuntu.com/25816789/ for the full logs.
>
> cloud-init is very slow to complete its initialization steps.
> Specifically, the 'init' takes over 150 seconds.
>
> Cloud-init v. 17.1 running 'init-local' at Wed, 25 Oct 2017 13:22:07
> +0000. Up 2.39 seconds.
> 2017-10-25 13:22:07,157 - util.py[WARNING]: did not find either path
> /sys/class/dmi/id or dmidecode command
> Cloud-init v. 17.1 running 'init' at Wed, 25 Oct 2017 13:22:16 +0000. Up
> 11.37 seconds.
> ci-info: ++++++++++++++++++++++++++++++++Net device
> info+++++++++++++++++++++++++++++++++
> ci-info: +--------+-------+-----------------+---------------+-------+
> -------------------+
> ci-info: | Device | Up | Address | Mask | Scope |
> Hw-Address |
> ci-info: +--------+-------+-----------------+---------------+-------+
> -------------------+
> ci-info: | ens3: | True | 192.168.100.161 | 255.255.255.0 | . |
> 52:54:00:bb:ad:fb |
> ci-info: | ens3: | True | . | . | d |
> 52:54:00:bb:ad:fb |
> ci-info: | lo: | True | 127.0.0.1 | 255.0.0.0 | . |
> . |
> ci-info: | lo: | True | . | . | d |
> . |
> ci-info: | sit0: | False | . | . | . |
> . |
> ci-info: +--------+-------+-----------------+---------------+-------+
> -------------------+
> ci-info: ++++++++++++++++++++++++++++++++Route IPv4
> info++++++++++++++++++++++++++++++++
> ci-info: +-------+---------------+---------------+-----------------+-
> ----------+-------+
> ci-info: | Route | Destination | Gateway | Genmask |
> Interface | Flags |
> ci-info: +-------+---------------+---------------+-----------------+-
> ----------+-------+
> ci-info: | 0 | 0.0.0.0 | 192.168.100.1 | 0.0.0.0 |
> ens3 | UG |
> ci-info: | 1 | 192.168.100.0 | 0.0.0.0 | 255.255.255.0 |
> ens3 | U |
> ci-info: | 2 | 192.168.100.1 | 0.0.0.0 | 255.255.255.255 |
> ens3 | UH |
> ci-info: +-------+---------------+---------------+-----------------+-
> ----------+-------+
> 2017-10-25 13:24:38,187 - util.py[WARNING]: Failed to resize filesystem
> (cmd=('resize2fs', '/dev/root'))
> 2017-10-25 13:24:38,193 - util.py[WARNING]: Running module resizefs
> (<module 'cloudinit.config.cc_resizefs' from '/usr/lib/python3/dist-
> packages/cloudinit/config/...

Read more...

Revision history for this message
Steve Langasek (vorlon) wrote :

On Thu, Oct 26, 2017 at 10:37:37AM -0000, Scott Moser wrote:
> Only via 'import random' and light usage.

Even "light" usage is going to block at this point in the boot; can you give
me a pointer into where this is in the code, so I can try to assess if this
should be avoided?

Revision history for this message
Scott Moser (smoser) wrote :
Download full text (7.4 KiB)

You can grep through, the onlyi real function taht uses it is 'rand_str'.
and i'm not sure really why that would be called at the point that you're
seeing the slow down.

I think it more valuable for you to figure out why your environment is
causing this issue and fixing that rather than fixing cloud-init (and
subsequently all users of entropy).

Also, information about how you've come across this bug would be good.

On Thu, Oct 26, 2017 at 11:24 AM, Steve Langasek <
<email address hidden>> wrote:

> On Thu, Oct 26, 2017 at 10:37:37AM -0000, Scott Moser wrote:
> > Only via 'import random' and light usage.
>
> Even "light" usage is going to block at this point in the boot; can you
> give
> me a pointer into where this is in the code, so I can try to assess if this
> should be avoided?
>
> --
> You received this bug notification because you are a member of Canonical
> Cloudware, which is subscribed to cloud-init.
> Matching subscriptions: cloud-init bugs
> https://bugs.launchpad.net/bugs/1727358
>
> Title:
> cloud-init is slow to complete init on minimized images
>
> Status in cloud-init:
> Incomplete
> Status in cloud-init package in Ubuntu:
> Incomplete
>
> Bug description:
> http://paste.ubuntu.com/25816789/ for the full logs.
>
> cloud-init is very slow to complete its initialization steps.
> Specifically, the 'init' takes over 150 seconds.
>
> Cloud-init v. 17.1 running 'init-local' at Wed, 25 Oct 2017 13:22:07
> +0000. Up 2.39 seconds.
> 2017-10-25 13:22:07,157 - util.py[WARNING]: did not find either path
> /sys/class/dmi/id or dmidecode command
> Cloud-init v. 17.1 running 'init' at Wed, 25 Oct 2017 13:22:16 +0000. Up
> 11.37 seconds.
> ci-info: ++++++++++++++++++++++++++++++++Net device
> info+++++++++++++++++++++++++++++++++
> ci-info: +--------+-------+-----------------+---------------+-------+
> -------------------+
> ci-info: | Device | Up | Address | Mask | Scope |
> Hw-Address |
> ci-info: +--------+-------+-----------------+---------------+-------+
> -------------------+
> ci-info: | ens3: | True | 192.168.100.161 | 255.255.255.0 | . |
> 52:54:00:bb:ad:fb |
> ci-info: | ens3: | True | . | . | d |
> 52:54:00:bb:ad:fb |
> ci-info: | lo: | True | 127.0.0.1 | 255.0.0.0 | . |
> . |
> ci-info: | lo: | True | . | . | d |
> . |
> ci-info: | sit0: | False | . | . | . |
> . |
> ci-info: +--------+-------+-----------------+---------------+-------+
> -------------------+
> ci-info: ++++++++++++++++++++++++++++++++Route IPv4
> info++++++++++++++++++++++++++++++++
> ci-info: +-------+---------------+---------------+-----------------+-
> ----------+-------+
> ci-info: | Route | Destination | Gateway | Genmask |
> Interface | Flags |
> ci-info: +-------+---------------+---------------+-----------------+-
> ----------+-------+
> ci-info: | 0 | 0.0.0.0 | 192.168.100.1 | 0.0.0.0 |
> ens3 | UG |
> ci-info: | 1 | 192.168.100.0 | 0.0.0.0 | 255.255.255.0 |
> ens3 ...

Read more...

Revision history for this message
Scott Moser (smoser) wrote :

You can grep through, the onlyi real function taht uses it is 'rand_str'. and i'm not sure really why that would be called at the point that you're seeing the slow down.

I think it more valuable for you to figure out why your environment is causing this issue and fixing that rather than fixing cloud-init (and subsequently all users of entropy).

Also, information about how you've come across this bug would be good.

Revision history for this message
Steve Langasek (vorlon) wrote :

On Thu, Oct 26, 2017 at 03:56:24PM -0000, Scott Moser wrote:
> You can grep through, the onlyi real function taht uses it is
> 'rand_str'. and i'm not sure really why that would be called at the
> point that you're seeing the slow down.

> I think it more valuable for you to figure out why your environment is
> causing this issue and fixing that rather than fixing cloud-init (and
> subsequently all users of entropy).

We know that Ubuntu Cloud images must function in entropy-poor environments;
that's why we ship pollinate by default. If this particular delay is due to
cloud-init (or something it invoked) consuming entropy in early boot and
blocking, that is a problem, even if we do resolve the lack of an in-kernel
entropy source for this particular image and use case.

> Also, information about how you've come across this bug would be good.

I don't think I have ever seen the boot delay Mathieu is reporting, even
using the same image. So we're dependent on him to provide a reproducer.

Revision history for this message
Mathieu Trudel-Lapierre (cyphermox) wrote :

You can easily reproduce this by running the following image in virt-manager with a default setup (and optionally add a RNG device if necessary):

https://launchpad.net/~cyphermox/+livefs/ubuntu/artful/ubuntu-cpc/+build/113428/+files/livecd.ubuntu-cpc.img

Revision history for this message
Mathieu Trudel-Lapierre (cyphermox) wrote :

You'll want to add an additional disk to provide the metadata for cloud-init too. The exact iso file I have used is attached.

Revision history for this message
Steve Langasek (vorlon) wrote :

On Thu, Oct 26, 2017 at 06:07:57PM -0000, Mathieu Trudel-Lapierre wrote:
> You can easily reproduce this by running the following image in virt-
> manager with a default setup (and optionally add a RNG device if
> necessary):

The defaults in both virt-manager and libvirt change over time. For
reproducibility, it's helpful to have either a libvirt xml, or for maximum
reproducibility, a full qemu commandline from a running instance.

Revision history for this message
Mathieu Trudel-Lapierre (cyphermox) wrote :

It really is *defaults* though, because you don't want to have any extra devices included that could throw off the boot (such as a RNG, which should not be added by default). In any case, I'll attach the xml I use as well, but I reproduce the build on a typical artful install.

The only important piece is to make sure the disks are VirtIO instead of IDE.

Revision history for this message
Mathieu Trudel-Lapierre (cyphermox) wrote :

Libvirt XML for the VM.

Revision history for this message
Scott Moser (smoser) wrote :

this does also reproduce:
 qemu-system-x86_64 -enable-kvm \
    -device virtio-scsi-pci,id=virtio-scsi-xkvm \
    -device virtio-net-pci,netdev=net00 -netdev type=user,id=net00 \
    -drive file=my.img,id=disk00,if=none,format=qcow2,index=0 \
    -device virtio-blk,drive=disk00,serial=my.img \
    -drive file=my-seed.img,id=disk01,if=none,format=raw,index=1 \
    -device virtio-blk,drive=disk01,serial=my-seed.img \
    -snapshot -nographic -m 768

I don't plan on spending any more time on this unless there is some indication that cloud-init is doing something wrong. It really does seem to me to be entropy related, and some optimization in this image has broken what was functional elsewhere.

Revision history for this message
Scott Moser (smoser) wrote :

I've taken your kernel (vmlinuz-4.4.0-1009-kvm) and booted a "normal" cloud image with that kernel.
It hangs similarly (but for noticibly shorter time).

http://cloud-images.ubuntu.com/daily/server/artful/20171026.1/artful-server-cloudimg-amd64.img

I did have to disable the UEFI partition (vfat) since there were modules for your kernel in /lib.

$ qemu-img create -f qcow2 -b artful-cloud-image.img my.img
$ sudo mount-image-callback -vv my.img -- mchroot sed -i "/LABEL=UEFI/s/^/#/" /etc/fstab

$ xkvm --netdev=user --disk=my.img --disk=my-seed.img
-- -snapshot -m 768 -nographic -echr 0x05 -smp 2 -kernel vmlinuz-4.4.0-1009-kv
m -append "root=PARTUUID=9697fbdd-a35d-46d6-a51c-08bfab833468 console=ttyS0 syst
emd.log_target=console systemd.log_level=info systemd.log_target=console"

 qemu-system-x86_64 -enable-kvm -device virtio-scsi-pci,id=virtio-scsi-xkvm -device virtio-net-pci,netdev=net00 -netdev type=user,id=net00 -drive file=my.img,id=disk00,if=none,format=qcow2,index=0 -device virtio-blk,drive=disk00,serial=my.img -drive file=my-seed.img,id=disk01,if=none,format=raw,index=1 -device virtio-blk,drive=disk01,serial=my-seed.img -snapshot -m 768 -nographic -echr 0x05 -smp 2 -kernel vmlinuz-4.4.0-1009-kvm -append "root=PARTUUID=9697fbdd-a35d-46d6-a51c-08bfab833468 console=ttyS0 systemd.log_target=console systemd.log_level=info systemd.log_target=console"

...

log looks like this, and hangs there.
 http://paste.ubuntu.com/25832616/

Heres a diff of configs for the not working kernel and the working kernel.
$ diff -u config-4.4.0-59-generic config-4.4.0-1009-kvm |
   pastebinit -f diff
http://paste.ubuntu.com/25825403/

Revision history for this message
Steve Langasek (vorlon) wrote :
Download full text (10.2 KiB)

I've tracked down the blocking use of randomness in cloud-init's python module initialization. There are several nonblocking calls to getrandom(), which I believe should be fine; then a blocking call in the multiprocessing module.

# PYTHONHASHSEED=1 gdb --args /usr/bin/python3-dbg /usr/bin/cloud-init init
GNU gdb (Ubuntu 8.0.1-0ubuntu1) 8.0.1
[...]
(gdb) break getrandom
Breakpoint 1 at 0x420b90
(gdb) python
>import sys
>sys.path.append('/usr/share/doc/python3.6/examples/gdb')
>import libpython
>end
(gdb) run
Starting program: /usr/bin/python3-dbg /usr/bin/cloud-init init
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Breakpoint 1, getrandom (buffer=buffer@entry=0x7fffffff8140,
    length=length@entry=2496, flags=flags@entry=1)
    at ../sysdeps/unix/sysv/linux/getrandom.c:30
[...]
(gdb) cont 4
Will ignore next 3 crossings of breakpoint 1. Continuing.
[...]
-------+
ci-info: | Route | Destination | Gateway | Genmask | Interface |
 Flags |
ci-info: +-------+---------------+---------------+-----------------+-----------+
-------+
ci-info: | 0 | 0.0.0.0 | 192.168.122.1 | 0.0.0.0 | ens3 |
   UG |
ci-info: | 1 | 192.168.122.0 | 0.0.0.0 | 255.255.255.0 | ens3 |
   U |
ci-info: | 2 | 192.168.122.1 | 0.0.0.0 | 255.255.255.255 | ens3 |
   UH |
ci-info: +-------+---------------+---------------+-----------------+-----------+
-------+
/usr/lib/python3.6/logging/config.py:85: ResourceWarning: unclosed file <_io.Tex
tIOWrapper name='/var/log/cloud-init.log' mode='a' encoding='UTF-8'>
  _install_loggers(cp, handlers, disable_existing_loggers)

Breakpoint 1, getrandom (buffer=buffer@entry=0x7ffff21b7340,
    length=length@entry=32, flags=flags@entry=0)
    at ../sysdeps/unix/sysv/linux/getrandom.c:30
30 in ../sysdeps/unix/sysv/linux/getrandom.c
(gdb) py-bt
Traceback (most recent call first):
  0x7ffff6b66b20
  File "/usr/lib/python3.6/multiprocessing/process.py", line 307, in __init__
    self._config = {'authkey': AuthenticationString(os.urandom(32)),
  File "/usr/lib/python3.6/multiprocessing/process.py", line 320, in <module>
    _current_process = _MainProcess()
  <built-in method exec of module object at remote 0x7ffff6bb9cd8>
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 678, in exec_module
  File "<frozen importlib._bootstrap>", line 665, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  <built-in method __import__ of module object at remote 0x7ffff6bb9cd8>
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1017, in _handle_fromlist
  File "/usr/lib/python3.6/multiprocessing/context.py", line 5, in <module>
    from . import process
  <built-in method exec of module object at remote 0x7ffff6bb9cd8>
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line...

Changed in python3.6 (Ubuntu):
status: New → Triaged
Changed in cloud-init (Ubuntu):
status: Incomplete → Triaged
Revision history for this message
Steve Langasek (vorlon) wrote :

This looks like a chain of decisions that are all locally reasonable but when put together may not give a sensible result. cloud-init uses jinja2 as a sensible templating engine. jinja2 has async handling, which relies on asyncio. asyncio uses the multiprocessing module, which has both thread and process models; and the process model sets up a secure (true random) key to use for IPC.

Possible options for eliminating this use of urandom() include:
 - making jinja2 configurable to avoid asyncio and changing cloud-init to select this.
 - adjusting asyncio to only lazily import concurrent.futures.
 - adjusting asyncio to import concurrent.futures.thread.ThreadPoolExecutor directly, avoiding the side-effect of importing concurrent.futures.process.ProcessPoolExecutor.
 - make multiprocessing configurable to knowingly preseed an insecure auth key, and have cloud-init initialize multiprocessing directly before importing jinja2.

It may or may not be worth making any changes to at all, if the problem only manifests with current linux-kvm kernel config and can be fixed there by enabling in-kernel entropy sources. This should be measured on other clouds to see if there's any benefit to boot speed at large.

Revision history for this message
Steve Langasek (vorlon) wrote :

With python3-dbg and python3.6-examples installed, here's a commandline for getting to the problematic getrandom() call.

# PYTHONPATH=/usr/share/doc/python3.6/examples/gdb PYTHONHASHSEED=1 gdb -ex 'break getrandom' -ex run -ex 'cont 4' -ex "python import libpython" -ex py-bt --args /usr/bin/python3-dbg /usr/bin/cloud-init init

Revision history for this message
Steve Langasek (vorlon) wrote :

So far I've tested in GCE and Azure, and both platforms show no delays at all due to lack of entropy at boot. Unless someone identifies a platform where this actually causes boot delays other than linux-kvm, which is presumed due to a missing kernel option, there's no point in changing anything in cloud-init or python.

Changed in python3.6 (Ubuntu):
importance: Undecided → Wishlist
Changed in cloud-init (Ubuntu):
importance: Undecided → Wishlist
Revision history for this message
David Britton (dpb) wrote :

Marking as wont-fix for cloud-init for now as that would be a workaround for the base problem that is not desired right now. Adding linux-kvm to evaluate why this difference exists between linux-generic and linux-kvm

Changed in cloud-init:
status: Incomplete → Won't Fix
Changed in cloud-init (Ubuntu):
status: Triaged → Won't Fix
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux-kvm (Ubuntu):
status: New → Confirmed
Revision history for this message
Eric Lafontaine (elafontaine) wrote :

In Openstack Newton (linux-kvm) with Centos 7 (latest cloud-image) with cloud-init 19.3, the delay is observed. I think this is due to the version that Centos ships with for python (2.7.5).

The problem is at the init-local stage. https://github.com/canonical/cloud-init/pull/103

here is the attached cloud-init.log

Revision history for this message
Eric Lafontaine (elafontaine) wrote :

Ok, I got a question for all the interested folks out there;

What is we were to have a test case that run init-local and fails if random gets imported?

I'm looking at how this could be done, but that test could link to this bug and have people realise that they need to be careful about the random import.

Is it a good idea or a bad idea? I honestly cannot tell and I have the tendency to think it's a good idea...

Revision history for this message
Dan Watkins (oddbloke) wrote :

Eric, this bug is about the linux-kvm kernel _in Ubuntu_. If you're seeing a cloud-init issue on Centos, then please file a separate bug.

Thanks!

Revision history for this message
James Falcon (falcojr) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.