cloud-init doesn't retry metadata lookups and hangs forever if metadata is down

Bug #1667735 reported by Christopher Horrell on 2017-02-24
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
cloud-init
Medium
Mike Gerdts
cloud-init (Ubuntu)
Medium
Unassigned
Precise
Medium
Unassigned
Trusty
Medium
Unassigned

Bug Description

If a host SmartOS server is rebooted and the metadata service is not available, a KVM VM instance that use cloud-init (via the SmartOS datasource) will fail to start.

If the metadata agent on the host server is not available the python code for cloud-init gets blocked forever waiting for data it will never receive. This causes the boot process for an instance to hang on cloud-init.

This is problematic if there happens to be some reason the metadata agent is not available for any reason while a SmartOS KVM VM that relies on cloud-init is booting.

From the engineer that worked on this (not the svadm command is run on the host SmartOS server):

You can reproduce this by disabling the metadata service SmartOS host:

svcadm disable metadata

and then boot a KVM VM running an Ubuntu Certified Cloud image such as:

c864f104-624c-43d2-835e-b49a39709b6b (ubuntu-certified-14.04 20150225.2)

when you do this, the VM's boot process will hang at cloud-init. If you then start the metadata service, cloud-init will not recover.

On of our engineers who looked at this was able to cause forward progress by applying this patch:

--- /usr/lib/python2.7/dist-packages/cloudinit/sources/DataSourceSmartOS.py.ori 2017-02-23 01:28:28.405885775 +0000
+++ /usr/lib/python2.7/dist-packages/cloudinit/sources/DataSourceSmartOS.py 2017-02-23 01:35:51.281885775 +0000
@@ -286,7 +286,7 @@
     if not seed_device:
         raise AttributeError("seed_device value is not set")

- ser = serial.Serial(seed_device, timeout=seed_timeout)
+ ser = serial.Serial(seed_device, timeout=10)
     if not ser.isOpen():
         raise SystemError("Unable to open %s" % seed_device)

which causes the following strace output:

[pid 2119] open("/dev/ttyS1", O_RDWR|O_NOCTTY|O_NONBLOCK) = 5
[pid 2119] ioctl(5, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, {B9600 -opost -isig -icanon -echo ...}) = 0
[pid 2119] write(5, "GET user-script\n", 16) = 16
[pid 2119] select(6, [5], [], [], {10, 0}) = 0 (Timeout)
[pid 2119] close(5) = 0
[pid 2119] open("/dev/ttyS1", O_RDWR|O_NOCTTY|O_NONBLOCK) = 5
[pid 2119] ioctl(5, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, {B9600 -opost -isig -icanon -echo ...}) = 0
[pid 2119] write(5, "GET iptables_disable\n", 21) = 21
[pid 2119] select(6, [5], [], [], {10, 0}) = 0 (Timeout)
[pid 2119] close(5) = 0

instead of:

[pid 1977] open("/dev/ttyS1", O_RDWR|O_NOCTTY|O_NONBLOCK) = 5
[pid 1977] ioctl(5, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, {B9600 -opost -isig -icanon -echo ...}) = 0
[pid 1977] write(5, "GET base64_keys\n", 16) = 16
[pid 1977] select(6, [5], [], [], NULL

which you get without the patch (notice the NULL for the timeout parameter). The code that gets blocked in this version of cloud-init is:

    ser.write("GET %s\n" % noun.rstrip())
    status = str(ser.readline()).rstrip()

in cloudinit/sources/DataSourceSmartOS.py. The ser.readline() documentation says

(https://pyserial.readthedocs.io/en/latest/shortintro.html#readline):

Be careful when using readline(). Do specify a timeout when opening the serial port otherwise it could block forever if no newline character is received. Also note that readlines() only works with a timeout. readlines() depends on having a timeout and interprets that as EOF (end of file). It raises an exception if the port is not opened correctly.

which is exactly the situation we've hit here.

It might be better to have a timeout but when the timeout is hit, the GET should be retried if there's no answer rather than moving on to the next key. A negative answer (NOTFOUND for example) should not be retried, only when there's no answer (because metadata is unavailable).

Once this is resolved, it should be possible to start a VM with cloud-init and metadata disabled, and then enable metadata some time later and have the boot process complete at that time.

Related branches

Scott Moser (smoser) on 2017-02-24
Changed in cloud-init (Ubuntu):
status: New → Confirmed
importance: Undecided → Medium
Changed in cloud-init:
status: New → Confirmed
importance: Undecided → Medium
Changed in cloud-init (Ubuntu):
status: Confirmed → Fix Released
Changed in cloud-init (Ubuntu Precise):
status: New → Confirmed
Changed in cloud-init (Ubuntu Trusty):
status: New → Confirmed
Changed in cloud-init (Ubuntu Precise):
importance: Undecided → Medium
Changed in cloud-init (Ubuntu Trusty):
importance: Undecided → Medium
Scott Moser (smoser) wrote :

"Once this is resolved, it should be possible to start a VM with cloud-init and metadata disabled, and then enable metadata some time later and have the boot process complete at that time."

That doesn't really make sense. And re-trying timing out in 10 seconds and re-trying wouldn't really change anything.

Either you block the boot waiting for the other end of the read() or you don't. "have the boot process complete at that time" is I think what we have in precise and trusty right now.

boot will hang until the read() came back. I'm not sure if upstart boot will actually wait forever on cloud-init-local to return , it may well do that. I suspect that systemd will kill cloud-init after like 90 seconds.

Its very arguable that the *right* thing to do is wait forever on the metadata service. Cloud-init only knows what it should do next based on information from the metadata service. If this is the first boot of an instance, then some things will happen, if it is a reboot, then things such as ssh key generation or user creation will not occur.

So this is *kind of* working by design. Sure we can change the setting to give up, but that will just result in cloud-init not finding a datasource. If that was a instance's first boot, then the user wouldn't be able to access the system. If it was any other boot, then things would be fine, but cloud-init has no way of knowing that this is "second boot" other than by the metadata service.

Josh Wilsdon (jwilsdon) wrote :

"That doesn't really make sense. And re-trying timing out in 10 seconds and re-trying wouldn't really change anything."

Unfortunately that's not the case for cloud-init in 14.04 (retrying there would fix the problem but miss keys) and only correct for cloud-init in 16.10 because that version has other problems.

With the Ubuntu 14.04 image's version of cloud-init (0.7.5-0ubuntu1.3) the read() is blocking with no timeout. If there was a timeout and if the request was correctly retried every 10 seconds, eventually things would proceed once the metadata service became available. As it stands now, these instances need to be externally rebooted when they become hung as they will never recover on their own.

I've just tested 16.10 as well with cloud-init 0.7.8-68-gca3ae67-0ubuntu1~16.10.1. With that version the code has changed significantly, but the problem is the same. Even though the timeout has been added in the new version, its usefulness has been negated since the new code gets itself stuck in an infinite loop when we hit this case. You can also see some more details in:

https://smartos.org/bugview/IMAGE-1014

In my testing on 16.10, systemd did not kill cloud-init in the 30 minutes I waited.

"Its very arguable that the *right* thing to do is wait forever on the metadata service."

I agree with this. However, that's not what cloud-init is doing. In part because cloud-init's implementation of the metadata specification (https://eng.joyent.com/mdata/protocol.html) is incomplete. In particular:

 * It uses V2 without doing NEGOTIATE
 * It uses the KVM serial port without reading all the data from the buffer before writing
 * It does not write '\n' and wait for 'invalid command\n'
 * When a read() times out, it tries a read() again instead of starting over

What happens with cloud-init in 16.10 is that if metadata is unavailable when the instance boots, cloud-init will:

 1) write data into the socket (nobody's listening)
 2) do a select() on the socket looking for readable data (and timeout after 10 seconds)
 3) goto 2

the loop between 2 and 3 becomes infinite because even if metadata is enabled at this point, cloud-init never attempts to send any commands to it.

If instead it were to:

 1) open the socket
 2) read on the socket (with a timeout) and discard any data
 3) write '\n'
 4) read on the socket for 'invalid command\n' (with a timeout, on timeout close socket and go to 1 )
 5) NEGOTIATE V2

before making any queries, it would be able to recover when the metadata service became available if it is unavailable initially. If you disable the metadata service and run mdata-get under strace, and then enable metadata, you'll see that that's how it is able to recover in this case.

So in summary: I think we're in agreement that cloud-init should wait forever for the metadata service, but both 14.04 and 16.10 have different but related problems in their implementation which prevent cloud-init from ever actually knowing when metadata has become available. The consequence of this is that in both versions if metadata is unavailable when cloud-init is first run, the VMs will hang until rebooted externally.

Scott Moser (smoser) wrote :

Thanks for clarifying.
Note there is also a socket client for your container service there. we'd ideally fix that too.

Scott Moser (smoser) on 2017-03-02
Changed in cloud-init (Ubuntu):
status: Fix Released → Confirmed
Scott Moser (smoser) on 2018-02-05
Changed in cloud-init (Ubuntu Precise):
status: Confirmed → Won't Fix
Mike Gerdts (mgerdts) on 2018-02-07
Changed in cloud-init:
assignee: nobody → Mike Gerdts (mgerdts)
Scott Moser (smoser) wrote :

An upstream commit landed for this bug.

To view that commit see the following URL:
https://git.launchpad.net/cloud-init/commit/?id=4c573d0e

Changed in cloud-init:
status: Confirmed → Fix Committed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package cloud-init - 18.2-14-g6d48d265-0ubuntu1

---------------
cloud-init (18.2-14-g6d48d265-0ubuntu1) bionic; urgency=medium

  * New upstream snapshot.
    - net: Depend on iproute2's ip instead of net-tools ifconfig or route
    - DataSourceSmartOS: fix hang when metadata service is down
      [Mike Gerdts] (LP: #1667735)
    - DataSourceSmartOS: change default fs on ephemeral disk from ext3 to
      ext4. [Mike Gerdts] (LP: #1763511)
    - pycodestyle: Fix invalid escape sequences in string literals.
    - Implement bash completion script for cloud-init command line

 -- Chad Smith <email address hidden> Wed, 18 Apr 2018 15:25:53 -0600

Changed in cloud-init (Ubuntu):
status: Confirmed → Fix Released

This bug is believed to be fixed in cloud-init in version 18.3. If this is still a problem for you, please make a comment and set the state back to New

Thank you.

Changed in cloud-init:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers