cloud-init fails when rebooting EC2 i3.metal instances

Bug #1841182 reported by Paride Legovini on 2019-08-23

This bug report will be marked for expiration in 59 days if no further activity occurs. (find out why)

8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
cloud-init
High
Unassigned

Bug Description

In order to collect boot-speed metrics I deploy/reboot/terminate several EC2 instances per day. At a high level this is that the jobs do:

1. Deploy an instance and wait for cloud-init
   to finish using `cloud-init status --wait`
2. Collect and retrieve some logs via SSH/SFTP
3. Reboot the instance using boto3's reboot()
4. Collect some more logs
5. Terminate the instance

This works in a fairly reliable way, but on i3.metal instances the instance often fails to survive the reboot step. After a failed reboot the instance state appears as "running", but it's unreachable via SSH.

By detaching the root volume and attaching it to another instance in the same availability zone I've been able to inspect the logs, and problem is a cloud-init failure. At a first glance of the logs it looks like cloud-init doesn't like /var/lib/cloud/data/set-hostname being empty:

2019-08-23 11:31:27,585 - util.py[DEBUG]: Reading from /var/lib/cloud/data/set-hostname (quiet=False)
2019-08-23 11:31:27,585 - util.py[DEBUG]: Read 0 bytes from /var/lib/cloud/data/set-hostname
2019-08-23 11:31:27,585 - util.py[WARNING]: failed stage init-local
2019-08-23 11:31:27,586 - util.py[DEBUG]: failed stage init-local
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 653, in status_wrapper
    ret = functor(name, args)
  File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 361, in main_init
    _maybe_set_hostname(init, stage='local', retry_stage='network')
  File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 709, in _maybe_set_hostname
    cc_set_hostname.handle('set-hostname', init.cfg, cloud, LOG, None)
  File "/usr/lib/python3/dist-packages/cloudinit/config/cc_set_hostname.py", line 67, in handle
    prev_hostname = util.load_json(util.load_file(prev_fn))
  File "/usr/lib/python3/dist-packages/cloudinit/util.py", line 1586, in load_json
    decoded = json.loads(decode_binary(text))
  File "/usr/lib/python3.6/json/__init__.py", line 354, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.6/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.6/json/decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

I'm not sure on where the actual problem is here. Is set-hostname supposed to always contain something? Should cloud-init be able to handle an empty set-hostname? Could the fact that the instance is rebooted shortly after being deployed affect this?

The full logs are attached.

Paride Legovini (legovini) wrote :
Ryan Harper (raharper) wrote :

Thanks for submitting this. Is it possible to collect-logs from the initial boot of the instance before the reboot? In particular I'd like to see a cloud-init collect-logs and a tar of /var/lib/cloud before the reboot.

Changed in cloud-init:
importance: Undecided → High
status: New → Incomplete

Ryan Harper wrote on 28/08/2019:
> Thanks for submitting this. Is it possible to collect-logs from the
> initial boot of the instance before the reboot? In particular I'd like
> to see a cloud-init collect-logs and a tar of /var/lib/cloud before the
> reboot.

Here are the first-boot logs from an instance that when rebooted hit the
problem. The post-reboot-console-log-fail.log file is the console log
collected after the reboot and that shows the cloud-init error message.
It has been collected using `aws ec2 get-console-output`.

Paride

Changed in cloud-init:
status: Incomplete → New
Dan Watkins (daniel-thewatkins) wrote :

OK, it looks like /var/lib/cloud/data/set-hostname ends up empty after the reboot, which is what causes the issue. We should handle that case gracefully, assuming that it isn't cloud-init that's causing it to be removed.

To help confirm what's going on, are you able to capture /var/lib/cloud and logs post-failure (perhaps by attaching the root disk elsewhere)?

Changed in cloud-init:
status: New → Incomplete
Paride Legovini (legovini) wrote :

Dan Watkins wrote on 03/09/2019:
> OK, it looks like /var/lib/cloud/data/set-hostname ends up empty after
> the reboot, which is what causes the issue. We should handle that case
> gracefully, assuming that it isn't cloud-init that's causing it to be
> removed.
>
> To help confirm what's going on, are you able to capture /var/lib/cloud
> and logs post-failure (perhaps by attaching the root disk elsewhere)?

Here are the first-boot and post-reboot logs collected from an instance
that encountered the problem.

Changed in cloud-init:
status: Incomplete → New
Dan Watkins (daniel-thewatkins) wrote :

OK, to clarify my previous comment, the issue is not that /var/lib/cloud/data/set-hostname is _missing_, the issue is specifically that it is present but empty. Unfortunately, we don't have any logging from the code path used to write set-hostname (it uses cloudinit.atomic_helper.write_json which uses atomic_helper.write_file), so we can't be 100% sure that we aren't incorrectly writing out this empty file.

So, I think this breaks down into three things:

1) We should improve the logging situation there (just filed bug 1843276)
2) We should work out why the file ends up empty in this case
3) We should gracefully treat an empty file as an absent file (probably with a WARNING?), because I think that's the best we can do once we find ourselves in that situation

(2) and (3) should _probably_ be separate bugs, but without (1) it's not 100% clear that that's the case so I haven't filed anything separate yet.

Dan Watkins (daniel-thewatkins) wrote :

Paride, are you able to test this again using the cloud-init from the daily PPA[0]? That has some additional logging added, which should make working out (2) easier.

[0] https://launchpad.net/~cloud-init-dev/+archive/ubuntu/daily

Changed in cloud-init:
status: New → Incomplete
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers