cloud-init fails when rebooting EC2 i3.metal instances
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
| cloud-init |
High
|
Unassigned |
Bug Description
In order to collect boot-speed metrics I deploy/
1. Deploy an instance and wait for cloud-init
to finish using `cloud-init status --wait`
2. Collect and retrieve some logs via SSH/SFTP
3. Reboot the instance using boto3's reboot()
4. Collect some more logs
5. Terminate the instance
This works in a fairly reliable way, but on i3.metal instances the instance often fails to survive the reboot step. After a failed reboot the instance state appears as "running", but it's unreachable via SSH.
By detaching the root volume and attaching it to another instance in the same availability zone I've been able to inspect the logs, and problem is a cloud-init failure. At a first glance of the logs it looks like cloud-init doesn't like /var/lib/
2019-08-23 11:31:27,585 - util.py[DEBUG]: Reading from /var/lib/
2019-08-23 11:31:27,585 - util.py[DEBUG]: Read 0 bytes from /var/lib/
2019-08-23 11:31:27,585 - util.py[WARNING]: failed stage init-local
2019-08-23 11:31:27,586 - util.py[DEBUG]: failed stage init-local
Traceback (most recent call last):
File "/usr/lib/
ret = functor(name, args)
File "/usr/lib/
_maybe_
File "/usr/lib/
cc_
File "/usr/lib/
prev_hostname = util.load_
File "/usr/lib/
decoded = json.loads(
File "/usr/lib/
return _default_
File "/usr/lib/
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/
raise JSONDecodeError
json.decoder.
I'm not sure on where the actual problem is here. Is set-hostname supposed to always contain something? Should cloud-init be able to handle an empty set-hostname? Could the fact that the instance is rebooted shortly after being deployed affect this?
The full logs are attached.
Paride Legovini (paride) wrote : | #1 |
Ryan Harper (raharper) wrote : | #2 |
Changed in cloud-init: | |
importance: | Undecided → High |
status: | New → Incomplete |
Paride Legovini (paride) wrote : Re: [Bug 1841182] Re: cloud-init fails when rebooting EC2 i3.metal instances | #3 |
Ryan Harper wrote on 28/08/2019:
> Thanks for submitting this. Is it possible to collect-logs from the
> initial boot of the instance before the reboot? In particular I'd like
> to see a cloud-init collect-logs and a tar of /var/lib/cloud before the
> reboot.
Here are the first-boot logs from an instance that when rebooted hit the
problem. The post-reboot-
collected after the reboot and that shows the cloud-init error message.
It has been collected using `aws ec2 get-console-
Paride
Changed in cloud-init: | |
status: | Incomplete → New |
Dan Watkins (oddbloke) wrote : | #4 |
OK, it looks like /var/lib/
To help confirm what's going on, are you able to capture /var/lib/cloud and logs post-failure (perhaps by attaching the root disk elsewhere)?
Changed in cloud-init: | |
status: | New → Incomplete |
Paride Legovini (paride) wrote : | #5 |
Dan Watkins wrote on 03/09/2019:
> OK, it looks like /var/lib/
> the reboot, which is what causes the issue. We should handle that case
> gracefully, assuming that it isn't cloud-init that's causing it to be
> removed.
>
> To help confirm what's going on, are you able to capture /var/lib/cloud
> and logs post-failure (perhaps by attaching the root disk elsewhere)?
Here are the first-boot and post-reboot logs collected from an instance
that encountered the problem.
Changed in cloud-init: | |
status: | Incomplete → New |
Dan Watkins (oddbloke) wrote : | #6 |
OK, to clarify my previous comment, the issue is not that /var/lib/
So, I think this breaks down into three things:
1) We should improve the logging situation there (just filed bug 1843276)
2) We should work out why the file ends up empty in this case
3) We should gracefully treat an empty file as an absent file (probably with a WARNING?), because I think that's the best we can do once we find ourselves in that situation
(2) and (3) should _probably_ be separate bugs, but without (1) it's not 100% clear that that's the case so I haven't filed anything separate yet.
Dan Watkins (oddbloke) wrote : | #7 |
Paride, are you able to test this again using the cloud-init from the daily PPA[0]? That has some additional logging added, which should make working out (2) easier.
[0] https:/
Changed in cloud-init: | |
status: | New → Incomplete |
Launchpad Janitor (janitor) wrote : | #8 |
[Expired for cloud-init because there has been no activity for 60 days.]
Changed in cloud-init: | |
status: | Incomplete → Expired |
Changed in cloud-init: | |
status: | Expired → Incomplete |
Changed in cloud-init: | |
status: | Incomplete → Expired |
Paride Legovini (paride) wrote : | #10 |
This didn't happen again in quite some time, not sure what fixed it. I'll reopen this bug report if I encounter the issue again.
Thanks for submitting this. Is it possible to collect-logs from the initial boot of the instance before the reboot? In particular I'd like to see a cloud-init collect-logs and a tar of /var/lib/cloud before the reboot.