eth0 lost carrier / down after restart and IP change on older EC2-classic instance
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
cloud-images |
Invalid
|
Undecided
|
Unassigned | ||
cloud-init |
Invalid
|
Undecided
|
Unassigned |
Bug Description
I'm experiencing a consistent issue where older EC2 instance types (e.g. c3.large) launched in EC2-Classic from the bionic AMI lose network connection if they're stopped and subsequently restarted.
They work fine on the first boot, but when restarted they time out both for things like SSH and also for EC2's status checks. They also appear to have no outbound connection e.g. to the metadata service etc. Rebooting does not resolve the issue, nor does stopping and starting again.
On one occasion when testing, I resumed the instance very quickly and Amazon allocated it the same IP address as before - the instance booted with no problems. Normally however the instance gets a new IP address - so it appears this may be related.
This is happening consistently with ami-08d658f84a6
It does not happen if launching a newer instance type into EC2-VPC.
Steps to reproduce:
* Launch ami-08d658f84a6
* Wait for instance to boot, SSH to instance and observe all working normally. Wait for EC2 status checks to initialise and observe they pass.
* Stop instance
* Wait a minute or two - if restarted very rapidly AWS may reallocate the previous IP
* Start instance and observe it has been allocated a new IP address
* Wait a few minutes
* Attempt to SSH to the instance and observe the connection times out
* Observe that the EC2 instance reachability status check is failing
* Use the EC2 console to take an instance screenshot and observe that the console is showing the login prompt
By attaching the root volume from the broken instance to a new instance, I was able to capture and compare the syslog for the two boots. Both appear broadly similar at first, DHCP works as expected over eth0.
In both boots, systemd-networkd then reports "eth0: lost carrier".
On the successful boot, systemd-networkd almost immediately afterwards then reports "eth0: gained carrier" and "eth0: IPv6 successfully enabled". However on the failed boot these entries never appear.
Shortly afterwards cloud-init runs and on the success boot shows eth0 up with both IPv4 and IPv6 addresses, and valid routing tables. On the failed boot it shows eth0 down, no IPv4 routing table and an empty IPv6 routing table.
Also later on in the log from the failed boot amazon-
One thing I did notice is that the images don't appear to have been configured to disable Predictable Network Interface Names. I've tried changing that but it didn't resolve the issue. On reflection I think that's perhaps unrelated, since presumably the interface names don't change between a stop and start of the same instance on the same EC2 instance type, and the first boot works happily. Also the logs are all consistently showing eth0 rather than one of the newer interface names.
Hello Andrew, thank you for reporting this bug. I have added the cloud-init project as the issue is specific to that code and they can look into your issue.