cloud-images

eth0 lost carrier / down after restart and IP change on older EC2-classic instance

Bug #1817035 reported by Andrew Coulton on 2019-02-21

This bug report is a duplicate of: Bug #1802073: No network in AWS (EC-Classic) after stopping and starting instance. Edit Remove

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	cloud-images	Invalid	Undecided	Unassigned
	cloud-init	Invalid	Undecided	Unassigned

Bug Description

I'm experiencing a consistent issue where older EC2 instance types (e.g. c3.large) launched in EC2-Classic from the bionic AMI lose network connection if they're stopped and subsequently restarted.

They work fine on the first boot, but when restarted they time out both for things like SSH and also for EC2's status checks. They also appear to have no outbound connection e.g. to the metadata service etc. Rebooting does not resolve the issue, nor does stopping and starting again.

On one occasion when testing, I resumed the instance very quickly and Amazon allocated it the same IP address as before - the instance booted with no problems. Normally however the instance gets a new IP address - so it appears this may be related.

This is happening consistently with ami-08d658f84a6d84a80 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190212.1) and I've also reproduced with ami-0c21eb76a5574aa2f (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190210)

It does not happen if launching a newer instance type into EC2-VPC.

Steps to reproduce:

* Launch ami-08d658f84a6d84a80 on a c3.large in EC2-Classic, with a securing group allowing port 22 from anywhere and other configuration all as AWS defaults
* Wait for instance to boot, SSH to instance and observe all working normally. Wait for EC2 status checks to initialise and observe they pass.
* Stop instance
* Wait a minute or two - if restarted very rapidly AWS may reallocate the previous IP
* Start instance and observe it has been allocated a new IP address
* Wait a few minutes
* Attempt to SSH to the instance and observe the connection times out
* Observe that the EC2 instance reachability status check is failing
* Use the EC2 console to take an instance screenshot and observe that the console is showing the login prompt

By attaching the root volume from the broken instance to a new instance, I was able to capture and compare the syslog for the two boots. Both appear broadly similar at first, DHCP works as expected over eth0.

In both boots, systemd-networkd then reports "eth0: lost carrier".

On the successful boot, systemd-networkd almost immediately afterwards then reports "eth0: gained carrier" and "eth0: IPv6 successfully enabled". However on the failed boot these entries never appear.

Shortly afterwards cloud-init runs and on the success boot shows eth0 up with both IPv4 and IPv6 addresses, and valid routing tables. On the failed boot it shows eth0 down, no IPv4 routing table and an empty IPv6 routing table.

Also later on in the log from the failed boot amazon-ssm-agent.amazon-ssm-agent reports that it cannot contact the metadata service (dial tcp 169.254.169.254:80: connect: network is unreachable).

One thing I did notice is that the images don't appear to have been configured to disable Predictable Network Interface Names. I've tried changing that but it didn't resolve the issue. On reflection I think that's perhaps unrelated, since presumably the interface names don't change between a stop and start of the same instance on the same EC2 instance type, and the first boot works happily. Also the logs are all consistently showing eth0 rather than one of the newer interface names.