eth0 lost carrier / down after restart and IP change on older EC2-classic instance

Bug #1817035 reported by Andrew Coulton
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
cloud-images
Invalid
Undecided
Unassigned
cloud-init
Invalid
Undecided
Unassigned

Bug Description

I'm experiencing a consistent issue where older EC2 instance types (e.g. c3.large) launched in EC2-Classic from the bionic AMI lose network connection if they're stopped and subsequently restarted.

They work fine on the first boot, but when restarted they time out both for things like SSH and also for EC2's status checks. They also appear to have no outbound connection e.g. to the metadata service etc. Rebooting does not resolve the issue, nor does stopping and starting again.

On one occasion when testing, I resumed the instance very quickly and Amazon allocated it the same IP address as before - the instance booted with no problems. Normally however the instance gets a new IP address - so it appears this may be related.

This is happening consistently with ami-08d658f84a6d84a80 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190212.1) and I've also reproduced with ami-0c21eb76a5574aa2f (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190210)

It does not happen if launching a newer instance type into EC2-VPC.

Steps to reproduce:

* Launch ami-08d658f84a6d84a80 on a c3.large in EC2-Classic, with a securing group allowing port 22 from anywhere and other configuration all as AWS defaults
* Wait for instance to boot, SSH to instance and observe all working normally. Wait for EC2 status checks to initialise and observe they pass.
* Stop instance
* Wait a minute or two - if restarted very rapidly AWS may reallocate the previous IP
* Start instance and observe it has been allocated a new IP address
* Wait a few minutes
* Attempt to SSH to the instance and observe the connection times out
* Observe that the EC2 instance reachability status check is failing
* Use the EC2 console to take an instance screenshot and observe that the console is showing the login prompt

By attaching the root volume from the broken instance to a new instance, I was able to capture and compare the syslog for the two boots. Both appear broadly similar at first, DHCP works as expected over eth0.

In both boots, systemd-networkd then reports "eth0: lost carrier".

On the successful boot, systemd-networkd almost immediately afterwards then reports "eth0: gained carrier" and "eth0: IPv6 successfully enabled". However on the failed boot these entries never appear.

Shortly afterwards cloud-init runs and on the success boot shows eth0 up with both IPv4 and IPv6 addresses, and valid routing tables. On the failed boot it shows eth0 down, no IPv4 routing table and an empty IPv6 routing table.

Also later on in the log from the failed boot amazon-ssm-agent.amazon-ssm-agent reports that it cannot contact the metadata service (dial tcp 169.254.169.254:80: connect: network is unreachable).

One thing I did notice is that the images don't appear to have been configured to disable Predictable Network Interface Names. I've tried changing that but it didn't resolve the issue. On reflection I think that's perhaps unrelated, since presumably the interface names don't change between a stop and start of the same instance on the same EC2 instance type, and the first boot works happily. Also the logs are all consistently showing eth0 rather than one of the newer interface names.

Revision history for this message
Andrew Coulton (acoulton) wrote :
Revision history for this message
Andrew Coulton (acoulton) wrote :
Revision history for this message
Robert C Jennings (rcj) wrote :

Hello Andrew, thank you for reporting this bug. I have added the cloud-init project as the issue is specific to that code and they can look into your issue.

Revision history for this message
Chad Smith (chad.smith) wrote :

Thank you much for this bug and making Ubuntu and cloud-init better.

We have recently landed a fix for this issue in tip of cloud-init per bug: #1802073. Our plan is to update Xenial, Bionic and Cosmic in our upcoming Stable Release Update.

I have added series-specific tasks to #1802073 which we will close as each series is publicly available.

Revision history for this message
Chad Smith (chad.smith) wrote :

Marking this cloud-init task as Invalid in favor of tracking out SRU to each ubuntu series in LP: #1802073

Changed in cloud-init:
status: New → Invalid
Revision history for this message
Andrew Coulton (acoulton) wrote :

Thanks very much Robert and Chad. Sorry - I had googled and searched the cloud-images tracker for an existing issue but it didn't occur to me to look at cloud-init. Great news that a fix is on the way.

Revision history for this message
Robert C Jennings (rcj) wrote :

I am going to close out the cloud-images side of this bug as well. The daily image for a release will contain the fix as soon as it is released.

Changed in cloud-images:
status: New → Invalid
Revision history for this message
Dan Watkins (oddbloke) wrote :

As both bugs tasks are Invalid now, we can mark this as a dupe without losing any info.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.