No network in AWS (EC-Classic) after stopping and starting instance

Bug #1802073 reported by Jani Ollikainen on 2018-11-07
32
This bug affects 3 people
Affects Status Importance Assigned to Milestone
cloud-init (Ubuntu)
High
Guilherme G. Piccoli
Xenial
Low
Unassigned
Bionic
High
Unassigned
Cosmic
High
Unassigned

Bug Description

I don't know is this cloud-init or netplan or what, but this is not good.

Background:
# lsb_release -rd
Description: Ubuntu 18.04.1 LTS
Release: 18.04
# apt-cache policy cloud-init
cloud-init:
  Installed: 18.4-0ubuntu1~18.04.1
  Candidate: 18.4-0ubuntu1~18.04.1
  Version table:
 *** 18.4-0ubuntu1~18.04.1 500
        500 http://eu-west-1.ec2.archive.ubuntu.com/ubuntu bionic-updates/main amd64 Packages
        100 /var/lib/dpkg/status
     18.2-14-g6d48d265-0ubuntu1 500
        500 http://eu-west-1.ec2.archive.ubuntu.com/ubuntu bionic/main amd64 Packages

1. Get newest image to use

$ aws --region eu-west-1 ec2 describe-images --owners 099720109477 --filters Name=root-device-type,Values=ebs Name=architecture,Values=x86_64 Name=name,Values='*hvm-ssd/ubuntu-bionic-18.04*' --query 'sort_by(Images, &Name)[-1].ImageId'

"ami-08596fdd2d5b64915"

2. Start instance to EC2-Classic with that image.

3. Try to SSH. Everything is ok.

# cat /var/log/cloud-init-output.log
Cloud-init v. 18.4-0ubuntu1~18.04.1 running 'init-local' at Wed, 07 Nov 2018 08:12:16 +0000. Up 10.51 seconds.
Cloud-init v. 18.4-0ubuntu1~18.04.1 running 'init' at Wed, 07 Nov 2018 08:12:21 +0000. Up 15.50 seconds.
ci-info: +++++++++++++++++++++++++++++++++++++++Net device info++++++++++++++++++++++++++++++++++++++++
ci-info: +--------+------+-----------------------------+-----------------+--------+-------------------+
ci-info: | Device | Up | Address | Mask | Scope | Hw-Address |
ci-info: +--------+------+-----------------------------+-----------------+--------+-------------------+
ci-info: | eth0 | True | 10.74.200.25 | 255.255.255.192 | global | 22:00:0a:4a:c8:19 |
ci-info: | eth0 | True | fe80::2000:aff:fe4a:c819/64 | . | link | 22:00:0a:4a:c8:19 |
ci-info: | lo | True | 127.0.0.1 | 255.0.0.0 | host | . |
ci-info: | lo | True | ::1/128 | . | host | . |
ci-info: +--------+------+-----------------------------+-----------------+--------+-------------------+
...
Cloud-init v. 18.4-0ubuntu1~18.04.1 running 'modules:config' at Wed, 07 Nov 2018 08:12:41 +0000. Up 35.63 seconds.
Cloud-init v. 18.4-0ubuntu1~18.04.1 running 'modules:final' at Wed, 07 Nov 2018 08:12:44 +0000. Up 38.98 seconds.
Cloud-init v. 18.4-0ubuntu1~18.04.1 finished at Wed, 07 Nov 2018 08:12:45 +0000. Datasource DataSourceEc2Local. Up 39.38 seconds

4. Stop the instance.

5. Start the instance.

6. Try to SSH.
Expected to happen: Instance has network and is working.
What happens: Instance has no working network

Getting instance log we can see:
[ 11.342357] cloud-init[412]: Cloud-init v. 18.4-0ubuntu1~18.04.1 running 'init-local' at Wed, 07 Nov 2018 08:21:07 +0000. Up 10.77 seconds.
[ OK ] Started Initial cloud-init job (pre-networking).
[ OK ] Reached target Network (Pre).
         Starting Network Service...
[ OK ] Started Network Service.
         Starting Network Name Resolution...
         Starting Wait for Network to be Configured...
[ OK ] Started Wait for Network to be Configured.
         Starting Initial cloud-init job (metadata service crawler)...
[ OK ] Started Network Name Resolution.
[ OK ] Reached target Host and Network Name Lookups.
[ OK ] Reached target Network.
[ 13.036207] cloud-init[637]: Cloud-init v. 18.4-0ubuntu1~18.04.1 running 'init' at Wed, 07 Nov 2018 08:21:08 +0000. Up 12.55 seconds.
[ 13.052849] cloud-init[637]: ci-info: +++++++++++++++++++++++++++Net device info++++++++++++++++++++++++++++
[ 13.100325] cloud-init[637]: ci-info: +--------+-------+-----------+-----------+-------+-------------------+
[ 13.121790] cloud-init[637]: ci-info: | Device | Up | Address | Mask | Scope | Hw-Address |
[ 13.129189] cloud-init[637]: ci-info: +--------+-------+-----------+-----------+-------+-------------------+
[ 13.144839] cloud-init[637]: ci-info: | eth0 | False | . | . | . | 22:00:0b:0a:cb:2d |[ OK ] Started Initial cloud-init job (metadata service crawler).
[ 13.158694] cloud-init
[ OK ] Reached target System Initialization.[637]: ci-info: | lo | True | 127.0.0.1 | 255.0.0.0 | host | . |
[ OK ] Started Daily apt download activities.
[ 13.179053] ] Started Message of the Day.
cloud-init[637]: ci-info: | lo | True | ::1/128 | . | host | . |[ OK ] Started ACPI Events Check.
[ OK ] Reached target Paths.
[ 13.201012] cloud-init[637]: ci-info: +--------+-------+-----------+-----------+-------+-------------------+] Listening on ACPID Listen Socket.
[ OK ] Listening on Open-iSCSI iscsid Socket.
[ 13.213993] cloud-init[ OK ] Listening on D-Bus System Message Bus Socket.[637]: ci-info: +++++++++++++++++++Route IPv6 info+++++++++++++++++++
         Starting Socket activation for snappy daemon.
[ 13.229707] cloud-init[637]: ci-info: +-------+-------------+---------+-----------+-------+
[ OK ] Started Daily apt upgrade and clean activities.
[ 13.244949] cloud-init[637]: ci-info: | Route | Destination | Gateway | Interface | Flags |] Listening on UUID daemon activation socket.
         Starting LXD - unix socket.
[ OK ] Started Daily Cleanup of Temporary Directories.[ 13.256281] cloud-init[637]: ci-info: +-------+-------------+---------+-----------+-------+
[ OK ] Started Discard unused blocks once a week.
[ OK ] Reached target Timers.
[ OK ] Reached target Cloud-config availability.
[ 13.286424] cloud-init[637]:
[ OK ] Reached target Network is Online.ci-info: +-------+-------------+---------+-----------+-------+

It would be nice that the instances would work also after stop&start as they used to.
My speculation for the problem is that /etc/netplan/50-cloud-init.yaml has:
            match:
                macaddress: 22:00:0a:66:16:17

Which changes in the stop&start and it is handled in wrong order. File is not generated before trying to get the network up and there is no device for that macaddress. But without console access and internal knowledge how this netplan/cloud-init/systemd thingie works it's kind of hard to
pinpoint the problematic thing.

But I did a small test. Edited /usr/lib/python3/dist-packages/cloudinit/net/netplan.py

            if if_type == 'physical':
                # required_keys = ['name', 'mac_address']
                eth = {
                    'set-name': ifname,
                    'match': ifcfg.get('match', None),
                }
                if eth['match'] is None:
                    macaddr = ifcfg.get('mac_address', None)
                    if macaddr is not None:
                        eth['match'] = {'macaddress': macaddr.lower()}
                    else:
                        del eth['match']
                        del eth['set-name']
+ del eth['match']
+ del eth['set-name']
                _extract_addresses(ifcfg, eth, ifname)
                ethernets.update({ifname: eth})

And then run:
cloud-init clean
cloud-init init

# cat /etc/netplan/50-cloud-init.yaml

# This file is generated from information provided by
# the datasource. Changes to it will not persist across an instance.
# To disable cloud-init's network configuration capabilities, write a file
# /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg with the following:
# network: {config: disabled}
network:
    version: 2
    ethernets:
        eth0:
            dhcp4: true

And then stopped the instance and started it.. It gets the network and works. And stopped again just to be sure it wasn't one time magic. Started and it works.

So the problem really seems that the match/macaddress but how one should properly fix that, I'll leave for people who have made it misbehave like this.

But I think there might be some pretty scared and annoyed people after stopping the instance and starting it, the instance is unreachable. Also depending on their skills to troubleshoot the problem and mount the volume to another instance and fix it (if it's ebs backed, if not, sorry, make a new instance).

Related branches

Jani Ollikainen (bestis) wrote :

And little bit more testing. It really seems to be that cloud-init does not generate the file again if not run clout-init clean before. So it doesn't generate the file again after stop&start and macaddress differs and no network, and no network.

Let's test it by editing the file by hand:

$ cat /etc/netplan/50-cloud-init.yaml
# This file is generated from information provided by
# the datasource. Changes to it will not persist across an instance.
# To disable cloud-init's network configuration capabilities, write a file
# /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg with the following:
# network: {config: disabled}
# EDITED
network:
    version: 2
    ethernets:
        eth0:
            dhcp4: true

$ sudo cloud-init init

$ cat /etc/netplan/50-cloud-init.yaml
# This file is generated from information provided by
# the datasource. Changes to it will not persist across an instance.
# To disable cloud-init's network configuration capabilities, write a file
# /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg with the following:
# network: {config: disabled}
# EDITED
network:
    version: 2
    ethernets:
        eth0:
            dhcp4: true

$ sudo cloud-init clean

$ cat /etc/netplan/50-cloud-init.yaml
# This file is generated from information provided by
# the datasource. Changes to it will not persist across an instance.
# To disable cloud-init's network configuration capabilities, write a file
# /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg with the following:
# network: {config: disabled}
# EDITED
network:
    version: 2
    ethernets:
        eth0:
            dhcp4: true

$ sudo cloud-init init

$ cat /etc/netplan/50-cloud-init.yaml
# This file is generated from information provided by
# the datasource. Changes to it will not persist across an instance.
# To disable cloud-init's network configuration capabilities, write a file
# /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg with the following:
# network: {config: disabled}
network:
    version: 2
    ethernets:
        eth0:
            dhcp4: true

So clean and init is needed for the file to be generated again, and probably normal boot does just the init and yes, there is wrong macaddress, so no network for you.

Jani Ollikainen (bestis) on 2018-11-19
summary: - No network in AWS (EC-Classic) after stopping and starging instance
+ No network in AWS (EC-Classic) after stopping and starting instance
Marcos Martinez (frommelmak) wrote :

same problem here!

[ 12.935611] cloud-init[640]: Cloud-init v. 18.3-9-g2e62cb8a-0ubuntu1~18.04.2 running 'init' at Mon, 21 Jan 2019 07:38:47 +0000. Up 12.29 seconds.
[ 12.958782] cloud-init[640]: ci-info: +++++++++++++++++++++++++++Net device info++++++++++++++++++++++++++++
[ 12.965639] cloud-init[640]: ci-info: +--------+-------+-----------+-----------+-------+-------------------+
[ 12.980218] cloud-init[640]: ci-info: | Device | Up | Address | Mask | Scope | Hw-Address |
[ 12.992808] cloud-init[640]: ci-info: +--------+-------+-----------+-----------+-------+-------------------+
[ 13.015538] cloud-init[640]: ci-info: | eth0 | False | . | . | . | 22:00:0b:2c:9a:2b |
[ 13.022671] cloud-init[640]: ci-info: | lo | True | 127.0.0.1 | 255.0.0.0 | host | . |
[ 13.036164] cloud-init[640]: ci-info: | lo | True | ::1/128 | . | host | . |
[ 13.048743] cloud-init[640]: ci-info: +--------+-------+-----------+-----------+-------+-------------------+

Guilherme G. Piccoli (gpiccoli) wrote :

Thank you very much Jani Ollikainen for the great bug report, with instructions on how to reproduce and even more - you basically investigated the issue and figured it out yourself.

I'll discuss internally if the best option is to have cloud-init to generate the netplan file in all boots (which I think makes sense) or if we shouldn't have that "mac match" block.

Cheers,

Guilherme

Changed in cloud-init (Ubuntu):
status: New → Confirmed
importance: Undecided → High
assignee: nobody → Guilherme G. Piccoli (gpiccoli)
tags: added: sts
Jani Ollikainen (bestis) wrote :

Hi,

Yes probably the regeneration on every boot is better, but without understanding the internals it was easier for me to manually patch and remove those from config than trying to figure out how to get it to be regenerated every boot.

Because wanted to patch manually couple of instances before putting them into use.

Now just don't update them, before this is fixed :)

This bug is fixed with commit 0bb4c74e to cloud-init on branch master.
To view that commit see the following URL:
https://git.launchpad.net/cloud-init/commit/?id=0bb4c74e

Chad Smith (chad.smith) on 2019-02-21
Changed in cloud-init (Ubuntu):
status: Confirmed → Fix Committed
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in cloud-init (Ubuntu Bionic):
status: New → Confirmed
Changed in cloud-init (Ubuntu Cosmic):
status: New → Confirmed
Changed in cloud-init (Ubuntu Xenial):
status: New → Confirmed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package cloud-init - 18.5-45-g3554ffe8-0ubuntu1

---------------
cloud-init (18.5-45-g3554ffe8-0ubuntu1) disco; urgency=medium

  * New upstream snapshot.
    - cloud-init-per: POSIX sh does not support string subst, use sed
      (LP: #1819222)

 -- Daniel Watkins <email address hidden> Fri, 08 Mar 2019 17:42:34 -0500

Changed in cloud-init (Ubuntu):
status: Fix Committed → Fix Released

SRU release status LP: https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/1819067

Thanks Odd_Bloke, for the pointer!

Changed in cloud-init (Ubuntu Bionic):
importance: Undecided → High
Changed in cloud-init (Ubuntu Xenial):
importance: Undecided → Low
Changed in cloud-init (Ubuntu Cosmic):
importance: Undecided → High

Problem is resolved in Bionic's latest version of cloud-init, released yesterday:

$ dpkg -l | grep cloud-init
ii cloud-init 18.5-45-g3554ffe8-0ubuntu1~18.04.1

I've manually upgraded the package after bringing-up my EC2 Classic instance,
so notice the AWS image doesn't have the latest cloud-init version yet.

Thanks,

Guilherme

Changed in cloud-init (Ubuntu Bionic):
status: Confirmed → Fix Released
Changed in cloud-init (Ubuntu Xenial):
status: Confirmed → Fix Released
Changed in cloud-init (Ubuntu Cosmic):
status: Confirmed → Fix Released
Thorsten Meinl (sithmein) wrote :

The same problem also happens in EC2-VPC mode if the network interface changes. This happens if you create a custom AMI, e.g. using Packer, from the official Ubuntu 18.04 LTS AMI. The first start by packer writes the MAC address into /etc/netplan/50-cloud-init.yaml which gets part of the custom AMI. If you then try to launch an instance from that AMI it gets a new network interface with a new MAC address and the instance will not have any internet connection.
I tries to call "cloud-init clean" at the end of the packer build but that didn't have any positive effect.
It seems the current approach is kind of broken in general.

Hi Thorsten, thanks for your report. I discussed that with cloud-init team, and we're not clear about the steps you're doing to observe the issue.
It's an interesting problem, so we'd like to ask you to report a new bug against cloud-init, with the following information:

1) The most detailed steps you can provide in order we can reproduce the issue, specially detailed information about the changes/configs you're doing with NICs.

2) A tarball created by running the command "cloud-init collect-logs" after you have reproduced the issue.

Please comment here with the number of the new LP, we'll follow from there.
Thanks in advance,

Guilherme

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers