cloud-init persists in running state on Kali in AWS

Bug #1869430 reported by ananke
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
cloud-init
Invalid
Medium
Unassigned

Bug Description

Hello,

We're trying to customize published Kali AMIs using packer & cloud-init. The entire process works with Ubuntu, CentOS, and Amazon Linux 2 targets, but seemingly breaks with Kali. We've tried it with both the 2020.01 and 2019.03.

We're also experiencing a long timeout for ec2 data source:

root@kali:~# cloud-init status --long
status: running
time: Fri, 27 Mar 2020 20:06:54 +0000
detail:
DataSourceEc2Local

root@kali:~# cloud-init analyze blame
-- Boot Record 01 --
     51.20500s (init-local/search-Ec2Local)
     00.91700s (init-network/config-users-groups)
     00.67200s (init-network/config-growpart)
     00.27400s (init-network/config-resizefs)
     00.24800s (init-network/config-ssh)
     00.00600s (init-network/consume-user-data)
     00.00300s (init-network/check-cache)

Attached is the log tarball produced by cloud-init. We'd appreciate any hints as to what may be happening. It's worth noting that these targets are starting in a VPC without direct connection to the outside world, but there's a squid proxy available for web traffic. We have relevant parts set up to use that proxy.

Thanks!

Revision history for this message
ananke (ananke) wrote :
Revision history for this message
Ryan Harper (raharper) wrote :

Thanks for filing the bug.

Looking into the logs cloud-init-local.service and cloud-init.service run correctly and complete. cloud-config.service blocks until network-online.target is reached in systemd.

Typically this is after either 'networking.service' 'systemd-networkd-wait-online.service' or 'network-manager-wait-online.service' have completed.

the systemd journal logs the following output if network-online.target is reached

Mar 26 18:32:15 f4 systemd[1]: Reached target Network is Online.
Mar 26 18:32:15 f4 systemd[1]: Reached target System Initialization.

These are not present in the journal.

Do you have the AMI you tried? It's likely there are some changes needed to the service units in the image to ensure that network-online.target is reached. Then cloud-config modules/final can run to completion.

Changed in cloud-init:
importance: Undecided → Medium
status: New → Incomplete
Revision history for this message
ananke (ananke) wrote :

AMI used to generate the logs: ami-0a3803e4b51dabb6d (2019.03, us-east-1). We also saw the same thing with 2020.01: ami-0c11557d0e4e9c896

Revision history for this message
Ryan Harper (raharper) wrote :

The long timeout looks like networking ... maybe VPC related? not sure:

2020-03-27 20:05:56,061 - util.py[DEBUG]: Running command ['ip', '-4', 'route', 'add', 'default', 'via', '192.168.251.129', 'dev', 'eth0'] with allowed return codes [0] (shell=False, capture=True)
2020-03-27 20:06:36,198 - util.py[DEBUG]: Resolving URL: http://169.254.169.254 took 40.134 seconds
2020-03-27 20:06:46,208 - util.py[DEBUG]: Resolving URL: http://instance-data.:8773 took 10.010 seconds

Notice the timestamps; we tried to resolve the the standard metadata via IP;
that shouldn't happen so possibly DNS resolver entry isn't correct.

Let's see what's inside the AMI and go from there

Revision history for this message
Ryan Harper (raharper) wrote :

I'm interested in:

ls -al /etc/resolv*
cat /etc/resolv.conf
ls -al /etc/network/*
find /etc/systemd/system
ls -al /lib/systemd/system/*network*
ifquery --state --all
nmcli conn
systemctl status --no-pager
systemclt list-jobs
ip --brief route
ip --brief a

Revision history for this message
ananke (ananke) wrote :
Download full text (9.4 KiB)

Sorry for the delay, here's the relevant information:

kali@kali:~$ ls -la /etc/resolv*
-rw-r--r-- 1 root root 65 Mar 30 12:23 /etc/resolv.conf

/etc/resolvconf:
total 20
drwxr-xr-x 3 root root 4096 Sep 2 2019 .
drwxr-xr-x 159 root root 12288 Mar 30 12:23 ..
drwxr-xr-x 2 root root 4096 Sep 2 2019 update-libc.d

kali@kali:~$ cat /etc/resolv.conf
domain ec2.internal
search ec2.internal
nameserver 192.168.251.2

kali@kali:~$ ls -al /etc/network/*
-rw-r--r-- 1 root root 272 Sep 2 2019 /etc/network/interfaces

/etc/network/if-down.d:
total 12
drwxr-xr-x 2 root root 4096 Sep 2 2019 .
drwxr-xr-x 7 root root 4096 Sep 2 2019 ..
-rwxr-xr-x 1 root root 372 Feb 20 2019 openvpn
lrwxrwxrwx 1 root root 32 Sep 2 2019 wpasupplicant -> ../../wpa_supplicant/ifupdown.sh

/etc/network/if-post-down.d:
total 20
drwxr-xr-x 2 root root 4096 Sep 2 2019 .
drwxr-xr-x 7 root root 4096 Sep 2 2019 ..
lrwxrwxrwx 1 root root 23 Sep 2 2019 avahi-daemon -> ../if-up.d/avahi-daemon
-rwxr-xr-x 1 root root 1714 Sep 27 2016 ifenslave
lrwxrwxrwx 1 root root 28 Sep 2 2019 macchanger -> ../../macchanger/ifupdown.sh
-rwxr-xr-x 1 root root 1433 Feb 4 2019 vlan
-rwxr-xr-x 1 root root 1409 Jun 5 2018 wireless-tools
lrwxrwxrwx 1 root root 32 Sep 2 2019 wpasupplicant -> ../../wpa_supplicant/ifupdown.sh

/etc/network/if-pre-up.d:
total 36
drwxr-xr-x 2 root root 4096 Sep 2 2019 .
drwxr-xr-x 7 root root 4096 Sep 2 2019 ..
-rwxr-xr-x 1 root root 344 Jun 30 2016 ethtool
-rwxr-xr-x 1 root root 6298 Sep 27 2016 ifenslave
lrwxrwxrwx 1 root root 28 Sep 2 2019 macchanger -> ../../macchanger/ifupdown.sh
-rwxr-xr-x 1 root root 4224 Feb 21 2019 vlan
-rwxr-xr-x 1 root root 4191 Sep 15 2018 wireless-tools
lrwxrwxrwx 1 root root 32 Sep 2 2019 wpasupplicant -> ../../wpa_supplicant/ifupdown.sh

/etc/network/if-up.d:
total 32
drwxr-xr-x 2 root root 4096 Sep 2 2019 .
drwxr-xr-x 7 root root 4096 Sep 2 2019 ..
-rwxr-xr-x 1 root root 484 Apr 27 2018 avahi-daemon
-rwxr-xr-x 1 root root 1685 Jun 30 2016 ethtool
-rwxr-xr-x 1 root root 1741 Sep 27 2016 ifenslave
-rwxr-xr-x 1 root root 677 Feb 4 2019 ip
-rwxr-xr-x 1 root root 729 Jul 26 2019 miredo
-rwxr-xr-x 1 root root 385 Feb 20 2019 openvpn
lrwxrwxrwx 1 root root 32 Sep 2 2019 wpasupplicant -> ../../wpa_supplicant/ifupdown.sh

/etc/network/interfaces.d:
total 12
drwxr-xr-x 2 root root 4096 Mar 27 20:06 .
drwxr-xr-x 7 root root 4096 Sep 2 2019 ..
-rw-r--r-- 1 root root 367 Mar 27 20:06 50-cloud-init.cfg

kali@kali:~$ find /etc/systemd/system
/etc/systemd/system
/etc/systemd/system/sockets.target.wants
/etc/systemd/system/sockets.target.wants/uuidd.socket
/etc/systemd/system/sockets.target.wants/dm-event.socket
/etc/systemd/system/sockets.target.wants/rpcbind.socket
/etc/systemd/system/sockets.target.wants/pcscd.socket
/etc/systemd/system/sshd.service
/etc/systemd/system/multi-user.target.wants
/etc/systemd/system/multi-user.target.wants/NetworkManager.service
/etc/systemd/system/multi-user.target.wants/pppd-dns.service
/etc/systemd/system/multi-user.target.wants/binfmt-support.service
/etc/systemd/system/multi-user.target.wants/rsync.service
/etc/systemd/system/multi-user.target....

Read more...

Revision history for this message
ananke (ananke) wrote :
Download full text (4.1 KiB)

Few additional data points:

1) Resolving URL wait time seemed to be very long during first boot, but then subsequent ones didn't have this issue:

/var/log/cloud-init.log:2020-03-27 20:06:36,198 - util.py[DEBUG]: Resolving URL: http://169.254.169.254 took 40.134 seconds
/var/log/cloud-init.log:2020-03-27 20:06:46,208 - util.py[DEBUG]: Resolving URL: http://instance-data.:8773 took 10.010 seconds
/var/log/cloud-init.log:2020-03-30 13:10:13,660 - util.py[DEBUG]: Resolving URL: http://169.254.169.254 took 0.030 seconds
/var/log/cloud-init.log:2020-03-30 13:10:13,661 - util.py[DEBUG]: Resolving URL: http://instance-data.:8773 took 0.001 seconds
/var/log/cloud-init.log:2020-03-30 13:17:44,571 - util.py[DEBUG]: Resolving URL: http://169.254.169.254 took 0.022 seconds
/var/log/cloud-init.log:2020-03-30 13:17:44,572 - util.py[DEBUG]: Resolving URL: http://instance-data.:8773 took 0.001 seconds
/var/log/cloud-init.log:2020-03-30 13:20:33,384 - util.py[DEBUG]: Resolving URL: http://169.254.169.254 took 0.017 seconds
/var/log/cloud-init.log:2020-03-30 13:20:33,386 - util.py[DEBUG]: Resolving URL: http://instance-data.:8773 took 0.002 seconds

2) network-online.targets appears to have two dependencies: networking.service and NetworkManager-wait-online.service. The latter was in dead state, despite nm-online being able to connect. Disabling it did not solve the problem: cloud-init still shows running state after reboot

Before:

root@kali:~# systemctl list-dependencies network-online.target
network-online.target
● ├─networking.service
● └─NetworkManager-wait-online.service
root@kali:~# systemctl status NetworkManager-wait-online.service

After :

root@kali:~# systemctl list-dependencies network-online.target
network-online.target
● └─networking.service

3) cloud-init analyze blame shows long timeout during first boot on ec2local data source, but not on subsequent boots:

root@kali:~# cloud-init analyze blame
-- Boot Record 01 --
     51.20500s (init-local/search-Ec2Local)
     00.91700s (init-network/config-users-groups)
     00.67200s (init-network/config-growpart)
     00.27400s (init-network/config-resizefs)
     00.24800s (init-network/config-ssh)
     00.00600s (init-network/consume-user-data)
     00.00300s (init-network/check-cache)
     00.00200s (init-network/activate-datasource)
     00.00100s (init-network/config-write-files)
     00.00100s (init-network/config-update_hostname)
     00.00100s (init-network/config-update_etc_hosts)
     00.00100s (init-network/config-timezone)
     00.00100s (init-network/config-set_hostname)
     00.00100s (init-network/config-seed_random)
     00.00100s (init-network/config-rsyslog)
     00.00100s (init-network/config-migrator)
     00.00100s (init-network/config-ca-certs)
     00.00100s (init-local/check-cache)
     00.00000s (init-network/setup-datasource)
     00.00000s (init-network/consume-vendor-data)
     00.00000s (init-network/config-bootcmd)

-- Boot Record 02 --
     02.00900s (init-local/search-Ec2Local)
     00.07000s (init-network/config-growpart)
     00.03600s (init-network/config-resizefs)
     00.00700s (init-network/consume-user-data)
     00.00500s (init-local/check-cache)
     00.00200s...

Read more...

Revision history for this message
Ryan Harper (raharper) wrote :

Thanks. One thing that doesn't look correct is:

/etc/systemd/system/cloud-init.target.wants:

This should have all 4 services:

root@b1:/etc/systemd/system/cloud-init.target.wants# ls -al
total 22
drwxr-xr-x 2 root root 6 Mar 20 15:59 .
drwxr-xr-x 15 root root 19 Mar 20 16:00 ..
lrwxrwxrwx 1 root root 40 Mar 20 15:59 cloud-config.service -> /lib/systemd/system/cloud-config.service
lrwxrwxrwx 1 root root 39 Mar 20 15:59 cloud-final.service -> /lib/systemd/system/cloud-final.service
lrwxrwxrwx 1 root root 44 Mar 20 15:59 cloud-init-local.service -> /lib/systemd/system/cloud-init-local.service
lrwxrwxrwx 1 root root 38 Mar 20 15:59 cloud-init.service -> /lib/systemd/system/cloud-init.service

If you also have:

journalctl -o short-monotonic -b > journal.log
systemd-analyze critical-chain network-online.target
systemd-analyze critical-chain cloud-init.target

Revision history for this message
ananke (ananke) wrote :

Looks like the root cause was with how Kali AMI was built, it lacked few key things, such as having three of the cloud-init services enabled.

I've opened a bug with them and looks like they're adjusting things per recommendation: https://bugs.kali.org/view.php?id=6239

Thank you again!

Revision history for this message
Ryan Harper (raharper) wrote :

Excellent news! Thanks for following up with Kali upstream. I'm closing this issue as invalid for cloud-init. If you find out after the Kali changes there are still issues, either re-open this bug or file a new one if the issue/behavior is different.

Thanks!

Changed in cloud-init:
status: Incomplete → Invalid
Revision history for this message
James Falcon (falcojr) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.