bootcmd to write /etc/apt/apt.conf.d/95-juju-proxy-settings is not executed

Bug #1807615 reported by Ed Stewart on 2018-12-09
22
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack openstack-dashboard charm
Undecided
Unassigned
juju
Undecided
Dmitrii Shcherbakov

Bug Description

Installing openstack-dashboard (cs:openstack-dashboard-274) on lxd bionic container. About 80% of the time, this results in an error during the install hook with the following showing in the juju log:

2018-12-09 19:28:05 DEBUG install dpkg-query: package 'python-pip' is not installed and no information is available
2018-12-09 19:28:05 DEBUG install Use dpkg --info (= dpkg-deb --info) to examine archive files,
2018-12-09 19:28:05 DEBUG install and dpkg --contents (= dpkg-deb --contents) to list their contents.
2018-12-09 19:28:06 DEBUG install Reading package lists...
2018-12-09 19:28:06 DEBUG install Building dependency tree...
2018-12-09 19:28:06 DEBUG install Reading state information...
2018-12-09 19:28:06 DEBUG install E: Unable to locate package python-pip
2018-12-09 19:28:06 ERROR juju.worker.uniter.operation runhook.go:114 hook "install" failed: exit status 100

ssh'ing into the container and running sudo apt update resolves the issue so looks like something very similar to https://bugs.launchpad.net/charm-aodh/+bug/1765212.

Looking at the code, the main install() proc runs apt_update(fatal=True) which is great, however, the install hook calls a bash script first which is where pip gets installed:

declare -a DEPS=('apt' 'netaddr' 'netifaces' 'pip' 'yaml' 'dnspython')

check_and_install() {
    pkg="${1}-${2}"
    if ! dpkg -s ${pkg} 2>&1 > /dev/null; then
        apt-get -y install ${pkg}
    fi
}

...

exec ./hooks/install.real

install.real is where the apt update takes place, however, if pip isn't installed, the apt-get happens before the apt update leading to this issue.

Suggest adding an "apt update" before apt-get in /hooks/install

Ed Stewart (emcs2) wrote :
Download full text (31.7 KiB)

Post discussion with James P - this failed again this morning. Here is the the model config:

$ juju model-config
Attribute From Value
agent-metadata-url default ""
agent-stream default released
agent-version model 2.4.4
apt-ftp-proxy default ""
apt-http-proxy default ""
apt-https-proxy default ""
apt-mirror default ""
apt-no-proxy default ""
automatically-retry-hooks default true
backup-dir default ""
cloudinit-userdata default ""
container-image-metadata-url default ""
container-image-stream default released
container-inherit-properties default ""
container-networking-method model fan
default-series default bionic
development default false
disable-network-management default false
egress-subnets default ""
enable-os-refresh-update default true
enable-os-upgrade default true
fan-config model 10.142.0.0/20=252.0.0.0/8
firewall-mode default instance
ftp-proxy default ""
http-proxy default ""
https-proxy default ""
ignore-machine-addresses default false
image-metadata-url default ""
image-stream default released
juju-ftp-proxy default ""
juju-http-proxy default ""
juju-https-proxy default ""
juju-no-proxy default 127.0.0.1,localhost,::1
logforward-enabled default false
logging-config model <root>=WARNING;unit=DEBUG
max-action-results-age default 336h
max-action-results-size default 5G
max-status-history-age default 336h
max-status-history-size default 5G
net-bond-reconfigure-delay default 17
no-proxy default 127.0.0.1,localhost,::1
provisioner-harvest-mode default destroyed
proxy-ssh default false
resource-tags model {}
snap-http-proxy default ""
snap-https-proxy default ""
snap-store-assertions default ""
snap-store-proxy default ""
ssl-hostname-verification default true
storage-default-block-source model gce
test-mode default false
transmit-vendor-metrics default true
update-status-hook-interval default 5m

Here is the output of the machine-0-lxd-14.log:

2018-12-10 12:04:50 INFO juju.cmd supercommand.go:56 running jujud [2.4.4 gc go1.11]
2018-12-10 12:04:50 DEBUG juju.cmd supercommand.go:57 args: []string{"/var/lib/juju/tools/machine-0-lxd-14/jujud", "machine", "--data-dir", "/var/lib/juju", "--machine-id", "0/lxd/14", "--debug"}
2018-12-10 12:04:50 DEBUG juju.utils gomaxprocs.go:24 setting GOMAXPROCS to 8
2018-12-10 12:04:50 DEBUG juju.agent agent.go:542 read agent config, format "2.0"
2018-12-10 12:04:50 INFO juju.worker.upgradesteps worker.go:74 upgrade steps for 2.4.4 have already been run.
2018-12-10 12:04:50 DEBUG juju.worker.dependency en...

Ed Stewart (emcs2) wrote :

machine-0-lxd-14.log attached

James Page (james-page) wrote :

enable-os-refresh-update = true so units should get an update call before the charm is installed and executed; the referenced aodh bug may have the same root cause.

James Page (james-page) wrote :

Ed - any chance you can attach all of the log files in /var/log/juju and /var/log/cloud-init* to this bug report? that might tell us a bit more as to what's not happening correctly.

Changed in charm-openstack-dashboard:
status: New → Incomplete
Ed Stewart (emcs2) wrote :

Adding /var/log/juju/*.logs

Ed Stewart (emcs2) wrote :

/var/log/cloud-init* logs

please note these logs are from reproducing on charm version 274

Ed Stewart (emcs2) wrote :

is any more info required from us? Note this bug will be marked for expiry soon.

tags: added: atos
James Page (james-page) wrote :

OK so here is the root cause:

Err:1 http://archive.ubuntu.com/ubuntu bionic InRelease
  Temporary failure resolving 'archive.ubuntu.com'
Err:2 http://security.ubuntu.com/ubuntu bionic-security InRelease
  Temporary failure resolving 'security.ubuntu.com'
Err:3 http://archive.ubuntu.com/ubuntu bionic-updates InRelease
  Temporary failure resolving 'archive.ubuntu.com'
Err:4 http://archive.ubuntu.com/ubuntu bionic-backports InRelease
  Temporary failure resolving 'archive.ubuntu.com'
Reading package lists...
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/bionic/InRelease Temporary failure resolving 'archive.ubuntu.com'
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/bionic-updates/InRelease Temporary failure resolving 'archive.ubuntu.com'
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/bionic-backports/InRelease Temporary failure resolving 'archive.ubuntu.com'
W: Failed to fetch http://security.ubuntu.com/ubuntu/dists/bionic-security/InRelease Temporary failure resolving 'security.ubuntu.com'
W: Some index files failed to download. They have been ignored, or old ones used instead.

Juju does attempt to complete an update call via the cloud init tool; however it fails.

Changed in charm-openstack-dashboard:
status: Incomplete → New
James Page (james-page) wrote :

(from cloud-init-output.log)

2018-12-15 07:21:07,921 - stages.py[WARNING]: Failed to rename devices: Failed to apply network config names. Found bad network config version: None

AFAICT the container never actually got networked; somehow the agent binary was downloaded:

+ curl -sSfw agent binaries from %{url_effective} downloaded: HTTP %{http_code}; time %{time_total}s; size %{size_download} bytes; speed %{speed_download} bytes/s --connect-timeout 20 --noproxy * --insecure -o /var/lib/juju/tools/2.4.4-bionic-amd64/tools.tar.gz https://10.142.0.4:17070/model/6146af0e-a943-4b69-8564-6d056025c5a7/tools/2.4.4-bionic-amd64
agent binaries from https://10.142.0.4:17070/model/6146af0e-a943-4b69-8564-6d056025c5a7/tools/2.4.4-bionic-amd64 downloaded: HTTP 200; time 0.318633s; size 26120071 bytes; speed 82138588.000 bytes/s + echo Agent binaries downloaded successfully.
Agent binaries downloaded successfully.

but the machine does not appear to be networked at the end of the cloud-init processing:

+ ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1410
        inet 252.0.217.26 netmask 255.0.0.0 broadcast 252.255.255.255
        inet6 fe80::216:3eff:fea6:d719 prefixlen 64 scopeid 0x20<link>
        ether 00:16:3e:a6:d7:19 txqueuelen 1000 (Ethernet)
        RX packets 1439 bytes 26289010 (26.2 MB)
        RX errors 0 dropped 0 overruns 0 frame 0
        TX packets 776 bytes 54278 (54.2 KB)
        TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
        inet 127.0.0.1 netmask 255.0.0.0
        inet6 ::1 prefixlen 128 scopeid 0x10<host>
        loop txqueuelen 1000 (Local Loopback)
        RX packets 476 bytes 38184 (38.1 KB)
        RX errors 0 dropped 0 overruns 0 frame 0
        TX packets 476 bytes 38184 (38.1 KB)
        TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

James Page (james-page) wrote :

Would it also be possible to get the contents of /var/lib/cloud ? that might give us a bit more of a clue as to what's going on here.

I'm raising a juju bug as well as it may well be that the issue lies in cloud-init data that juju is generating for a LXD container ontop of a GCE instance.

Changed in charm-openstack-dashboard:
status: New → Incomplete
summary: - Installing on lxd gives an error due to python-pip not being found
+ gce: network broken, no apt update/install with LXD container

I managed to reproduce this in an environment where there is no default gateway at all and package downloads are only doable via a proxy server. I do not recall the same happening when a default gateway was configured.

ubuntu@juju-6d9aac-3-lxd-1:~$ apt policy python-pip
N: Unable to locate package python-pip

cat /etc/apt/apt.conf.d/95-juju-proxy-settings
Acquire::http::Proxy "http://10.232.0.1:3128";
Acquire::https::Proxy "http://10.232.0.1:3128";
Acquire::http::Proxy::"127.0.0.1" "DIRECT";
Acquire::https::Proxy::"127.0.0.1" "DIRECT";
Acquire::ftp::Proxy::"127.0.0.1" "DIRECT";
Acquire::http::Proxy::"::1" "DIRECT";
Acquire::https::Proxy::"::1" "DIRECT";
Acquire::ftp::Proxy::"::1" "DIRECT";
Acquire::http::Proxy::"localhost" "DIRECT";
Acquire::https::Proxy::"localhost" "DIRECT";
Acquire::ftp::Proxy::"localhost" "DIRECT";

Doing `apt update` allows the installation to go forward.

Changed in charm-openstack-dashboard:
status: Incomplete → Confirmed
Dmitrii Shcherbakov (dmitriis) wrote :
Dmitrii Shcherbakov (dmitriis) wrote :
Dmitrii Shcherbakov (dmitriis) wrote :

Routing table for that container:

ubuntu@juju-6d9aac-3-lxd-1:~$ ip route
10.232.0.0/21 dev eth1 proto kernel scope link src 10.232.1.130
10.232.40.0/21 dev eth0 proto kernel scope link src 10.232.46.35

Ed Stewart (emcs2) wrote :

@James P - I'll set up an environment to recreate. Certainly half explains it, but adds the confusion why cloud-init isn't getting an IP from the host VM. Can't see why it's a GCE specific thing, sounds more like a delay in the fan network coming up?

Presume in your comment "but the machine does not appear to be networked at the end of the cloud-init processing:" - you meant 'DOES' appear to be networked by the end of cloud-init given that ifconfig output?

Dmitrii Shcherbakov (dmitriis) wrote :
Download full text (3.9 KiB)

Besides "no default gateway" I had this in cloudinit-userdata:

juju model-config cloudinit-userdata
write_files:
- content: |
    [Service]
    Environment=SNAPPY_STORE_NO_CDN=1
  owner: "root:root"
  path: /etc/systemd/system/snapd.service.d/proxy-nocdn.conf
  permissions: '0644'
packages:
  - squashfuse
snap:
  commands:
    "00": systemctl restart snapd

https://paste.ubuntu.com/p/sV9gqCYhVC/
================================================================

I removed this cloudinit-userdata model-config and retried without it (what's below is not present in archives uploaded above) - doesn't look like this model-config affects anything, however, I can see by crtime of 95-juju-proxy-settings in the container that apt proxy settings land in a container after apt update is done (22:11:48 for update vs 22:11:53 for crtime of the proxy settings file:

https://paste.ubuntu.com/p/w4qFpR6KJ6/

Cloud-init v. 18.4-0ubuntu1~18.04.1 running 'modules:config' at Wed, 13 Feb 2019 22:11:45 +0000. Up 27.00 seconds.
Ign:1 http://security.ubuntu.com/ubuntu bionic-security InRelease
Err:2 http://security.ubuntu.com/ubuntu bionic-security Release
  Cannot initiate the connection to security.ubuntu.com:80 (2001:67c:1360:8c01::19). - connect (101: Network is unreachable) Cannot initiate the connection to security
.ubuntu.com:80 (91.189.92.201). - connect (101: Network is unreachable)

# ...

2019-02-13 22:11:48,632 - util.py[WARNING]: Failed to install packages: ['curl', 'cpu-checker', 'bridge-utils', 'cloud-utils', 'tmux', 'ubuntu-fan']
2019-02-13 22:11:48,633 - cc_package_update_upgrade_install.py[WARNING]: 2 failed with exceptions, re-raising the last one
2019-02-13 22:11:48,634 - util.py[WARNING]: Running module package-update-upgrade-install (<module 'cloudinit.config.cc_package_update_upgrade_install' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_package_update_upgrade_install.py'>) failed

debugfs -R 'stat /var/lib/lxd/storage-pools/default/containers/juju-6d9aac-3-lxd-2/rootfs/etc/apt/apt.conf.d/95-juju-proxy-settings' /dev/sda1 2>/dev/null | grep -P 'crtime|mtime'
 mtime: 0x5c64b32f:04de7d60 -- Thu Feb 14 00:15:43 2019
crtime: 0x5c649629:813b3c54 -- Wed Feb 13 22:11:53 2019

============================================

Although I can't see why that happens because bootcmd commands (including the one to write proxy apt settings) run before cc_package_update_upgrade_install

2019-02-13 22:11:21,350 - stages.py[DEBUG]: Running module bootcmd (<module 'cloudinit.config.cc_bootcmd' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_bootcmd.py'>) with frequency always
# ...
2019-02-13 22:11:21,679 - handlers.py[DEBUG]: finish: init-network/config-bootcmd: SUCCESS: config-bootcmd ran successfully

2019-02-13 22:11:46,314 - stages.py[DEBUG]: Running module runcmd (<module 'cloudinit.config.cc_runcmd' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_runcmd.py'>) with frequency once-per-instance
2019-02-13 22:11:46,314 - handlers.py[DEBUG]: start: modules-config/config-runcmd: running config-runcmd with frequency once-per-instance

# ...
2019-02-13 22:11:47,029 - stages.py[DEBUG]: Running module package-update-upgrade-install (<m...

Read more...

Dmitrii Shcherbakov (dmitriis) wrote :
Download full text (3.3 KiB)

Back to the behavior in #16. I think there is something wrong in how bootcmd is generated by Juju or processed by cloud-init.

1) LXD container user.user-data config: https://paste.ubuntu.com/p/cXn75hS8J9/
2) user-data generated out of it /var/lib/cloud/seed/nocloud-net/user-data https://paste.ubuntu.com/p/6B4KXJ6xvd/

see https://github.com/lxc/lxd/blob/master/doc/cloud-init.md

Notice that the command to write out /etc/apt/apt.conf.d/95-juju-proxy-settings is the last bootcmd in both (1) and (2).

3) if I disable and mask Juju services

sudo systemctl mask jujud-unit-openstack-dashboard-4.service
sudo systemctl disable jujud-unit-openstack-dashboard-4.service

sudo systemctl disable jujud-machine-3-lxd-3.service
sudo systemctl mask jujud-machine-3-lxd-3.service

And then do

rm /etc/apt/apt.conf.d/95-juju-proxy-settings
reboot

95-juju-proxy-settings does not appear which means that this bootcmd never runs and the file never gets created.

3) If I move the bootcmd to write out the proxy file (95-juju-proxy-settings) to the top of the list of bootcmds (https://paste.ubuntu.com/p/tHMyrNWQYK/
) and run `rm -r /var/lib/cloud/instances/juju-6d9aac-3-lxd-3/ ; reboot`, then I get 95-juju-proxy-settings at the right place and time (crtime reported via debugfs is before bootcmd stage ends).

In other words, there is something wrong either in the format of one of the bootcmds or the way cloud-init processes this list of commands.

The behavior in #16 is due to the fact that the bootcmd related to writing 95-juju-proxy-settings never runs while a machine agent has its own code (in proxyupdater goroutine) to write 95-juju-proxy-settings (which obviously happens after bootcmd finishes):

https://github.com/juju/juju/blob/juju-2.5.0/worker/proxyupdater/proxyupdater.go#L250-L261
 AptProxyConfigFile = AptConfigDirectory + "/95-juju-proxy-settings"

https://github.com/juju/packaging/blob/ba21344fff207f4fa06b0a08bac386b179a2e6a1/config/apt_constants.go#L37-L40
  content := paccmder.ProxyConfigContents(w.aptProxy) + "\n"
  err = ioutil.WriteFile(config.AptProxyConfigFile, []byte(content), 0644)

I also explored if bootcmd code could be doing work with child processes incorrectly but it does wait for a child process to exit by calling "communicate" and this means that all file descriptors will be closed by the time parent gets unblocked (http://man7.org/linux/man-pages/man2/exit.2.html "... _exit() does close open file descriptors ...":

https://github.com/cloud-init/cloud-init/blob/ubuntu/18.4-0ubuntu1_18.04.1/cloudinit/config/cc_bootcmd.py#L79-L102
def handle(name, cfg, cloud, log, _args):
# ...
            cmd = ['/bin/sh', tmpf.name]
            util.subp(cmd, env=env, capture=False)
# ...

https://github.com/cloud-init/cloud-init/blob/ubuntu/18.4-0ubuntu1_18.04.1/cloudinit/util.py#L1919-L2033
def subp(args, data=None, rcs=None, env=None, capture=True,
         combine_capture=False, shell=False,
         logstring=False, decode="replace", target=None, update_env=None,
         status_cb=None):
# ...
    stdin = None
    stdout = None
    stderr = None
    if capture:
        stdout = subprocess.PIPE
        stderr = subprocess.PIPE

        sp = subpr...

Read more...

Dmitrii Shcherbakov (dmitriis) wrote :

Pull-request:

https://github.com/juju/juju/pull/9751

Looks like the 'exit 0' statement in the bootcmd above the one to write proxy settings is causing the premature bootcmd shell process exit.

Dmitrii Shcherbakov (dmitriis) wrote :

Subscribed ~field-high (Juju) to expedite this https://github.com/juju/juju/pull/9751

summary: - gce: network broken, no apt update/install with LXD container
+ bootcmd to write /etc/apt/apt.conf.d/95-juju-proxy-settings is not
+ executed
Pedro Guimarães (pguimaraes) wrote :

I am facing the same issue. Work-around at this moment with Juju 2.5 is:

cloudinit-userdata: |
  postruncmd:
    - apt update

James Page (james-page) on 2019-02-14
Changed in charm-openstack-dashboard:
status: Confirmed → Incomplete
Changed in juju:
status: New → Fix Committed
assignee: nobody → Dmitrii Shcherbakov (dmitriis)
Dmitrii Shcherbakov (dmitriis) wrote :

By the way, Ed, in your cloud-init attachments https://bugs.launchpad.net/juju/+bug/1807615/comments/6 I can see the following:

  File "/usr/lib/python3/dist-packages/cloudinit/net/netplan.py", line 236, in _render_content
    if network_state.version == 2:
AttributeError: 'NoneType' object has no attribute 'version'

# ...

2018-12-15 07:21:08,859 - stages.py[WARNING]: Failed to rename devices: Failed to apply network config names. Found bad network config version: None

# ...
Cloud-init v. 18.4-0ubuntu1~18.04.1 running 'modules:config' at Sat, 15 Dec 2018 07:21:15 +0000. Up 9.00 seconds.

Err:1 http://archive.ubuntu.com/ubuntu bionic InRelease
  Temporary failure resolving 'archive.ubuntu.com'

This has since been worked around in Juju

https://bugs.launchpad.net/juju/+bug/1804493
https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/1798117

and is coming as an SRU for cloud-init (not yet in main for bionic at the time of writing):

https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/1813346

Could you use Juju 2.4.7 instead 2.4.4?

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers