install hooks fail because enable-os-upgrade/refresh-update ignored

Bug #1681005 reported by Curtis Hovey
4
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Invalid
Undecided
Unassigned
cloud-images
Invalid
Undecided
Unassigned
cloud-init
Expired
Undecided
Unassigned

Bug Description

As seen at
   http://reports.vapour.ws/releases/issue/5553b279749a56508ba88b45

There is a dramatic rise in hook failures since March 24. This is because juju is not running apt-get update or upgrade.

Joyent dominates, but rackspace and GCE have had spots of failures. The unit logs show apt fails to find packages. Joyent tends to have stale images so Juju CI tests always run with os-enable-refresh-update to have the correct list of packages.

Today I deployed the landscape bundle in joyent and repeated the postgresql install hook failures. The log confirm an apt issue

    E: Failed to fetch http://security.ubuntu.com/ubuntu/pool/main/l/linux/linux-libc-dev_4.4.0-71.92_amd64.deb 404 Not Found [IP: 91.189.88.149 80]

juju model-config shows apt should have been updated and the host upgraded

    enable-os-refresh-update default true
    enable-os-upgrade default true

But /var/log/apt/history.log and /var/log/apt/term.log show apt has not been touched since the image was made in March.

I left the model untouched for 3 hours. It remained in error. I ran
    juju ssh 2 -- sudo apt-get update

3 minutes later the unit was installed.

This bug is in 2.1 and 2.2

Revision history for this message
Ian Booth (wallyworld) wrote :

Both enable-os-refresh-update and enable-os-upgrade are true by default.

As a first step, I tested bootstrapping on LXD and adding a machine. For both machines, I added extra tracing to Juju to show that the cloud init script rendering was being passed true for both config values as expected. We are correctly setting the "package_update" and "package_upgrade" attributes in cloud init.

I can see in cloud init output that apt-get update and upgrade appear to be run.

Next step to test on a public cloud.

Revision history for this message
Curtis Hovey (sinzui) wrote :

Yes I know update and upgrade are enabled be default, which is why it took so long to discover this bug. We have never seen an issue with bootstrap. In fact, I think bootstrap has always worked. When health checks and tests cannot bootstrap, We know the cloud is ill or mirrors are stale. Apt issues in the log are brought to IS who control the mirrors. We occasionally see a mirror has OOMed.

Since cloud-init is involved, could we have provider specific problems? From my memory
* AWS has never broken like this and cloudware tests images there.
* Azure has never broken like this.
* GCE has broken recently, but the issue resolved itself in a few days. GCE might get images almost as often as AWS and Azure
* Rackspace has broken recently for short period and they have a history stale images
* Joyent is broken now and has a history of stale images.

Revision history for this message
Ian Booth (wallyworld) wrote :

Looking at machines run up in Joyent, the cloud init log clearly shows the package update/upgrade attributes are being correctly set:

<snip>
output:
  all: '| tee -a /var/log/cloud-init-output.log'
package_update: true
package_upgrade: true
packages:
<snip>

So juju is behaving correctly. There may be an upstream cloud-init change. The apt history shows a dist-upgrade is being done. But the cloud init output log does imply an apt update/upgrade is being done also.

Revision history for this message
Ian Booth (wallyworld) wrote :

I've checked the cloud init logs further and the correct apt-get commands are running.

cloud-init.log:Apr 12 01:14:45 13633ad8-e1f3-6a12-dd49-a5a992c29f5b [CLOUDINIT] util.py[DEBUG]: Running command ['eatmydata', 'apt-get', '--option=Dpkg::Options::=--force-confold', '--option=Dpkg::options::=--force-unsafe-io', '--assume-yes', '--quiet', 'update'] with allowed return codes [0] (shell=False, capture=False)
cloud-init.log:Apr 12 01:14:56 13633ad8-e1f3-6a12-dd49-a5a992c29f5b [CLOUDINIT] util.py[DEBUG]: apt-update [eatmydata apt-get --option=Dpkg::Options::=--force-confold --option=Dpkg::options::=--force-unsafe-io --assume-yes --quiet update] took 10.735 seconds
cloud-init.log:Apr 12 01:14:56 13633ad8-e1f3-6a12-dd49-a5a992c29f5b [CLOUDINIT] util.py[DEBUG]: Running command ['eatmydata', 'apt-get', '--option=Dpkg::Options::=--force-confold', '--option=Dpkg::options::=--force-unsafe-io', '--assume-yes', '--quiet', 'dist-upgrade'] with allowed return codes [0] (shell=False, capture=False)
cloud-init.log:Apr 12 01:15:24 13633ad8-e1f3-6a12-dd49-a5a992c29f5b [CLOUDINIT] util.py[DEBUG]: apt-upgrade [eatmydata apt-get --option=Dpkg::Options::=--force-confold --option=Dpkg::options::=--force-unsafe-io --assume-yes --quiet dist-upgrade] took 27.669 seconds

Revision history for this message
Ian Booth (wallyworld) wrote :

I can also see the expected apt-get output in the cloud init logs showing the package indices being updated from the expected sources, and the result of the dist-upgrade. So from what I can see:

- Juju is correctly configuring cloud init to run the apt commands
- cloud init is running the commands
- the cloud init logs show the commands being run

Maybe the archive mirrors are out of date?
The log output below shows the refresh happening based off a mirror.

Cloud-init v. 0.7.5 running 'modules:config' at Wed, 12 Apr 2017 01:14:36 +0000. Up 15.00 seconds.
Generating locales...
  en_US.UTF-8... up-to-date
Generation complete.
Get:1 http://security.ubuntu.com trusty-security InRelease [65.9 kB]
Get:2 http://security.ubuntu.com trusty-security/main Sources [129 kB]
Get:3 http://security.ubuntu.com trusty-security/universe Sources [50.6 kB]
Get:4 http://security.ubuntu.com trusty-security/main amd64 Packages [606 kB]
Get:5 http://security.ubuntu.com trusty-security/universe amd64 Packages [155 kB]
Ign http://eu-ams-1.joyent.clouds.archive.ubuntu.com trusty InRelease
Get:6 http://security.ubuntu.com trusty-security/main Translation-en [332 kB]
Get:7 http://security.ubuntu.com trusty-security/universe Translation-en [90.3 kB]
Get:8 http://eu-ams-1.joyent.clouds.archive.ubuntu.com trusty-updates InRelease [65.9 kB]
Hit http://eu-ams-1.joyent.clouds.archive.ubuntu.com trusty-backports InRelease
Hit http://eu-ams-1.joyent.clouds.archive.ubuntu.com trusty Release.gpg
Get:9 http://eu-ams-1.joyent.clouds.archive.ubuntu.com trusty-updates/main Sources [395 kB]
Get:10 http://eu-ams-1.joyent.clouds.archive.ubuntu.com trusty-updates/restricted Sources [6327 B]
Get:11 http://eu-ams-1.joyent.clouds.archive.ubuntu.com trusty-updates/universe Sources [176 kB]
Get:12 http://eu-ams-1.joyent.clouds.archive.ubuntu.com trusty-updates/multiverse Sources [7759 B]
Get:13 http://eu-ams-1.joyent.clouds.archive.ubuntu.com trusty-updates/main amd64 Packages [973 kB]
<snip>

Revision history for this message
Ian Booth (wallyworld) wrote :

This looks like an issue between cloud-init 0.7.5 on trusty vs 0.7.9 on xenial.

Booting a trusty machine shows the cloud-init.log with the required apt-get commands.
On a xenial machine, there are no errors but all of the apt-get update/upgrade commands are missing from cloud-init.log

Revision history for this message
Ian Booth (wallyworld) wrote :

The xenial machine which doesn't run apt-get update/upgrade appears to have the correct config as shown in cloud-init.log

<snip>

output:
  all: '| tee -a /var/log/cloud-init-output.log'
package_update: true
package_upgrade: true
packages:
- curl
- cpu-checker
- bridge-utils
- cloud-utils
- tmux
- socat

<snip>

Revision history for this message
Ian Booth (wallyworld) wrote :

I've targetted to cloud-init to see if those folks can give some feedback as to what might be happening.

Revision history for this message
Ian Booth (wallyworld) wrote :

Looking at the cloud init log on xenial, I can't see that's it's importing /usr/lib/python3/dist-packages/cloudinit/config/cc_package_update_upgrade_install.py

On trusty, it's a Python 2.7 dist package (which works). Not sure that makes a difference or not.

Changed in juju:
status: Triaged → Incomplete
milestone: 2.2-beta3 → none
Revision history for this message
Ian Booth (wallyworld) wrote :

This was tested on AWS and vSphere and apt-get update/upgrade run as expected.

So it appears the issue might be something to do with the Joyent images themselves?

Revision history for this message
Curtis Hovey (sinzui) wrote :

The "charms tests" section in any test results (http://reports.vapour.ws/releases/5121) show we have comprehensive testing across substrates. We know know joyent is problematic at this time.

Reviewing http://reports.vapour.ws/releases/issue/5553b279749a56508ba88b45 for specific substrates. We can see that rackspace was pretty busted Feb 1-8. GCE which, I thought has recent failures, not so (Mary 2016).

Given that this is likely a cloud-int or cloud-image problem, we are not likely to see this issue in images that we have 100% control like AWS, Azure, MAAS, vsphere, lxd. The three clouds that alter our images (and delay releases) are Joyent, Rackspace, and GCE.

I feel pretty powerless in this matter. I know the test will pass if I could force apt-get update. I presume juju tried and it was ignore. I could modify the charm or the bundle to trigger apt-get update. This act would undermine the landscape (not juju) in this test. Bad images are a cloud-health issue and I prefer to be catching this class on issue in cloud-heatlh checks, not juju revision tests.

Revision history for this message
Anastasia (anastasia-macmood) wrote :

Marking as Invalid for Juju - nothing that we can fix in our codebase.

Changed in juju:
status: Incomplete → Invalid
importance: Critical → Undecided
tags: added: id-58ee260e4c1fcd6e9adb77fe
Revision history for this message
Dan Watkins (oddbloke) wrote :

It's not entirely clear to me what the bug being reported against cloud-init here is. Given the age of the bug, I assume it no longer applies, but please do give us reproduction steps and move it back to New if I am mistaken.

Changed in cloud-init:
status: New → Incomplete
Revision history for this message
James Falcon (falcojr) wrote :
Changed in cloud-init:
status: Incomplete → Expired
Éric St-Jean (esj)
Changed in cloud-images:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.