upgrade-juju 1.16.6 -> 1.18 (tip) fails

Bug #1299802 reported by Roger Peppe
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju-core
Fix Released
High
John A Meinel
1.18
Fix Released
Critical
John A Meinel
juju-core (Ubuntu)
Fix Released
Critical
Unassigned
Trusty
Fix Released
Critical
Unassigned

Bug Description

To reproduce (I've successfully reproduced the issue three
times in a row now, although sometimes one or other of the
unit agents does manage to upgrade successfully)

# deploy a 1.16 environment:
$ pwd
/home/rog/src/go/src/launchpad.net/juju-core/
$ bzr update -r juju-1.16.4
$ godeps -u dependencies.tsv
$ go install ./...
$ juju bootstrap
$ juju deploy wordpress
$ juju deploy mysql
$ juju add-relation wordpress mysql
# wait for services to start successfully
$ bzr update
$ bzr revision-info --tree
2509 tarmac-20140328183634-d639ea10wem30lfv
# use 1.18 as version number to avoid development-version
# logic.
$ ed version/version.go
/version = /s/".*"/"1.18.0"/
w
q
$ godeps -u dependencies.tsv
$ go install ./...
$ juju upgrade-juju --upload-tools
$

Sample status from the resulting environment is appended at the end of this description;
all-machines.log is attached.

The wordpress/0 unit has failed to come up.
Looking at the logs, it seems that the agent upgrades to 1.18,
but then the DesiredVersion API call returns 1.16 again
(this is weird), so it downgrades, but then the agent
configuration file is not compatible, because 1.18 config
files are not readable by 1.16 agents.

$ juju status
WARNING unknown config field "public-bucket"
WARNING unknown config field "public-bucket-region"
environment: sparse
machines:
  "0":
    agent-state: started
    agent-version: 1.18.0.1
    dns-name: ec2-54-82-187-161.compute-1.amazonaws.com
    instance-id: i-e383d6b2
    instance-state: running
    series: precise
    hardware: arch=amd64 cpu-cores=1 cpu-power=100 mem=1740M root-disk=8192M
  "1":
    agent-state: started
    agent-version: 1.18.0.1
    dns-name: ec2-54-82-25-135.compute-1.amazonaws.com
    instance-id: i-db88dd8a
    instance-state: running
    series: precise
    hardware: arch=amd64 cpu-cores=1 cpu-power=100 mem=1740M root-disk=8192M
  "2":
    agent-state: started
    agent-version: 1.18.0.1
    dns-name: ec2-54-80-24-48.compute-1.amazonaws.com
    instance-id: i-348cd965
    instance-state: running
    series: precise
    hardware: arch=amd64 cpu-cores=1 cpu-power=100 mem=1740M root-disk=8192M
services:
  mysql:
    charm: cs:precise/mysql-38
    exposed: false
    relations:
      cluster:
      - mysql
      db:
      - wordpress
    units:
      mysql/0:
        agent-state: started
        agent-version: 1.18.0.1
        machine: "2"
        public-address: ec2-54-80-24-48.compute-1.amazonaws.com
  wordpress:
    charm: cs:precise/wordpress-21
    exposed: false
    relations:
      db:
      - mysql
      loadbalancer:
      - wordpress
    units:
      wordpress/0:
        agent-state: down
        agent-state-info: (started)
        agent-version: 1.18.0.1
        machine: "1"
        open-ports:
        - 80/tcp
        public-address: ec2-54-82-25-135.compute-1.amazonaws.com

Related branches

Revision history for this message
Roger Peppe (rogpeppe) wrote :
Revision history for this message
John A Meinel (jameinel) wrote :

Right now this would appear to block 1.18.0 because it would prevent being able to upgrade from 1.16.6.

I'll dig into this and see if I can reproduce.

Changed in juju-core:
importance: Undecided → Critical
milestone: none → 1.18.0
status: New → Triaged
description: updated
Revision history for this message
John A Meinel (jameinel) wrote :

This doesn't reproduce for me. Is it reliably reproducable for you? I made sure to use the same version of juju-core that you tested with (vs just tip of trunk).

Revision history for this message
John A Meinel (jameinel) wrote :

This is a successful upgrade, which was the first one I tried.

Curtis Hovey (sinzui)
Changed in juju-core:
status: Triaged → Incomplete
John A Meinel (jameinel)
Changed in juju-core:
importance: Critical → High
milestone: 1.19.0 → 1.19.1
Revision history for this message
Aaron Bentley (abentley) wrote :

Here is a log of a failed upgrade:
https://pastebin.canonical.com/107906/

It succeeds when --version is specified:
https://pastebin.canonical.com/107907/

Revision history for this message
John A Meinel (jameinel) wrote :

So we sorted out the logic of how it could fail.

Unit agents in 1.18 now watch the version of their associated Machine agent. So DesiredVersion returns the version of the Machine agent in 1.18.

However, 1.16 agents just all watch the global environment configuration value.

Which means there is a race. If

1) Unit agent and API server both see the updated DesiredVersion and download and install their tools.
2) Machine agent has not seen it yet (or is still downloading)
3) API And Unit agent upgrade themselves and start running the new code.
At this point, the Unit Agent will check and see that it should be running the same version as its Machine, which is now older than its current version.

I'm not sure how to fix it, but at least we can understand *why* it is happening.

Revision history for this message
John A Meinel (jameinel) wrote :

The best way that we can think of to work around this is to just refuse to downgrade. The idea that we *could* downgrade has always been tempting, but all of our logic is around Upgrade and Upgrade steps, and we actually have 0 logic around Downgrade steps. (so things like rewriting the schema in the database or agent.conf files is actually not reversable.)

Changed in juju-core:
importance: High → Critical
milestone: 1.19.1 → 1.19.0
status: Incomplete → Triaged
Revision history for this message
John A Meinel (jameinel) wrote :

Actually is critical because if you do get into this situation it will be really hard to recover from it.

Changed in juju-core:
assignee: nobody → John A Meinel (jameinel)
John A Meinel (jameinel)
Changed in juju-core:
status: Triaged → In Progress
Go Bot (go-bot)
Changed in juju-core:
status: In Progress → Fix Committed
James Page (james-page)
Changed in juju-core (Ubuntu Trusty):
importance: Undecided → Critical
status: New → Triaged
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package juju-core - 1.18.1-0ubuntu1

---------------
juju-core (1.18.1-0ubuntu1) trusty; urgency=medium

  * New upstream point release, including fixes for:
    - Upgrading juju 1.16.6 -> 1.18.x fails (LP: #1299802).
    - Peer relation disappears during juju-upgrade (LP: #1303697).
    - public-address of units changes to internal bridge post upgrade
      (LP: #1303735).
    - Unable to deploy local charms without series (LP: #1303880).
    - juju scp no longer allows multiple extra arguments to be passed
      (LP: #1306208).
    - juju cannot downgrade to same major.minor version with earlier
      patch number (LP: #1306296).
 -- James Page <email address hidden> Sat, 12 Apr 2014 07:04:37 +0100

Changed in juju-core (Ubuntu Trusty):
status: Triaged → Fix Released
Curtis Hovey (sinzui)
Changed in juju-core:
importance: Critical → High
Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → Fix Released
Revision history for this message
Sacha Yunusic (sacha-m) wrote :

I just upgrade to 1.21.3 and all I get is "config-changed" error on several services... I tried with "juju resolved --retry landscape/0", with no luck. Help please!

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.