juju upgrade failures

Bug #1507867 reported by Ian Booth
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju-core
Expired
Critical
Unassigned

Bug Description

This is a meta bug to capture the results of analysis RT 85463
https://rt.admin.canonical.com/Ticket/Display.html?id=85463

Separate bugs will likely be opened to cover individual fixes.

----------------------------

* Test upgrade path 1.20.14 -> 1.24.6

* Case 1: without ignore-machine-addresses

- Redeploying staging with standard HA cloud, VIP in low IP range
- Locally upgrade to 1.24.6 via apt
- Upgrade agents
$ juju get-env ignore-machine-addresses
ERROR key "ignore-machine-addresses" not found in
"bootstack-staging" environment.
$ juju set-env tools-url=https://streams.canonical.com/juju/tools
$ juju upgrade-juju --version="1.24.6"
- Upgrade did complete, but most agents wouldn't be upgraded, ie.
are left on 1.20.14
- Unrecoverable hook errors

Initial analysis:
- unit agents never received upgrade notification
- host machine never upgraded
- tools could not be retrieved

2015-10-15 12:33:03 ERROR juju.worker.upgrader upgrader.go:157 failed to fetch tools from "https://172.20.168.3:17070/environment/5c8be479-220d-49fd-851e-507221728dfc/tools/1.24.6-trusty-amd64": bad HTTP response: 400 Bad Request

Looking on the state server machine 0 where the above request is processed:

2015-10-15 12:33:00 DEBUG juju.apiserver apiserver.go:257 <- [5B6] machine-0-lxc-14 {"RequestId":212,"Type":"Upgrader","Request":"Tools","Params":"'params redacted'"}
2015-10-15 12:33:00 DEBUG juju.apiserver apiserver.go:271 -> [5B6] machine-0-lxc-14 2.633245ms {"RequestId":212,"Response":"'body redacted'"} Upgrader[""].Tools
2015-10-15 12:33:00 DEBUG juju.apiserver utils.go:71 validate env uuid: state server environment - 5c8be479-220d-49fd-851e-507221728dfc
2015-10-15 12:33:00 ERROR juju.apiserver tools.go:59 GET(/environment/5c8be479-220d-49fd-851e-507221728dfc/tools/1.24.6-trusty-amd64?%3Aenvuuid=5c8be479-220d-49fd-851e-507221728dfc&%3Aversion=1.24.6-trusty-amd64&) failed: failed to open GridFS file "abafdc81-364f-4b92-8565-3415328319a1": not found
2015-10-15 12:33:00 DEBUG juju.apiserver tools.go:119 sending error: 400 failed to open GridFS file "abafdc81-364f-4b92-8565-3415328319a1": not found
2015-10-15 12:33:00 DEBUG juju.apiserver apiserver.go:257 <- [5D1] machine-1-lxc-11 {"RequestId":210,"Type":"Upgrader","Request":"Tools","Params":"'params redacted'"}

This implies that the underlying Juju blobstore has become corrupt - somehow a previously stored tools blob is not there.

Looking further up the log file:

2015-10-15 12:19:30 DEBUG juju.apiserver apiserver.go:271 -> [1B] machine-0 1.451437574s {"RequestId":17,"Error":"cannot read settings: EOF","Response":"'body redacted'"} Environment[""].EnvironConfig
2015-10-15 12:19:30 DEBUG juju.cmd.jujud machine.go:1604 worker "certupdater" exited with retrieving initial server addesses: EOF
2015-10-15 12:19:30 INFO juju.worker runner.go:275 stopped "certupdater", err: retrieving initial server addesses: EOF
2015-10-15 12:19:30 DEBUG juju.worker runner.go:203 "certupdater" done: retrieving initial server addesses: EOF
2015-10-15 12:19:30 INFO juju.cmd.jujud util.go:139 error pinging *state.State: EOF
...
...
2015-10-15 12:19:33 ERROR juju.cmd.jujud util.go:217 closeWorker: close error: closing state failed: error stopping transaction watcher: watcher iteration error: EOF
2015-10-15 12:19:33 INFO juju.worker runner.go:275 stopped "state", err: retrieving initial server addesses: EOF
2015-10-15 12:19:33 DEBUG juju.worker runner.go:203 "state" done: retrieving initial server addesses: EOF
2015-10-15 12:19:33 ERROR juju.worker runner.go:223 exited "state": retrieving initial server addesses: EOF
2015-10-15 12:19:33 INFO juju.worker runner.go:261 restarting "state" in 3s
2015-10-15 12:19:33 DEBUG juju.storage managedstorage.go:239 resource catalog entry created with id "6c58819968776a22d1adc1c9f5a60927aebc6c57cdc284e9dd189f18b6893a32d18b9ca807dbf704197b91039096431d"
2015-10-15 12:19:33 ERROR juju.apiserver tools.go:59 GET(/environment/5c8be479-220d-49fd-851e-507221728dfc/tools/1.24.6-trusty-amd64?%3Aenvuuid=5c8be479-220d-49fd-851e-507221728dfc&%3Aversion=1.24.6-trusty-amd64&) failed: error fetching tools: error caching tools: cannot store tools tarball: cannot add resource "environs/5c8be479-220d-49fd-851e-507221728dfc/tools/1.24.6-trusty-amd64-a31004660b3d816789d6eaf2e16b7c9f553cd9098e2c37dc6499b8a4c1c29e4d" to store at storage path "2566ed00-a2b8-4609-8f59-caf009605236": failed to write data: EOF
2015-10-15 12:19:33 DEBUG juju.apiserver tools.go:119 sending error: 400 error fetching tools: error caching tools: cannot store tools tarball: cannot add resource "environs/5c8be479-220d-49fd-851e-507221728dfc/tools/1.24.6-trusty-amd64-a31004660b3d816789d6eaf2e16b7c9f553cd9098e2c37dc6499b8a4c1c29e4d" to store at storage path "2566ed00-a2b8-4609-8f59-caf009605236": failed to write data: EOF

So the entire Juju model and blobstore mongo databases have become corrupt.

Need to look into why.

As an aside, unit agents are unnecessarily bouncing:

ceilometer-hacluster/2 - last logged messages....

2015-10-15 12:33:26 INFO juju.worker.uniter uniter.go:144 unit "ceilometer-hacluster/2" shutting down: ModeAbide: cannot set invalid status "started"
2015-10-15 12:33:26 ERROR juju.worker.uniter.filter filter.go:116 tomb: dying
2015-10-15 12:33:26 DEBUG juju.worker.uniter runlistener.go:97 juju-run listener stopping
2015-10-15 12:33:26 DEBUG juju.worker.uniter runlistener.go:117 juju-run listener stopped
2015-10-15 12:33:26 ERROR juju.worker runner.go:218 exited "uniter": ModeAbide: cannot set invalid status "started"
2015-10-15 12:33:26 INFO juju.worker runner.go:252 restarting "uniter" in 3s

The restart is unfortunate - it is due to an unrecognised status value "started" being set - this should not cause an agent restart.

----------------------------

* Test upgrade path 1.22.6 -> 1.24.6 with ignore-machine-addresses

- Redeploying staging with standard HA cloud, VIP in low IP range
- Locally upgrade to 1.24.6 via apt
- Upgrade agents
$ juju set-env ignore-machine-addresses=True
$ juju upgrade-juju --version="1.24.6"
- After a few mins, upgrade is done
- Errors:
+ many hook errors
+ some units have not been upgraded
+ some units don't seem to have a public address set at all
$ juju ssh mysql/0
ERROR unit "mysql/0" has no internal address
Download signature.asc
application/pgp-signature 801b

Initial analysis:

- logs show units in error have failed to run config changed hook due to

2015-10-19 09:44:13 DEBUG worker.uniter.jujuc server.go:159 hook context id "ceilometer/0-config-changed-6086224491777868053"; dir "/var/lib/juju/agents/unit-ceilometer-0/charm"
2015-10-19 09:44:13 INFO config-changed error: private-address not set
2015-10-19 09:44:13 INFO config-changed Traceback (most recent call last):
2015-10-19 09:44:13 INFO config-changed File "/var/lib/juju/agents/unit-ceilometer-0/charm/hooks/config-changed", line 333, in <module>
2015-10-19 09:44:13 INFO config-changed hooks.execute(sys.argv)

When ignore-machine-addresses is set to true, this causes the machiner worker to clear the addresses recorded against the machine when the agent restarts after the upgrade. Because addresses were previously set, the preferred private address is also cleared. Thus when the unit asks for it's private address, it gets back an empty value and the hooks fail.

* Test upgrade path 1.20.14 -> 1.24.6

* Case 2: with ignore-machine-addresses=true

- Redeploying staging with standard HA cloud, VIP in low IP range
- Locally upgrade to 1.24.6 via apt
- Upgrade agents
$ juju set-env ignore-machine-addresses=true
$ juju set-env tools-url=https://streams.canonical.com/juju/tools
$ juju upgrade-juju --version="1.24.6"
- Upgrade completes (ie., apiserver accepts connections again)
after a few mins
- Errors:
+ hook errors on 56 of 102 units
+ 5 units have not been upgraded
+ some units don't seem to have a public address set at all
$ juju ssh mysql/0
ERROR unit "mysql/0" has no internal address

Initial analysis:

Logs show same root cause as the previous 1.22 -> 1.24 upgrade.
- machine private address being reset.

Revision history for this message
Horacio Durán (hduran-8) wrote :

I had a long look at the logs and talked with the bootstack people originally reporting the error, to have certainty we will need the db (which will be provided tomorrow) and also some stats of the machine while running the upgrade, such as the free ram.

I have a couple of working theories:
From matching the error lines from the EOF appearances with the files in the actual code, the agent seems to be 1.20 at that point.
I think it could be either:

Not enough memory.
Agent not being able to authenticate
Related to the previous one, db is upgraded but agent still runs old version.

I cannot say more without a closer inspection of the db given the old version we are departing from.

Revision history for this message
Michael Foord (mfoord) wrote :

I've done test upgrades, with a deployed unit of mysql, of 1.20 -> 1.24 and 1.22 -> 1.24 with "ignore-machine-addresses" on and 1.22 -> 1.24 with "ignore-machine-addresses" off. They all worked fine and, even after some time, continued to report the correct addresses.

Revision history for this message
Ian Booth (wallyworld) wrote :

Sadly that sort of test is not sufficient. The issues seen with machine addresses would be timing dependent and deploying an openstack bundle with many units deployed into containers may well be the only way to reproduce. The total clearing out of machine addresses will indeed cause a unit get private-address to return empty if called at an unfortunate time before any machine addresses have been set again.

Revision history for this message
Michael Foord (mfoord) wrote :

On further investigation, with an upgrade from 1.20 -> 124.6 I can *usually* get a situation where the machine agent does *not* report the new version (so apparently hasn't been upgraded) but the unit agent does.

Revision history for this message
Horacio Durán (hduran-8) wrote :

Michael Foord managed to reproduce this and there was a relevant exchange:

<voidspace> perrito666: not really, I can reproduce an issue - when I upgrade from 1.20 to 1.24 *most* of the time (but not always) the machine agent fails to upgrade version
<voidspace> perrito666: the unit agent reports the correct new version, but not the machine agent
<voidspace> perrito666: however, I can't reproduce the bug as described (missing address or corrupted db)
<voidspace> perrito666: this is with a deployed mongo unit and ignore-machine-addresses on
<perrito666> voidspace: maybe you can help me a bit, from reading at the logs, It seems to me that the juju binary in use is in fact the old one
<voidspace> perrito666: it would be weird for the machine agent and unit agent to be from different binaries
<voidspace> but that's what status is reporting

Revision history for this message
Andrew Wilkins (axwalk) wrote :

There's a few things that jump out at me from the logs, but nothing conclusive. If machine-2.log is still available, it would be useful to see if there's anything in there that's not made its way into all-machines.log.

machine-2 is running the database-master upgrade steps. However, the last one it logs (in all-machines.log) is:
    "running upgrade step: change updated field on statushistory from time to int"

There should be a few more database-master ones after that, and then a bunch of all-machine steps. Furthermore, there's no logging to say that "All upgrade steps completed successfully" on machine-2, but there is on machine-0 and machine-1. Those two machines also claim that the database master (machine-2) finished upgrading.

The next odd thing is that the last upgrade step for machine-2 is logged at 2015-10-15 12:19:25. A handful of logs later, time moves along... and then goes back in time. And when it goes back in time, it's doing machine-provisioner things which suggests that upgrades *have* finished. Maybe time going backwards is due to NTP updates; I don't really know. In any case, there's missing logging.

Revision history for this message
Michael Foord (mfoord) wrote :

Note that the "preferred address" work landed in 1.24.7 - which is why the "no address" failure message is different between 1.24.6 / 1.24.7 - however the underlying problem (no internal address) is the same, just reported differently. In the 1.24.7 logs the preferred address upgrade step runs ok.

"ignore-machine-addresses" only takes effect after the machine 0 upgrade has completed and the machiner starts. (This happens before the unit agent starts.)

From looking at the status output it seems that most (but not all) of the units with errors are all containers. It's possible that "ignore-machine-addresses" is broken for containers. We will investigate.

Revision history for this message
Michael Foord (mfoord) wrote :

If I deploy a model with a lxc container and ignore-machine-addresses on, then after upgrade (1.20.14 -> 1.24.6) then the upgrade fails and ssh into the lxc container unit fails with "no internal address". So it does seem that "ignore-machine-addresses" is broken for lxc containers. This isn't the only problem but seems to be part of the problem.

Revision history for this message
Dimiter Naydenov (dimitern) wrote :

Let's file a separate bug for the (now confirmed) issue with ignore-machine-addresses: true inside containers. The problem is ignoring the addresses the machine agent sees on the instance is fine only because we have provider-reported addresses still (e.g. node details from maas). For containers, which the provider doesn't know about (obviously), addresses we see on the machine should never be ignored.

Peter, we have a possible solution and we can provide a tarball with patched binaries you could try, however we were having issues with both upgrading 1.20->1.24.8.1 and 1.22 -> 1.24.8.1 (version change due to --upload-tools) due to 'invalid series "wily"' error from the apiserver. Trying apt-get upgrade, dist-upgrade, on both the client machine and on the bootstrap node, as well as using --series with upgrade-juju --upload-tools didn't solve the problem, so we can't actually verify it yet, but since you don't seem to have that issue, the proposed patch might work for you.

Michael, can you push what you have in the WIP branch with the fix please?

Revision history for this message
Michael Foord (mfoord) wrote :

Containers are broken with "ignore-machine-addresses" (no upgrade needed to demonstrate it). Bug raised:
https://bugs.launchpad.net/juju-core/+bug/1509292

Revision history for this message
Michael Foord (mfoord) wrote :

Proposed fix for containers and ignore-machine-addresses here (targeting 1.24 initially): https://github.com/voidspace/juju/tree/1509292-ignore-machine-addresses-1.24

Revision history for this message
Horacio Durán (hduran-8) wrote :

I re checked my steps and found the following:

* in the logs there is no mention to upgrade to 1.24.x so this error might be even happening beforehand in machine-0

I bootstraped an env and placed a copy of the /var/lib/juju folder obtained from bootstack instead of the one just bootstraped.
modified the db according to the steps from https://github.com/juju/juju/blob/1.25/cmd/plugins/juju-restore/restore.go#L159 and finally started juju and mongo.

In those conditions I could reproduce the errors from the logs (running 1.24.x in this case, even though the logs in bootstack dont seem to indicate the update ever took place)

so I ran:

* sudo start juju-db
* sudo start jujud-machine-0

The logs for juju machine 0 started outputting errors like:
2015-10-15 09:42:11 INFO juju.cmd.jujud agent.go:177 error pinging *state.State: EOF

Exactly like the reported issue

The logs for mongo, which we did not get from bootstack, had a lot of:

Oct 23 19:42:52 ip-10-9-140-230 mongod.37017[6702]: Fri Oct 23 19:42:52.543 [conn6] presence.presence.pings Deleted record list corrupted in bucket 8, link number 3, invalid link is 302816:49d20, throwing Fatal Assertion
Oct 23 19:42:52 ip-10-9-140-230 mongod.37017[6702]: Fri Oct 23 19:42:52.543 [conn6] presence.presence.pings Fatal Assertion 16469
Oct 23 19:42:52 ip-10-9-140-230 mongod.37017[6702]: Fri Oct 23 19:42:52.549 [conn6] #012#012***aborting after fassert() failure#012#012

Oct 23 19:44:56 ip-10-9-140-230 mongod.37017[6976]: Fri Oct 23 19:44:56.428 [conn10] juju.txns Deleted record list corrupted in bucket 4, link number 1, invalid link is 2425504:250020, throwing Fatal Assertion
Oct 23 19:44:56 ip-10-9-140-230 mongod.37017[6976]: Fri Oct 23 19:44:56.428 [conn10] juju.txns Fatal Assertion 16469
Oct 23 19:44:56 ip-10-9-140-230 mongod.37017[6976]: Fri Oct 23 19:44:56.434 [conn10] #012#012***aborting after fassert() failure#012#012

which caused mongo to crash and juju to get EOF as the connection fell.

I ran a repairDatabase() on each db for juju and the juju agent started correctly.

I recommended bootstack to obtain a copy of syslog from the sample deploy so we can see exactly what happened to mongo before becoming corrupted.

Changed in juju-core:
assignee: nobody → Michael Foord (mfoord)
Revision history for this message
Cheryl Jennings (cherylj) wrote :

For clarification - we are currently waiting on the syslog from a sample deploy from the bootstack team to help debug the mongo issues seen.

Changed in juju-core:
assignee: Michael Foord (mfoord) → Horacio Durán (hduran-8)
Revision history for this message
Michael Foord (mfoord) wrote :

 Note that part of the problem they saw was because containers were broken with "ignore-machine-addresses". This is now fixed on 1.24+. Unfortunately due to another bug (only fixed in 1.25+ by Ian Booth) we can't supply them with binaries to test as "juju-upgrade" with "--upload-tools" is broken (because Wily is an unknown series).

    https://bugs.launchpad.net/juju-core/+bug/1403689

There's also another (possibly) relevant bug (upgrades broken - currently only blocking 1.22):

    https://bugs.launchpad.net/juju-core/+bug/1510952

Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 1.26-alpha1 → 1.26-alpha2
Revision history for this message
Michael Foord (mfoord) wrote :

1.26-alpha-1 is now in the devel channel, so it should be possible to test that.

The problems in the case you tested of 1.22 -> 1.24 with "ignore-machine-addresses" were consistent with the container machine address issue being the main problem. So it's possible that a 1.22 -> 1.26 alpha 1 (and therefore 1.25.1 / 1.24.8 when they're released) upgrade will work.

Revision history for this message
Katherine Cox-Buday (cox-katherine-e) wrote :

The Mongo corruption has not been addressed yet; we're waiting on a syslog to do so.

Changed in juju-core:
assignee: Horacio Durán (hduran-8) → Wayne Witzel III (wwitzel3)
Changed in juju-core:
status: Triaged → In Progress
Revision history for this message
Ian Booth (wallyworld) wrote :

RE: the issue with invalid binary version numbers when juju upgrade-juju is run

The flow is that the juju command asks the agent for all known tools so it can pick the ones to upgrade to. The agent will return all tools cached in state as well as one from simplestreams. For some reason, the tools in state were cached with an empty series in the metadata. When there records were shipped back to the upgrade command, it complained.

Clearing the cached tools metadata in state resolved the invalid binary version issue. The upgrade command gets further along, printing what tools it will use. The next step is for it to write the new agent-version setting to state to trigger the upgrade. But before it does this the client exists with a connection closed error. This is the next issue to diagnose.

Revision history for this message
Ian Booth (wallyworld) wrote :
Download full text (3.5 KiB)

Running upgrade-juju --version 1.26-alpha1 again results in an api call to the state server agent to get the available tools metadata. tailing the logs shows this does go out to simplestreams as expected. But the data logged does not match what is in the sjon files. The json files clearly have metadata for tools 1.26-alpha1 (and also 1.25) but the logged data shows only tools data up to 1.24 being read:

machine-1: 2015-11-18 05:19:18 DEBUG juju.environs.simplestreams simplestreams.go:923 finding products at path "streams/v1/com.ubuntu.juju-devel-tools.sjson"
machine-1: 2015-11-18 05:19:18 INFO juju.utils http.go:60 hostname SSL verification enabled
machine-1: 2015-11-18 05:19:18 DEBUG juju.environs.simplestreams simplestreams.go:961 metadata: &{map[com.ubuntu.juju:12.04:i386:{ 1.21-alpha1 i386 map[20151105:0xc213673f60]} com.ubuntu.juju:14.04:arm64:{ 1.21-alpha1 arm64 map[20151105:0xc2115b5300]} com.ubuntu.juju:14.10:i386:{ 1.21-alpha1 i386 map[20151105:0xc21107f060]} com.ubuntu.juju:15.04:arm64:{ 1.21-alpha3 arm64 map[20151105:0xc21107f6c0]} com.ubuntu.juju:win8:amd64:{ 1.21-alpha3 amd64 map[20151105:0xc21077e780]} com.ubuntu.juju:15.04:ppc64el:{ 1.21-alpha3 ppc64el map[20151105:0xc211470900]} com.ubuntu.juju:15.10:ppc64:{ 1.24.1 ppc64 map[20151105:0xc211f00360]} com.ubuntu.juju:15.10:ppc64el:{ 1.24.1 ppc64el map[20151105:0xc211f00480]} com.ubuntu.juju:14.04:ppc64:{ 1.21-alpha1 ppc64 map[20151105:0xc2115b5960]} com.ubuntu.juju:15.04:armhf:{ 1.21-alpha3 armhf map[20151105:0xc21107fba0]} com.ubuntu.juju:15.10:i386:{ 1.24.1 i386 map[20151105:0xc211f00240]} com.ubuntu.juju:win2012r2:amd64:{ 1.21-alpha3 amd64 map[20151105:0xc211f00ba0]} com.ubuntu.juju:win7:amd64:{ 1.21-alpha3 amd64 map[20151105:0xc211f00cc0]} com.ubuntu.juju:14.10:amd64:{ 1.21-alpha1 amd64 map[20151105:0xc2115b5ba0]} com.ubuntu.juju:centos7:amd64:{ 1.24-beta5 amd64 map[20151105:0xc211f005a0]} com.ubuntu.juju:12.04:armhf:{ 1.21-alpha1 armhf map[20151105:0xc213673e40]} com.ubuntu.juju:14.04:amd64:{ 1.21-alpha1 amd64 map[20151105:0xc2115b51e0]} com.ubuntu.juju:14.10:arm64:{ 1.21-alpha1 arm64 map[20151105:0xc2115b5cc0]} com.ubuntu.juju:15.04:ppc64:{ 1.21-alpha3 ppc64 map[20151105:0xc21107ff60]} com.ubuntu.juju:15.10:amd64:{ 1.24.1 amd64 map[20151105:0xc211f00a80]} com.ubuntu.juju:14.04:i386:{ 1.21-alpha1 i386 map[20151105:0xc2115b5840]} com.ubuntu.juju:14.10:ppc64:{ 1.21-alpha1 ppc64 map[20151105:0xc21107f2a0]} com.ubuntu.juju:win2012hv:amd64:{ 1.21-alpha3 amd64 map[20151105:0xc211f007e0]} com.ubuntu.juju:win81:amd64:{ 1.21-alpha3 amd64 map[20151105:0xc21077e1e0]} com.ubuntu.juju:14.10:ppc64el:{ 1.21-alpha1 ppc64el map[20151105:0xc21107f3c0]} com.ubuntu.juju:15.10:arm64:{ 1.24.1 arm64 map[20151105:0xc211f00000]} com.ubuntu.juju:win2012hvr2:amd64:{ 1.21-alpha3 amd64 map[20151105:0xc211f00960]} com.ubuntu.juju:12.04:amd64:{ 1.21-alpha1 amd64 map[20151105:0xc213673d20]} com.ubuntu.juju:14.04:armhf:{ 1.21-alpha1 armhf map[20151105:0xc2115b5660]} com.ubuntu.juju:14.04:ppc64el:{ 1.21-alpha1 ppc64el map[20151105:0xc2115b5a80]} com.ubuntu.juju:14.10:armhf:{ 1.21-alpha1 armhf map[20151105:0xc2115b5f00]} com.ubuntu.juju:15.04:amd6...

Read more...

Revision history for this message
Ian Booth (wallyworld) wrote :

The problem is that the simlestreams metadata on stream.canonical.com is incorrect. This is very serious and from memory has also happened once before. Not sure if it's the exact same problem as before.

Below is a snippet from https://streams.canonical.com/juju/tools/streams/v1/com.ubuntu.juju-devel-tools.sjson

The top level version attribute "version": "1.21-alpha1" masks all the versions in the contained maps, so that only version 1.21-aplha1 is visible to juju. This attribute must not be put there by the metadata generation scripts.

        "com.ubuntu.juju:12.04:amd64": {
            "version": "1.21-alpha1",
            "arch": "amd64",
            "versions": {
                "20151105": {
                    "items": {
                        "1.21-alpha1-precise-amd64": {
                            "release": "precise",
                            "version": "1.21-alpha1",
                            "arch": "amd64",
                            "size": 8509740,
                            "path": "devel/juju-1.21-alpha1-precise-amd64.tgz",
                            "ftype": "tar.gz",
                            "sha256": "0a8353fc8b6e99dccf04e67fdfc3834db47de673ff6a10249c18ff745af56def"
                        },
                        "1.21-alpha2-precise-amd64": {
                            "release": "precise",
                            "version": "1.21-alpha2",
                            "arch": "amd64",
                            "size": 8874880,
                            "path": "devel/juju-1.21-alpha2-precise-amd64.tgz",
                            "ftype": "tar.gz",
                            "sha256": "ffc0e56aff09933d37cee6d6af0029f353abcb399190ec85d2e255057a304275"
                        },

Revision history for this message
Ian Booth (wallyworld) wrote :

Trying upgrading with --upload-tools results in a different problem. The connection between client and agent is shutdown after writing the uploaded tools to the mongo blobstore.

Server:
machine-1: 2015-11-18 05:42:57 DEBUG juju.storage managedstorage.go:291 managed resource entry created with path "environs/d363cb36-36fb-4456-891a-60d21c30d171/tools/1.26-alpha1.1-saucy-amd64-e02aaa91cfadf0d5f809d9a01ed3b2cb9504b094bf829d0b395eb3498e1dc427" -> "d01ed56f4ba79e7e88e01bbbf24a7f143fe62358b509dfae996111f5f6ef496306bee0d11e1bbf3e74d703069913ca15"
machine-1: 2015-11-18 05:42:58 DEBUG juju.state.toolstorage tools.go:119 removed old tools blob
machine-1: 2015-11-18 05:42:58 INFO juju.apiserver apiserver.go:280 [2E8] user-admin@local API connection terminated after 21.647695548s
machine-1: 2015-11-18 05:43:08 INFO juju.mongo open.go:125 dialled mongo successfully on address "127.0.0.1:37017"

Client:
./juju upgrade-juju --debug --upload-tools
2015-11-18 05:42:58 INFO cmd cmd.go:129 available tools:
    1.26-alpha1.1-trusty-amd64
2015-11-18 05:42:58 INFO cmd cmd.go:129 best version:
    1.26-alpha1.1
2015-11-18 05:42:58 ERROR cmd supercommand.go:444 connection is shut down

Revision history for this message
Curtis Hovey (sinzui) wrote :

juju metadata generate-tools never updates the version. We take a snapshot of all the stream metadata before we release in case we want to rollback. We have snapshots back to 2014-11-14, 2 years. we can see juju always selects the first and oldest version of agents found, and places that version as the version for the series-arch.

So there is no reason to rollback because stream are consistent, and validation of stream confirms that the value doesn't change. I think the juju metadata plugin needs fixing.

As the QA team intends to stop using the metadata plugin, our replacement tool can place the *last* and most recent version of agents found in the version fields.

Revision history for this message
Aaron Bentley (abentley) wrote :

The top-level values function as *defaults*, not *overrides* in simplestreams:
$ sstream-query https://streams.canonical.com/juju/tools/streams/v1/com.ubuntu.juju-devel-tools.sjson item_name=1.21-alpha2-precise-amd64 --no-verify --output-format='%(version)s'
1.21-alpha2

Revision history for this message
Wayne Witzel III (wwitzel3) wrote :

If I use upload-tools with 1.22.8, like the staging environment has, I am then unable to find a matching tools version when trying to issue a juju upgrade-juju. Even if a I set the agent-stream and the tools-url.

    juju version
        1.22.8-trusty-amd64
    juju bootstrap --upload-tools

    juju status
        agent-version: 1.22.8.1

    juju set-env agent-stream=devel
    juju set-env tools-url=https://streams.canonical.com/juju/tools

    juju upgrade-juju --version 1.26-alpha1
        2015-11-18 17:03:58 ERROR juju.cmd supercommand.go:430 no matching tools available

Revision history for this message
Wayne Witzel III (wwitzel3) wrote :

I opened this bug for the upload-tools / upgrade-juju issue. https://bugs.launchpad.net/juju-core/+bug/1517632

Revision history for this message
Ian Booth (wallyworld) wrote :

With the tools metadata logging above, the inclusion of the 1.21 version numbers in the tools data back from simplestreams is misleading but not the issue, as Aaron and Curtis have mentioned. Further investigation into the root cause is happening.

Revision history for this message
Ian Booth (wallyworld) wrote :
Download full text (8.2 KiB)

I've done some more digging and there does appear to be an issue with simplestreams metadata retrieval. Adding extra debugging to print what juju actually retrieves from the devel stream is below. This is the result of Juju receiving and parsing the metadata it has just retrieved. There's no modern tools in that list.

machine-0: 2015-11-19 03:06:47 INFO juju.apiserver.common tools.go:279 SIMPLESTREAMS LIST: 1.21-alpha1-precise-amd64;1.21-alpha1-precise-armhf;1.21-alpha1-precise-i386;1.21-alpha1-trusty-amd64;1.21-alpha1-trusty-arm64;1.21-alpha1-trusty-armhf;1.21-alpha1-trusty-i386;1.21-alpha1-trusty-ppc64el;1.21-alpha1-utopic-amd64;1.21-alpha1-utopic-arm64;1.21-alpha1-utopic-armhf;1.21-alpha1-utopic-i386;1.21-alpha1-utopic-ppc64el;1.21-alpha2-precise-amd64;1.21-alpha2-precise-armhf;1.21-alpha2-precise-i386;1.21-alpha2-trusty-amd64;1.21-alpha2-trusty-arm64;1.21-alpha2-trusty-armhf;1.21-alpha2-trusty-i386;1.21-alpha2-trusty-ppc64el;1.21-alpha2-utopic-amd64;1.21-alpha2-utopic-arm64;1.21-alpha2-utopic-armhf;1.21-alpha2-utopic-i386;1.21-alpha2-utopic-ppc64el;1.21-alpha3-precise-amd64;1.21-alpha3-precise-armhf;1.21-alpha3-precise-i386;1.21-alpha3-trusty-amd64;1.21-alpha3-trusty-arm64;1.21-alpha3-trusty-armhf;1.21-alpha3-trusty-i386;1.21-alpha3-trusty-ppc64el;1.21-alpha3-utopic-amd64;1.21-alpha3-utopic-arm64;1.21-alpha3-utopic-armhf;1.21-alpha3-utopic-i386;1.21-alpha3-utopic-ppc64el;1.21-alpha3-vivid-amd64;1.21-alpha3-vivid-arm64;1.21-alpha3-vivid-armhf;1.21-alpha3-vivid-i386;1.21-alpha3-vivid-ppc64el;1.21-alpha3-win2012-amd64;1.21-alpha3-win2012hv-amd64;1.21-alpha3-win2012hvr2-amd64;1.21-alpha3-win2012r2-amd64;1.21-alpha3-win7-amd64;1.21-alpha3-win8-amd64;1.21-alpha3-win81-amd64;1.21-beta1-precise-amd64;1.21-beta1-precise-armhf;1.21-beta1-precise-i386;1.21-beta1-trusty-amd64;1.21-beta1-trusty-arm64;1.21-beta1-trusty-armhf;1.21-beta1-trusty-i386;1.21-beta1-trusty-ppc64el;1.21-beta1-utopic-amd64;1.21-beta1-utopic-arm64;1.21-beta1-utopic-armhf;1.21-beta1-utopic-i386;1.21-beta1-utopic-ppc64el;1.21-beta1-vivid-amd64;1.21-beta1-vivid-arm64;1.21-beta1-vivid-armhf;1.21-beta1-vivid-i386;1.21-beta1-vivid-ppc64el;1.21-beta1-win2012-amd64;1.21-beta1-win2012hv-amd64;1.21-beta1-win2012hvr2-amd64;1.21-beta1-win2012r2-amd64;1.21-beta1-win7-amd64;1.21-beta1-win8-amd64;1.21-beta1-win81-amd64;1.21-beta2-precise-amd64;1.21-beta2-precise-armhf;1.21-beta2-precise-i386;1.21-beta2-trusty-amd64;1.21-beta2-trusty-arm64;1.21-beta2-trusty-armhf;1.21-beta2-trusty-i386;1.21-beta2-trusty-ppc64el;1.21-beta2-utopic-amd64;1.21-beta2-utopic-arm64;1.21-beta2-utopic-armhf;1.21-beta2-utopic-i386;1.21-beta2-utopic-ppc64el;1.21-beta2-vivid-amd64;1.21-beta2-vivid-arm64;1.21-beta2-vivid-armhf;1.21-beta2-vivid-i386;1.21-beta2-vivid-ppc64el;1.21-beta2-win2012-amd64;1.21-beta2-win2012hv-amd64;1.21-beta2-win2012hvr2-amd64;1.21-beta2-win2012r2-amd64;1.21-beta2-win7-amd64;1.21-beta2-win8-amd64;1.21-beta2-win81-amd64;1.21-beta3-precise-amd64;1.21-beta3-precise-armhf;1.21-beta3-precise-i386;1.21-beta3-trusty-amd64;1.21-beta3-trusty-arm64;1.21-beta3-trusty-armhf;1.21-beta3-trusty-i386;1.21-beta3-trusty-ppc64el;1.21-beta3-utopic-amd64;1.21-beta3-utopic-arm64;1.21-beta3-utopic-armhf;1.21-beta3-utopic...

Read more...

Revision history for this message
Ian Booth (wallyworld) wrote :

The above problematic behaviour is confirmed by extra logging in the upgrade-juju command - see https://bugs.launchpad.net/juju-core/+bug/1517632/comments/3

Changed in juju-core:
milestone: 1.26-alpha2 → 1.26-beta1
Revision history for this message
Cheryl Jennings (cherylj) wrote :

The bugs that have been spun off of this meta-bug are:

Bug #1509292 - "ignore-machine-addresses" broken for containers
Bug #1516150 - LXC containers getting HA VIP addresses after reboot
Bug #1517632 - juju upgrade-juju after upload-tools fails

The only remaining issue was the possible mongodb corruption, which was seen once. If corruption is encountered again, a new bug should be opened to track that issue separately.

Changed in juju-core:
status: In Progress → Incomplete
milestone: 1.26-beta1 → none
Curtis Hovey (sinzui)
Changed in juju-core:
assignee: Wayne Witzel III (wwitzel3) → nobody
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for juju-core because there has been no activity for 60 days.]

Changed in juju-core:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.