upgrading from 1.18.1 to 1.19 (trunk) fails (API server stops responding)

Bug #1307450 reported by John A Meinel
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju-core
Fix Released
Critical
John A Meinel

Bug Description

As seen in CI:
http://ec2-54-84-137-170.compute-1.amazonaws.com:8080/job/aws-upgrade/1139/console

 PATH=/mnt/jenkinshome/jobs/aws-upgrade/workspace/extracted-bin/usr/lib/juju-1.19.0/bin:/usr/local/sbin:/usr/local/bin:/usr/bin:/usr/sbin:/sbin:/bin
+ /var/lib/jenkins/ci-cd-scripts2/wait_for_agent_update.py test-release-aws
1.18.0: 1, 0, 2, mysql/0, wordpress/0
1.18.0: 1, 0, 2, mysql/0, wordpress/0
1.18.0: 1, 0, 2, mysql/0, wordpress/0
1.18.0: 1, 0, 2, mysql/0, wordpress/0
1.18.0: 1, 0, 2, mysql/0, wordpress/0
Timed out waiting for juju status to succeed: Command '('juju', '--show-log', 'status', '-e', 'test-release-aws')' returned non-zero exit status 1
2014-04-13 19:22:15 INFO juju.cmd supercommand.go:297 running juju-1.18.0-precise-amd64 [gc]
2014-04-13 19:22:15 INFO juju api.go:238 connecting to API addresses: [ec2-54-86-35-211.compute-1.amazonaws.com:17070]
2014-04-13 19:22:15 INFO juju apiclient.go:114 state/api: dialing "wss://ec2-54-86-35-211.compute-1.amazonaws.com:17070/"
2014-04-13 19:22:16 ERROR juju apiclient.go:119 state/api: websocket.Dial wss://ec2-54-86-35-211.compute-1.amazonaws.com:17070/: dial tcp 172.31.47.92:17070: connection refused
...

Tags: ci

Related branches

John A Meinel (jameinel)
summary: - upgrading from 1.18.0 to 1.19 (trunk) fails (API server stops
+ upgrading from 1.18.1 to 1.19 (trunk) fails (API server stops
responding)
Revision history for this message
John A Meinel (jameinel) wrote :

So I tried to reproduce this with:

juju-1.18 bootstrap -e amz-jam --debug
juju-1.18 upgrade-juju -e amz-jam --version 1.18.0
# add units, but this should actually be unrelated to the bugs I encountered
juju-1.19 upgrade-juju -e amz-jam --upload-tools

At this point, the server notices that it wants to upgrade, but fails to actually do so. It appears that the tools get upload to s3, but the URLs that are created to download them fail. I can't get them to work when I manually copy them into a web browser.

I do have the --debug machine-0.log and all-machines.log available, but I don't have time to sanitize them to put them on a bug report.

Revision history for this message
John A Meinel (jameinel) wrote :

failing machine-0.log

the upload starts around line 1737
2014-04-14 14:29:34 DEBUG juju.state.apiserver tools.go:90 request to upload tools 1.19.0.1-trusty-amd64 for series "trusty,precise"
It appears to succeed here:
2014-04-14 14:30:29 INFO juju.environs.tools simplestreams.go:377 Writing tools/streams/v1/index.json
2014-04-14 14:30:29 INFO juju.environs.tools simplestreams.go:377 Writing tools/streams/v1/com.ubuntu.juju:released:tools.json
2014-04-14 14:30:29 INFO juju.environs.sync sync.go:127 tools metadata written
2014-04-14 14:30:29 DEBUG juju.state.apiserver apiserver.go:120 <- [10] user-admin {"RequestId":6,"Type":"Client","Request":"SetEnvironAgentVersion","Params":{"Version":"1.19.0.1"}}

However, I see here:
2014-04-14 14:30:33 DEBUG juju.environs.simplestreams simplestreams.go:468 read metadata index at "https://s3.amazonaws.com/0f30e0d6cf4547df8b786e1dc2b19b4a/tools/streams/v1/index.json?AWSAccessKeyId=XXXXXX"
2014-04-14 14:30:33 DEBUG juju.environs.simplestreams simplestreams.go:476 index file has no data for product name(s) ["com.ubuntu.juju:12.04:i386"]

It is looking for i386 tools, when I uploaded amd64 tools.

So it looks like "juju-1.18 bootstrap" has started preferring i386 instances rather than amd64.... :(

Revision history for this message
John A Meinel (jameinel) wrote :

So it appears 1.19.0 is trying to connect to the database somehow differently:
2014-04-14 18:11:33 DEBUG juju.state.apiserver apiserver.go:127 -> [1] machine-0 40.970088016s {"RequestId":92,"Error":"watcher has been stopped","ErrorCode":"stopped","Response":{}} NotifyWatcher["16"].Next
2014-04-14 18:11:33 INFO juju.rpc server.go:295 error closing codec: write tcp 127.0.0.1:58053: broken pipe
2014-04-14 18:11:33 INFO juju.state.apiserver apiserver.go:135 [1] machine-0 API connection terminated after 6m45.651136547s
2014-04-14 18:11:33 INFO juju.worker.upgrader error.go:32 upgraded from 1.18.1-precise-amd64 to 1.19.0.1-precise-amd64 ("https://s3.amazonaws.com/b6bbc17f4f914a42840f3e5cf9fec7cf/tools/releases/juju-1.19.0.1-precise-amd64.tgz?AWSAccessKeyId=AKIAJQ4KQVZRZLUR6YWQ&Expires=1713118291&Signature=Ah33VHBEO41byrDWndjXB4plWvA%3D")
2014-04-14 18:11:33 ERROR juju.cmd supercommand.go:300 must restart: an agent upgrade is available
2014-04-14 18:11:34 INFO juju.cmd supercommand.go:296 running juju-1.19.0.1-precise-amd64 [gc]
2014-04-14 18:11:34 INFO juju.cmd.jujud machine.go:147 machine agent machine-0 start (1.19.0.1-precise-amd64 [gc])
2014-04-14 18:11:34 DEBUG juju.agent agent.go:365 read agent config, format "1.18"
2014-04-14 18:11:34 INFO juju runner.go:262 worker: start "api"
2014-04-14 18:11:34 INFO juju runner.go:262 worker: start "statestarter"
2014-04-14 18:11:34 INFO juju runner.go:262 worker: start "termination"
2014-04-14 18:11:34 INFO juju.state.api apiclient.go:201 dialing "wss://localhost:17070/"
2014-04-14 18:11:34 INFO juju runner.go:262 worker: start "state"
2014-04-14 18:11:34 INFO juju.agent.mongo mongo.go:189 Ensuring mongo server is running; dataDir /var/lib/juju; port 37017
2014-04-14 18:11:34 DEBUG juju.agent.mongo mongo.go:255 found mongod at: /usr/bin/mongod
2014-04-14 18:11:34 INFO juju.state.api apiclient.go:209 error dialing "wss://localhost:17070/": websocket.Dial wss://localhost:17070/: dial tcp 127.0.0.1:17070: connection refused
2014-04-14 18:11:34 ERROR juju runner.go:220 worker: exited "api": timed out connecting to "wss://localhost:17070/"
2014-04-14 18:11:34 INFO juju runner.go:254 worker: restarting "api" in 3s
2014-04-14 18:11:34 DEBUG juju.agent.mongo mongo.go:262 mongod --version:
db version v2.4.6
Mon Apr 14 18:11:34.222 git version: nogitversion

2014-04-14 18:11:34 INFO juju.state open.go:81 opening state, mongo addresses: ["localhost:37017"]; entity "machine-0"
2014-04-14 18:11:34 DEBUG juju.state open.go:86 dialing mongo
2014-04-14 18:11:34 DEBUG juju.state open.go:92 connection established
2014-04-14 18:11:34 ERROR juju runner.go:220 worker: exited "state": cannot log in to admin database as "machine-0": unauthorized mongo access: auth fails
2014-04-14 18:11:34 INFO juju runner.go:254 worker: restarting "state" in 3s

I think the key point is that because this is an upgrade, we haven't given "machine-0" access to the "admin" database. Which means we can't try to log into that database.

So either we need to teach upgrade to give machine-0 admin rights (which we've determined to be hard to do), or we treat failures to connect to "admin" as a soft failure.

Revision history for this message
John A Meinel (jameinel) wrote :

Giving admin rights is bug #1306902, but the steps to do so are a bit convoluted, so for now we're going with "soft login".

Changed in juju-core:
milestone: 1.19.0 → 2.0
status: Triaged → In Progress
Go Bot (go-bot)
Changed in juju-core:
status: In Progress → Fix Committed
Ian Booth (wallyworld)
Changed in juju-core:
milestone: 2.0 → 1.19.0
Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.