juju-upgrade to 1.24.7 leaves juju state server unreachable

Bug #1517992 reported by Johan Ehnberg
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
juju-core
Won't Fix
High
Unassigned

Bug Description

Running upgrade (from 1.23.3) seems fine:
juju upgrade-juju --debug
2015-11-19 16:47:54 INFO juju.cmd supercommand.go:37 running juju [1.25.0-trusty-amd64 gc]
2015-11-19 16:47:54 DEBUG juju.api api.go:154 trying cached API connection settings - endpoints [192.168.0.50:17070]
2015-11-19 16:47:54 INFO juju.api api.go:266 connecting to API addresses: [192.168.0.50:17070]
2015-11-19 16:47:54 INFO juju.api apiclient.go:262 dialing "wss://192.168.0.50:17070/environment/ab165d76-5703-4ec2-806f-fad6691ff339/api"
2015-11-19 16:47:54 INFO juju.api apiclient.go:194 connection established to "wss://192.168.0.50:17070/environment/ab165d76-5703-4ec2-806f-fad6691ff339/api"
2015-11-19 16:47:54 DEBUG juju.api api.go:476 API hostnames unchanged - not resolving
2015-11-19 16:47:54 DEBUG juju.api api.go:506 cacheChangedAPIInfo: serverUUID="ab165d76-5703-4ec2-806f-fad6691ff339"
2015-11-19 16:47:59 DEBUG juju.cmd.juju upgradejuju.go:363 found a more recent stable version 1.24.7
2015-11-19 16:47:59 INFO juju.cmd cmd.go:129 available tools:
    1.15.0-precise-amd64
<snip>
    1.25.0-win81-amd64
2015-11-19 16:47:59 INFO juju.cmd cmd.go:129 best version:
    1.24.7
2015-11-19 16:47:59 INFO juju.cmd.juju upgradejuju.go:214 started upgrade to 1.24.7
2015-11-19 16:47:59 INFO juju.cmd supercommand.go:435 command finished

Soon on machine-0.log (and thousands of apiserver messages):
2015-11-19 16:59:03 DEBUG juju.apiserver apiserver.go:267 <- [4] unit-nrpe-19 {"RequestId":81,"Type":"Uniter","Version":2,"Request":"Watch","Params":"'params redacted'"}
2015-11-19 16:59:03 INFO juju.worker runner.go:261 start "networker"
2015-11-19 16:59:03 INFO juju.worker runner.go:261 start "authenticationworker"
2015-11-19 16:59:03 INFO juju.networker networker.go:163 networker is disabled - not starting on machine "machine-0"
2015-11-19 16:59:03 DEBUG juju.apiserver apiserver.go:267 <- [8] machine-0 {"RequestId":21,"Type":"Rsyslog","Request":"GetRsyslogConfig","Params":"'params redacted'"}
2015-11-19 16:59:03 DEBUG juju.apiserver apiserver.go:267 <- [8] machine-0 {"RequestId":22,"Type":"KeyUpdater","Request":"AuthorisedKeys","Params":"'params redacted'"}
2015-11-19 16:59:03 DEBUG juju.apiserver apiserver.go:267 <- [8] machine-0 {"RequestId":23,"Type":"NotifyWatcher","Id":"4","Request":"Next","Params":"'params redacted'"}
2015-11-19 16:59:03 DEBUG juju.apiserver apiserver.go:267 <- [8] machine-0 {"RequestId":24,"Type":"Machiner","Request":"APIHostPorts","Params":"'params redacted'"}
2015-11-19 16:59:03 DEBUG juju.apiserver apiserver.go:267 <- [8] machine-0 {"RequestId":25,"Type":"NotifyWatcher","Id":"5","Request":"Next","Params":"'params redacted'"}
2015-11-19 16:59:03 INFO juju.mongo open.go:125 dialled mongo successfully on address "192.168.0.50:37017"
2015-11-19 16:59:03 DEBUG juju.worker.logger logger.go:45 reconfiguring logging from "<root>=DEBUG" to "<root>=WARNING;unit=DEBUG"
2015-11-19 16:59:09 ERROR juju.api.watcher watcher.go:84 error trying to stop watcher: connection is shut down
2015-11-19 16:59:09 ERROR juju.worker runner.go:208 fatal "api": must restart: an agent upgrade is available
2015-11-19 16:59:09 ERROR juju.worker.peergrouper worker.go:138 peergrouper loop terminated: tomb: dying
2015-11-19 16:59:10 ERROR juju.api.watcher watcher.go:84 error trying to stop watcher: connection is shut down
2015-11-19 16:59:10 ERROR juju.api.watcher watcher.go:84 error trying to stop watcher: connection is shut down
2015-11-19 16:59:10 ERROR juju.api.watcher watcher.go:84 error trying to stop watcher: connection is shut down
2015-11-19 16:59:10 ERROR juju.api.watcher watcher.go:84 error trying to stop watcher: connection is shut down
2015-11-19 16:59:10 ERROR juju.api.watcher watcher.go:84 error trying to stop watcher: connection is shut down
2015-11-19 16:59:10 ERROR juju.api.watcher watcher.go:84 error trying to stop watcher: connection is shut down
2015-11-19 16:59:10 ERROR juju.api.watcher watcher.go:84 error trying to stop watcher: connection is shut down
2015-11-19 16:59:10 ERROR juju.api.watcher watcher.go:84 error trying to stop watcher: connection is shut down
2015-11-19 16:59:10 ERROR juju.api.watcher watcher.go:84 error trying to stop watcher: connection is shut down
2015-11-19 16:59:10 ERROR juju.api.watcher watcher.go:84 error trying to stop watcher: connection is shut down
2015-11-19 16:59:10 ERROR juju.api.watcher watcher.go:84 error trying to stop watcher: connection is shut down
2015-11-19 16:59:10 ERROR juju.api.watcher watcher.go:84 error trying to stop watcher: connection is shut down
2015-11-19 16:59:10 ERROR juju.api.watcher watcher.go:84 error trying to stop watcher: connection is shut down
2015-11-19 16:59:10 ERROR juju.worker runner.go:208 fatal "environ-provisioner": watcher has been stopped
2015-11-19 16:59:10 ERROR juju.worker runner.go:208 fatal "firewaller": watcher has been stopped
2015-11-19 16:59:10 ERROR juju.worker runner.go:219 exited "ab165d76-5703-4ec2-806f-fad6691ff339": watcher has been stopped
2015-11-19 16:59:11 ERROR juju.rpc server.go:573 error writing response: write tcp 192.168.0.79:49173: broken pipe

juju status becomes unresponsive. Running 'restart jujud-machine-0 results in about the the same errors as above:

On machine-0.log:
2015-11-19 17:05:05 DEBUG juju.apiserver apiserver.go:278 -> [59] machine-0 6.071358ms {"RequestId":8,"Response":"'body redacted'"} Environment[""].EnvironConfig
2015-11-19 17:05:05 DEBUG juju.worker.logger logger.go:45 reconfiguring logging from "<root>=DEBUG" to "<root>=WARNING;unit=DEBUG"
2015-11-19 17:05:05 ERROR juju.worker runner.go:208 fatal "upgrader": must restart: an agent upgrade is available
2015-11-19 17:05:09 ERROR juju.api.watcher watcher.go:84 error trying to stop watcher: connection is shut down
2015-11-19 17:05:09 ERROR juju.worker runner.go:208 fatal "api": must restart: an agent upgrade is available
2015-11-19 17:05:09 ERROR juju.worker.peergrouper worker.go:138 peergrouper loop terminated: tomb: dying
2015-11-19 17:05:10 ERROR juju.api.watcher watcher.go:84 error trying to stop watcher: connection is shut down
2015-11-19 17:05:10 ERROR juju.api.watcher watcher.go:84 error trying to stop watcher: connection is shut down
2015-11-19 17:05:10 ERROR juju.api.watcher watcher.go:84 error trying to stop watcher: connection is shut down
2015-11-19 17:05:10 ERROR juju.api.watcher watcher.go:84 error trying to stop watcher: connection is shut down
2015-11-19 17:05:10 ERROR juju.api.watcher watcher.go:84 error trying to stop watcher: connection is shut down
2015-11-19 17:05:10 ERROR juju.api.watcher watcher.go:84 error trying to stop watcher: connection is shut down
2015-11-19 17:05:10 ERROR juju.api.watcher watcher.go:84 error trying to stop watcher: connection is shut down
2015-11-19 17:05:10 ERROR juju.api.watcher watcher.go:84 error trying to stop watcher: connection is shut down
2015-11-19 17:05:10 ERROR juju.api.watcher watcher.go:84 error trying to stop watcher: connection is shut down
2015-11-19 17:05:10 ERROR juju.api.watcher watcher.go:84 error trying to stop watcher: connection is shut down
2015-11-19 17:05:10 ERROR juju.api.watcher watcher.go:84 error trying to stop watcher: connection is shut down
2015-11-19 17:05:10 ERROR juju.api.watcher watcher.go:84 error trying to stop watcher: connection is shut down
2015-11-19 17:05:10 ERROR juju.api.watcher watcher.go:84 error trying to stop watcher: connection is shut down
2015-11-19 17:05:10 ERROR juju.worker runner.go:208 fatal "firewaller": watcher has been stopped
2015-11-19 17:05:10 ERROR juju.worker runner.go:208 fatal "environ-provisioner": watcher has been stopped
2015-11-19 17:05:10 ERROR juju.worker runner.go:219 exited "ab165d76-5703-4ec2-806f-fad6691ff339": watcher has been stopped

Between restarting jujud and the errors is a window of about 5 seconds when I can connect to the state server.

Revision history for this message
Johan Ehnberg (johan-ehnberg) wrote :

Trying to use that window to reset the upgrade does not help:

johan@maas:~$ juju upgrade-juju --reset-previous-upgrade --debug
2015-11-19 17:28:18 INFO juju.cmd supercommand.go:37 running juju [1.25.0-trusty-amd64 gc]
2015-11-19 17:28:18 DEBUG juju.api api.go:154 trying cached API connection settings - endpoints [192.168.0.50:17070]
2015-11-19 17:28:18 INFO juju.api api.go:266 connecting to API addresses: [192.168.0.50:17070]
2015-11-19 17:28:18 INFO juju.api apiclient.go:262 dialing "wss://192.168.0.50:17070/environment/ab165d76-5703-4ec2-806f-fad6691ff339/api"
2015-11-19 17:28:18 DEBUG juju.api apiclient.go:268 error dialing "wss://192.168.0.50:17070/environment/ab165d76-5703-4ec2-806f-fad6691ff339/api", will retry: websocket.Dial wss://192.168.0.50:17070/environment/ab165d76-5703-4ec2-806f-fad6691ff339/api: dial tcp 192.168.0.50:17070: connection refused
2015-11-19 17:28:20 INFO juju.api apiclient.go:262 dialing "wss://192.168.0.50:17070/environment/ab165d76-5703-4ec2-806f-fad6691ff339/api"
2015-11-19 17:28:20 DEBUG juju.provider.maas environprovider.go:28 opening environment "maas".
2015-11-19 17:28:20 INFO juju.api apiclient.go:194 connection established to "wss://192.168.0.50:17070/environment/ab165d76-5703-4ec2-806f-fad6691ff339/api"
2015-11-19 17:28:20 DEBUG juju.environs utils.go:94 StateServerInstances returned: [/MAAS/api/1.0/nodes/node-b0c658c8-7f8e-11e4-9d45-0021702cb7de/]
2015-11-19 17:28:20 DEBUG juju.api api.go:476 API hostnames unchanged - not resolving
2015-11-19 17:28:20 DEBUG juju.api api.go:506 cacheChangedAPIInfo: serverUUID="ab165d76-5703-4ec2-806f-fad6691ff339"
2015-11-19 17:28:24 DEBUG juju.cmd.juju upgradejuju.go:363 found a more recent stable version 1.25.0
2015-11-19 17:28:24 INFO juju.cmd cmd.go:129 available tools:
    1.15.0-precise-amd64
<snip>
    1.25.0-win81-amd64
2015-11-19 17:28:24 INFO juju.cmd cmd.go:129 best version:
    1.25.0

WARNING! using --reset-previous-upgrade when an upgrade is in progress
will cause the upgrade to fail. Only use this option to clear an
incomplete upgrade where the root cause has been resolved.

Continue [y/N]? y
2015-11-19 17:28:25 ERROR juju.cmd supercommand.go:429 some agents have not upgraded to the current environment version 1.24.7: machine-0 <snip>

Revision history for this message
Cheryl Jennings (cherylj) wrote :

Can you upload the full machine-0.log?

Changed in juju-core:
status: New → Triaged
importance: Undecided → High
Revision history for this message
Johan Ehnberg (johan-ehnberg) wrote :

There's too much data to check to post publicly, but I can share it privately with devs. I sent you the link. Note that I use the following to work around mongod log spam:
:msg, contains, "authenticate db: " stop
:msg, contains, "warning: No such role" stop

Also, I previously filtered out apiserver messages due to the same reason (log was growing to gigabytes...)

Revision history for this message
Johan Ehnberg (johan-ehnberg) wrote :

Is there any workaround/fix for this? Are there any tests you'd like me to do? We have uses for the machines concerned, so otherwise I will go ahead and restore the previous state.

Revision history for this message
Cheryl Jennings (cherylj) wrote :

Johan, I'm sorry I hadn't had a chance to look at this yet. In looking at the machine-0 log, I see:

2015-11-19 17:28:27 DEBUG juju.apiserver.upgrader upgrader.go:185 desired version is 1.24.7, but current version is 1.23.3 and agent is not a manager node

Which says to me that the state server doesn't think it's the state server.

- Are you running HA?
- If so:
    - when you are able to run juju status, do you see that any of the state servers have upgraded?
    - could you also upload the logs from the other state servers
- Can you attach the syslog from all state servers

If you need the environment back and you do a restore, I'd recommend going straight from 1.20.* to 1.24.7. 1.23 had some problems, including during upgrades.

Revision history for this message
Johan Ehnberg (johan-ehnberg) wrote :

Thanks for having a look! This is the only state server, i.e. no HA, in a small environment of 6 machines.

Earlier the same day I did an upgrade to 1.23 with --version to the same state server, since the default upgrade without specified version did not recognize anything newer. This was despite the MAAS apache logs showing the state server downloading everything up to 1.25 when running without --version or with --version 1.21. This may be related but if I stumble upon it again I will file it separately.

For the restore I have two alternatives, back to 1.23 with juju's restore .tgz (probably I have to do this on a clean instance, though, unless the 5-second window allows me) just before attempting the upgrade, or back to 1.20 the night before with BackupPC at file level, since juju backups was not implemented at that time. Any help either can have in testing?

Revision history for this message
Cheryl Jennings (cherylj) wrote :

I spoke with some other folks about this, and it turns out that there is a script to help work around upgrade issues on 1.23:

https://github.com/juju/plugins/blob/master/juju-unstick-upgrade

You can see how to install the plugins in the readme here: https://github.com/juju/plugins

This may work around the issue you're seeing. If it doesn't, you can try to restore back to 1.23 using juju restore.

Revision history for this message
Johan Ehnberg (johan-ehnberg) wrote :

juju unstick-upgrade was ineffective, the log shows no change afterwards.

juju restore started in the five second window but afterwards the state server api nonresponsive and it seems to be stuck in loop, with the state server looping the following:

2015-11-25 07:32:56 INFO juju.mongo open.go:125 dialled mongo successfully on address "127.0.0.1:37017"
2015-11-25 07:32:57 INFO juju.mongo open.go:125 dialled mongo successfully on address "127.0.0.1:37017"
2015-11-25 07:32:58 INFO juju.mongo open.go:125 dialled mongo successfully on address "127.0.0.1:37017"
2015-11-25 07:32:58 INFO juju.worker runner.go:261 start "api"
2015-11-25 07:32:58 INFO juju.api apiclient.go:261 dialing "wss://localhost:17070/environment/ab165d76-5703-4ec2-806f-fad6691ff339/api"
2015-11-25 07:32:58 INFO juju.api apiclient.go:269 error dialing "wss://localhost:17070/environment/ab165d76-5703-4ec2-806f-fad6691ff339/api": websocket.Dial wss://localhost:17070/environment/ab165d76-5703-4ec2-806f-fad6691ff339/api: dial tcp 127.0.0.1:17070: connection refused
2015-11-25 07:32:58 ERROR juju.worker runner.go:219 exited "api": unable to connect to "wss://localhost:17070/environment/ab165d76-5703-4ec2-806f-fad6691ff339/api"
2015-11-25 07:32:58 INFO juju.worker runner.go:253 restarting "api" in 3s

Revision history for this message
Johan Ehnberg (johan-ehnberg) wrote :

juju backups restore -b is also not working since the machine is still present (and in this case I will not remove it yet).

I will attempt a file restore to 1.20, to before the upgrade earlier to 1.23 the same day.

Revision history for this message
Johan Ehnberg (johan-ehnberg) wrote :

Restoring to 1.20 (all of /var/lib/juju on state server) failed with mongod unable to pick up the database:
Nov 25 09:50:18 pkcmr mongod.37017[7091]: Wed Nov 25 09:50:18.561 [conn406] Assertion: 10295:getFile(): bad file number value (corrupt db?): run repair
Nov 25 09:50:18 pkcmr mongod.37017[7091]: Wed Nov 25 09:50:18.565 [conn406] insert juju.system.indexes keyUpdates:0 exception: getFile(): bad file number value (corrupt db?): run repair code:10295 locks(micros) w:7436 3ms

Restoring back to the failed upgrade state between 1.23.3 and 1.24.7 "succeeded". At least mongod is running and I have the 5 second window again.

Tools show:
drwxr-xr-x 2 root root 4096 Nov 25 10:12 1.20.13-trusty-amd64
drwxr-xr-x 2 root root 4096 Nov 25 10:12 1.23.3-trusty-amd64
drwxr-xr-x 2 root root 4096 Nov 25 10:12 1.24.7-trusty-amd64
lrwxrwxrwx 1 root root 19 Nov 19 13:28 machine-0 -> 1.23.3-trusty-amd64
lrwxrwxrwx 1 root root 19 Nov 19 15:01 unit-juju-gui-0 -> 1.23.3-trusty-amd64
lrwxrwxrwx 1 root root 19 Nov 19 15:03 unit-nrpe-0 -> 1.23.3-trusty-amd64

Syslog mongod shows the same as in bug #1383260, a lot of these until jujud exits:
Nov 25 10:21:45 pkcmr mongod.37017[5547]: Wed Nov 25 10:21:45.041 [conn64] command juju.$cmd command: { getLastError: 1, j: true } ntoreturn:1 keyUpdates:0 reslen:101 232ms
Nov 25 10:22:06 pkcmr mongod.37017[5547]: Wed Nov 25 10:22:06.349 [conn22] command presence.$cmd command: { getLastError: 1, j: true } ntoreturn:1 keyUpdates:0 reslen:101 276ms

Since it seems the other directions I would take are not viable, I am open to suggestions and further testing of this. If we find a fix, we could propose that to be included in unstick-upgrade?

Revision history for this message
Johan Ehnberg (johan-ehnberg) wrote :

Manually linking the machine-0 tool to 1.24.7-trusty-amd64 gave some partially positive results:

- jujud on the state server stays up (took a while for everything stabilized after some minutes of heavy I/O, CPU and RAM activity)
- juju status works, but juju status [servicename] does not, returns ERROR could not filter units: could not filter units: cannot get status: unit not found
- machine agents upgraded fine
- unit agents upgraded but show agent-state-info: 'hook failed: "config-changed"'
- juju-gui shows an empty environment
- juju upgrade-juju upgraded everything onward to 1.25.0, but with the same issues as above

Revision history for this message
Johan Ehnberg (johan-ehnberg) wrote :

Due to juju backups create complaining about the symlink loop, I noticed these:

/var/lib/juju/tools/1.24.7-trusty-amd64:
lrwxrwxrwx 1 root root 19 Nov 25 10:59 1.24.7-trusty-amd64 -> 1.24.7-trusty-amd64

/var/lib/juju/tools/1.23.3-trusty-amd64
lrwxrwxrwx 1 root root 19 Nov 25 10:48 1.24.7-trusty-amd64 -> 1.24.7-trusty-amd64

That coincides with the first upgrade (from 1.20).

Revision history for this message
Cheryl Jennings (cherylj) wrote :

- The "could not filter units" error has been reported before, after an upgrade from 1.20 -> 1.24 (see bug #1516989). I haven't been able to determine the cause of that one yet.

- For the config-changed hook failures, can you attach the unit logs for those units?

- For the gui problem - can you attach the unit log for juju-gui? I can take a look, but will probably need to tap the juju-gui guys for help.

Revision history for this message
Johan Ehnberg (johan-ehnberg) wrote :

Here's a container on a development machine that I can post.

Revision history for this message
Johan Ehnberg (johan-ehnberg) wrote :

Mostly same as previous.

Revision history for this message
Johan Ehnberg (johan-ehnberg) wrote :
Download full text (21.9 KiB)

Possibly relating to bug #1516989:

I was able to run juju resolved on one of the units, and its status returned to normal. The next run of juju resolved on the same or any other unit returns:
ERROR cannot get status: unit not found

I wonder if this indicates, that the first run or first run within a specific time frame, of the filtering juju commands such as juju status [servicename] or juju resolved [unitname] work.

The log on that unit where resolved ran shows nothing out of the ordinary for the wordpress charm in this case (showing only starting from running resolved):
2015-11-26 07:40:23 INFO config-changed Reading package lists...
2015-11-26 07:40:23 INFO config-changed Building dependency tree...
2015-11-26 07:40:23 INFO config-changed Reading state information...
2015-11-26 07:40:23 INFO config-changed Package 'apache2.2-common' is not installed, so not removed
2015-11-26 07:40:23 INFO config-changed Package 'libapache2-mod-php4' is not installed, so not removed
2015-11-26 07:40:23 INFO config-changed Package 'apache' is not installed, so not removed
2015-11-26 07:40:23 INFO config-changed Package 'apache-ssl' is not installed, so not removed
2015-11-26 07:40:23 INFO config-changed Package 'libapache2-mod-auth-pam' is not installed, so not removed
2015-11-26 07:40:23 INFO config-changed Package 'libapache2-mod-auth-sys-group' is not installed, so not removed
2015-11-26 07:40:23 INFO config-changed Package 'apache-perl' is not installed, so not removed
2015-11-26 07:40:23 INFO config-changed Package 'libapache-ssi-perl' is not installed, so not removed
2015-11-26 07:40:23 INFO config-changed Package 'libapache-mod-perl' is not installed, so not removed
2015-11-26 07:40:23 INFO config-changed Package 'libapache-request-perl' is not installed, so not removed
2015-11-26 07:40:23 INFO config-changed Package 'libapache-mod-jk' is not installed, so not removed
2015-11-26 07:40:23 INFO config-changed Package 'libapache2-mod-jk2' is not installed, so not removed
2015-11-26 07:40:23 INFO config-changed Package 'libplack-middleware-apache2-modssl-perl' is not installed, so not removed
2015-11-26 07:40:23 INFO config-changed Package 'mono-apache-server1' is not installed, so not removed
2015-11-26 07:40:23 INFO config-changed Package 'apache-openid' is not installed, so not removed
2015-11-26 07:40:23 INFO config-changed Package 'libapache2-mod-fastcgi' is not installed, so not removed
2015-11-26 07:40:23 INFO config-changed Package 'libapache-mod-php5' is not installed, so not removed
2015-11-26 07:40:23 INFO config-changed Package 'libapache-mod-php4' is not installed, so not removed
2015-11-26 07:40:23 INFO config-changed Package 'libapache2-mod-php4' is not installed, so not removed
2015-11-26 07:40:23 INFO config-changed Package 'libapache2-mod-auth-pam' is not installed, so not removed
2015-11-26 07:40:23 INFO config-changed Package 'libapache2-mod-auth-sys-group' is not installed, so not removed
2015-11-26 07:40:23 INFO config-changed Package 'libapache-ssi-perl' is not installed, so not removed
2015-11-26 07:40:23 INFO config-changed Package 'libapache-mod-perl' is not installed, so not removed
2015-11-26 07:40:23 INFO config-changed Pa...

Revision history for this message
Johan Ehnberg (johan-ehnberg) wrote :

unit-juju-gui-0 log. juju-gui resides in the same (root) instance as the juju state server.

Revision history for this message
Johan Ehnberg (johan-ehnberg) wrote :

I am trying to see if a new instance of juju-gui would work. Since it does not contain anything user-made I am trying to destroy the old one. Status shows:

  juju-gui:
    charm: cs:trusty/juju-gui-41
    exposed: true
    life: dying
    service-status:
      current: error
      message: 'hook failed: "config-changed"'
      since: 25 Nov 2015 12:59:55+02:00
    relations:
      juju-info:
      - nrpe
    units:
      juju-gui/0:
        workload-status:
          current: error
          message: 'hook failed: "config-changed"'
          since: 25 Nov 2015 12:59:55+02:00
        agent-status:
          current: idle
          since: 25 Nov 2015 12:59:55+02:00
          version: 1.25.0
        agent-state: error
        agent-state-info: 'hook failed: "config-changed"'
        agent-version: 1.25.0
        life: dying
        upgrading-from: cs:trusty/juju-gui-16
        machine: "0"
        public-address: pkcmr.maas.molnix.com
        subordinates:
          nrpe/0:
            workload-status:
              current: error
              message: 'hook failed: "config-changed"'
              since: 25 Nov 2015 13:23:34+02:00
            agent-status:
              current: idle
              since: 25 Nov 2015 13:23:34+02:00
              version: 1.25.0
            agent-state: error
            agent-state-info: 'hook failed: "config-changed"'
            agent-version: 1.25.0
            upgrading-from: cs:trusty/nrpe-4
            public-address: pkcmr.maas.molnix.com

Yet resolving it (with or without -r) does not help:
juju resolved -r juju-gui/0
ERROR cannot set resolved mode for unit "juju-gui/0": already resolved
juju resolved -r nrpe/0
ERROR cannot set resolved mode for unit "nrpe/0": already resolved

It would seem the juju state server is still in a bit quirky state. Are there any database migrations that were skipped by manually changing the symlink?

Revision history for this message
Paul Gear (paulgear) wrote :

I experienced this problem today on a small environment which had been upgraded from 1.20.14 -> 1.21.3 -> 1.22.8 -> 1.23.3 successfully. Several restarts of jujud-machine-0 did not help, nor did a reboot of the bootstrap node.

Changing the symlink in /var/lib/juju/tools for machine-0 from 1.23.3-trusty-amd64 to 1.24.7-trusty-amd64 seems to have been effective, though, because after adjusting the symlink & restarting jujud-machine-0, everything came up except for a couple of config-changed & leader-election hook errors which disappeared after a "juju resolved --retry".

Is there anything else I need to do to ensure health of the environment? I've kept all logs for review, if any devs would like to look at them.

Revision history for this message
Johan Ehnberg (johan-ehnberg) wrote :

Another symptom that lingers is that 'juju run' will not work on the state server (all other instances are fine):
juju --debug run --machine 0 'ls'
2015-11-27 09:16:38 INFO juju.cmd supercommand.go:37 running juju [1.25.0-trusty-amd64 gc]
2015-11-27 09:16:38 DEBUG juju.api api.go:154 trying cached API connection settings - endpoints [192.168.0.50:17070]
2015-11-27 09:16:38 INFO juju.api api.go:266 connecting to API addresses: [192.168.0.50:17070]
2015-11-27 09:16:38 INFO juju.api apiclient.go:262 dialing "wss://192.168.0.50:17070/environment/ab165d76-5703-4ec2-806f-fad6691ff339/api"
2015-11-27 09:16:38 INFO juju.api apiclient.go:194 connection established to "wss://192.168.0.50:17070/environment/ab165d76-5703-4ec2-806f-fad6691ff339/api"
2015-11-27 09:16:38 DEBUG juju.api api.go:476 API hostnames unchanged - not resolving
2015-11-27 09:16:38 DEBUG juju.api api.go:506 cacheChangedAPIInfo: serverUUID="ab165d76-5703-4ec2-806f-fad6691ff339"
2015-11-27 09:21:38 ERROR juju.cmd supercommand.go:429 command timed out

The same using 'juju ssh' is fine:
juju ssh 0 ls
Warning: Permanently added '192.168.0.50' (ECDSA) to the list of known hosts.
Warning: Permanently added '192.168.0.50' (ECDSA) to the list of known hosts.
Connection to 192.168.0.50 closed.

Revision history for this message
Johan Ehnberg (johan-ehnberg) wrote :

I confirm I can run juju resolved -r about once a day and solves the hook errors. Subsequent tries the same day return:
ERROR cannot get status: unit not found

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

@johan-ehnberg: it looks like you've now run into bug 1516989. I know this is being actively fixed today. There's also a tool being developed to fix affected environments.

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

@paulgear: I'm happy to look at the logs of the environment in question to check if all is well. Email me with details of how to get to the logs perhaps?

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

Given the problem described here is a known issue with the 1.23 release series that has been fixed in 1.24 and has a workaround via the unstick-upgrade plugin I'm going to mark this as Wont Fix. Avoiding 1.23 is highly recommended.

As mentioned before the "unit not found" issue is covered by bug 1516989.

Changed in juju-core:
status: Triaged → Won't Fix
Revision history for this message
Johan Ehnberg (johan-ehnberg) wrote :

Since 1.23 is discouraged, could that be mentioned prominently at least in the docs if not in the client itself?
https://jujucharms.com/docs/1.25/juju-upgrade
https://jujucharms.com/docs/1.25/troubleshooting-upgrade

Furthermore, as mentioned, unstick-upgrade did not work in this case but I think some minor tuning might do the trick, if what the script was supposed to do was the manual linking I did above.

Revision history for this message
Cheryl Jennings (cherylj) wrote :

Johan - yes, we are going to update the documentation to remove references to 1.23 in the examples, and mention that it should not be used. Thanks for the feedback!

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.