juju-core

juju thinks it is upgrading after a reboot

Bug #1583683 reported by Darryl Weaver on 2016-05-19

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	juju-core	Expired	Undecided	Unassigned
	1.25	Won't Fix	Undecided	Unassigned

Bug Description

A single state server machine was rebooted.
After rebooting, no agents could connect to the state server stating that:
err: login for "machine-1" blocked because upgrade in progress

Juju status still works, shows the status of agents unable to connect and all tools versions at 1.25.0 and an available version of 1.25.5.

However, any other juju commands state that functionality is limited due to an upgrade being in progress.

This causes log file lines to be generated from each unit that it cannot connect to the state server. This swamps the log files with entries from all units every few seconds, which was how the problem was first alerted. Due to this log file overpopulation each jujud service on each unit has been stopped to prevent flooding log files.

The state server was also stopped and the mongodb stopped and restarted to see if that corrected the issue, it did not. As the mongodb was very large, it was decided that a repair might be a good idea. The mongodb was repaired using the command:
/usr/lib/juju/bin/mongod --auth --dbpath '/var/lib/juju/db' --sslOnNormalPorts --sslPEMKeyFile '/var/lib/juju/server.pem' --sslPEMKeyPassword ignored --port 37017 --noprealloc --syslog --smallfiles --keyFile '/var/lib/juju/shared-secret' --replSet juju --ipv6 --oplogSize 1024 --repair

As copied from the /etc/init/juju-db.conf file and added --repair option.
This resulted in mongo doing some work to repair the tables and seemed to reduce the space being used, but this did not solve the issue.

The mongo database and the state server jujud were started again, inspecting the log files there are entries about modifying the machine-0 user password, that could be relevant.
Inspecting the mongodb was then carried out to see if there was anything obvious.

When logging in to mongodb the user used was: machine-0 and the password retrieved from the /var/lib/juju/agents/machine-0/agent.conf from the "statepassword:" entry.

This was used to log in to the mongodb with:
mongo admin --port 37017 --ssl -u machine-0 -p STATEPASSWORD
replacing STATEPASSWORD with the password retrieved.

Using the login to inspect the system.users collection:
db.system.users.find()
{ "_id" : ObjectId("5669c7ddb7e2d47e2d0dda2c"), "pwd" : "953f3915d4d1664562ffb23a98781cea", "roles" : [ "dbAdminAnyDatabase", "userAdminAnyDatabase", "clusterAdmin", "readWriteAnyDatabase" ], "user" : "admin" }
{ "_id" : ObjectId("5669c7deb7e2d47e2d0dda2e"), "pwd" : "45f37171811d1e4ddb1e876fe87432a7", "roles" : [ "dbAdminAnyDatabase", "userAdminAnyDatabase", "clusterAdmin", "readWriteAnyDatabase" ], "user" : "machine-0" }

But the error when jujud on machine-0 tries to start up states in the log:
2016-05-19 15:44:01 INFO juju.apiserver.common password.go:105 setting password for "machine-0"
2016-05-19 15:44:01 DEBUG juju.apiserver apiserver.go:275 -> [6C1] machine-0 149.634561ms {"RequestId":4,"Response":"'body redacted'"} Agent[""].SetPasswords
2016-05-19 15:44:01 DEBUG juju.apiserver apiserver.go:275 -> [6C3] user-admin@local 586.3us {"RequestId":4,"Response":"'body redacted'"} Client[""].PrivateAddress
2016-05-19 15:44:01 INFO juju.worker runner.go:275 stopped "api", err: can't reset agent password: cannot set password of machine 0: cannot find transaction ObjectIdHex("56cad80a0348dc044300000a")
2016-05-19 15:44:01 DEBUG juju.worker runner.go:203 "api" done: can't reset agent password: cannot set password of machine 0: cannot find transaction ObjectIdHex("56cad80a0348dc044300000a")
2016-05-19 15:44:01 ERROR juju.worker runner.go:223 exited "api": can't reset agent password: cannot set password of machine 0: cannot find transaction ObjectIdHex("56cad80a0348dc044300000a")
2016-05-19 15:44:01 INFO juju.worker runner.go:261 restarting "api" in 3s
2016-05-19 15:44:01 INFO juju.apiserver apiserver.go:284 [6C1] machine-0 API connection terminated after 435.924025ms

All /var/log/ files are attached to the bug report and so all the mongoDB logs and the juju logs that were not rotated out are available.

I have been unable to work out any way to recover this into a working state.

The current state is that all juju agents are stopped across the whole environment in order to prevent juju from making any changes and to prevent log files exhausting disk space.

We can therefore modify the state server without effecting the rest of the environment and then only add agents one at a time, once the state server is back up and running again.
Unfortunately, backups failed due to the large size of the juju database and so there is no fall back position that would work other than to re-deploy the entire environment and move all production workloads.