juju thinks it is upgrading after a reboot

Bug #1583683 reported by Darryl Weaver
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju-core
Expired
Undecided
Unassigned
1.25
Won't Fix
Undecided
Unassigned

Bug Description

A single state server machine was rebooted.
After rebooting, no agents could connect to the state server stating that:
  err: login for "machine-1" blocked because upgrade in progress

Juju status still works, shows the status of agents unable to connect and all tools versions at 1.25.0 and an available version of 1.25.5.

However, any other juju commands state that functionality is limited due to an upgrade being in progress.

This causes log file lines to be generated from each unit that it cannot connect to the state server. This swamps the log files with entries from all units every few seconds, which was how the problem was first alerted. Due to this log file overpopulation each jujud service on each unit has been stopped to prevent flooding log files.

The state server was also stopped and the mongodb stopped and restarted to see if that corrected the issue, it did not. As the mongodb was very large, it was decided that a repair might be a good idea. The mongodb was repaired using the command:
/usr/lib/juju/bin/mongod --auth --dbpath '/var/lib/juju/db' --sslOnNormalPorts --sslPEMKeyFile '/var/lib/juju/server.pem' --sslPEMKeyPassword ignored --port 37017 --noprealloc --syslog --smallfiles --keyFile '/var/lib/juju/shared-secret' --replSet juju --ipv6 --oplogSize 1024 --repair

As copied from the /etc/init/juju-db.conf file and added --repair option.
This resulted in mongo doing some work to repair the tables and seemed to reduce the space being used, but this did not solve the issue.

The mongo database and the state server jujud were started again, inspecting the log files there are entries about modifying the machine-0 user password, that could be relevant.
Inspecting the mongodb was then carried out to see if there was anything obvious.

When logging in to mongodb the user used was: machine-0 and the password retrieved from the /var/lib/juju/agents/machine-0/agent.conf from the "statepassword:" entry.

This was used to log in to the mongodb with:
mongo admin --port 37017 --ssl -u machine-0 -p STATEPASSWORD
replacing STATEPASSWORD with the password retrieved.

Using the login to inspect the system.users collection:
db.system.users.find()
{ "_id" : ObjectId("5669c7ddb7e2d47e2d0dda2c"), "pwd" : "953f3915d4d1664562ffb23a98781cea", "roles" : [ "dbAdminAnyDatabase", "userAdminAnyDatabase", "clusterAdmin", "readWriteAnyDatabase" ], "user" : "admin" }
{ "_id" : ObjectId("5669c7deb7e2d47e2d0dda2e"), "pwd" : "45f37171811d1e4ddb1e876fe87432a7", "roles" : [ "dbAdminAnyDatabase", "userAdminAnyDatabase", "clusterAdmin", "readWriteAnyDatabase" ], "user" : "machine-0" }

But the error when jujud on machine-0 tries to start up states in the log:
2016-05-19 15:44:01 INFO juju.apiserver.common password.go:105 setting password for "machine-0"
2016-05-19 15:44:01 DEBUG juju.apiserver apiserver.go:275 -> [6C1] machine-0 149.634561ms {"RequestId":4,"Response":"'body redacted'"} Agent[""].SetPasswords
2016-05-19 15:44:01 DEBUG juju.apiserver apiserver.go:275 -> [6C3] user-admin@local 586.3us {"RequestId":4,"Response":"'body redacted'"} Client[""].PrivateAddress
2016-05-19 15:44:01 INFO juju.worker runner.go:275 stopped "api", err: can't reset agent password: cannot set password of machine 0: cannot find transaction ObjectIdHex("56cad80a0348dc044300000a")
2016-05-19 15:44:01 DEBUG juju.worker runner.go:203 "api" done: can't reset agent password: cannot set password of machine 0: cannot find transaction ObjectIdHex("56cad80a0348dc044300000a")
2016-05-19 15:44:01 ERROR juju.worker runner.go:223 exited "api": can't reset agent password: cannot set password of machine 0: cannot find transaction ObjectIdHex("56cad80a0348dc044300000a")
2016-05-19 15:44:01 INFO juju.worker runner.go:261 restarting "api" in 3s
2016-05-19 15:44:01 INFO juju.apiserver apiserver.go:284 [6C1] machine-0 API connection terminated after 435.924025ms

All /var/log/ files are attached to the bug report and so all the mongoDB logs and the juju logs that were not rotated out are available.

I have been unable to work out any way to recover this into a working state.

The current state is that all juju agents are stopped across the whole environment in order to prevent juju from making any changes and to prevent log files exhausting disk space.

We can therefore modify the state server without effecting the rest of the environment and then only add agents one at a time, once the state server is back up and running again.
Unfortunately, backups failed due to the large size of the juju database and so there is no fall back position that would work other than to re-deploy the entire environment and move all production workloads.

Revision history for this message
Darryl Weaver (dweaver) wrote :

All logs from /var/log/ on machine-0.

Revision history for this message
Cheryl Jennings (cherylj) wrote :

When agents start, they are always in an "upgrading" state so as to not attempt operations while an upgrade is occurring. Once the check that an upgrade is not required, they will switch out of that state.

It looks like in your case that that check isn't happening because of the mongo errors. There is a tool that you can use to purge these missing transactions: https://github.com/mjs/mgopurge

You can build and run it yourself, or let me know if you'd like me to build it for you.

Once you run mgopurge, try and restart the jujud agent on machine 0. If that was the only problem, the agent should come back up and exit the upgrading state.

Revision history for this message
Cheryl Jennings (cherylj) wrote :

It would be nice to understand why mongo got into that state originally. Was there a particular reason the state server was rebooted?

Changed in juju-core:
status: New → Incomplete
Revision history for this message
Darryl Weaver (dweaver) wrote :

I downloaded, compiled and ran the mgopurge script on the MongoDB and it cleaned up a long list of transactions in the database. I then restarted the machine agent on machine-0 and it came up correctly without the errors from the previous tries. After about 2 minutes juju logs recorded that juju started the "upgrader" service and it completed correctly and juju queries no longer stated that juju was upgrading.

So, that did solve the problem.

With regards to how the mongoDB got into this state in the first place, we would have to dig into it more to find some clues as the log files for the original error were rotated out of the current log files before capture.

The state server was rebooted originally because of slow performance of queries to the API.

Revision history for this message
Darryl Weaver (dweaver) wrote :

Unfortunately, there is not much more information to go on.
By the time we were able to look into the logs the relevant time had been rotated out and all we could see was the juju and mongodb logs after the reboot.

So, no information on how it got into this state in the first place. We only know that Juju queries were extremely slow, but were completing. Operations decided to reboot the machine in an attempt to fix the slow performance.

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for juju-core because there has been no activity for 60 days.]

Changed in juju-core:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.