juju upgrade-juju failed to configure mongodb replicasets
| Affects | Status | Importance | Assigned to | Milestone | |
|---|---|---|---|---|---|
| | juju-core |
High
|
Menno Finlay-Smits | ||
| | 1.24 |
High
|
Menno Finlay-Smits | ||
Bug Description
We have an environment where juju upgrade-juju failed to configure
replicasets on mongodb correctly leaving the environment without a
jujud on machine 0.
This environment was previously running juju 1.18.3.1 on Ubuntu 14.04
and we were in process of upgrading to 1.20.11. machine 0 had run out
of space (because versions of juju available in an LTS don't rotate
logs - LP #1078213) and the full root partition had left apt in a very
confused state. That in turn left the juju-upgrade in an infinite
loop, retrying apt-get (LP #1441904).
Eventually a human came along, freed up space and unconfused apt.
This unblocked upgrade-juju but it didn't proceed as expected and
mongo was left in a state where replicasets are uninitalized.
Unfortunately, I can't trivial reproduce this if I try in with a fresh
environment. It could be because it's a race of some sort or it could
be because of the specific way in which apt-get was failing (which I
also can't easily reproduce). apt-get was causing juju to say this:
juju.utils.apt apt.go:166 apt-get command failed: unexpected error type *errors.errorString
Thankfully this is a staging environment so we can leave it in a
broken state for as long as you need, if that helps.
| James Troup (elmo) wrote : | #1 |
| James Troup (elmo) wrote : | #2 |
| James Troup (elmo) wrote : | #3 |
| tags: | added: mongodb upgrade-juju |
| Changed in juju-core: | |
| status: | New → Triaged |
| importance: | Undecided → High |
| milestone: | none → 1.24-alpha1 |
| tags: | added: canonical-is |
| Changed in juju-core: | |
| milestone: | 1.24-alpha1 → 1.25.0 |
| Changed in juju-core: | |
| assignee: | nobody → Menno Smits (menno.smits) |
| Menno Finlay-Smits (menno.smits) wrote : | #4 |
I know it's been a while now but is this environment still up in it's broken state?
If so, can you please get to the mongo shell on each state server host (as you already did on machine-0) and run "rs.status()"?
Also, is there a way for me to get access to this environment?
I'll keep looking through the logs and the code for 1.20 now.
| Changed in juju-core: | |
| status: | Triaged → In Progress |
| Menno Finlay-Smits (menno.smits) wrote : | #5 |
Hunting through the logs and the code has revealed the likely cause.
1. When the machine agent comes up it initialises the replicaset if the initial agent version (as read from the agent's config file) is less than 1.19.0 (HA was added in 1.19), however because the apt-get issue the first start of the machine agent into 1.20 never got to initialise the replicaset.
2. There are no upgrade steps defined for 1.20 which caused the agent upgrade logic to immediately write the new agent version to the agent config soon after the 1.20 agent started.
3. Once the apt-get issue was resolved the agent was restarted but because the agent config now contains the new agent version the mongodb replicaset initialisation wasn't triggered, leading to a broken mongo setup.
If I've got this right, the following should fix the broken environment:
1. SSH to the state server host
2. Stop the machine agent (sudo stop jujud-machine-0)
3. Open /var/lib/
4. Update the upgradedToVersion field to 1.18.3.1
5. Start the machine agent (sudo start jujud-machine-0)
This should force juju to initialise the replicaset and the environment should come up as long as there aren't other problems caused by the disk space issue (I'm concerned what might have happened to mongodb when the disk ran out).
| Menno Finlay-Smits (menno.smits) wrote : | #6 |
I've discussed this issue with Ian (wallyworld).
The way to fix this properly in Juju is to move the replicaset initialisation code from where it is to a state server only upgrade step (this concept didn't exist at the time HA was introduced hence the current [more fragile] way of initialising the replicaset).
The problem can only happen if there isn't an upgrade step defined and all recent Juju releases have many upgrade steps so this issue isn't critical for 1.24-alpha1. It will be fixed for the 1.24 series however. Retargetting the bug to reflect this.
I'm also going to fix this for 1.23.
| Menno Finlay-Smits (menno.smits) wrote : | #7 |
Fix for 1.23: https:/
| Menno Finlay-Smits (menno.smits) wrote : | #8 |
Fix for 1.24: https:/
| Menno Finlay-Smits (menno.smits) wrote : | #9 |
Fix for master: https:/
| Changed in juju-core: | |
| status: | In Progress → Fix Committed |
| Chris Stratford (chris-gondolin) wrote : | #10 |
Sadly the fix suggested in post 5 doesn't seem to have done anything (and yes, it is still both up and broken, so fee free to throw tests/fixes our way and we'll try them out)
machine-0.log shows this after manually resetting upgradedeToVersion:
2015-05-11 10:33:02 INFO juju.cmd supercommand.go:37 running jujud [1.20.11-
2015-05-11 10:33:02 INFO juju.cmd.jujud machine.go:158 machine agent machine-0 start (1.20.11-
2015-05-11 10:33:02 DEBUG juju.agent agent.go:377 read agent config, format "1.18"
2015-05-11 10:33:02 INFO juju.cmd.jujud machine.go:169 no upgrade steps required or upgrade steps for 1.20.11 have already been run.
2015-05-11 10:33:02 INFO juju.worker runner.go:260 start "api"
2015-05-11 10:33:02 INFO juju.worker runner.go:260 start "statestarter"
2015-05-11 10:33:02 INFO juju.worker runner.go:260 start "termination"
2015-05-11 10:33:02 INFO juju.state.api apiclient.go:242 dialing "wss://
2015-05-11 10:33:02 INFO juju.worker runner.go:260 start "state"
2015-05-11 10:33:02 INFO juju.state.api apiclient.go:250 error dialing "wss://
2015-05-11 10:33:02 ERROR juju.worker runner.go:218 exited "api": unable to connect to "wss://
2015-05-11 10:33:02 INFO juju.worker runner.go:252 restarting "api" in 3s
2015-05-11 10:33:02 INFO juju.mongo open.go:104 dialled mongo successfully
2015-05-11 10:33:05 INFO juju.worker runner.go:260 start "api"
2015-05-11 10:33:05 INFO juju.state.api apiclient.go:242 dialing "wss://
2015-05-11 10:33:05 INFO juju.state.api apiclient.go:250 error dialing "wss://
| Menno Finlay-Smits (menno.smits) wrote : | #11 |
Chris: sorry to hear the fix didn't help. Is there any way I can get access to that environment.
FWIW, the changes to stop this issue from happening in the first place are landing in the next stable versions of Juju.
| Changed in juju-core: | |
| status: | Fix Committed → Fix Released |
| Barry Price (barryprice) wrote : | #12 |
Menno - we should be able to grant access to the environment, emailing you with details.
| no longer affects: | juju-core/1.23 |


root@juju- stag-ue- summit- machine- 0:~# mongo --ssl -u admin -p $(grep oldpassword /var/lib/ juju/agents/ machine- 0/agent. conf | awk -e '{print $2}') localhost: 37017/admin 37017/admin shell/query. js:128
"startupStatus " : 3, replset config from self or any seed (EMPTYCONFIG)"
MongoDB shell version: 2.4.9
connecting to: localhost:
> show collections
Thu Apr 9 00:00:02.440 error: { "$err" : "not master and slaveOk=false", "code" : 13435 } at src/mongo/
> rs.status()
{
"info" : "run rs.initiate(...) if not yet done for the set",
"ok" : 0,
"errmsg" : "can't get local.system.
}
>