juju upgrade-juju failed to configure mongodb replicasets

Bug #1441913 reported by James Troup
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju-core
Fix Released
High
Menno Finlay-Smits
1.24
Fix Released
High
Menno Finlay-Smits

Bug Description

We have an environment where juju upgrade-juju failed to configure
replicasets on mongodb correctly leaving the environment without a
jujud on machine 0.

This environment was previously running juju 1.18.3.1 on Ubuntu 14.04
and we were in process of upgrading to 1.20.11. machine 0 had run out
of space (because versions of juju available in an LTS don't rotate
logs - LP #1078213) and the full root partition had left apt in a very
confused state. That in turn left the juju-upgrade in an infinite
loop, retrying apt-get (LP #1441904).

Eventually a human came along, freed up space and unconfused apt.
This unblocked upgrade-juju but it didn't proceed as expected and
mongo was left in a state where replicasets are uninitalized.

Unfortunately, I can't trivial reproduce this if I try in with a fresh
environment. It could be because it's a race of some sort or it could
be because of the specific way in which apt-get was failing (which I
also can't easily reproduce). apt-get was causing juju to say this:

  juju.utils.apt apt.go:166 apt-get command failed: unexpected error type *errors.errorString

Thankfully this is a staging environment so we can leave it in a
broken state for as long as you need, if that helps.

Revision history for this message
James Troup (elmo) wrote :
Revision history for this message
James Troup (elmo) wrote :
Revision history for this message
James Troup (elmo) wrote :

root@juju-stag-ue-summit-machine-0:~# mongo --ssl -u admin -p $(grep oldpassword /var/lib/juju/agents/machine-0/agent.conf | awk -e '{print $2}') localhost:37017/admin
MongoDB shell version: 2.4.9
connecting to: localhost:37017/admin
> show collections
Thu Apr 9 00:00:02.440 error: { "$err" : "not master and slaveOk=false", "code" : 13435 } at src/mongo/shell/query.js:128
> rs.status()
{
        "startupStatus" : 3,
        "info" : "run rs.initiate(...) if not yet done for the set",
        "ok" : 0,
        "errmsg" : "can't get local.system.replset config from self or any seed (EMPTYCONFIG)"
}
>

Curtis Hovey (sinzui)
tags: added: mongodb upgrade-juju
Changed in juju-core:
status: New → Triaged
importance: Undecided → High
milestone: none → 1.24-alpha1
tags: added: canonical-is
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 1.24-alpha1 → 1.25.0
Changed in juju-core:
assignee: nobody → Menno Smits (menno.smits)
Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

I know it's been a while now but is this environment still up in it's broken state?

If so, can you please get to the mongo shell on each state server host (as you already did on machine-0) and run "rs.status()"?

Also, is there a way for me to get access to this environment?

I'll keep looking through the logs and the code for 1.20 now.

Changed in juju-core:
status: Triaged → In Progress
Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

Hunting through the logs and the code has revealed the likely cause.

1. When the machine agent comes up it initialises the replicaset if the initial agent version (as read from the agent's config file) is less than 1.19.0 (HA was added in 1.19), however because the apt-get issue the first start of the machine agent into 1.20 never got to initialise the replicaset.

2. There are no upgrade steps defined for 1.20 which caused the agent upgrade logic to immediately write the new agent version to the agent config soon after the 1.20 agent started.

3. Once the apt-get issue was resolved the agent was restarted but because the agent config now contains the new agent version the mongodb replicaset initialisation wasn't triggered, leading to a broken mongo setup.

If I've got this right, the following should fix the broken environment:

1. SSH to the state server host
2. Stop the machine agent (sudo stop jujud-machine-0)
3. Open /var/lib/juju/agents/machine-0/agent.conf in an editor
4. Update the upgradedToVersion field to 1.18.3.1
5. Start the machine agent (sudo start jujud-machine-0)

This should force juju to initialise the replicaset and the environment should come up as long as there aren't other problems caused by the disk space issue (I'm concerned what might have happened to mongodb when the disk ran out).

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

I've discussed this issue with Ian (wallyworld).

The way to fix this properly in Juju is to move the replicaset initialisation code from where it is to a state server only upgrade step (this concept didn't exist at the time HA was introduced hence the current [more fragile] way of initialising the replicaset).

The problem can only happen if there isn't an upgrade step defined and all recent Juju releases have many upgrade steps so this issue isn't critical for 1.24-alpha1. It will be fixed for the 1.24 series however. Retargetting the bug to reflect this.

I'm also going to fix this for 1.23.

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :
Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :
Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :
Changed in juju-core:
status: In Progress → Fix Committed
Revision history for this message
Chris Stratford (chris-gondolin) wrote :

Sadly the fix suggested in post 5 doesn't seem to have done anything (and yes, it is still both up and broken, so fee free to throw tests/fixes our way and we'll try them out)

machine-0.log shows this after manually resetting upgradedeToVersion:

2015-05-11 10:33:02 INFO juju.cmd supercommand.go:37 running jujud [1.20.11-trusty-amd64 gc]
2015-05-11 10:33:02 INFO juju.cmd.jujud machine.go:158 machine agent machine-0 start (1.20.11-trusty-amd64 [gc])
2015-05-11 10:33:02 DEBUG juju.agent agent.go:377 read agent config, format "1.18"
2015-05-11 10:33:02 INFO juju.cmd.jujud machine.go:169 no upgrade steps required or upgrade steps for 1.20.11 have already been run.
2015-05-11 10:33:02 INFO juju.worker runner.go:260 start "api"
2015-05-11 10:33:02 INFO juju.worker runner.go:260 start "statestarter"
2015-05-11 10:33:02 INFO juju.worker runner.go:260 start "termination"
2015-05-11 10:33:02 INFO juju.state.api apiclient.go:242 dialing "wss://localhost:17070/"
2015-05-11 10:33:02 INFO juju.worker runner.go:260 start "state"
2015-05-11 10:33:02 INFO juju.state.api apiclient.go:250 error dialing "wss://localhost:17070/": websocket.Dial wss://localhost:17070/: dial tcp 127.0.0.1:17070: connection refused
2015-05-11 10:33:02 ERROR juju.worker runner.go:218 exited "api": unable to connect to "wss://localhost:17070/"
2015-05-11 10:33:02 INFO juju.worker runner.go:252 restarting "api" in 3s
2015-05-11 10:33:02 INFO juju.mongo open.go:104 dialled mongo successfully
2015-05-11 10:33:05 INFO juju.worker runner.go:260 start "api"
2015-05-11 10:33:05 INFO juju.state.api apiclient.go:242 dialing "wss://localhost:17070/"
2015-05-11 10:33:05 INFO juju.state.api apiclient.go:250 error dialing "wss://localhost:17070/": websocket.Dial wss://localhost:17070/: dial tcp 127.0.0.1:17070: connection refused

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

Chris: sorry to hear the fix didn't help. Is there any way I can get access to that environment.

FWIW, the changes to stop this issue from happening in the first place are landing in the next stable versions of Juju.

Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → Fix Released
Revision history for this message
Barry Price (barryprice) wrote :

Menno - we should be able to grant access to the environment, emailing you with details.

no longer affects: juju-core/1.23
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.