juju-core

juju upgrade-juju failed to configure mongodb replicasets

Bug #1441913 reported by James Troup on 2015-04-08

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	juju-core	Fix Released	High	Menno Finlay-Smits	juju-core 1.25-alpha1
	1.24	Fix Released	High	Menno Finlay-Smits	juju-core 1.24-beta2

Bug Description

We have an environment where juju upgrade-juju failed to configure
replicasets on mongodb correctly leaving the environment without a
jujud on machine 0.

This environment was previously running juju 1.18.3.1 on Ubuntu 14.04
and we were in process of upgrading to 1.20.11. machine 0 had run out
of space (because versions of juju available in an LTS don't rotate
logs - LP #1078213) and the full root partition had left apt in a very
confused state. That in turn left the juju-upgrade in an infinite
loop, retrying apt-get (LP #1441904).

Eventually a human came along, freed up space and unconfused apt.
This unblocked upgrade-juju but it didn't proceed as expected and
mongo was left in a state where replicasets are uninitalized.

Unfortunately, I can't trivial reproduce this if I try in with a fresh
environment. It could be because it's a race of some sort or it could
be because of the specific way in which apt-get was failing (which I
also can't easily reproduce). apt-get was causing juju to say this:

juju.utils.apt apt.go:166 apt-get command failed: unexpected error type *errors.errorString

Thankfully this is a staging environment so we can leave it in a
broken state for as long as you need, if that helps.

Tags:

Revision history for this message

James Troup (elmo) wrote on 2015-04-08:

/var/log/juju/machine-0.log Edit (497.4 KiB, application/octet-stream)

Revision history for this message

James Troup (elmo) wrote on 2015-04-08:

mongod references from syslog Edit (2.3 MiB, application/octet-stream)

Revision history for this message

James Troup (elmo) wrote on 2015-04-09:

root@juju-stag-ue-summit-machine-0:~# mongo --ssl -u admin -p $(grep oldpassword /var/lib/juju/agents/machine-0/agent.conf | awk -e '{print $2}') localhost:37017/admin
MongoDB shell version: 2.4.9
connecting to: localhost:37017/admin
> show collections
Thu Apr 9 00:00:02.440 error: { "$err" : "not master and slaveOk=false", "code" : 13435 } at src/mongo/shell/query.js:128
> rs.status()
{
        "startupStatus" : 3,
        "info" : "run rs.initiate(...) if not yet done for the set",
        "ok" : 0,
        "errmsg" : "can't get local.system.replset config from self or any seed (EMPTYCONFIG)"
}
>

Curtis Hovey (sinzui) on 2015-04-09

tags:	added: mongodb upgrade-juju
Changed in juju-core:
status:	New → Triaged
importance:	Undecided → High
milestone:	none → 1.24-alpha1
tags:	added: canonical-is

Curtis Hovey (sinzui) on 2015-04-27

Changed in juju-core:
milestone:	1.24-alpha1 → 1.25.0

Menno Finlay-Smits (menno.smits) on 2015-04-30

Changed in juju-core:
assignee:	nobody → Menno Smits (menno.smits)

Revision history for this message

Menno Finlay-Smits (menno.smits) wrote on 2015-04-30:

I know it's been a while now but is this environment still up in it's broken state?

If so, can you please get to the mongo shell on each state server host (as you already did on machine-0) and run "rs.status()"?

Also, is there a way for me to get access to this environment?

I'll keep looking through the logs and the code for 1.20 now.

Menno Finlay-Smits (menno.smits) on 2015-05-01

Changed in juju-core:
status:	Triaged → In Progress

Revision history for this message

Menno Finlay-Smits (menno.smits) wrote on 2015-05-01:

Hunting through the logs and the code has revealed the likely cause.

1. When the machine agent comes up it initialises the replicaset if the initial agent version (as read from the agent's config file) is less than 1.19.0 (HA was added in 1.19), however because the apt-get issue the first start of the machine agent into 1.20 never got to initialise the replicaset.

2. There are no upgrade steps defined for 1.20 which caused the agent upgrade logic to immediately write the new agent version to the agent config soon after the 1.20 agent started.

3. Once the apt-get issue was resolved the agent was restarted but because the agent config now contains the new agent version the mongodb replicaset initialisation wasn't triggered, leading to a broken mongo setup.

If I've got this right, the following should fix the broken environment:

1. SSH to the state server host
2. Stop the machine agent (sudo stop jujud-machine-0)
3. Open /var/lib/juju/agents/machine-0/agent.conf in an editor
4. Update the upgradedToVersion field to 1.18.3.1
5. Start the machine agent (sudo start jujud-machine-0)

This should force juju to initialise the replicaset and the environment should come up as long as there aren't other problems caused by the disk space issue (I'm concerned what might have happened to mongodb when the disk ran out).

Revision history for this message

Menno Finlay-Smits (menno.smits) wrote on 2015-05-01:

I've discussed this issue with Ian (wallyworld).

The way to fix this properly in Juju is to move the replicaset initialisation code from where it is to a state server only upgrade step (this concept didn't exist at the time HA was introduced hence the current [more fragile] way of initialising the replicaset).

The problem can only happen if there isn't an upgrade step defined and all recent Juju releases have many upgrade steps so this issue isn't critical for 1.24-alpha1. It will be fixed for the 1.24 series however. Retargetting the bug to reflect this.

I'm also going to fix this for 1.23.

Revision history for this message

Menno Finlay-Smits (menno.smits) wrote on 2015-05-07:

Fix for 1.23: https://github.com/juju/juju/pull/2239

Revision history for this message

Menno Finlay-Smits (menno.smits) wrote on 2015-05-07:

Fix for 1.24: https://github.com/juju/juju/pull/2241

Revision history for this message

Menno Finlay-Smits (menno.smits) wrote on 2015-05-07:

Fix for master: https://github.com/juju/juju/pull/2242

Menno Finlay-Smits (menno.smits) on 2015-05-11

Changed in juju-core:
status:	In Progress → Fix Committed

Revision history for this message

Chris Stratford (chris-gondolin) wrote on 2015-05-11:

#10

Sadly the fix suggested in post 5 doesn't seem to have done anything (and yes, it is still both up and broken, so fee free to throw tests/fixes our way and we'll try them out)

machine-0.log shows this after manually resetting upgradedeToVersion:

2015-05-11 10:33:02 INFO juju.cmd supercommand.go:37 running jujud [1.20.11-trusty-amd64 gc]
2015-05-11 10:33:02 INFO juju.cmd.jujud machine.go:158 machine agent machine-0 start (1.20.11-trusty-amd64 [gc])
2015-05-11 10:33:02 DEBUG juju.agent agent.go:377 read agent config, format "1.18"
2015-05-11 10:33:02 INFO juju.cmd.jujud machine.go:169 no upgrade steps required or upgrade steps for 1.20.11 have already been run.
2015-05-11 10:33:02 INFO juju.worker runner.go:260 start "api"
2015-05-11 10:33:02 INFO juju.worker runner.go:260 start "statestarter"
2015-05-11 10:33:02 INFO juju.worker runner.go:260 start "termination"
2015-05-11 10:33:02 INFO juju.state.api apiclient.go:242 dialing "wss://localhost:17070/"
2015-05-11 10:33:02 INFO juju.worker runner.go:260 start "state"
2015-05-11 10:33:02 INFO juju.state.api apiclient.go:250 error dialing "wss://localhost:17070/": websocket.Dial wss://localhost:17070/: dial tcp 127.0.0.1:17070: connection refused
2015-05-11 10:33:02 ERROR juju.worker runner.go:218 exited "api": unable to connect to "wss://localhost:17070/"
2015-05-11 10:33:02 INFO juju.worker runner.go:252 restarting "api" in 3s
2015-05-11 10:33:02 INFO juju.mongo open.go:104 dialled mongo successfully
2015-05-11 10:33:05 INFO juju.worker runner.go:260 start "api"
2015-05-11 10:33:05 INFO juju.state.api apiclient.go:242 dialing "wss://localhost:17070/"
2015-05-11 10:33:05 INFO juju.state.api apiclient.go:250 error dialing "wss://localhost:17070/": websocket.Dial wss://localhost:17070/: dial tcp 127.0.0.1:17070: connection refused

Revision history for this message

Menno Finlay-Smits (menno.smits) wrote on 2015-05-11:

#11

Chris: sorry to hear the fix didn't help. Is there any way I can get access to that environment.

FWIW, the changes to stop this issue from happening in the first place are landing in the next stable versions of Juju.

Curtis Hovey (sinzui) on 2015-05-20

Changed in juju-core:
status:	Fix Committed → Fix Released

Revision history for this message

Barry Price (barryprice) wrote on 2015-05-27:

#12

Menno - we should be able to grant access to the environment, emailing you with details.

Menno Finlay-Smits (menno.smits) on 2015-10-07

no longer affects:

juju-core/1.23

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.