Cannot achieve HA

Bug #1355320 reported by Curtis Hovey
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju-core
Fix Released
Critical
Andrew Wilkins

Bug Description

Both functional-ha-recovery and functional-ha-backup-restore fail as of commit f1511b48. The last passing commit is 4ed820f2, but this might have been luck because that last time HA ware reliable was commit e3dfcc01.

In both tests, ensure-ha never reaches HA. This might be mongo related because the backup-restore test also failed, though the errors look differently. The error specifically is about status fails, juju never go to to HA. I suspect the juju client is raising a real error because the state-server has become unavailable in a way that status doesn't know the user should try again. In general, status cannot raise an error while the state server transitions to HA because we know transitioning to HA is normal.

I have retested on AWS and HP. The error is the same. I am attaching the latest log. There isn't really anything to learn here because juju is not providing details of what status failed. I will try to capture a log from machine 0.

Tags: ci ha regression
Revision history for this message
Curtis Hovey (sinzui) wrote :
Changed in juju-core:
assignee: nobody → Menno Smits (menno.smits)
status: Triaged → In Progress
Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

I can reproduce this easily on EC2. You just need to do:

  juju bootstrap
  juju deploy ubuntu # this might not be necessary but it's what the test does
  juju ensure-availability

At some point "juju status" stops working with "WARNING discarding API open error: auth fails" and if you look at the machine agent logs for the new state servers created by ensure-availability they fail right after startup like this:

2014-08-11 23:17:24 INFO juju.cmd.jujud machine.go:164 machine agent machine-2 start (1.21-alpha1.1-trusty-amd64 [gc])
2014-08-11 23:17:24 INFO juju.network network.go:97 setting prefer-ipv6 to false
2014-08-11 23:17:24 INFO juju.worker runner.go:261 start "api"
2014-08-11 23:17:24 INFO juju.worker runner.go:261 start "statestarter"
2014-08-11 23:17:24 INFO juju.worker runner.go:261 start "termination"
2014-08-11 23:17:24 INFO juju.state.api apiclient.go:252 dialing "wss://ip-10-197-159-62.us-west-2.compute.internal:17070/"
2014-08-11 23:17:25 INFO juju.state.api apiclient.go:175 connection established to "wss://ip-10-197-159-62.us-west-2.compute.internal:17070/"
2014-08-11 23:17:25 INFO juju.state.api apiclient.go:252 dialing "wss://ip-10-197-159-62.us-west-2.compute.internal:17070/"
2014-08-11 23:17:26 INFO juju.state.api apiclient.go:175 connection established to "wss://ip-10-197-159-62.us-west-2.compute.internal:17070/"
2014-08-11 23:17:26 ERROR juju.worker runner.go:219 exited "api": cannot set password of machine 2: auth fails
2014-08-11 23:17:26 INFO juju.worker runner.go:253 restarting "api" in 3s
2014-08-11 23:17:29 INFO juju.worker runner.go:261 start "api"
2014-08-11 23:17:29 INFO juju.state.api apiclient.go:252 dialing "wss://ip-10-197-159-62.us-west-2.compute.internal:17070/"
2014-08-11 23:17:29 INFO juju.state.api apiclient.go:175 connection established to "wss://ip-10-197-159-62.us-west-2.compute.internal:17070/"
2014-08-11 23:17:29 ERROR juju.worker runner.go:219 exited "api": cannot get machine 2: auth fails
2014-08-11 23:17:29 INFO juju.worker runner.go:253 restarting "api" in 3s
(continues on and on)

Investigating further...

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

This is the problem commit:

* commit 3b6da1d429bff627a636ca4512c1ae8230f26539
|\ Merge: 508b64f 3ea8fce
| | Author: Juju bot <email address hidden>
| | Date: Mon Aug 11 04:34:18 2014 +0100
| |
| | Merge pull request #414 from axw/state-remove-setmongopassword
| |
| | State remove setmongopassword
| |
| | We create a user in the admin database
| | with read/write privileges to all databases,
| | so there's no need to add users to each
| | other database. Consequently, we can remove
| | the mongo.SetMongoPassword function and
| | change state.Machine.SetMongoPassword to
| | just call mongo.SetAdminMongoPassword.
| |
| | Also removed unit.SetMongoPassword, which
| | isn't required anymore.
| |
| | Tested live:
| | - juju bootstrap && juju ensure-availability
| | - juju upgrade-juju

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

The most efficient steps I have to reproduce:

git checkout 3b6da1d
git install ./..
juju bootstrap --upload-tools
# wait until bootstrap agent is started and stable
juju ensure-availability

Now wait. At some point "juju status" will stop working and the machine-1 and machine-2 agents are never able to connect to the API.

It's helpful to SSH to machine 1 or 2 as they come up (but before the API stops working) so that you can get to the logs on those machines.

Changed in juju-core:
assignee: Menno Smits (menno.smits) → Andrew Wilkins (axwalk)
Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

@axw now taking over as the problem rev appears to be his.

Revision history for this message
Andrew Wilkins (axwalk) wrote :

I've reverted that change. Will close this and continue trying to see what went wrong.

Changed in juju-core:
status: In Progress → Fix Committed
Revision history for this message
Andrew Wilkins (axwalk) wrote :

I'm still getting the error without my change.

Revision history for this message
Andrew Wilkins (axwalk) wrote :

I think I hadn't uploaded the correct binary before, so ignore my last comment.

I've identified the problem in the old PR. I guess either I didn't test properly, or mucked something up when merging/rebasing and didn't test again. The problem was that we would do a Login just after setting the password (for another agent). That agent would come up and change its password, causing the other agent's password to become invalid.

I've got a new branch in the works which I've just tested, and have confirmed that ensure-availability works. Three state servers all "started".

Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.