ensure-availability fails on GCE

Bug #1493058 reported by Nick Veitch
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Invalid
Critical
Tim Penhey
juju-core
Won't Fix
Low
Unassigned
1.24
Won't Fix
High
Unassigned
1.25
Won't Fix
Undecided
Unassigned

Bug Description

I seem to consistently get an error when bootstrapping an environment on GCE and then running:

    juju ensure-availability -n 3

The instances are created but report error in juju status:

$ juju status -e gce2
environment: gce2
machines:
  "0":
    agent-state: started
    agent-version: 1.24.5
    dns-name: 104.197.32.115
    instance-id: juju-065a5323-a699-46c3-891b-3d0acb5ac2be-machine-0
    instance-state: RUNNING
    series: trusty
    hardware: arch=amd64 cpu-cores=1 cpu-power=138 mem=1700M root-disk=10240M availability-zone=us-central1-a
    state-server-member-status: has-vote
  "1":
    agent-state-info: 'sending new instance request: GCE operation "operation-1441633339906-51f286b1f17d3-e1d1d6cd-2dc343e7"
      failed'
    instance-id: pending
    series: trusty
    state-server-member-status: adding-vote
  "2":
    agent-state-info: 'sending new instance request: GCE operation "operation-1441633349468-51f286bb0ff60-73ba2dbc-9c78182b"
      failed'
    instance-id: pending
    series: trusty
    state-server-member-status: adding-vote
services: {}

connecting to one of the failed instances and retrieving the log I get:

2015-09-07 13:45:57 INFO juju.cmd supercommand.go:37 running jujud [1.24.5-trusty-amd64 gc]
2015-09-07 13:45:57 DEBUG juju.agent agent.go:432 read agent config, format "1.18"
2015-09-07 13:45:57 INFO juju.cmd.jujud machine.go:419 machine agent machine-1 start (1.24.5-trusty-amd64 [gc])
2015-09-07 13:45:57 DEBUG juju.wrench wrench.go:112 couldn't read wrench directory: stat /var/lib/juju/wrench: no such file or directory
2015-09-07 13:45:57 INFO juju.cmd.jujud upgrade.go:87 no upgrade steps required or upgrade steps for 1.24.5 have already been run.
2015-09-07 13:45:57 INFO juju.network network.go:194 setting prefer-ipv6 to false
2015-09-07 13:45:57 INFO juju.worker runner.go:269 start "api"
2015-09-07 13:45:57 INFO juju.worker runner.go:269 start "statestarter"
2015-09-07 13:45:57 INFO juju.worker runner.go:269 start "termination"
2015-09-07 13:45:57 INFO juju.api apiclient.go:331 dialing "wss://10.240.221.164:17070/environment/065a5323-a699-46c3-891b-3d0acb5ac2be/api"
2015-09-07 13:45:57 DEBUG juju.worker runner.go:196 "statestarter" started
2015-09-07 13:45:57 DEBUG juju.worker runner.go:196 "termination" started
2015-09-07 13:45:57 DEBUG juju.worker runner.go:191 stop "state"
2015-09-07 13:45:57 INFO juju.api apiclient.go:263 connection established to "wss://10.240.221.164:17070/environment/065a5323-a699-46c3-891b-3d0acb5ac2be/api
"
2015-09-07 13:45:57 INFO juju.api apiclient.go:331 dialing "wss://10.240.221.164:17070/environment/065a5323-a699-46c3-891b-3d0acb5ac2be/api"
2015-09-07 13:45:57 INFO juju.api apiclient.go:263 connection established to "wss://10.240.221.164:17070/environment/065a5323-a699-46c3-891b-3d0acb5ac2be/api
"
2015-09-07 13:45:57 ERROR juju.cmd.jujud agent.go:298 agent terminating due to error returned during API open: invalid entity name or password
2015-09-07 13:45:57 INFO juju.worker runner.go:275 stopped "api", err: agent should be terminated
2015-09-07 13:45:57 DEBUG juju.worker runner.go:203 "api" done: agent should be terminated
2015-09-07 13:45:57 ERROR juju.worker runner.go:212 fatal "api": agent should be terminated
2015-09-07 13:45:57 DEBUG juju.worker runner.go:248 killing "statestarter"
2015-09-07 13:45:57 DEBUG juju.worker runner.go:248 killing "termination"
2015-09-07 13:45:57 INFO juju.worker runner.go:275 stopped "statestarter", err: <nil>
2015-09-07 13:45:57 INFO juju.worker runner.go:275 stopped "termination", err: <nil>
2015-09-07 13:45:57 DEBUG juju.worker runner.go:203 "statestarter" done: <nil>
2015-09-07 13:45:57 DEBUG juju.worker runner.go:227 no restart, removing "statestarter" from known workers
2015-09-07 13:45:57 DEBUG juju.worker runner.go:203 "termination" done: <nil>
2015-09-07 13:45:57 DEBUG juju.worker runner.go:227 no restart, removing "termination" from known workers
2015-09-07 13:45:57 DEBUG juju.service discovery.go:115 discovered init system "upstart" from local host
2015-09-07 13:45:57 DEBUG juju.service discovery.go:115 discovered init system "upstart" from local host
2015-09-07 13:45:57 INFO juju.cmd supercommand.go:436 command

Revision history for this message
Martin Packman (gz) wrote :

Is this just on HA, or does deploying any further machines fail? See bug 1438200 for a similar error without HA involved.,

Changed in juju-core:
status: New → Incomplete
tags: added: gce-provider ha
Revision history for this message
Nick Veitch (evilnick) wrote :

Hmm, the error in #1438200 looks very similar, however I managed to successfully deploy services after the HA instances failed, so I guess this is different

Revision history for this message
Nick Veitch (evilnick) wrote :

If it helps any, I have retrieved the following from the GCE console log:

 Create an instance
juju-3606c42b-85f7-4da8-8e43-68ec3316ff0f-machine-2
<email address hidden>
Start: Sep 7, 2015, 3:51:12 PM
End: Sep 7, 2015, 3:51:12 PM
RESOURCE_ALREADY_EXISTS

(and similar)

tags: added: docteam
Curtis Hovey (sinzui)
tags: added: jujuqa
Revision history for this message
Curtis Hovey (sinzui) wrote :

Juju 2.0-beta3 cannot bring a machines in HA.

Changed in juju-core:
status: Incomplete → Triaged
Curtis Hovey (sinzui)
Changed in juju-core:
importance: Undecided → High
affects: juju-core → juju
Changed in juju-core:
status: New → Incomplete
Changed in juju:
milestone: none → 2.1.0
Changed in juju-core:
status: Incomplete → Invalid
Changed in juju:
importance: High → Critical
milestone: 2.1.0 → 2.0-rc1
Revision history for this message
Richard Harding (rharding) wrote :

Rested this today:

bootstrap to gce
enbale-ha
wait for it to settle
manually turn off machine 0
juju status hangs and doesn't function

Tim Penhey (thumper)
Changed in juju:
status: Triaged → In Progress
assignee: nobody → Tim Penhey (thumper)
Revision history for this message
Tim Penhey (thumper) wrote :

Tested with tip of master, right now, and it is working.

test-gce-ha:
  details:
    uuid: dfacd75e-bb5d-445b-8726-8424d57ee386
    api-endpoints: ['104.196.179.76:17070', '104.196.194.140:17070', '104.196.205.112:17070',
      '10.240.0.3:17070', '10.240.0.4:17070', '10.240.0.6:17070']
    ca-cert: <snippity>
    cloud: google
    region: us-east1
    agent-version: 2.0-rc1.1
  controller-machines:
    "0":
      instance-id: juju-4e4ff0-0
      ha-status: down, lost connection
    "1":
      instance-id: juju-4e4ff0-1
      ha-status: ha-enabled
    "2":
      instance-id: juju-4e4ff0-2
      ha-status: ha-enabled
  models:
    controller:
      uuid: fc76c84a-3164-4c60-8572-083f4c4e4ff0
      machine-count: 3
      core-count: 3
    default:
      uuid: b151154d-cbfb-4a3c-86c1-3c94b2ff7216
  current-model: admin@local/controller
  account:
    user: admin@local
    access: superuser

Revision history for this message
Tim Penhey (thumper) wrote :

Ran juju status with --debug.

Watched it parallel try to connect to all six published endpoints. Successfully connected to machine-2 first, and returned status.

$ juju status --debug
14:16:51 INFO juju.cmd supercommand.go:63 running juju [2.0-rc1 gc go1.6.2]
14:16:51 DEBUG juju.cmd supercommand.go:64 args: []string{"juju", "status", "--debug"}
14:16:51 INFO juju.juju api.go:72 connecting to API addresses: [104.196.179.76:17070 104.196.194.140:17070 104.196.205.112:17070 10.240.0.3:17070 10.240.0.4:17070 10.240.0.6:17070]
14:16:51 INFO juju.api apiclient.go:507 dialing "wss://104.196.179.76:17070/model/fc76c84a-3164-4c60-8572-083f4c4e4ff0/api"
14:16:51 INFO juju.api apiclient.go:507 dialing "wss://104.196.194.140:17070/model/fc76c84a-3164-4c60-8572-083f4c4e4ff0/api"
14:16:51 INFO juju.api apiclient.go:507 dialing "wss://104.196.205.112:17070/model/fc76c84a-3164-4c60-8572-083f4c4e4ff0/api"
14:16:52 INFO juju.api apiclient.go:507 dialing "wss://10.240.0.3:17070/model/fc76c84a-3164-4c60-8572-083f4c4e4ff0/api"
14:16:52 INFO juju.api apiclient.go:507 dialing "wss://10.240.0.4:17070/model/fc76c84a-3164-4c60-8572-083f4c4e4ff0/api"
14:16:52 INFO juju.api apiclient.go:507 dialing "wss://10.240.0.6:17070/model/fc76c84a-3164-4c60-8572-083f4c4e4ff0/api"
14:16:53 INFO juju.api apiclient.go:302 connection established to "wss://104.196.179.76:17070/model/fc76c84a-3164-4c60-8572-083f4c4e4ff0/api"
14:16:53 DEBUG juju.juju api.go:263 API hostnames unchanged - not resolving
MODEL CONTROLLER CLOUD/REGION VERSION
controller test-gce-ha google/us-east1 2.0-rc1.1

APP VERSION STATUS SCALE CHARM STORE REV OS NOTES

UNIT WORKLOAD AGENT MACHINE PUBLIC-ADDRESS PORTS MESSAGE

MACHINE STATE DNS INS-ID SERIES AZ
0 down 104.196.205.112 juju-4e4ff0-0 xenial us-east1-b
1 started 104.196.194.140 juju-4e4ff0-1 xenial us-east1-c
2 started 104.196.179.76 juju-4e4ff0-2 xenial us-east1-d

14:16:53 DEBUG juju.api apiclient.go:558 health ping failed: connection is shut down
14:16:53 INFO cmd supercommand.go:465 command finished

Changed in juju:
status: In Progress → Invalid
Curtis Hovey (sinzui)
Changed in juju:
milestone: 2.0-rc1 → none
Revision history for this message
Curtis Hovey (sinzui) wrote :

Using juju 1.25 I consistently see an error binging up the 3rd machine

ending new instance request: sending new instance request:
      googleapi: Error 409: The resource ''projects/gothic-list-89514/zones/us-central1-b/instances/juju-a334ad9b-103c-48ff-8321-15bd272db86b-machine-3

Changed in juju-core:
status: Invalid → Triaged
importance: Undecided → Low
tags: added: enable-ha
Revision history for this message
Anastasia (anastasia-macmood) wrote :

Marking this bug as a 'won't fix' for Juju 1.x as it is not a critical bug.

Changed in juju-core:
status: Triaged → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.