manual provider add-machine fails immediately after bootstrap

Bug #1538583 reported by Curtis Hovey on 2016-01-27
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
juju
Medium
Unassigned
juju-core
High
Unassigned
1.25
High
Unassigned

Bug Description

As seen in
    http://reports.vapour.ws/releases/issue/56a8d7ec749a564e013cb5df

CI saw errors like this in several manual provider tests. The wily, trusty and precise jobs failed when provisioning in ec2.

2016-01-26 20:31:22 ERROR cmd supercommand.go:448 connection is shut down
2016-01-26 20:31:22 ERROR Command '('juju', '--show-log', 'add-machine', '-m', 'manual-deploy-wily-amd64', 'ssh:ec2-54-236-226-51.compute-1.amazonaws.com')' returned non-zero exit status 1

The trusty ppc4el job passed though.

The offending behaviour is now in master and 1.25. 1.25.4 cannot be released.

Curtis Hovey (sinzui) on 2016-01-27
Changed in juju-core:
status: New → Incomplete
summary: - manual proovider add-machine failed in api-command-rename
+ manual provider add-machine failed in api-command-rename
tags: added: regression

I see that the precise jobs all failed because of this error: /juju-qa.jujuci: No such file or directory

The wily job failed due to a "connection is shutdown" error , which makes me wonder if it's related to bug 1517632 (the outdated distro info bug)

Curtis Hovey (sinzui) wrote :

The outcome analyser is wrongly selecting a recoverable issue from the actual error in the job. I have fixed the job so future runs wont report that issue. The data we have do show the same failure
    http://reports.vapour.ws/releases/3544/job/manual-deploy-precise-amd64/attempt/3024

Changed in juju-core:
status: Incomplete → Triaged
no longer affects: juju-core/api-command-rename
summary: - manual provider add-machine failed in api-command-rename
+ manual provider add-machine fails immediately after bootstrap
Changed in juju-core:
importance: Undecided → High
importance: High → Critical
milestone: none → 2.0-alpha2
Cheryl Jennings (cherylj) wrote :

This has been seen on more than api-controller-rename.

If you try to add-machine immediately after the bootstrap of the manual provider, you will get this connection is shut down error. I image it would happen with any command.

Running the command again will succeed.

Initial investigation shows that the mongo connection drops right after bootstrap completes. I see these messages in /var/log/syslog at the time that the connection drops: http://paste.ubuntu.com/14698112/

Seems to happen at the same time as this line in machine-0.log:
2016-01-29 19:33:24 DEBUG juju.replicaset replicaset.go:168 Set() changing replica set

Which is followed by:
2016-01-29 19:33:24 DEBUG juju.replicaset replicaset.go:179 got EOF while running Set(), calling session.Refresh()

I see the same messages in machine-0.log and /var/log/syslog when I bootstrap my MAAS env, but the timing is different such that running `juju bootstrap && juju add-machine` will succeed.

Cheryl Jennings (cherylj) wrote :

The replicaset is changed as a result of adding a new IP to the API host ports:

2016-01-29 19:33:24 DEBUG juju.state address.go:137 setting API hostPorts: [[52.35.31.97:17070 172.31.22.1:17070 127.0.0.1:17070 [::1]:17070]]
2016-01-29 19:33:24 DEBUG juju.worker.peergrouper desired.go:38 calculating desired peer group
2016-01-29 19:33:24 DEBUG juju.worker.peergrouper desired.go:39 members: map[*peergrouper.machine]*replicaset.Member{&peergrouper.machine{id: "0", wantsVote: true, hostPort: "172.31.22.1:37017"}
:(*replicaset.Member)(0xc820625040)}
2016-01-29 19:33:24 DEBUG juju.worker.peergrouper desired.go:40 extra: []replicaset.Member(nil)
2016-01-29 19:33:24 DEBUG juju.worker.peergrouper desired.go:41 maxId: 1
2016-01-29 19:33:24 DEBUG juju.worker.peergrouper desired.go:110 assessing possible peer group changes:
2016-01-29 19:33:24 DEBUG juju.worker.peergrouper desired.go:116 machine "0" is already voting
2016-01-29 19:33:24 DEBUG juju.worker.peergrouper desired.go:134 assessed
2016-01-29 19:33:24 DEBUG juju.worker.peergrouper worker.go:282 desired peer group members: []replicaset.Member{replicaset.Member{Id:1, Address:"172.31.22.1:37017", Arbiter:(*bool)(nil), BuildIn
dexes:(*bool)(nil), Hidden:(*bool)(nil), Priority:(*float64)(nil), Tags:map[string]string{"juju-machine-id":"0"}, SlaveDelay:(*time.Duration)(nil), Votes:(*int)(nil)}}
2016-01-29 19:33:24 DEBUG juju.replicaset replicaset.go:168 Set() changing replica set
from {Name: juju, Version: 1, Members: {Member{1 "52.35.31.97:37017" map[juju-machine-id:0]}}}
  to {Name: juju, Version: 2, Members: {Member{1 "172.31.22.1:37017" map[juju-machine-id:0]}}}

Cheryl Jennings (cherylj) wrote :
Cheryl Jennings (cherylj) wrote :
Cheryl Jennings (cherylj) wrote :

I see this on master after the merge of api-command-rename.

Steps for a manual recreate:

1 - Provision two machines in AWS. Make sure that the security group opens ports 17017 and 37017.
2 - Edit config for manual provider to point to one of the machines
3 - Bootstrap and add-machine with the same command (using a juju built from master)

juju bootstrap --upload-tools --debug && juju add-machine ssh:<IP of second machine>

You will see the "connection is shut down" error.

Curtis Hovey (sinzui) wrote :

This bug is now in 1.25. Since we know the last good commit The suspect commits are:
    https://github.com/juju/juju/commit/e0fc2ca179881c658dc931e15e848707901bd97f
    https://github.com/juju/juju/commit/9f5ca7ef8d0bc60f438e9f44813e1b9701a8a654
    https://github.com/juju/juju/commit/76ca2e226388be4afbd5f40f66ab86e7fafc21de

As none of the commts specifically changes behaviour, we believe one of them changes timing and causes the behaviour change.

description: updated
Dave Cheney (dave-cheney) wrote :

I was able to reproduce a slightly different version of the error with the manual provider

2016-02-04 05:30:47 DEBUG juju.api api.go:505 API addresses unchanged
2016-02-04 05:30:47 DEBUG juju.api api.go:514 cacheChangedAPIInfo: serverUUID="b2cce9ba-0cf1-4c32-8e7c-41195b0ece57"
2016-02-04 05:30:47 DEBUG juju.api.block client.go:25
STORAGE FRONT-END: base.clientFacade{facadeCaller:base.facadeCaller{facadeName:"Block", bestVersion:2, caller:(*api.state)(0xc820174500)}, closer:(*api.state)(0xc820174500)}
2016-02-04 05:30:47 DEBUG juju.api.block client.go:26
STORAGE BACK-END: base.clientFacade{facadeCaller:base.facadeCaller{facadeName:"Block", bestVersion:2, caller:(*api.state)(0xc820174500)}, closer:(*api.state)(0xc820174500)}
2016-02-04 05:30:47 DEBUG juju.api apiclient.go:510 health ping failed: connection is shut down
2016-02-04 05:30:47 INFO cmd cmd.go:129 Bootstrap complete
2016-02-04 05:30:47 INFO cmd supercommand.go:454 command finished

I'm trying to apply the hammer of enlightenment to get more details about what happened on line 510

Changed in juju-core:
assignee: nobody → Dave Cheney (dave-cheney)
Dave Cheney (dave-cheney) wrote :

I'm having a lot of problems reproducing the problem. The line quoted above, juju.cmd supercommand.go +429 makes no sense, that file is only 39 lines long -- i've checked 1.25 and master.

Dave Cheney (dave-cheney) wrote :

OMG. This is supercommand.go from github.com/juju/cmd/supercommand.go, NOT, github.com/juju/juju/cmd/supercommand.go -- seriously, we have to stop doing this shit.

Curtis Hovey (sinzui) on 2016-02-11
Changed in juju-core:
milestone: 2.0-alpha2 → 2.0-beta1
Curtis Hovey (sinzui) on 2016-02-19
Changed in juju-core:
milestone: 2.0-beta1 → 2.0-beta2
Curtis Hovey (sinzui) on 2016-02-29
Changed in juju-core:
importance: Critical → High
Curtis Hovey (sinzui) on 2016-03-10
Changed in juju-core:
milestone: 2.0-beta2 → 2.0-beta3
Curtis Hovey (sinzui) on 2016-03-25
Changed in juju-core:
milestone: 2.0-beta3 → 2.0-beta4
Changed in juju-core:
milestone: 2.0-beta4 → 2.0.1
Curtis Hovey (sinzui) on 2016-07-22
Changed in juju-core:
assignee: Dave Cheney (dave-cheney) → nobody
affects: juju-core → juju
Changed in juju:
milestone: 2.0.1 → none
milestone: none → 2.0.1
Changed in juju-core:
importance: Undecided → High
status: New → Triaged
Curtis Hovey (sinzui) on 2016-08-23
Changed in juju-core:
status: Triaged → Invalid
Curtis Hovey (sinzui) on 2016-10-28
Changed in juju:
milestone: 2.0.1 → none
Curtis Hovey (sinzui) on 2017-02-14
Changed in juju:
importance: High → Medium
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers