manual provider add-machine fails immediately after bootstrap

Bug #1538583 reported by Curtis Hovey
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Expired
Medium
Unassigned
juju-core
Invalid
High
Unassigned
1.25
Invalid
High
Unassigned

Bug Description

As seen in
    http://reports.vapour.ws/releases/issue/56a8d7ec749a564e013cb5df

CI saw errors like this in several manual provider tests. The wily, trusty and precise jobs failed when provisioning in ec2.

2016-01-26 20:31:22 ERROR cmd supercommand.go:448 connection is shut down
2016-01-26 20:31:22 ERROR Command '('juju', '--show-log', 'add-machine', '-m', 'manual-deploy-wily-amd64', 'ssh:ec2-54-236-226-51.compute-1.amazonaws.com')' returned non-zero exit status 1

The trusty ppc4el job passed though.

The offending behaviour is now in master and 1.25. 1.25.4 cannot be released.

Curtis Hovey (sinzui)
Changed in juju-core:
status: New → Incomplete
summary: - manual proovider add-machine failed in api-command-rename
+ manual provider add-machine failed in api-command-rename
tags: added: regression
Revision history for this message
Cheryl Jennings (cherylj) wrote : Re: manual provider add-machine failed in api-command-rename

I see that the precise jobs all failed because of this error: /juju-qa.jujuci: No such file or directory

The wily job failed due to a "connection is shutdown" error , which makes me wonder if it's related to bug 1517632 (the outdated distro info bug)

Revision history for this message
Curtis Hovey (sinzui) wrote :

The outcome analyser is wrongly selecting a recoverable issue from the actual error in the job. I have fixed the job so future runs wont report that issue. The data we have do show the same failure
    http://reports.vapour.ws/releases/3544/job/manual-deploy-precise-amd64/attempt/3024

Changed in juju-core:
status: Incomplete → Triaged
no longer affects: juju-core/api-command-rename
summary: - manual provider add-machine failed in api-command-rename
+ manual provider add-machine fails immediately after bootstrap
Changed in juju-core:
importance: Undecided → High
importance: High → Critical
milestone: none → 2.0-alpha2
Revision history for this message
Cheryl Jennings (cherylj) wrote :

This has been seen on more than api-controller-rename.

If you try to add-machine immediately after the bootstrap of the manual provider, you will get this connection is shut down error. I image it would happen with any command.

Running the command again will succeed.

Initial investigation shows that the mongo connection drops right after bootstrap completes. I see these messages in /var/log/syslog at the time that the connection drops: http://paste.ubuntu.com/14698112/

Seems to happen at the same time as this line in machine-0.log:
2016-01-29 19:33:24 DEBUG juju.replicaset replicaset.go:168 Set() changing replica set

Which is followed by:
2016-01-29 19:33:24 DEBUG juju.replicaset replicaset.go:179 got EOF while running Set(), calling session.Refresh()

I see the same messages in machine-0.log and /var/log/syslog when I bootstrap my MAAS env, but the timing is different such that running `juju bootstrap && juju add-machine` will succeed.

Revision history for this message
Cheryl Jennings (cherylj) wrote :

The replicaset is changed as a result of adding a new IP to the API host ports:

2016-01-29 19:33:24 DEBUG juju.state address.go:137 setting API hostPorts: [[52.35.31.97:17070 172.31.22.1:17070 127.0.0.1:17070 [::1]:17070]]
2016-01-29 19:33:24 DEBUG juju.worker.peergrouper desired.go:38 calculating desired peer group
2016-01-29 19:33:24 DEBUG juju.worker.peergrouper desired.go:39 members: map[*peergrouper.machine]*replicaset.Member{&peergrouper.machine{id: "0", wantsVote: true, hostPort: "172.31.22.1:37017"}
:(*replicaset.Member)(0xc820625040)}
2016-01-29 19:33:24 DEBUG juju.worker.peergrouper desired.go:40 extra: []replicaset.Member(nil)
2016-01-29 19:33:24 DEBUG juju.worker.peergrouper desired.go:41 maxId: 1
2016-01-29 19:33:24 DEBUG juju.worker.peergrouper desired.go:110 assessing possible peer group changes:
2016-01-29 19:33:24 DEBUG juju.worker.peergrouper desired.go:116 machine "0" is already voting
2016-01-29 19:33:24 DEBUG juju.worker.peergrouper desired.go:134 assessed
2016-01-29 19:33:24 DEBUG juju.worker.peergrouper worker.go:282 desired peer group members: []replicaset.Member{replicaset.Member{Id:1, Address:"172.31.22.1:37017", Arbiter:(*bool)(nil), BuildIn
dexes:(*bool)(nil), Hidden:(*bool)(nil), Priority:(*float64)(nil), Tags:map[string]string{"juju-machine-id":"0"}, SlaveDelay:(*time.Duration)(nil), Votes:(*int)(nil)}}
2016-01-29 19:33:24 DEBUG juju.replicaset replicaset.go:168 Set() changing replica set
from {Name: juju, Version: 1, Members: {Member{1 "52.35.31.97:37017" map[juju-machine-id:0]}}}
  to {Name: juju, Version: 2, Members: {Member{1 "172.31.22.1:37017" map[juju-machine-id:0]}}}

Revision history for this message
Cheryl Jennings (cherylj) wrote :
Revision history for this message
Cheryl Jennings (cherylj) wrote :
Revision history for this message
Cheryl Jennings (cherylj) wrote :

I see this on master after the merge of api-command-rename.

Steps for a manual recreate:

1 - Provision two machines in AWS. Make sure that the security group opens ports 17017 and 37017.
2 - Edit config for manual provider to point to one of the machines
3 - Bootstrap and add-machine with the same command (using a juju built from master)

juju bootstrap --upload-tools --debug && juju add-machine ssh:<IP of second machine>

You will see the "connection is shut down" error.

Revision history for this message
Curtis Hovey (sinzui) wrote :

This bug is now in 1.25. Since we know the last good commit The suspect commits are:
    https://github.com/juju/juju/commit/e0fc2ca179881c658dc931e15e848707901bd97f
    https://github.com/juju/juju/commit/9f5ca7ef8d0bc60f438e9f44813e1b9701a8a654
    https://github.com/juju/juju/commit/76ca2e226388be4afbd5f40f66ab86e7fafc21de

As none of the commts specifically changes behaviour, we believe one of them changes timing and causes the behaviour change.

description: updated
Revision history for this message
Dave Cheney (dave-cheney) wrote :

I was able to reproduce a slightly different version of the error with the manual provider

2016-02-04 05:30:47 DEBUG juju.api api.go:505 API addresses unchanged
2016-02-04 05:30:47 DEBUG juju.api api.go:514 cacheChangedAPIInfo: serverUUID="b2cce9ba-0cf1-4c32-8e7c-41195b0ece57"
2016-02-04 05:30:47 DEBUG juju.api.block client.go:25
STORAGE FRONT-END: base.clientFacade{facadeCaller:base.facadeCaller{facadeName:"Block", bestVersion:2, caller:(*api.state)(0xc820174500)}, closer:(*api.state)(0xc820174500)}
2016-02-04 05:30:47 DEBUG juju.api.block client.go:26
STORAGE BACK-END: base.clientFacade{facadeCaller:base.facadeCaller{facadeName:"Block", bestVersion:2, caller:(*api.state)(0xc820174500)}, closer:(*api.state)(0xc820174500)}
2016-02-04 05:30:47 DEBUG juju.api apiclient.go:510 health ping failed: connection is shut down
2016-02-04 05:30:47 INFO cmd cmd.go:129 Bootstrap complete
2016-02-04 05:30:47 INFO cmd supercommand.go:454 command finished

I'm trying to apply the hammer of enlightenment to get more details about what happened on line 510

Changed in juju-core:
assignee: nobody → Dave Cheney (dave-cheney)
Revision history for this message
Dave Cheney (dave-cheney) wrote :

I'm having a lot of problems reproducing the problem. The line quoted above, juju.cmd supercommand.go +429 makes no sense, that file is only 39 lines long -- i've checked 1.25 and master.

Revision history for this message
Dave Cheney (dave-cheney) wrote :

OMG. This is supercommand.go from github.com/juju/cmd/supercommand.go, NOT, github.com/juju/juju/cmd/supercommand.go -- seriously, we have to stop doing this shit.

Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 2.0-alpha2 → 2.0-beta1
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 2.0-beta1 → 2.0-beta2
Curtis Hovey (sinzui)
Changed in juju-core:
importance: Critical → High
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 2.0-beta2 → 2.0-beta3
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 2.0-beta3 → 2.0-beta4
Changed in juju-core:
milestone: 2.0-beta4 → 2.0.1
Curtis Hovey (sinzui)
Changed in juju-core:
assignee: Dave Cheney (dave-cheney) → nobody
affects: juju-core → juju
Changed in juju:
milestone: 2.0.1 → none
milestone: none → 2.0.1
Changed in juju-core:
importance: Undecided → High
status: New → Triaged
Curtis Hovey (sinzui)
Changed in juju-core:
status: Triaged → Invalid
Curtis Hovey (sinzui)
Changed in juju:
milestone: 2.0.1 → none
Curtis Hovey (sinzui)
Changed in juju:
importance: High → Medium
Revision history for this message
Canonical Juju QA Bot (juju-qa-bot) wrote :

This bug has not been updated in 5 years, so we're marking it Expired. If you believe this is incorrect, please update the status.

Changed in juju:
status: Triaged → Expired
tags: added: expirebugs-bot
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.