Canonical Juju

manual provider add-machine fails immediately after bootstrap

Bug #1538583 reported by Curtis Hovey on 2016-01-27

This bug affects 3 people

	Status	Importance	Assigned to
Canonical Juju	Expired	Medium	Unassigned
juju-core	Invalid	High	Unassigned
1.25	Invalid	High	Unassigned

Bug Description

As seen in
http://reports.vapour.ws/releases/issue/56a8d7ec749a564e013cb5df

CI saw errors like this in several manual provider tests. The wily, trusty and precise jobs failed when provisioning in ec2.

2016-01-26 20:31:22 ERROR cmd supercommand.go:448 connection is shut down
2016-01-26 20:31:22 ERROR Command '('juju', '--show-log', 'add-machine', '-m', 'manual-deploy-wily-amd64', 'ssh:ec2-54-236-226-51.compute-1.amazonaws.com')' returned non-zero exit status 1

The trusty ppc4el job passed though.

The offending behaviour is now in master and 1.25. 1.25.4 cannot be released.

See original description

Tags:

Curtis Hovey (sinzui) on 2016-01-27

Changed in juju-core:
status:	New → Incomplete
summary:	- manual proovider add-machine failed in api-command-rename + manual provider add-machine failed in api-command-rename
tags:	added: regression

Revision history for this message

Cheryl Jennings (cherylj) wrote on 2016-01-27: Re: manual provider add-machine failed in api-command-rename

I see that the precise jobs all failed because of this error: /juju-qa.jujuci: No such file or directory

The wily job failed due to a "connection is shutdown" error , which makes me wonder if it's related to bug 1517632 (the outdated distro info bug)

Revision history for this message

Curtis Hovey (sinzui) wrote on 2016-01-27:

The outcome analyser is wrongly selecting a recoverable issue from the actual error in the job. I have fixed the job so future runs wont report that issue. The data we have do show the same failure
http://reports.vapour.ws/releases/3544/job/manual-deploy-precise-amd64/attempt/3024

Cheryl Jennings (cherylj) on 2016-01-29

Changed in juju-core:
status:	Incomplete → Triaged
no longer affects:	juju-core/api-command-rename
summary:	- manual provider add-machine failed in api-command-rename + manual provider add-machine fails immediately after bootstrap
Changed in juju-core:
importance:	Undecided → High
importance:	High → Critical
milestone:	none → 2.0-alpha2

Revision history for this message

Cheryl Jennings (cherylj) wrote on 2016-01-29:

This has been seen on more than api-controller-rename.

If you try to add-machine immediately after the bootstrap of the manual provider, you will get this connection is shut down error. I image it would happen with any command.

Running the command again will succeed.

Initial investigation shows that the mongo connection drops right after bootstrap completes. I see these messages in /var/log/syslog at the time that the connection drops: http://paste.ubuntu.com/14698112/

Seems to happen at the same time as this line in machine-0.log:
2016-01-29 19:33:24 DEBUG juju.replicaset replicaset.go:168 Set() changing replica set

Which is followed by:
2016-01-29 19:33:24 DEBUG juju.replicaset replicaset.go:179 got EOF while running Set(), calling session.Refresh()

I see the same messages in machine-0.log and /var/log/syslog when I bootstrap my MAAS env, but the timing is different such that running `juju bootstrap && juju add-machine` will succeed.

Revision history for this message

Cheryl Jennings (cherylj) wrote on 2016-01-29:

The replicaset is changed as a result of adding a new IP to the API host ports:

2016-01-29 19:33:24 DEBUG juju.state address.go:137 setting API hostPorts: [[52.35.31.97:17070 172.31.22.1:17070 127.0.0.1:17070 [::1]:17070]]
2016-01-29 19:33:24 DEBUG juju.worker.peergrouper desired.go:38 calculating desired peer group
2016-01-29 19:33:24 DEBUG juju.worker.peergrouper desired.go:39 members: map[*peergrouper.machine]*replicaset.Member{&peergrouper.machine{id: "0", wantsVote: true, hostPort: "172.31.22.1:37017"}
:(*replicaset.Member)(0xc820625040)}
2016-01-29 19:33:24 DEBUG juju.worker.peergrouper desired.go:40 extra: []replicaset.Member(nil)
2016-01-29 19:33:24 DEBUG juju.worker.peergrouper desired.go:41 maxId: 1
2016-01-29 19:33:24 DEBUG juju.worker.peergrouper desired.go:110 assessing possible peer group changes:
2016-01-29 19:33:24 DEBUG juju.worker.peergrouper desired.go:116 machine "0" is already voting
2016-01-29 19:33:24 DEBUG juju.worker.peergrouper desired.go:134 assessed
2016-01-29 19:33:24 DEBUG juju.worker.peergrouper worker.go:282 desired peer group members: []replicaset.Member{replicaset.Member{Id:1, Address:"172.31.22.1:37017", Arbiter:(*bool)(nil), BuildIn
dexes:(*bool)(nil), Hidden:(*bool)(nil), Priority:(*float64)(nil), Tags:map[string]string{"juju-machine-id":"0"}, SlaveDelay:(*time.Duration)(nil), Votes:(*int)(nil)}}
2016-01-29 19:33:24 DEBUG juju.replicaset replicaset.go:168 Set() changing replica set
from {Name: juju, Version: 1, Members: {Member{1 "52.35.31.97:37017" map[juju-machine-id:0]}}}
to {Name: juju, Version: 2, Members: {Member{1 "172.31.22.1:37017" map[juju-machine-id:0]}}}

Revision history for this message

Cheryl Jennings (cherylj) wrote on 2016-01-29:

machine-0.log Edit (1.5 MiB, text/plain)

Revision history for this message

Cheryl Jennings (cherylj) wrote on 2016-01-29:

syslog Edit (253.2 KiB, text/plain)

Revision history for this message

Cheryl Jennings (cherylj) wrote on 2016-02-03:

I see this on master after the merge of api-command-rename.

Steps for a manual recreate:

1 - Provision two machines in AWS. Make sure that the security group opens ports 17017 and 37017.
2 - Edit config for manual provider to point to one of the machines
3 - Bootstrap and add-machine with the same command (using a juju built from master)

juju bootstrap --upload-tools --debug && juju add-machine ssh:<IP of second machine>

You will see the "connection is shut down" error.

Revision history for this message

Curtis Hovey (sinzui) wrote on 2016-02-03:

This bug is now in 1.25. Since we know the last good commit The suspect commits are:
    https://github.com/juju/juju/commit/e0fc2ca179881c658dc931e15e848707901bd97f
    https://github.com/juju/juju/commit/9f5ca7ef8d0bc60f438e9f44813e1b9701a8a654
    https://github.com/juju/juju/commit/76ca2e226388be4afbd5f40f66ab86e7fafc21de

As none of the commts specifically changes behaviour, we believe one of them changes timing and causes the behaviour change.

description:

updated

Revision history for this message

Dave Cheney (dave-cheney) wrote on 2016-02-04:

I was able to reproduce a slightly different version of the error with the manual provider

2016-02-04 05:30:47 DEBUG juju.api api.go:505 API addresses unchanged
2016-02-04 05:30:47 DEBUG juju.api api.go:514 cacheChangedAPIInfo: serverUUID="b2cce9ba-0cf1-4c32-8e7c-41195b0ece57"
2016-02-04 05:30:47 DEBUG juju.api.block client.go:25
STORAGE FRONT-END: base.clientFacade{facadeCaller:base.facadeCaller{facadeName:"Block", bestVersion:2, caller:(*api.state)(0xc820174500)}, closer:(*api.state)(0xc820174500)}
2016-02-04 05:30:47 DEBUG juju.api.block client.go:26
STORAGE BACK-END: base.clientFacade{facadeCaller:base.facadeCaller{facadeName:"Block", bestVersion:2, caller:(*api.state)(0xc820174500)}, closer:(*api.state)(0xc820174500)}
2016-02-04 05:30:47 DEBUG juju.api apiclient.go:510 health ping failed: connection is shut down
2016-02-04 05:30:47 INFO cmd cmd.go:129 Bootstrap complete
2016-02-04 05:30:47 INFO cmd supercommand.go:454 command finished

I'm trying to apply the hammer of enlightenment to get more details about what happened on line 510

Changed in juju-core:
assignee:	nobody → Dave Cheney (dave-cheney)

Revision history for this message

Dave Cheney (dave-cheney) wrote on 2016-02-05:

#10

I'm having a lot of problems reproducing the problem. The line quoted above, juju.cmd supercommand.go +429 makes no sense, that file is only 39 lines long -- i've checked 1.25 and master.

Revision history for this message

Dave Cheney (dave-cheney) wrote on 2016-02-05:

#11

OMG. This is supercommand.go from github.com/juju/cmd/supercommand.go, NOT, github.com/juju/juju/cmd/supercommand.go -- seriously, we have to stop doing this shit.

Curtis Hovey (sinzui) on 2016-02-11

Changed in juju-core:
milestone:	2.0-alpha2 → 2.0-beta1

Curtis Hovey (sinzui) on 2016-02-19

Changed in juju-core:
milestone:	2.0-beta1 → 2.0-beta2

Curtis Hovey (sinzui) on 2016-02-29

Changed in juju-core:
importance:	Critical → High

Curtis Hovey (sinzui) on 2016-03-10

Changed in juju-core:
milestone:	2.0-beta2 → 2.0-beta3

Curtis Hovey (sinzui) on 2016-03-25

Changed in juju-core:
milestone:	2.0-beta3 → 2.0-beta4

Cheryl Jennings (cherylj) on 2016-04-06

Changed in juju-core:
milestone:	2.0-beta4 → 2.0.1

Curtis Hovey (sinzui) on 2016-07-22

Changed in juju-core:
assignee:	Dave Cheney (dave-cheney) → nobody

Canonical Juju QA Bot (juju-qa-bot) on 2016-08-23

affects:	juju-core → juju
Changed in juju:
milestone:	2.0.1 → none
milestone:	none → 2.0.1

Canonical Juju QA Bot (juju-qa-bot) on 2016-08-23

Changed in juju-core:
importance:	Undecided → High
status:	New → Triaged

Curtis Hovey (sinzui) on 2016-08-23

Changed in juju-core:
status:	Triaged → Invalid

Curtis Hovey (sinzui) on 2016-10-28

Changed in juju:
milestone:	2.0.1 → none

Curtis Hovey (sinzui) on 2017-02-14

Changed in juju:
importance:	High → Medium

Revision history for this message

Canonical Juju QA Bot (juju-qa-bot) wrote on 2022-11-03:

#12

This bug has not been updated in 5 years, so we're marking it Expired. If you believe this is incorrect, please update the status.

Changed in juju:
status:	Triaged → Expired
tags:	added: expirebugs-bot

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

Bug #1567104

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.