Bug #1347715 “Manual provider does not respond after bootstrap” : Bugs : juju-core

Revision history for this message

Curtis Hovey (sinzui) wrote on 2014-07-23:

#1

all-machines..log from manual-deploy-precise-amd64 Edit (6.3 KiB, application/octet-stream)

Revision history for this message

Curtis Hovey (sinzui) wrote on 2014-07-25:

#2

all-machines..log from manual-deploy-precise-amd64 Edit (5.0 KiB, application/octet-stream)

Both the ec2 and ppc machines have network to retrieve tools. The tests pass when we test 1.20.2
machine-0: 2014-07-25 20:05:08 ERROR juju.worker runner.go:219 exited "api": unable to connect to "wss://localhost:17070/"

The master branch versions have not passed in 4 days since commit 2285c3e3. With commit e2e54ff5 we see failures.

Revision history for this message

Curtis Hovey (sinzui) wrote on 2014-07-25:

#3

The regression was introduced in one of these commits.
    f7ab0772a7bb499ec1978d0a61510113ac0d07fe Merge pull request #348 from klyachin/130-cloudinit-networker
    ba01d526fe92ff67d3d3839af691d0c464b83d87 Merge pull request #342 from axw/ensurenotbootstrapped-stateserverinstances
    e2e54ff5d9a2574c1c8106dafd1740825a616c0a
Merge pull request #328 from gabriel-samfira/add-os-field

Revision history for this message

Andrew Wilkins (axwalk) wrote on 2014-07-28:

#4

Looking into it. I can manually bootstrap an LXC container locally without any issues.

Changed in juju-core:
status:	Triaged → In Progress

Revision history for this message

Andrew Wilkins (axwalk) wrote on 2014-07-30:

#5

Still trying to verify, but it looks a bit like the client is not setting use-sshstorage=true in the .jenv file. This would mean that the client tries to use HTTP storage. It appears that the security group does not permit port 8040, the HTTP storage port, which would explain why the client cannot get through that way.

Andrew Wilkins (axwalk) on 2014-07-30

Changed in juju-core:
assignee:	nobody → Andrew Wilkins (axwalk)

Revision history for this message

Andrew Wilkins (axwalk) wrote on 2014-07-30:

#6

So I made the change, and it's working its way through the queue... but now, after further investigation, I'm fairly convinced that it will not fix the problem. I'm waiting for the test to start again so I can get on the machine and gather some more data.

Revision history for this message

Andrew Wilkins (axwalk) wrote on 2014-07-30:

#7

Okay, I've found the culprit. ssh known_hosts warnings. sigh

Andrew Wilkins (axwalk) on 2014-07-31

Changed in juju-core:
status:	In Progress → Fix Committed

Curtis Hovey (sinzui) on 2014-07-31

Changed in juju-core:
status:	Fix Committed → In Progress

Revision history for this message

Curtis Hovey (sinzui) wrote on 2014-07-31:

#8

The tests still fail as of commit 62e172632c3e9d8496805ed5223f9f4acc28986a
http://juju-ci.vapour.ws:8080/job/manual-deploy-precise-amd64/944/console
http://juju-ci.vapour.ws:8080/job/manual-deploy-trusty-ppc64/449/console

The commit is after the commits that tried to fix the manual provider:
    ab7affd5c1fe62729bad99109f1a6b08b528f82e
    dff967dd7713aa99afc04c2387cfe0c73fabba0d
    a03922a72b4e4daa8cadb8cd01c6cfd906fa6ff5

I see this for the ec2 instances that were created up for the test
2014-07-31 16:44:03 ERROR juju.cmd supercommand.go:323 Warning: Permanently added 'ec2-54-210-191-68.compute-1.amazonaws.com,172.31.37.64' (ECDSA) to the list of known hosts.
Permission denied (publickey).: subprocess encountered error code 255

I see this for the ppc machines that were selects for the test
2014-07-31 17:21:40 INFO juju.environs.manual init.go:121 series: trusty, characteristics: arch=ppc64 cpu-cores=1 mem=8166M
2014-07-31 17:21:41 ERROR juju.cmd supercommand.go:323 Permission denied (publickey).: subprocess encountered error code 255

Andrew Wilkins (axwalk) on 2014-08-04

Changed in juju-core:
status:	In Progress → Triaged
assignee:	Andrew Wilkins (axwalk) → nobody

Wayne Witzel III (wwitzel3) on 2014-08-04

Changed in juju-core:
assignee:	nobody → Wayne Witzel III (wwitzel3)

Wayne Witzel III (wwitzel3) on 2014-08-05

Changed in juju-core:
assignee:	Wayne Witzel III (wwitzel3) → nobody

Nate Finch (natefinch) on 2014-08-06

Changed in juju-core:
assignee:	nobody → Wayne Witzel III (wwitzel3)

Revision history for this message

Wayne Witzel III (wwitzel3) wrote on 2014-08-06:

#9

I've made many attempts are reproducing this failure locally, even using the euca commands provided from Curtis on IRC, but I am unable to get this auth failure. Even with setting my authorized-keys and/or authorized-keys-path.

I will continue to investigate the differences between the code changes to see if that sparks and ideas about what the cause of this problem is.

Wayne Witzel III (wwitzel3) on 2014-08-07

Changed in juju-core:
assignee:	Wayne Witzel III (wwitzel3) → nobody

Andrew Wilkins (axwalk) on 2014-08-08

Changed in juju-core:
assignee:	nobody → Andrew Wilkins (axwalk)

Revision history for this message

Andrew Wilkins (axwalk) wrote on 2014-08-08:

#10

Looks related to recent changes to validate/open/prepare etc.; the state server is attempting to use ssh storage when it shouldn't.

Revision history for this message

Andrew Wilkins (axwalk) wrote on 2014-08-08:

#11

Fix proposed: https://github.com/juju/juju/pull/483

Revision history for this message

Andrew Wilkins (axwalk) wrote on 2014-08-08:

#12

Should be fixed now.

Changed in juju-core:
status:	Triaged → Fix Committed

Curtis Hovey (sinzui) on 2014-08-11

Changed in juju-core:
status:	Fix Committed → Triaged

Revision history for this message

Curtis Hovey (sinzui) wrote on 2014-08-11:

#13

While the ssh error is gone, there is still an error adding machine. Neither the amd64 test using ec2 instances or the ppc64 test using metal can add-machine. As of Commit c436dc40 these tests are still broken in master.

From http://juju-ci.vapour.ws:8080/job/manual-deploy-precise-amd64/1011/console

2014-08-11 11:02:33 INFO euca-describe-instances --filter tag:job_name=manual-deploy-precise-amd64 --filter instance-state-name=running
INSTANCE i-6ded5d41
INSTANCE i-6ced5d40
INSTANCE i-6eed5d42
Traceback (most recent call last):
  File "/var/lib/jenkins/juju-ci-tools/deploy_job.py", line 4, in <module>
    deploy_job()
  File "/mnt/jenkinshome/juju-ci-tools/deploy_stack.py", line 320, in deploy_job
    bootstrap_host, machines, series, log_dir, debug)
  File "/mnt/jenkinshome/juju-ci-tools/deploy_stack.py", line 377, in _deploy_job
    env, already_bootstrapped=True, machines=machines)
  File "/mnt/jenkinshome/juju-ci-tools/deploy_stack.py", line 60, in prepare_environment
    env.juju('add-machine', 'ssh:' + machine)
  File "/mnt/jenkinshome/juju-ci-tools/jujupy.py", line 254, in juju
    return self.client.juju(self, command, args)
  File "/mnt/jenkinshome/juju-ci-tools/jujupy.py", line 160, in juju
    return subprocess.check_call(args)
  File "/usr/lib/python2.7/subprocess.py", line 511, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '('juju', '--show-log', 'add-machine', '-e', 'manual-deploy-precise-amd64', 'ssh:ec2-54-210-230-190.compute-1.amazonaws.com')' returned non-zero exit status 1

From http://juju-ci.vapour.ws:8080/job/manual-deploy-trusty-ppc64/515/console

Warning: Permanently added '10.245.67.135' (ECDSA) to the list of known hosts.
Warning: Permanently added '10.245.67.135' (ECDSA) to the list of known hosts.
juju --show-log destroy-environment manual-deploy-trusty-ppc64 --force -y
2014-08-11 10:43:01 INFO juju.cmd supercommand.go:37 running juju [1.21-alpha1-trusty-ppc64 gccgo]
2014-08-11 10:43:02 INFO juju.cmd supercommand.go:329 command finished
Traceback (most recent call last):
  File "/var/lib/jenkins/juju-ci-tools/deploy_job.py", line 4, in <module>
    deploy_job()
  File "/var/lib/jenkins/juju-ci-tools/deploy_stack.py", line 320, in deploy_job
    bootstrap_host, machines, series, log_dir, debug)
  File "/var/lib/jenkins/juju-ci-tools/deploy_stack.py", line 377, in _deploy_job
    env, already_bootstrapped=True, machines=machines)
  File "/var/lib/jenkins/juju-ci-tools/deploy_stack.py", line 60, in prepare_environment
    env.juju('add-machine', 'ssh:' + machine)
  File "/var/lib/jenkins/juju-ci-tools/jujupy.py", line 254, in juju
    return self.client.juju(self, command, args)
  File "/var/lib/jenkins/juju-ci-tools/jujupy.py", line 160, in juju
    return subprocess.check_call(args)
  File "/usr/lib/python2.7/subprocess.py", line 540, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '('juju', '--show-log', 'add-machine', '-e', 'manual-deploy-trusty-ppc64', 'ssh:10.245.67.136')' returned non-zero exit status 1

While the ssh error is gone, there is still an error adding machine. Neither the amd64 test using ec2 instances or the ppc64 test using metal can add-machine. As of Commit c436dc40 these tests are still broken in master.

From http://juju-ci.vapour.ws:8080/job/manual-deploy-precise-amd64/1011/console

2014-08-11 11:02:33 INFO euca-describe-instances --filter tag:job_name=manual-deploy-precise-amd64 --filter instance-state-name=running
INSTANCE	i-6ded5d41
INSTANCE	i-6ced5d40
INSTANCE	i-6eed5d42
Traceback (most recent call last):
  File "/var/lib/jenkins/juju-ci-tools/deploy_job.py", line 4, in <module>
    deploy_job()
  File "/mnt/jenkinshome/juju-ci-tools/deploy_stack.py", line 320, in deploy_job
    bootstrap_host, machines, series, log_dir, debug)
  File "/mnt/jenkinshome/juju-ci-tools/deploy_stack.py", line 377, in _deploy_job
    env, already_bootstrapped=True, machines=machines)
  File "/mnt/jenkinshome/juju-ci-tools/deploy_stack.py", line 60, in prepare_environment
    env.juju('add-machine', 'ssh:' + machine)
  File "/mnt/jenkinshome/juju-ci-tools/jujupy.py", line 254, in juju
    return self.client.juju(self, command, args)
  File "/mnt/jenkinshome/juju-ci-tools/jujupy.py", line 160, in juju
    return subprocess.check_call(args)
  File "/usr/lib/python2.7/subprocess.py", line 511, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '('juju', '--show-log', 'add-machine', '-e', 'manual-deploy-precise-amd64', 'ssh:ec2-54-210-230-190.compute-1.amazonaws.com')' returned non-zero exit status 1

From http://juju-ci.vapour.ws:8080/job/manual-deploy-trusty-ppc64/515/console

Warning: Permanently added '10.245.67.135' (ECDSA) to the list of known hosts.
Warning: Permanently added '10.245.67.135' (ECDSA) to the list of known hosts.
juju --show-log destroy-environment manual-deploy-trusty-ppc64 --force -y
2014-08-11 10:43:01 INFO juju.cmd supercommand.go:37 running juju [1.21-alpha1-trusty-ppc64 gccgo]
2014-08-11 10:43:02 INFO juju.cmd supercommand.go:329 command finished
Traceback (most recent call last):
  File "/var/lib/jenkins/juju-ci-tools/deploy_job.py", line 4, in <module>
    deploy_job()
  File "/var/lib/jenkins/juju-ci-tools/deploy_stack.py", line 320, in deploy_job
    bootstrap_host, machines, series, log_dir, debug)
  File "/var/lib/jenkins/juju-ci-tools/deploy_stack.py", line 377, in _deploy_job
    env, already_bootstrapped=True, machines=machines)
  File "/var/lib/jenkins/juju-ci-tools/deploy_stack.py", line 60, in prepare_environment
    env.juju('add-machine', 'ssh:' + machine)
  File "/var/lib/jenkins/juju-ci-tools/jujupy.py", line 254, in juju
    return self.client.juju(self, command, args)
  File "/var/lib/jenkins/juju-ci-tools/jujupy.py", line 160, in juju
    return subprocess.check_call(args)
  File "/usr/lib/python2.7/subprocess.py", line 540, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '('juju', '--show-log', 'add-machine', '-e', 'manual-deploy-trusty-ppc64', 'ssh:10.245.67.136')' returned non-zero exit status 1

Revision history for this message

Andrew Wilkins (axwalk) wrote on 2014-08-12:

#14

2014-08-11 23:21:42 ERROR juju.cmd supercommand.go:323 upgrade in progress - Juju functionality is limited

FUUUUUUU

Revision history for this message

Andrew Wilkins (axwalk) wrote on 2014-08-12:

#15

That problem should be fixed by a PR that Menno has ready for landing, but was waiting for trunk to be unblocked. I've just pointed the lander at it, and said that it fixes this bug.

Changed in juju-core:
status:	Triaged → In Progress

Revision history for this message

Andrew Wilkins (axwalk) wrote on 2014-08-12:

#16

Marking fix committed, please reopen if it persists... I'll keep an eye on CI and do so myself if it's still broken.

Changed in juju-core:
status:	In Progress → Fix Committed

Revision history for this message

Menno Finlay-Smits (menno.smits) wrote on 2014-08-12:

#17

Looks like the problem is still happening even with that PR.

Changed in juju-core:
status:	Fix Committed → In Progress

Menno Finlay-Smits (menno.smits) on 2014-08-12

Changed in juju-core:
assignee:	Andrew Wilkins (axwalk) → Menno Smits (menno.smits)

Revision history for this message

Menno Finlay-Smits (menno.smits) wrote on 2014-08-12:

#18

To reproduce, with current trunk:

1. Have an instance ready somewhere (e.g. EC2) to use with the manual provider.

2. include "logging-config: <root>=DEBUG" in the environment section in environments.yaml. Mine looked like:

  manual-ec2:
    type: manual
    bootstrap-host: blah-blah.us-west-2.compute.amazonaws.com
    bootstrap-user: ubuntu
    logging-config: <root>=DEBUG

- juju bootstrap --upload-tools

SSH to the bootstrap node and you'll see a panic along the lines of:

panic: interface conversion: interface is float64, not int

goroutine 160 [running]:
runtime.panic(0xdfeb40, 0xc21033e100)
        /usr/lib/go/src/pkg/runtime/panic.c:266 +0xb6
github.com/juju/juju/provider/manual.(*environConfig).storageAddr(0xc2101baef0, 0xc20ce0, 0xc2101910f0)
        /home/menno/go/src/github.com/juju/juju/provider/manual/config.go:73 +0xe3
github.com/juju/juju/provider/manual.(*manualEnviron).SetConfig(0xc210351540, 0xc2101b1770, 0x0, 0x0)
        /home/menno/go/src/github.com/juju/juju/provider/manual/environ.go:190 +0x627
github.com/juju/juju/provider/manual.manualProvider.open(0xc2101b8b10, 0xc2101b8b10, 0xc2101b8b00, 0x0, 0x0)
        /home/menno/go/src/github.com/juju/juju/provider/manual/provider.go:84 +0x63
github.com/juju/juju/provider/manual.manualProvider.Open(0xc2101b1770, 0xf57b60, 0x4, 0xc21018f3d0, 0x6)
        /home/menno/go/src/github.com/juju/juju/provider/manual/provider.go:78 +0xc1
github.com/juju/juju/provider/manual.manualProvider.Open·i(0x10, 0xc2101b1770, 0x7f9f1d44cf48, 0x10, 0x0, ...)
        /home/menno/go/src/github.com/juju/juju/provider/manual/config.go:1 +0x34
github.com/juju/juju/environs.New(0xc2101b1770, 0xc2101b1770, 0x0, 0x0, 0xc200000001)
        /home/menno/go/src/github.com/juju/juju/environs/open.go:165 +0x8e
github.com/juju/juju/worker.WaitForEnviron(0x7f9f1d45f860, 0xc2102e7f50, 0x7f9f1d46b4f0, 0xc210151180, 0xc2102f5900, ...)
        /home/menno/go/src/github.com/juju/juju/worker/environ.go:52 +0x1cb
github.com/juju/juju/worker/provisioner.(*environProvisioner).loop(0xc21023ab00, 0x0, 0x0)
        /home/menno/go/src/github.com/juju/juju/worker/provisioner/provisioner.go:158 +0x191
github.com/juju/juju/worker/provisioner.func·002()
        /home/menno/go/src/github.com/juju/juju/worker/provisioner/provisioner.go:144 +0x59
created by github.com/juju/juju/worker/provisioner.NewEnvironProvisioner
        /home/menno/go/src/github.com/juju/juju/worker/provisioner/provisioner.go:145 +0x191

Somehow the storage port is ending up a float where an int is expected. There are tests around this but they obviously don't cover this case. I haven't figured out how the float value makes it in.

I will attach the full machine-0.log.

To reproduce, with current trunk:

1. Have an instance ready somewhere (e.g. EC2) to use with the manual provider.

2. include "logging-config: <root>=DEBUG" in the environment section in environments.yaml. Mine looked like:

manual-ec2:
    type: manual
    bootstrap-host: blah-blah.us-west-2.compute.amazonaws.com
    bootstrap-user: ubuntu
    logging-config: <root>=DEBUG

- juju bootstrap --upload-tools

SSH to the bootstrap node and you'll see a panic along the lines of:

panic: interface conversion: interface is float64, not int

goroutine 160 [running]:
runtime.panic(0xdfeb40, 0xc21033e100)
        /usr/lib/go/src/pkg/runtime/panic.c:266 +0xb6
github.com/juju/juju/provider/manual.(*environConfig).storageAddr(0xc2101baef0, 0xc20ce0, 0xc2101910f0)
        /home/menno/go/src/github.com/juju/juju/provider/manual/config.go:73 +0xe3
github.com/juju/juju/provider/manual.(*manualEnviron).SetConfig(0xc210351540, 0xc2101b1770, 0x0, 0x0)
        /home/menno/go/src/github.com/juju/juju/provider/manual/environ.go:190 +0x627
github.com/juju/juju/provider/manual.manualProvider.open(0xc2101b8b10, 0xc2101b8b10, 0xc2101b8b00, 0x0, 0x0)
        /home/menno/go/src/github.com/juju/juju/provider/manual/provider.go:84 +0x63
github.com/juju/juju/provider/manual.manualProvider.Open(0xc2101b1770, 0xf57b60, 0x4, 0xc21018f3d0, 0x6)
        /home/menno/go/src/github.com/juju/juju/provider/manual/provider.go:78 +0xc1
github.com/juju/juju/provider/manual.manualProvider.Open·i(0x10, 0xc2101b1770, 0x7f9f1d44cf48, 0x10, 0x0, ...)
        /home/menno/go/src/github.com/juju/juju/provider/manual/config.go:1 +0x34
github.com/juju/juju/environs.New(0xc2101b1770, 0xc2101b1770, 0x0, 0x0, 0xc200000001)
        /home/menno/go/src/github.com/juju/juju/environs/open.go:165 +0x8e
github.com/juju/juju/worker.WaitForEnviron(0x7f9f1d45f860, 0xc2102e7f50, 0x7f9f1d46b4f0, 0xc210151180, 0xc2102f5900, ...)
        /home/menno/go/src/github.com/juju/juju/worker/environ.go:52 +0x1cb
github.com/juju/juju/worker/provisioner.(*environProvisioner).loop(0xc21023ab00, 0x0, 0x0)
        /home/menno/go/src/github.com/juju/juju/worker/provisioner/provisioner.go:158 +0x191
github.com/juju/juju/worker/provisioner.func·002()
        /home/menno/go/src/github.com/juju/juju/worker/provisioner/provisioner.go:144 +0x59
created by github.com/juju/juju/worker/provisioner.NewEnvironProvisioner
        /home/menno/go/src/github.com/juju/juju/worker/provisioner/provisioner.go:145 +0x191

Somehow the storage port is ending up a float where an int is expected. There are tests around this but they obviously don't cover this case. I haven't figured out how the float value makes it in.

I will attach the full machine-0.log.

Revision history for this message

Menno Finlay-Smits (menno.smits) wrote on 2014-08-12:

#19

machine-0.log Edit (192.0 KiB, text/plain)

machine-0 log from jujud start until the problem happens (and a little further). The panics and jujud restarts repeated continually after this.

Revision history for this message

Menno Finlay-Smits (menno.smits) wrote on 2014-08-12:

#20

The problem exists in master but doesn't exist in 1.20.2 so the problem was introduced some time after 1.20.2.

Revision history for this message

Menno Finlay-Smits (menno.smits) wrote on 2014-08-12:

#21

Time for someone else to have a go.

Changed in juju-core:
assignee:	Menno Smits (menno.smits) → nobody

Revision history for this message

Nate Finch (natefinch) wrote on 2014-08-12:

#22

The problem appears to be that we deserialize from json, which gives us a float64 in the map, and then just directly try to cast it to an int. We should instead do a type switch for int or float64 and convert.

Ideally we wouldn't have map[string]interface{} in any of our code.... and this is exactly why. But we can at least fix this problem for now.

What's weird is that there's a test that tries to ensure we handle this problem, but it evidently is not catching the error. I'll figure out why and make sure the test fails before I fix the code.

Changed in juju-core:
assignee:	nobody → Nate Finch (natefinch)

Revision history for this message

Andrew Wilkins (axwalk) wrote on 2014-08-13:

#23

The problem is that our config is ridiculously convoluted, and I've broken it while fixing a different config-related issue. SetConfig is being called with unvalidated config, and I recently changed SetConfig to *not* validate it where it was being before. My thinking was that we should not be adding defaults to an already prepared config. Unfortunately this means the code that coerces float64s to ints doesn't get executed.

I'll take this over, will see what I can come up with today.

Changed in juju-core:
assignee:	Nate Finch (natefinch) → Andrew Wilkins (axwalk)

Revision history for this message

Andrew Wilkins (axwalk) wrote on 2014-08-13:

#24

And now I see that there's a PR that has fixed this (in a different way). Closing...

Changed in juju-core:
assignee:	Andrew Wilkins (axwalk) → Nate Finch (natefinch)
status:	In Progress → Fix Committed

Revision history for this message

Andrew Wilkins (axwalk) wrote on 2014-08-13:

#25

Still broken, in yet another way.

Changed in juju-core:
status:	Fix Committed → Triaged
assignee:	Nate Finch (natefinch) → nobody

Andrew Wilkins (axwalk) on 2014-08-13

Changed in juju-core:
assignee:	nobody → Andrew Wilkins (axwalk)
status:	Triaged → In Progress

Revision history for this message

Andrew Wilkins (axwalk) wrote on 2014-08-13:

#26

OK, I can reproduce this in my environment. Hope to have it fixed soon.

Revision history for this message

Andrew Wilkins (axwalk) wrote on 2014-08-13:

#27

Found the culprit.

Revision history for this message

Andrew Wilkins (axwalk) wrote on 2014-08-13:

#28

See https://github.com/juju/juju/pull/504 for details on what was broken.

Andrew Wilkins (axwalk) on 2014-08-13

Changed in juju-core:
status:	In Progress → Fix Committed

Revision history for this message

Menno Finlay-Smits (menno.smits) wrote on 2014-08-13:

#29

Just following up... the manual provider CI jobs are now passing reliably. This ticket is finally *done*.

Curtis Hovey (sinzui) on 2014-08-21

Changed in juju-core:
status:	Fix Committed → Fix Released

juju-core

Manual provider does not respond after bootstrap

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches