Manual provider does not respond after bootstrap

Bug #1347715 reported by Curtis Hovey
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju-core
Fix Released
Critical
Andrew Wilkins

Bug Description

Both manual-provider tests show that it cannot bootstrap instances in aws or metal in the lab
    http://juju-ci.vapour.ws:8080/job/manual-deploy-precise-amd64/
    http://juju-ci.vapour.ws:8080/job/manual-deploy-trusty-ppc64/

Exception: Timed out waiting for juju status to succeed: Command '('juju', '--show-log', 'status', '-e', 'manual-deploy-trusty-ppc64')'

the all-machines.log reports
    machine-0: 2014-07-23 13:08:32 ERROR juju.worker runner.go:218 exited "upgrade-steps": unexpected quit

The regression is introduced by commit d57b76cd in master (the previous commit passed)
    Merge pull request #358 from tasdomas/factory-improvements

One option to fix this issue is to revert the commit.

I am attaching the logs from the latest test, which is from the next commit in master.

Revision history for this message
Curtis Hovey (sinzui) wrote :
Revision history for this message
Curtis Hovey (sinzui) wrote :

Both the ec2 and ppc machines have network to retrieve tools. The tests pass when we test 1.20.2
machine-0: 2014-07-25 20:05:08 ERROR juju.worker runner.go:219 exited "api": unable to connect to "wss://localhost:17070/"

The master branch versions have not passed in 4 days since commit 2285c3e3. With commit e2e54ff5 we see failures.

Revision history for this message
Curtis Hovey (sinzui) wrote :

The regression was introduced in one of these commits.
    f7ab0772a7bb499ec1978d0a61510113ac0d07fe Merge pull request #348 from klyachin/130-cloudinit-networker
    ba01d526fe92ff67d3d3839af691d0c464b83d87 Merge pull request #342 from axw/ensurenotbootstrapped-stateserverinstances
    e2e54ff5d9a2574c1c8106dafd1740825a616c0a
Merge pull request #328 from gabriel-samfira/add-os-field

Revision history for this message
Andrew Wilkins (axwalk) wrote :

Looking into it. I can manually bootstrap an LXC container locally without any issues.

Changed in juju-core:
status: Triaged → In Progress
Revision history for this message
Andrew Wilkins (axwalk) wrote :

Still trying to verify, but it looks a bit like the client is not setting use-sshstorage=true in the .jenv file. This would mean that the client tries to use HTTP storage. It appears that the security group does not permit port 8040, the HTTP storage port, which would explain why the client cannot get through that way.

Andrew Wilkins (axwalk)
Changed in juju-core:
assignee: nobody → Andrew Wilkins (axwalk)
Revision history for this message
Andrew Wilkins (axwalk) wrote :

So I made the change, and it's working its way through the queue... but now, after further investigation, I'm fairly convinced that it will not fix the problem. I'm waiting for the test to start again so I can get on the machine and gather some more data.

Revision history for this message
Andrew Wilkins (axwalk) wrote :

Okay, I've found the culprit. ssh known_hosts warnings. sigh

Andrew Wilkins (axwalk)
Changed in juju-core:
status: In Progress → Fix Committed
Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → In Progress
Revision history for this message
Curtis Hovey (sinzui) wrote :

The tests still fail as of commit 62e172632c3e9d8496805ed5223f9f4acc28986a
    http://juju-ci.vapour.ws:8080/job/manual-deploy-precise-amd64/944/console
    http://juju-ci.vapour.ws:8080/job/manual-deploy-trusty-ppc64/449/console

The commit is after the commits that tried to fix the manual provider:
    ab7affd5c1fe62729bad99109f1a6b08b528f82e
    dff967dd7713aa99afc04c2387cfe0c73fabba0d
    a03922a72b4e4daa8cadb8cd01c6cfd906fa6ff5

I see this for the ec2 instances that were created up for the test
    2014-07-31 16:44:03 ERROR juju.cmd supercommand.go:323 Warning: Permanently added 'ec2-54-210-191-68.compute-1.amazonaws.com,172.31.37.64' (ECDSA) to the list of known hosts.
    Permission denied (publickey).: subprocess encountered error code 255

I see this for the ppc machines that were selects for the test
    2014-07-31 17:21:40 INFO juju.environs.manual init.go:121 series: trusty, characteristics: arch=ppc64 cpu-cores=1 mem=8166M
    2014-07-31 17:21:41 ERROR juju.cmd supercommand.go:323 Permission denied (publickey).: subprocess encountered error code 255

Andrew Wilkins (axwalk)
Changed in juju-core:
status: In Progress → Triaged
assignee: Andrew Wilkins (axwalk) → nobody
Changed in juju-core:
assignee: nobody → Wayne Witzel III (wwitzel3)
Changed in juju-core:
assignee: Wayne Witzel III (wwitzel3) → nobody
Nate Finch (natefinch)
Changed in juju-core:
assignee: nobody → Wayne Witzel III (wwitzel3)
Revision history for this message
Wayne Witzel III (wwitzel3) wrote :

I've made many attempts are reproducing this failure locally, even using the euca commands provided from Curtis on IRC, but I am unable to get this auth failure. Even with setting my authorized-keys and/or authorized-keys-path.

I will continue to investigate the differences between the code changes to see if that sparks and ideas about what the cause of this problem is.

Changed in juju-core:
assignee: Wayne Witzel III (wwitzel3) → nobody
Andrew Wilkins (axwalk)
Changed in juju-core:
assignee: nobody → Andrew Wilkins (axwalk)
Revision history for this message
Andrew Wilkins (axwalk) wrote :

Looks related to recent changes to validate/open/prepare etc.; the state server is attempting to use ssh storage when it shouldn't.

Revision history for this message
Andrew Wilkins (axwalk) wrote :
Revision history for this message
Andrew Wilkins (axwalk) wrote :

Should be fixed now.

Changed in juju-core:
status: Triaged → Fix Committed
Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → Triaged
Revision history for this message
Curtis Hovey (sinzui) wrote :

While the ssh error is gone, there is still an error adding machine. Neither the amd64 test using ec2 instances or the ppc64 test using metal can add-machine. As of Commit c436dc40 these tests are still broken in master.

From http://juju-ci.vapour.ws:8080/job/manual-deploy-precise-amd64/1011/console

2014-08-11 11:02:33 INFO euca-describe-instances --filter tag:job_name=manual-deploy-precise-amd64 --filter instance-state-name=running
INSTANCE i-6ded5d41
INSTANCE i-6ced5d40
INSTANCE i-6eed5d42
Traceback (most recent call last):
  File "/var/lib/jenkins/juju-ci-tools/deploy_job.py", line 4, in <module>
    deploy_job()
  File "/mnt/jenkinshome/juju-ci-tools/deploy_stack.py", line 320, in deploy_job
    bootstrap_host, machines, series, log_dir, debug)
  File "/mnt/jenkinshome/juju-ci-tools/deploy_stack.py", line 377, in _deploy_job
    env, already_bootstrapped=True, machines=machines)
  File "/mnt/jenkinshome/juju-ci-tools/deploy_stack.py", line 60, in prepare_environment
    env.juju('add-machine', 'ssh:' + machine)
  File "/mnt/jenkinshome/juju-ci-tools/jujupy.py", line 254, in juju
    return self.client.juju(self, command, args)
  File "/mnt/jenkinshome/juju-ci-tools/jujupy.py", line 160, in juju
    return subprocess.check_call(args)
  File "/usr/lib/python2.7/subprocess.py", line 511, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '('juju', '--show-log', 'add-machine', '-e', 'manual-deploy-precise-amd64', 'ssh:ec2-54-210-230-190.compute-1.amazonaws.com')' returned non-zero exit status 1

From http://juju-ci.vapour.ws:8080/job/manual-deploy-trusty-ppc64/515/console

Warning: Permanently added '10.245.67.135' (ECDSA) to the list of known hosts.
Warning: Permanently added '10.245.67.135' (ECDSA) to the list of known hosts.
juju --show-log destroy-environment manual-deploy-trusty-ppc64 --force -y
2014-08-11 10:43:01 INFO juju.cmd supercommand.go:37 running juju [1.21-alpha1-trusty-ppc64 gccgo]
2014-08-11 10:43:02 INFO juju.cmd supercommand.go:329 command finished
Traceback (most recent call last):
  File "/var/lib/jenkins/juju-ci-tools/deploy_job.py", line 4, in <module>
    deploy_job()
  File "/var/lib/jenkins/juju-ci-tools/deploy_stack.py", line 320, in deploy_job
    bootstrap_host, machines, series, log_dir, debug)
  File "/var/lib/jenkins/juju-ci-tools/deploy_stack.py", line 377, in _deploy_job
    env, already_bootstrapped=True, machines=machines)
  File "/var/lib/jenkins/juju-ci-tools/deploy_stack.py", line 60, in prepare_environment
    env.juju('add-machine', 'ssh:' + machine)
  File "/var/lib/jenkins/juju-ci-tools/jujupy.py", line 254, in juju
    return self.client.juju(self, command, args)
  File "/var/lib/jenkins/juju-ci-tools/jujupy.py", line 160, in juju
    return subprocess.check_call(args)
  File "/usr/lib/python2.7/subprocess.py", line 540, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '('juju', '--show-log', 'add-machine', '-e', 'manual-deploy-trusty-ppc64', 'ssh:10.245.67.136')' returned non-zero exit status 1

Revision history for this message
Andrew Wilkins (axwalk) wrote :

2014-08-11 23:21:42 ERROR juju.cmd supercommand.go:323 upgrade in progress - Juju functionality is limited

FUUUUUUU

Revision history for this message
Andrew Wilkins (axwalk) wrote :

That problem should be fixed by a PR that Menno has ready for landing, but was waiting for trunk to be unblocked. I've just pointed the lander at it, and said that it fixes this bug.

Changed in juju-core:
status: Triaged → In Progress
Revision history for this message
Andrew Wilkins (axwalk) wrote :

Marking fix committed, please reopen if it persists... I'll keep an eye on CI and do so myself if it's still broken.

Changed in juju-core:
status: In Progress → Fix Committed
Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

Looks like the problem is still happening even with that PR.

Changed in juju-core:
status: Fix Committed → In Progress
Changed in juju-core:
assignee: Andrew Wilkins (axwalk) → Menno Smits (menno.smits)
Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

To reproduce, with current trunk:

1. Have an instance ready somewhere (e.g. EC2) to use with the manual provider.

2. include "logging-config: <root>=DEBUG" in the environment section in environments.yaml. Mine looked like:

  manual-ec2:
    type: manual
    bootstrap-host: blah-blah.us-west-2.compute.amazonaws.com
    bootstrap-user: ubuntu
    logging-config: <root>=DEBUG

- juju bootstrap --upload-tools

SSH to the bootstrap node and you'll see a panic along the lines of:

panic: interface conversion: interface is float64, not int

goroutine 160 [running]:
runtime.panic(0xdfeb40, 0xc21033e100)
        /usr/lib/go/src/pkg/runtime/panic.c:266 +0xb6
github.com/juju/juju/provider/manual.(*environConfig).storageAddr(0xc2101baef0, 0xc20ce0, 0xc2101910f0)
        /home/menno/go/src/github.com/juju/juju/provider/manual/config.go:73 +0xe3
github.com/juju/juju/provider/manual.(*manualEnviron).SetConfig(0xc210351540, 0xc2101b1770, 0x0, 0x0)
        /home/menno/go/src/github.com/juju/juju/provider/manual/environ.go:190 +0x627
github.com/juju/juju/provider/manual.manualProvider.open(0xc2101b8b10, 0xc2101b8b10, 0xc2101b8b00, 0x0, 0x0)
        /home/menno/go/src/github.com/juju/juju/provider/manual/provider.go:84 +0x63
github.com/juju/juju/provider/manual.manualProvider.Open(0xc2101b1770, 0xf57b60, 0x4, 0xc21018f3d0, 0x6)
        /home/menno/go/src/github.com/juju/juju/provider/manual/provider.go:78 +0xc1
github.com/juju/juju/provider/manual.manualProvider.Open·i(0x10, 0xc2101b1770, 0x7f9f1d44cf48, 0x10, 0x0, ...)
        /home/menno/go/src/github.com/juju/juju/provider/manual/config.go:1 +0x34
github.com/juju/juju/environs.New(0xc2101b1770, 0xc2101b1770, 0x0, 0x0, 0xc200000001)
        /home/menno/go/src/github.com/juju/juju/environs/open.go:165 +0x8e
github.com/juju/juju/worker.WaitForEnviron(0x7f9f1d45f860, 0xc2102e7f50, 0x7f9f1d46b4f0, 0xc210151180, 0xc2102f5900, ...)
        /home/menno/go/src/github.com/juju/juju/worker/environ.go:52 +0x1cb
github.com/juju/juju/worker/provisioner.(*environProvisioner).loop(0xc21023ab00, 0x0, 0x0)
        /home/menno/go/src/github.com/juju/juju/worker/provisioner/provisioner.go:158 +0x191
github.com/juju/juju/worker/provisioner.func·002()
        /home/menno/go/src/github.com/juju/juju/worker/provisioner/provisioner.go:144 +0x59
created by github.com/juju/juju/worker/provisioner.NewEnvironProvisioner
        /home/menno/go/src/github.com/juju/juju/worker/provisioner/provisioner.go:145 +0x191

Somehow the storage port is ending up a float where an int is expected. There are tests around this but they obviously don't cover this case. I haven't figured out how the float value makes it in.

I will attach the full machine-0.log.

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

machine-0 log from jujud start until the problem happens (and a little further). The panics and jujud restarts repeated continually after this.

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

The problem exists in master but doesn't exist in 1.20.2 so the problem was introduced some time after 1.20.2.

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

Time for someone else to have a go.

Changed in juju-core:
assignee: Menno Smits (menno.smits) → nobody
Revision history for this message
Nate Finch (natefinch) wrote :

The problem appears to be that we deserialize from json, which gives us a float64 in the map, and then just directly try to cast it to an int. We should instead do a type switch for int or float64 and convert.

Ideally we wouldn't have map[string]interface{} in any of our code.... and this is exactly why. But we can at least fix this problem for now.

What's weird is that there's a test that tries to ensure we handle this problem, but it evidently is not catching the error. I'll figure out why and make sure the test fails before I fix the code.

Changed in juju-core:
assignee: nobody → Nate Finch (natefinch)
Revision history for this message
Andrew Wilkins (axwalk) wrote :

The problem is that our config is ridiculously convoluted, and I've broken it while fixing a different config-related issue. SetConfig is being called with unvalidated config, and I recently changed SetConfig to *not* validate it where it was being before. My thinking was that we should not be adding defaults to an already prepared config. Unfortunately this means the code that coerces float64s to ints doesn't get executed.

I'll take this over, will see what I can come up with today.

Changed in juju-core:
assignee: Nate Finch (natefinch) → Andrew Wilkins (axwalk)
Revision history for this message
Andrew Wilkins (axwalk) wrote :

And now I see that there's a PR that has fixed this (in a different way). Closing...

Changed in juju-core:
assignee: Andrew Wilkins (axwalk) → Nate Finch (natefinch)
status: In Progress → Fix Committed
Revision history for this message
Andrew Wilkins (axwalk) wrote :

Still broken, in yet another way.

Changed in juju-core:
status: Fix Committed → Triaged
assignee: Nate Finch (natefinch) → nobody
Andrew Wilkins (axwalk)
Changed in juju-core:
assignee: nobody → Andrew Wilkins (axwalk)
status: Triaged → In Progress
Revision history for this message
Andrew Wilkins (axwalk) wrote :

OK, I can reproduce this in my environment. Hope to have it fixed soon.

Revision history for this message
Andrew Wilkins (axwalk) wrote :

Found the culprit.

Revision history for this message
Andrew Wilkins (axwalk) wrote :

See https://github.com/juju/juju/pull/504 for details on what was broken.

Andrew Wilkins (axwalk)
Changed in juju-core:
status: In Progress → Fix Committed
Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

Just following up... the manual provider CI jobs are now passing reliably. This ticket is finally *done*.

Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.