Can't reprovision a machine with manual provider

Bug #1418139 reported by Michael Foord
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Nate Finch
juju-core
Fix Released
High
Nate Finch
1.25
Fix Released
High
Nate Finch

Bug Description

If I bootstrap to a machine using the manual provider, then destroy-environment and attempt to re-bootstrap it fails. This is because of "turds" left in place that "destroy-environment" doesn't clean-up.

$ juju bootstrap --upload-tools
Bootstrapping environment "manual"
Starting new instance for initial state server
ERROR failed to bootstrap environment: machine is already provisioned

$ ssh michael@192.168.178.180
michael@192.168.178.180's password:
Welcome to Ubuntu 14.04.1 LTS (GNU/Linux 3.13.0-45-generic x86_64)

 * Documentation: https://help.ubuntu.com/

  System information as of Wed Feb 4 17:28:09 GMT 2015

  System load: 0.04 Processes: 94
  Usage of /: 25.0% of 6.76GB Users logged in: 1
  Memory usage: 14% IP address for eth0: 192.168.178.180
  Swap usage: 0%

  Graph this data and manage this system at:
    https://landscape.canonical.com/

Last login: Wed Feb 4 17:20:46 2015 from ubuntubox.fritz.box
michael@ubuntu:~$ sudo find / -name "*juju*"
[sudo] password for michael:
/etc/apt/apt.conf.d/42-juju-proxy-settings
/etc/init/jujud-unit-wordpress-0.conf
/etc/sudoers.d/90-juju-ubuntu
/usr/lib/juju
/usr/lib/python2.7/dist-packages/landscape/lib/juju.pyc
/usr/lib/python2.7/dist-packages/landscape/lib/juju.py
/usr/share/doc/juju-mongodb
/home/ubuntu/.juju-proxy
/var/cache/apt/archives/juju-mongodb_2.4.9-0ubuntu3_amd64.deb
/var/log/juju
/var/log/upstart/juju-db.log
/var/lib/dpkg/info/juju-mongodb.list
/var/lib/dpkg/info/juju-mongodb.md5sums
/var/lib/juju

Revision history for this message
Curtis Hovey (sinzui) wrote :

This has happened twice on the machines we do manual provider testing. This also the same problem with a failed destroy-environment with local-provider. When destroy-environment fails to cleanup resources, a subsequent bootstrap will also fail.

I believe destroy-environment --force does not guarantee cleanup of resources, and several people have used --force in error.

Was this failure seen with a released juju version? Does it happen all the time? What --force used?

Changed in juju-core:
status: New → Triaged
importance: Undecided → Medium
tags: added: bootstrap destroy-environment manual-provider
Revision history for this message
Michael Foord (mfoord) wrote :

When using kvm images and the manual provider I see the issue reproducibly even when *not* using --force. This is using trunk, I can test with 1.21 and 1.22 as well to see if it's a regression.

Revision history for this message
Michael Foord (mfoord) wrote :

I've just tried again with 1.21, 1.22 and 1.23 (current branches in all cases rather than a released version) and it worked fine. I suspect the problem is therefore that my memory is faulty and I used "--force" to destroy-environment. I'll retry to confirm.

Revision history for this message
Michael Foord (mfoord) wrote :

Ok, so it worked when I used "--force" as well. When I was seeing this reliably I had the kvm instance firewalled off to use a proxy. It seems unlikely that this is the difference, but I'll try it and see.

Revision history for this message
Michael Foord (mfoord) wrote : Re: Can't reprovision a machine with manual provider and machine behind a proxy

Hah, so with trunk and a firewalled kvm instance (only access to the network via a proxy except for the state server port and ssh) I see the problem.

$ juju bootstrap --upload-tools
Bootstrapping environment "manual"
Starting new instance for initial state server
Building tools to upload (1.23-alpha1.1-trusty-amd64)
Installing Juju agent on bootstrap instance
Logging to /var/log/cloud-init-output.log on remote host
Running apt-get update
Installing package: curl
Installing package: cpu-checker
Installing package: bridge-utils
Installing package: rsyslog-gnutls
Bootstrapping Juju machine agent
Starting Juju machine agent (jujud-machine-0)
Bootstrap complete
$ juju destroy-environment
error: no environment specified
$ juju destroy-environment manual
WARNING! this command will destroy the "manual" environment (type: manual)
This includes all machines, services, data and other resources.

Continue [y/N]? y
$ juju bootstrap --upload-tools
WARNING This juju environment is already bootstrapped. If you want to start a new Juju
environment, first run juju destroy-environment to clean up, or switch to an
alternative environment.
ERROR environment is already bootstrapped

summary: - Can't reprovision a machine with manual provider
+ Can't reprovision a machine with manual provider and machine behind a
+ proxy
Revision history for this message
Chris Gregan (cgregan) wrote :

I have also reproduced this issue using the following steps

1) Bootstrap a manual environment
2) Deploy a charm that fails to complete deployment
3) Reboot local machine to ensure all connections to env are closed
4) Open terminal
5) juju switch
6) juju destroy-environment -e manual

System requests password to local ssh keyfile and attempt to destroy env, but fails
After that destroy says the environment returns that it is not bootstrapped but bootstrap returns that the machine is already provisioned.

Changed in juju-core:
assignee: nobody → Eric Snow (ericsnowcurrently)
status: Triaged → In Progress
Revision history for this message
Eric Snow (ericsnowcurrently) wrote :

I poked at this but was not able to get things set up to reproduce. I also tried to take the alternate approach of modifying the "uninstall juju" script that the manual provider runs when destroying the environment:

  https://github.com/juju/juju/blob/juju-1.25.3/provider/manual/environ.go#L283

I deleted the lines that remove the files and directories. However, though the script ran, the directories and files were removed anyway. So not only does that script run but the agent also cleans up after itself directly (e.g. the data dir is deleted when the agent gets ErrTerminateAgent).

Anyway, apparently the proxy-dependent scenario is causing that both the manual provider's cleanup script *and* the machine agent's cleanup operations are getting circumvented. That said, it looks like *some* files were cleaned up (e.g. /etc/rsyslog.d/25-juju.conf), based on Michael's origin post-destroy file listing.

So hopefully that combination of clues gets us closer to the source of the problem.

Changed in juju-core:
assignee: Eric Snow (ericsnowcurrently) → nobody
status: In Progress → Triaged
Nate Finch (natefinch)
Changed in juju-core:
assignee: nobody → Nate Finch (natefinch)
Revision history for this message
Nate Finch (natefinch) wrote :

I don't seem to need to do anything special for this to break.. it just breaks randomly some of the time, as far as I can tell.

Revision history for this message
Nate Finch (natefinch) wrote :

Here's the end of the log... the problem is a panic during shutdown:

....
2016-05-06 02:52:25 ERROR juju.worker runner.go:212 fatal "instancepoller": watcher has been stopped
2016-05-06 02:52:25 ERROR juju.worker runner.go:212 fatal "environ-provisioner": watcher has been stopped
2016-05-06 02:52:25 ERROR juju.worker runner.go:212 fatal "firewaller": watcher has been stopped
2016-05-06 02:52:25 ERROR juju.worker runner.go:223 exited "ceffd324-fb1b-488c-894d-3ddc5e0c5216": watcher has been stopped
2016-05-06 02:52:25 ERROR juju.worker runner.go:212 fatal "api-post-upgrade": watcher has been stopped
2016-05-06 02:52:25 ERROR juju.worker runner.go:223 exited "api": watcher has been stopped
panic: send on closed channel

goroutine 776 [running]:
panic(0x1937800, 0xc820aeda00)
 /home/nate/go/src/runtime/panic.go:464 +0x3e6
github.com/godbus/dbus.(*Conn).inWorker(0xc820620480)
 /home/nate/src/github.com/godbus/dbus/conn.go:327 +0x1048
created by github.com/godbus/dbus.(*Conn).Auth
 /home/nate/src/github.com/godbus/dbus/auth.go:118 +0xf3d

Revision history for this message
Nate Finch (natefinch) wrote :

There's a bug filed (by the intrepid davecheney, no less) back in november:
https://github.com/godbus/dbus/issues/45

The response? PRs welcome, we don't care, we never close the connections anyway.

Revision history for this message
Nate Finch (natefinch) wrote :

I have a fix... it's super simple. Just needs a test, but it's late, so I'm punting on that.

For our purposes, we can patch it as needed for demos.

Here's my PR to the original repo: https://github.com/godbus/dbus/pull/64

Changed in juju-core:
status: Triaged → In Progress
Revision history for this message
Nate Finch (natefinch) wrote :

PR now has a test that will show the panic if you remove the fix: https://github.com/godbus/dbus/pull/64

Revision history for this message
Nate Finch (natefinch) wrote :

PR for fix in our repo here: https://github.com/juju/juju/pull/5363

Nate Finch (natefinch)
summary: - Can't reprovision a machine with manual provider and machine behind a
- proxy
+ Can't reprovision a machine with manual provider
Changed in juju-core:
importance: Medium → High
milestone: none → 2.0-beta7
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 2.0-beta7 → 2.0-beta8
Curtis Hovey (sinzui)
tags: added: manual-story
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 2.0-beta8 → 2.0-beta9
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 2.0-beta9 → 2.0-beta10
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 2.0-beta10 → 2.0-beta11
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 2.0-beta11 → 2.0-beta12
Revision history for this message
Cheryl Jennings (cherylj) wrote :

Nate - was the update to godbus the only change needed to fix this bug?

Changed in juju-core:
milestone: 2.0-beta12 → 2.0-beta13
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 2.0-beta13 → 2.0-beta14
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 2.0-beta14 → 2.0-beta15
Revision history for this message
Anastasia (anastasia-macmood) wrote :

@Nate

Could you please update this report?

I saw that your fix to 1.25 has landed. Has the fix landed in master too?

Is there anything else needed to consider this bug fixed?

Thank you :)

Revision history for this message
Nate Finch (natefinch) wrote :

The fix is landed in both 1.25 and master. It's been in master since May 9th.

Changed in juju-core:
status: In Progress → Fix Committed
Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → Fix Released
affects: juju-core → juju
Changed in juju:
milestone: 2.0-beta15 → none
milestone: none → 2.0-beta15
Changed in juju-core:
assignee: nobody → Nate Finch (natefinch)
importance: Undecided → High
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.