Canonical Juju

manual-provider: systemd services left behind

Bug #1611453 reported by Curtis Hovey on 2016-08-09

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Fix Released	Critical	Andrew Wilkins

Bug Description

As seen at
http://reports.vapour.ws/releases/issue/57680d5a749a560ba9ca107f

manual-provider: systemd services left behind One of these
juju-db.service
jujud-machine-X.service
jujud-unit-dummy-XXXXXX-0.service

CI has a new cleanup script that will clean up the services and fail juju. Prior to this script, all xenial manual provider tests failed on try 2.

Tags:

Revision history for this message

Curtis Hovey (sinzui) wrote on 2016-08-09:

This summary of just the controller host shows that the just the systemd service symlinks are left behind. jujud and mongod are not running. The /var/lib/juju dir does not exist.

+ DIRTY=0
+ JUJU_DIR=/var/lib/juju
+ DUMMY_DIR=/var/run/dummy-sink
Cleaning manual machine
+ echo 'Cleaning manual machine'
+ [[ -d /var/run/dummy-sink ]]
+ ps -f -C jujud
UID PID PPID C STIME TTY TIME CMD
+ ps -f -C mongod
UID PID PPID C STIME TTY TIME CMD
+ [[ -d /etc/systemd/system ]]
++ ls /etc/systemd/system/juju-db.service /etc/systemd/system/jujud-machine-0.service
+ found='/etc/systemd/system/juju-db.service
/etc/systemd/system/jujud-machine-0.service'
ERROR manual-provider: systemd services left behind.
+ [[ -n /etc/systemd/system/juju-db.service
/etc/systemd/system/jujud-machine-0.service ]]
+ DIRTY=1
+ echo 'ERROR manual-provider: systemd services left behind.'
+ for service_path in '$found'
++ basename /etc/systemd/system/juju-db.service
+ service=juju-db.service
+ sudo systemctl stop --force juju-db.service
Failed to stop juju-db.service: Unit juju-db.service not loaded.
+ true
+ sudo systemctl disable juju-db.service
Failed to execute operation: No such file or directory
+ true
+ sudo rm /etc/systemd/system/juju-db.service
+ for service_path in '$found'
++ basename /etc/systemd/system/jujud-machine-0.service
+ service=jujud-machine-0.service
+ sudo systemctl stop --force jujud-machine-0.service
Failed to stop jujud-machine-0.service: Unit jujud-machine-0.service not loaded.
+ true
+ sudo systemctl disable jujud-machine-0.service
Failed to execute operation: No such file or directory
+ true
+ sudo rm /etc/systemd/system/jujud-machine-0.service
+ [[ -d /etc/init ]]
++ find /etc/init -name 'juju*' -print
+ found=
+ [[ -n '' ]]
+ [[ -d /var/lib/juju ]]
Cleaning completed
+ echo 'Cleaning completed'
+ exit 1
++ '[' 1 = 1 ']'

Alexis Bruemmer (alexis-bruemmer) on 2016-08-09

Changed in juju-core:
assignee:	nobody → Alexis Bruemmer (alexis-bruemmer)

Anastasia (anastasia-macmood) on 2016-08-10

Changed in juju-core:
milestone:	2.0-beta15 → 2.0-beta16

Andrew Wilkins (axwalk) on 2016-08-11

Changed in juju-core:
status:	Triaged → In Progress
assignee:	Alexis Bruemmer (alexis-bruemmer) → Andrew Wilkins (axwalk)

Revision history for this message

Andrew Wilkins (axwalk) wrote on 2016-08-11:

machine-1.log Edit (80.1 KiB, text/plain)

I've fixed kill-controller so the bootstrap machine's systemd bits are cleaned up, but I've found an issue when removing manual machines. Something is wedging the machine agent so it never gets to the uninstall logic. Restarting the agent causes it to uninstall itself as expected.

Attaching a log from a machine agent which I sent SIGQUIT to, dumping the goroutine status.

Revision history for this message

Andrew Wilkins (axwalk) wrote on 2016-08-18:

I can no longer reproduce this on master. Maybe fixed by rogpeppe's recent changes to do with killing RPC requests.

Changed in juju-core:
status:	In Progress → Fix Committed

Canonical Juju QA Bot (juju-qa-bot) on 2016-08-23

affects:	juju-core → juju
Changed in juju:
milestone:	2.0-beta16 → none
milestone:	none → 2.0-beta16

Curtis Hovey (sinzui) on 2016-08-25

Changed in juju:
status:	Fix Committed → Fix Released

Revision history for this message

Aaron Bentley (abentley) wrote on 2017-01-20:

It appears Andrew's fix did not completely fix the issue. We've continued to observe it every month since then, though it is far less common than in August.

Changed in juju:
status:	Fix Released → Triaged

Anastasia (anastasia-macmood) on 2017-01-20

Changed in juju:
milestone:	2.0-beta16 → 2.1.0

Revision history for this message

Andrew Wilkins (axwalk) wrote on 2017-01-22:

functional-assess-cloud-manual is using kill-controller (at the end), which is not guaranteed to clean up manual machines. It will work *if* it is able to talk to the API, but in this case the test is explicitly stopping jujud-machine-0 to exercise the can't-talk-to-API branch.

Aaron, IIRC we discussed this in Barcelona, and you were going to disable that bit for manual? It might have been a different test, but it's the same sort of scenario I think.

Anastasia (anastasia-macmood) on 2017-01-23

Changed in juju:
status:	Triaged → Incomplete
milestone:	2.1.0 → none

Revision history for this message

Nicholas Skaggs (nskaggs) wrote on 2017-01-24:

@Aaron, can you speak to this? See Andrew's comment above about not using kill-controller. Is there a reason kill-controller is special in the manual case?

Revision history for this message

Aaron Bentley (abentley) wrote on 2017-01-24:

After discussion we Curtis, we believe the test is correct. If we could install juju over SSH, we should be able to remove it over SSH.

Revision history for this message

Andrew Wilkins (axwalk) wrote on 2017-01-24:

> After discussion we Curtis, we believe the test is correct. If we could install juju over SSH, we should be able to remove it over SSH.

I'll repeat what we discussed in Barcelona, hopefully it will ring a bell.

kill-controller is effectively the same as the old "destroy-environment --force", with a (IMO misguided) attempt to do the destruction "safely" via the API first. So you can think of kill-controller as "destroy-environment || destroy-environment --force".

Now, *IF* Juju can connect to the API then all will be well. The knowledge of manual machines is defined in the controller -- *only* in the controller -- so talking to the controller to clean up means that the knowledge of those manual machines is accessible.

In the fallback case, Juju is doing everything from the client side. The client is not necessarily the same one that added the manual machines. There is no knowledge of them at the client; it is all within the controller. This is unlike cloud providers, where the client can communicate directly with the cloud to identify resources related to the controller. There is nowhere to go, except the controller, which is inaccessible (or the code path wouldn't be taken).

I hope that clarifies things.

Revision history for this message

Aaron Bentley (abentley) wrote on 2017-01-25: Re: [Bug 1611453] Re: manual-provider: systemd services left behind

On 2017-01-24 06:55 PM, Andrew Wilkins wrote:
> There is nowhere to go, except the controller, which is
> inaccessible (or the code path wouldn't be taken).

It depends what you mean by "controller". In the
functional-assess-cloud-manual test, the jujud-machine-0 service is not
running, but the machine is running and accessible by ssh.

If you look at
http://reports.vapour.ws/releases/4761/job/functional-assess-cloud-manual/attempt/152
you'll see there are errors about machine 0 being dirty.

There are also errors about machine 1, and I agree that we can't
plausibly clean up machine 1 (unless we were to store information about
added machines somewhere other than mongodb, like machine 0's filesystem).

Revision history for this message

Aaron Bentley (abentley) wrote on 2017-01-31:

#10

It appears that the errors I saw about machine 0 were about *default-model* machine 0, not controller machine 0, so my comments have a faulty basis.

Aaron Bentley (abentley) on 2017-01-31