manual-provider: systemd services left behind

Bug #1611453 reported by Curtis Hovey
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
Critical
Andrew Wilkins

Bug Description

As seen at
    http://reports.vapour.ws/releases/issue/57680d5a749a560ba9ca107f

manual-provider: systemd services left behind One of these
juju-db.service
jujud-machine-X.service
jujud-unit-dummy-XXXXXX-0.service

CI has a new cleanup script that will clean up the services and fail juju. Prior to this script, all xenial manual provider tests failed on try 2.

Revision history for this message
Curtis Hovey (sinzui) wrote :

This summary of just the controller host shows that the just the systemd service symlinks are left behind. jujud and mongod are not running. The /var/lib/juju dir does not exist.

+ DIRTY=0
+ JUJU_DIR=/var/lib/juju
+ DUMMY_DIR=/var/run/dummy-sink
Cleaning manual machine
+ echo 'Cleaning manual machine'
+ [[ -d /var/run/dummy-sink ]]
+ ps -f -C jujud
UID PID PPID C STIME TTY TIME CMD
+ ps -f -C mongod
UID PID PPID C STIME TTY TIME CMD
+ [[ -d /etc/systemd/system ]]
++ ls /etc/systemd/system/juju-db.service /etc/systemd/system/jujud-machine-0.service
+ found='/etc/systemd/system/juju-db.service
/etc/systemd/system/jujud-machine-0.service'
ERROR manual-provider: systemd services left behind.
+ [[ -n /etc/systemd/system/juju-db.service
/etc/systemd/system/jujud-machine-0.service ]]
+ DIRTY=1
+ echo 'ERROR manual-provider: systemd services left behind.'
+ for service_path in '$found'
++ basename /etc/systemd/system/juju-db.service
+ service=juju-db.service
+ sudo systemctl stop --force juju-db.service
Failed to stop juju-db.service: Unit juju-db.service not loaded.
+ true
+ sudo systemctl disable juju-db.service
Failed to execute operation: No such file or directory
+ true
+ sudo rm /etc/systemd/system/juju-db.service
+ for service_path in '$found'
++ basename /etc/systemd/system/jujud-machine-0.service
+ service=jujud-machine-0.service
+ sudo systemctl stop --force jujud-machine-0.service
Failed to stop jujud-machine-0.service: Unit jujud-machine-0.service not loaded.
+ true
+ sudo systemctl disable jujud-machine-0.service
Failed to execute operation: No such file or directory
+ true
+ sudo rm /etc/systemd/system/jujud-machine-0.service
+ [[ -d /etc/init ]]
++ find /etc/init -name 'juju*' -print
+ found=
+ [[ -n '' ]]
+ [[ -d /var/lib/juju ]]
Cleaning completed
+ echo 'Cleaning completed'
+ exit 1
++ '[' 1 = 1 ']'

Changed in juju-core:
assignee: nobody → Alexis Bruemmer (alexis-bruemmer)
Changed in juju-core:
milestone: 2.0-beta15 → 2.0-beta16
Andrew Wilkins (axwalk)
Changed in juju-core:
status: Triaged → In Progress
assignee: Alexis Bruemmer (alexis-bruemmer) → Andrew Wilkins (axwalk)
Revision history for this message
Andrew Wilkins (axwalk) wrote :

I've fixed kill-controller so the bootstrap machine's systemd bits are cleaned up, but I've found an issue when removing manual machines. Something is wedging the machine agent so it never gets to the uninstall logic. Restarting the agent causes it to uninstall itself as expected.

Attaching a log from a machine agent which I sent SIGQUIT to, dumping the goroutine status.

Revision history for this message
Andrew Wilkins (axwalk) wrote :

I can no longer reproduce this on master. Maybe fixed by rogpeppe's recent changes to do with killing RPC requests.

Changed in juju-core:
status: In Progress → Fix Committed
affects: juju-core → juju
Changed in juju:
milestone: 2.0-beta16 → none
milestone: none → 2.0-beta16
Curtis Hovey (sinzui)
Changed in juju:
status: Fix Committed → Fix Released
Revision history for this message
Aaron Bentley (abentley) wrote :

It appears Andrew's fix did not completely fix the issue. We've continued to observe it every month since then, though it is far less common than in August.

Changed in juju:
status: Fix Released → Triaged
Changed in juju:
milestone: 2.0-beta16 → 2.1.0
Revision history for this message
Andrew Wilkins (axwalk) wrote :

functional-assess-cloud-manual is using kill-controller (at the end), which is not guaranteed to clean up manual machines. It will work *if* it is able to talk to the API, but in this case the test is explicitly stopping jujud-machine-0 to exercise the can't-talk-to-API branch.

Aaron, IIRC we discussed this in Barcelona, and you were going to disable that bit for manual? It might have been a different test, but it's the same sort of scenario I think.

Changed in juju:
status: Triaged → Incomplete
milestone: 2.1.0 → none
Revision history for this message
Nicholas Skaggs (nskaggs) wrote :

@Aaron, can you speak to this? See Andrew's comment above about not using kill-controller. Is there a reason kill-controller is special in the manual case?

Revision history for this message
Aaron Bentley (abentley) wrote :

After discussion we Curtis, we believe the test is correct. If we could install juju over SSH, we should be able to remove it over SSH.

Revision history for this message
Andrew Wilkins (axwalk) wrote :

> After discussion we Curtis, we believe the test is correct. If we could install juju over SSH, we should be able to remove it over SSH.

I'll repeat what we discussed in Barcelona, hopefully it will ring a bell.

kill-controller is effectively the same as the old "destroy-environment --force", with a (IMO misguided) attempt to do the destruction "safely" via the API first. So you can think of kill-controller as "destroy-environment || destroy-environment --force".

Now, *IF* Juju can connect to the API then all will be well. The knowledge of manual machines is defined in the controller -- *only* in the controller -- so talking to the controller to clean up means that the knowledge of those manual machines is accessible.

In the fallback case, Juju is doing everything from the client side. The client is not necessarily the same one that added the manual machines. There is no knowledge of them at the client; it is all within the controller. This is unlike cloud providers, where the client can communicate directly with the cloud to identify resources related to the controller. There is nowhere to go, except the controller, which is inaccessible (or the code path wouldn't be taken).

I hope that clarifies things.

Revision history for this message
Aaron Bentley (abentley) wrote : Re: [Bug 1611453] Re: manual-provider: systemd services left behind

On 2017-01-24 06:55 PM, Andrew Wilkins wrote:
> There is nowhere to go, except the controller, which is
> inaccessible (or the code path wouldn't be taken).

It depends what you mean by "controller". In the
functional-assess-cloud-manual test, the jujud-machine-0 service is not
running, but the machine is running and accessible by ssh.

If you look at
http://reports.vapour.ws/releases/4761/job/functional-assess-cloud-manual/attempt/152
you'll see there are errors about machine 0 being dirty.

There are also errors about machine 1, and I agree that we can't
plausibly clean up machine 1 (unless we were to store information about
added machines somewhere other than mongodb, like machine 0's filesystem).

Revision history for this message
Aaron Bentley (abentley) wrote :

It appears that the errors I saw about machine 0 were about *default-model* machine 0, not controller machine 0, so my comments have a faulty basis.

Aaron Bentley (abentley)
Changed in juju:
status: Incomplete → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.