remove-juju-services sometimes fails to make a machine provisionable in juju 2.7

Bug #1884331 reported by Vern Hart
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
Medium
Achilleas Anagnostopoulos

Bug Description

I've been playing around with manually added machines (so that we can manage the maas nodes with juju for monitoring purposes) and I've gotten into a situation where remove-juju-services doesn't work.

On one machine, after removing the machine from the model I run remove-juju services and see some failures:

    $ sudo /sbin/remove-juju-services
    removing juju service: /etc/systemd/system/jujud-machine-1-exec-start.sh
    Failed to stop jujud-machine-1-exec-start.sh.service: Unit jujud-machine-1-exec-start.sh.service not loaded.
    Failed to disable unit: Unit file jujud-machine-1-exec-start.sh.service does not exist.
    removing juju service: /etc/systemd/system/jujud-machine-1.service
    Removed /etc/systemd/system/multi-user.target.wants/jujud-machine-1.service.
    removing /var/lib/juju/db/*
    removing /var/lib/juju/raft/*

If I re-add the machine at this point I get the "machine is already provisioned" error.

If I then do a little extra cleanup, it works:

    sudo rm -rf /etc/systemd/system/jujud-machine*

While not ideal, that's a workable solution. However, on another system I can't seem to do anything to make juju want to be friends with it.

    $ sudo /sbin/remove-juju-services
    ls: cannot access '/etc/systemd/system/juju*': No such file or directory
    removing /var/lib/juju/db/*
    removing /var/lib/juju/raft/*

Try to add-machine: ERROR machine is already provisioned. I've double and triple-checked to make sure I'm adding the same machine that I'm trying to clean up.

Here are all the files and directories that start with "juju":

    /snap/bin/juju
    /snap/juju
    /snap/juju/12370/bash_completions/juju
    /snap/juju/12370/bash_completions/juju-version
    /snap/juju/12370/bin/juju
    /snap/juju/12370/bin/juju-metadata
    /snap/juju/12370/bin/jujuc
    /snap/juju/12370/bin/jujud
    /snap/juju/12370/bin/jujud-versions.yaml
    /var/snap/juju
    /var/lib/snapd/sequence/juju.json
    /var/lib/snapd/snaps/juju_12370.snap
    /var/lib/lxcfs/cgroup/memory/system.slice/jujud-machine-5.service
    /var/lib/lxcfs/cgroup/cpu,cpuacct/system.slice/jujud-machine-5.service
    /var/lib/lxcfs/cgroup/devices/system.slice/jujud-machine-5.service
    /var/lib/lxcfs/cgroup/blkio/system.slice/jujud-machine-5.service
    /var/log/maas/rsyslog/juju-3
    /var/log/maas/rsyslog/juju-1
    /var/log/maas/rsyslog/juju-2
    /var/log/libvirt/qemu/juju-2-serial0.log.1
    /var/log/libvirt/qemu/juju-2.log.2.gz
    /var/log/libvirt/qemu/juju-2.log.1
    /var/log/libvirt/qemu/juju-2.log.3.gz
    /var/log/libvirt/qemu/juju-2-serial0.log
    /var/log/libvirt/qemu/juju-2-serial0.log.2.gz
    /var/log/libvirt/qemu/juju-2.log
    /sys/fs/cgroup/memory/system.slice/jujud-machine-5.service
    /sys/fs/cgroup/cpu,cpuacct/system.slice/jujud-machine-5.service
    /sys/fs/cgroup/devices/system.slice/jujud-machine-5.service
    /sys/fs/cgroup/blkio/system.slice/jujud-machine-5.service
    /run/snapd/lock/juju.lock
    /run/libvirt/qemu/juju-2.xml
    /run/libvirt/qemu/juju-2.pid
    /root/snap/juju
    /etc/profile.d/juju-introspection.sh
    /etc/profile.d/juju-proxy.sh
    /etc/juju-proxy.conf
    /etc/juju-proxy-systemd.conf
    /etc/libvirt/qemu/juju-2.xml
    /etc/libvirt/qemu/autostart/juju-2.xml
    /etc/sudoers.d/jujumanage
    /usr/lib/python3/dist-packages/landscape/lib/juju.py
    /usr/lib/python3/dist-packages/landscape/lib/__pycache__/juju.cpython-36.pyc
    /usr/share/bash-completion/completions/juju-version
    /usr/share/bash-completion/completions/juju
    /usr/share/sosreport/sos/plugins/juju.py
    /usr/share/sosreport/sos/plugins/__pycache__/juju.cpython-36.pyc
    /usr/bin/juju-updateseries
    /usr/bin/juju-introspect
    /usr/bin/juju-dumplogs
    /usr/bin/juju-run

Nothing seems too suspicious in there except for the jujud-machine5.service directories but those are all on virtual file systems (lxcfs and cgroup) and I'm not sure how to remove them. This system has a libvirt juju-2 vm so there's some of that in there.

I'm not sure what else to look for. If I turn on --debug to the add-machine command, it doesn't give much more output:

    $ juju add-machine --debug ssh:root@10.93.0.12 -m maas-infra
    23:16:00 INFO juju.cmd supercommand.go:83 running juju [2.7.6 gc go1.10.4]
    23:16:00 DEBUG juju.cmd supercommand.go:84 args: []string{"/snap/juju/11454/bin/juju", "add-machine", "--debug", "ssh:root@10.93.0.12", "-m", "maas-infra"}
    23:16:00 INFO juju.juju api.go:67 connecting to API addresses: [10.93.0.62:17070 10.93.0.80:17070 10.93.0.46:17070]
    23:16:00 DEBUG juju.api apiclient.go:1092 successfully dialed "wss://10.93.0.46:17070/model/13c2de68-d9a1-4aab-8ef6-4858a71a1c19/api"
    23:16:00 INFO juju.api apiclient.go:624 connection established to "wss://10.93.0.46:17070/model/13c2de68-d9a1-4aab-8ef6-4858a71a1c19/api"
    23:16:00 INFO juju.cmd.juju.machine add.go:246 load config
    23:16:00 INFO juju.juju api.go:67 connecting to API addresses: [10.93.0.46:17070 10.93.0.80:17070 10.93.0.62:17070]
    23:16:00 DEBUG juju.api apiclient.go:1092 successfully dialed "wss://10.93.0.80:17070/model/13c2de68-d9a1-4aab-8ef6-4858a71a1c19/api"
    23:16:00 INFO juju.api apiclient.go:624 connection established to "wss://10.93.0.80:17070/model/13c2de68-d9a1-4aab-8ef6-4858a71a1c19/api"
    23:16:00 INFO juju.juju api.go:302 API endpoints changed from [10.93.0.80:17070 10.93.0.46:17070 10.93.0.62:17070] to [10.93.0.80:17070 10.93.0.62:17070 10.93.0.46:17070]
    23:16:00 INFO cmd authkeys.go:114 Adding contents of "/home/ubuntu/.local/share/juju/ssh/juju_id_rsa.pub" to authorized-keys
    23:16:00 INFO cmd authkeys.go:114 Adding contents of "/home/ubuntu/.ssh/id_rsa.pub" to authorized-keys
    23:16:00 INFO juju.environs.manual.sshprovisioner sshprovisioner.go:43 initialising "10.93.0.12", user "root"
    23:16:00 DEBUG juju.utils.ssh ssh.go:305 using OpenSSH ssh client
    23:16:00 INFO juju.environs.manual.sshprovisioner sshprovisioner.go:54 ubuntu user is already initialised
    23:16:00 INFO juju.environs.manual.sshprovisioner sshprovisioner.go:167 Checking if 10.93.0.12 is already provisioned
    23:16:00 DEBUG juju.utils.ssh ssh.go:305 using OpenSSH ssh client
    23:16:01 DEBUG juju.api monitor.go:35 RPC connection died
    23:16:01 DEBUG juju.api monitor.go:35 RPC connection died
    ERROR machine is already provisioned
    23:16:01 DEBUG cmd supercommand.go:519 error stack:
    machine is already provisioned
    /build/juju/parts/juju/go/src/github.com/juju/juju/cmd/juju/machine/add.go:384:

(And: no, that's not truncated. It seems like it is but that's all I get.)

What else does juju check for to see if the machine is already provisioned?

Revision history for this message
Pen Gale (pengale) wrote :

The script that checks to see if Juju is provisioned checks for files that started with the juju machine name, and checks running services. Is there a ghost service running?

The other possibility is that the check script is exiting with an error for some reason, and Juju is backing out due to the error.

Pen Gale (pengale)
Changed in juju:
status: New → Incomplete
Revision history for this message
Drew Freiberger (afreiberger) wrote :

@petevg:

I'm picking up this environment from Vern and it appears taht what's happening in sshprovisioner is that it's doing a 'systemctl -a' and looking for any instances of the word "juju" in juju 2.7.6:

https://github.com/juju/juju/blob/juju-2.7.6/environs/manual/sshprovisioner/sshprovisioner.go#L184

However, this has been patched to look for "jujud-machine" in later revisions.

The call to the remote ssh provisioned machine is finding snap.*juju.*.mounts in systemctl output that contain "juju".

This is resolved in 2.8 and later.

https://github.com/juju/juju/blob/2.8/environs/manual/sshprovisioner/sshprovisioner.go#L184

Can we get a backport of that fix to the 2.7.x line for 2.7.8?

For now, we'll have to remove juju* snaps from these machines we want to manually provision or determine if it's safe for the environment to upgrade to 2.8 juju.

Changed in juju:
status: Incomplete → Confirmed
summary: - remove-juju-services sometimes fails to make a machine provisionable
+ remove-juju-services sometimes fails to make a machine provisionable in
+ juju 2.7
Revision history for this message
Drew Freiberger (afreiberger) wrote :

FYI, this is what I did as a workaround to ensure my model was still 2.7.x:

sudo snap switch juju --channel 2.7/stable
juju add-model <model-name> # if not already existing
sudo snap switch juju --channel 2.8/stable
juju add-machine ssh:user@ip -m <model-name>
sudo snap switch juju --channel 2.7/stable

To explain the user story for this issue, there is a requirement to have infrastructure nodes of our cloud product to be monitored with juju charms such as nrpe/filebeat/telegraf/hw-health/infra-node. These infrastructure nodes are manually deployed hardware that host MAAS, KVM pods, and a Postgres/Pacemaker cluster along with the snapped juju client and snap-delivered tools like juju-wait, juju-lint, and juju-crashdump, hence the snap.*juju.*.mount services in systemd.

Revision history for this message
Vern Hart (vern) wrote :

I realize this is fixed in 2.8 but I'm running 2.7 so I dug in a little bit.

Upon closer inspection, this is is the effective command that is being executed:

    systemctl list-unit-files --no-legend --no-page -l -t service |
        grep -o -P '^\w[\S]*(?=\.service)'

The output of that is searched for the simple string "juju" and it finds:

    snap.juju.fetch-oci

We don't have microk8s and that seems to be the only purpose of fetch-oci so it is safe to remove:

    rm /etc/systemd/system/snap.juju.fetch-oci.service

Pen Gale (pengale)
Changed in juju:
milestone: none → 2.7.8
Changed in juju:
assignee: nobody → Achilleas Anagnostopoulos (achilleasa)
Changed in juju:
importance: Undecided → Medium
status: Confirmed → In Progress
Revision history for this message
Achilleas Anagnostopoulos (achilleasa) wrote :

PR https://github.com/juju/juju/pull/11827 back-ports some of the 2.8 changes to make the cleanup script more robust.

Changed in juju:
status: In Progress → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
Ian Booth (wallyworld)
Changed in juju:
status: Fix Released → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.