Sending a SIGABRT to jujud process causes jujud to uninstall (wiping /var/lib/juju)
| Affects | Status | Importance | Assigned to | Milestone | |
|---|---|---|---|---|---|
| | juju-core |
High
|
Andrew Wilkins | ||
| | 1.25 |
High
|
Andrew Wilkins | ||
Bug Description
[Environment]
This has been observed in 2 different environments:
Both Trusty 14.04.2
Juju-core 1.23.3
Juju-core 1.20.9
[Description]
We initially faced this issue by running the following sequence on the bootstrap node with 1.23.3,
this is not a normal operation made on any juju installation, but this leaded to discover the issue.
0) unlink /var/lib/
1) ln -s /var/lib/
2) Edit the agent.conf on the machine, pointing the upgradedTo: 1.23.2
3) $ restart jujud-machine-0
Then the following log entries were printed:
2015-06-10 10:03:45 INFO juju.mongo open.go:104 dialled mongo successfully
2015-06-10 10:03:45 ERROR juju.worker runner.go:207 fatal "state": agent should be terminated
2015-06-10 10:03:45 DEBUG juju.worker runner.go:241 killing "statestarter"
2015-06-10 10:03:45 DEBUG juju.worker runner.go:241 killing "termination"
2015-06-10 10:03:47 INFO juju.worker runner.go:260 start "api"
2015-06-10 10:03:47 INFO juju.state.api apiclient.go:242 dialing "wss://
2015-06-10 10:03:47 INFO juju.state.api apiclient.go:250 error dialing "wss://
2015-06-10 10:03:47 ERROR juju.worker runner.go:218 exited "api": unable to connect to "wss://
2015-06-10 10:03:50 ERROR juju.cmd supercommand.go:323 uninstall failed: [remove /var/lib/juju: directory not empty]
/bin/sh: 1: exec: /var/lib/
From this point juju was uninstalled, we discovered that sending a 'killall -SIGABRT jujud' causes juju to uninstall.
machine-2[20270]: 2015-06-11 15:18:22 ERROR juju.worker runner.go:219 exited "api": watcher has been stopped
unit-percona-
unit-percona-
unit-percona-
unit-percona-
unit-percona-
unit-percona-
unit-percona-
unit-percona-
unit-percona-
At this point /var/lib/juju has been removed from the system.
[ Suggestion ]
Currently the provisioner has a 'provisioner-
juju to take over an environment in case of any failure.
I would like to suggest to have something similar for the machine agent workers. 'workers-safe-mode' ? that prevents
jujud to uninstall itself in case of any worker error.
| tags: | added: cts |
| tags: | added: sts |
| Nate Finch (natefinch) wrote : | #1 |
| Curtis Hovey (sinzui) wrote : | #2 |
Wow. Thank you for this bug report. CI has lost jujud and we suspected SIGABRT. These reproducible steps are a fine outline of how to shoot yourself in the foot.
| Changed in juju-core: | |
| status: | New → Triaged |
| importance: | Undecided → Medium |
| Curtis Hovey (sinzui) wrote : | #3 |
The use of SIGABRT is by design. I think the issue here is there was no intent to send SIGABRT. This issue has come up before and one suggestion was to use SIGUSER1 or SIgUSER2 because uninstalling is clearly a surprising behaviour for aborting an operation.
| Gema Gomez (gema) wrote : | #4 |
Curtis, will this design decision be rectified, then?
| Jorge Niedbalski (niedbalski) wrote : | #5 |
I really think that the issue of uninstall the agent on case of SIGABRT or any signal without a safe-mode or
caution is still a very very dangerous thing for production environments.
| Curtis Hovey (sinzui) wrote : | #6 |
@gema, which decision?
That ABRT is the signal to uninstall, or that Juju uninstalled without human permission. certainly the last case is a bug. For former case while ABRT is not my preferred solution (I like USER1), Juju can still choose to call any signal to uninstall itself. I think Juju should ask permission to commit seppuku.
| Nate Finch (natefinch) wrote : | #7 |
I agree with Jorge... I think there are better ways to tell juju to uninstall, that won't be hit by accident. Unless we can name a signal SIGPLEASEUNINST
| Nate Finch (natefinch) wrote : | #8 |
Note, we just had another customer who seems to have run into this same thing, in his production environment.
| tags: | removed: cts |
| Andrew Wilkins (axwalk) wrote : | #9 |
We're probably better off not using a signal at all; just touch a file in the data-dir and the agent can watch for it and uninstall upon finding it. The main difficulty now is that we have to continue supporting destruction of old environments with SIGABRT -- but that's restricted to the local and manual providers at least.
| Changed in juju-core: | |
| assignee: | nobody → Andrew Wilkins (axwalk) |
| Changed in juju-core: | |
| importance: | Medium → High |
| Changed in juju-core: | |
| milestone: | none → 1.26-alpha1 |
| Changed in juju-core: | |
| status: | Triaged → In Progress |
| Andrew Wilkins (axwalk) wrote : | #10 |
Proposed fix: https:/
| Changed in juju-core: | |
| status: | In Progress → Fix Committed |
| Changed in juju-core: | |
| status: | Fix Committed → Fix Released |


I was able to repro this on 1.24 using a local environment, and running killall -SIGABRT jujud in one of the containers brought up by add-machine.
This is the key line in the log:
2015-06-11 16:19:57 ERROR juju.worker runner.go:208 fatal "termination": agent should be terminated