Sending a SIGABRT to jujud process causes jujud to uninstall (wiping /var/lib/juju)

Bug #1464304 reported by Jorge Niedbalski
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
juju-core
Fix Released
High
Andrew Wilkins
1.25
Fix Released
High
Andrew Wilkins

Bug Description

[Environment]

This has been observed in 2 different environments:

Both Trusty 14.04.2

Juju-core 1.23.3
Juju-core 1.20.9

[Description]

We initially faced this issue by running the following sequence on the bootstrap node with 1.23.3,
this is not a normal operation made on any juju installation, but this leaded to discover the issue.

0) unlink /var/lib/juju/tools/machine-0
1) ln -s /var/lib/juju/tools/1.23.2-trusty-amd64/ /var/lib/juju/tools/machine-0
2) Edit the agent.conf on the machine, pointing the upgradedTo: 1.23.2
3) $ restart jujud-machine-0

Then the following log entries were printed:

2015-06-10 10:03:45 INFO juju.mongo open.go:104 dialled mongo successfully
2015-06-10 10:03:45 ERROR juju.worker runner.go:207 fatal "state": agent should be terminated
2015-06-10 10:03:45 DEBUG juju.worker runner.go:241 killing "statestarter"
2015-06-10 10:03:45 DEBUG juju.worker runner.go:241 killing "termination"
2015-06-10 10:03:47 INFO juju.worker runner.go:260 start "api"
2015-06-10 10:03:47 INFO juju.state.api apiclient.go:242 dialing "wss://localhost:17070/"
2015-06-10 10:03:47 INFO juju.state.api apiclient.go:250 error dialing "wss://localhost:17070/": websocket.Dial wss://localhost:17070/: dial tcp 127.0.0.1:17070: connection refused
2015-06-10 10:03:47 ERROR juju.worker runner.go:218 exited "api": unable to connect to "wss://localhost:17070/"
2015-06-10 10:03:50 ERROR juju.cmd supercommand.go:323 uninstall failed: [remove /var/lib/juju: directory not empty]
/bin/sh: 1: exec: /var/lib/juju/tools/machine-0/jujud: not found

From this point juju was uninstalled, we discovered that sending a 'killall -SIGABRT jujud' causes juju to uninstall.

machine-2[20270]: 2015-06-11 15:18:22 ERROR juju.worker runner.go:219 exited "api": watcher has been stopped
unit-percona-cluster-0[23184]: 2015-06-11 15:18:22 ERROR juju.worker.uniter.filter filter.go:137 watcher has been stopped
unit-percona-cluster-0[23184]: 2015-06-11 15:18:22 ERROR juju.worker.uniter modes.go:222 error while stopping hooks: hook source stopped providing updates
unit-percona-cluster-0[23184]: 2015-06-11 15:18:22 ERROR juju.worker runner.go:208 fatal "upgrader": watcher has been stopped
unit-percona-cluster-0[23184]: 2015-06-11 15:18:22 ERROR juju.worker runner.go:208 fatal "rsyslog": watcher has been stopped
unit-percona-cluster-0[23184]: 2015-06-11 15:18:22 ERROR juju.worker runner.go:208 fatal "proxyupdater": watcher has been stopped
unit-percona-cluster-0[23184]: 2015-06-11 15:18:22 ERROR juju.worker runner.go:208 fatal "logger": watcher has been stopped
unit-percona-cluster-0[23184]: 2015-06-11 15:18:22 ERROR juju.worker runner.go:208 fatal "apiaddressupdater": watcher has been stopped
unit-percona-cluster-0[23184]: 2015-06-11 15:18:22 ERROR juju.worker runner.go:208 fatal "uniter": watcher has been stopped
unit-percona-cluster-0[23184]: 2015-06-11 15:18:22 ERROR juju.worker runner.go:219 exited "api": watcher has been stopped

At this point /var/lib/juju has been removed from the system.

[ Suggestion ]

Currently the provisioner has a 'provisioner-safe-mode', which by default prevents
juju to take over an environment in case of any failure.

I would like to suggest to have something similar for the machine agent workers. 'workers-safe-mode' ? that prevents
jujud to uninstall itself in case of any worker error.

Tags: sts
tags: added: cts
Felipe Reyes (freyes)
tags: added: sts
Revision history for this message
Nate Finch (natefinch) wrote :

I was able to repro this on 1.24 using a local environment, and running killall -SIGABRT jujud in one of the containers brought up by add-machine.

This is the key line in the log:

2015-06-11 16:19:57 ERROR juju.worker runner.go:208 fatal "termination": agent should be terminated

Revision history for this message
Curtis Hovey (sinzui) wrote :

Wow. Thank you for this bug report. CI has lost jujud and we suspected SIGABRT. These reproducible steps are a fine outline of how to shoot yourself in the foot.

Changed in juju-core:
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
Curtis Hovey (sinzui) wrote :

The use of SIGABRT is by design. I think the issue here is there was no intent to send SIGABRT. This issue has come up before and one suggestion was to use SIGUSER1 or SIgUSER2 because uninstalling is clearly a surprising behaviour for aborting an operation.

Revision history for this message
Gema Gomez (gema) wrote :

Curtis, will this design decision be rectified, then?

Revision history for this message
Jorge Niedbalski (niedbalski) wrote :

I really think that the issue of uninstall the agent on case of SIGABRT or any signal without a safe-mode or
caution is still a very very dangerous thing for production environments.

Revision history for this message
Curtis Hovey (sinzui) wrote :

@gema, which decision?

That ABRT is the signal to uninstall, or that Juju uninstalled without human permission. certainly the last case is a bug. For former case while ABRT is not my preferred solution (I like USER1), Juju can still choose to call any signal to uninstall itself. I think Juju should ask permission to commit seppuku.

Revision history for this message
Nate Finch (natefinch) wrote :

I agree with Jorge... I think there are better ways to tell juju to uninstall, that won't be hit by accident. Unless we can name a signal SIGPLEASEUNINSTALLJUJU, then I think we need a different interface for this. We know jujud is installed on the machine, why not just make uninstall a jujud command? jujud uninstall -y ... just as easy to run via SSH, but much more clear as the what will happen when the command runs.

Revision history for this message
Nate Finch (natefinch) wrote :

Note, we just had another customer who seems to have run into this same thing, in his production environment.

tags: removed: cts
Revision history for this message
Andrew Wilkins (axwalk) wrote :

We're probably better off not using a signal at all; just touch a file in the data-dir and the agent can watch for it and uninstall upon finding it. The main difficulty now is that we have to continue supporting destruction of old environments with SIGABRT -- but that's restricted to the local and manual providers at least.

Changed in juju-core:
assignee: nobody → Andrew Wilkins (axwalk)
Tim Penhey (thumper)
Changed in juju-core:
importance: Medium → High
Ian Booth (wallyworld)
Changed in juju-core:
milestone: none → 1.26-alpha1
Andrew Wilkins (axwalk)
Changed in juju-core:
status: Triaged → In Progress
Revision history for this message
Andrew Wilkins (axwalk) wrote :
Andrew Wilkins (axwalk)
Changed in juju-core:
status: In Progress → Fix Committed
Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.