Charm GC in 1.25.6 leaves breakage behind
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Canonical Juju |
Fix Released
|
High
|
Menno Finlay-Smits | ||
juju-core |
Fix Released
|
Critical
|
Menno Finlay-Smits |
Bug Description
Juju 1.25.6's unused charm garbage collection fails to remove all references to the old charm, causing unit agents to infinitely loop and exhaust all disk space in certain circumstances when a new version of the charm is deployed under a reused revision.
Specifically, a revision may be reused if the greatest revision of a charm becomes unreferenced and is GCed, letting PrepareLocalCha
The unit agent notices the hash mismatch and retries the download attempt, but only after leaving a copy of bad file in /var/lib/
On a fresh deployment, affected units show up as "failed" and "Waiting for agent initialization to finish", and at normal levels the logs will be silent except for "ERROR juju.worker.
Steps to reproduce
------------------
juju deploy local:trusty/
# Wait for charm to install.
juju destroy-service some-charm
# Wait for 'running "charm" cleanup: "local:
touch trusty/
juju deploy local:trusty/
# Wait for the unit agent to drop into failed, and note the hash mismatches in its log.
Workaround
----------
Delete /var/lib/
description: | updated |
description: | updated |
Changed in juju-core: | |
assignee: | nobody → Menno Smits (menno.smits) |
importance: | Undecided → High |
status: | Won't Fix → In Progress |
milestone: | none → 1.25.7 |
Changed in juju: | |
status: | New → In Progress |
importance: | Undecided → High |
assignee: | nobody → Menno Smits (menno.smits) |
milestone: | none → 2.0-rc2 |
Changed in juju-core: | |
importance: | High → Critical |
description: | updated |
description: | updated |
Changed in juju-core: | |
status: | In Progress → Fix Committed |
tags: | added: canonical-is |
Changed in juju: | |
status: | Fix Committed → Fix Released |
Changed in juju-core: | |
status: | Fix Committed → Fix Released |
We've managed to resolve the situation (see below for the fix). There are several small failures that all contributed to make this bug hard to find and hard to fix.
The root cause is that the copy of the charm on machine 0 in /var/lib/ juju/charm- get-cache/ was truncated. This meant that the machines due to host the service would download the charm, but the hash comparison would fail. This sent juju into an endless loop downloading the charm, failing the hash comparison, and downloading the charm again.
There are several issues here:
1) The downloaded charms were never cleared up on disk, resulting in a full disk.
2) The hash comparison log message was at the INFO level, which made it very hard to see amidst the clutter of other log messages. The relevant log messages were:
``` snapdeltaservic e-11 starting uniter. operation executor.go:69 running operation install local:trusty/ snapdeltaservic e-11 uniter. operation executor.go:103 preparing operation "install local:trusty/ snapdeltaservic e-11" uniter. charm bundles.go:60 downloading local:trusty/ snapdeltaservic e-11 from https:/ /10.50. 28.64:17070/ environment/ 7da51b 76-a297- 4b45-8b9b- fcfb224f3b90/ charms? file=%2A& url=local% 3Atrusty% 2Fsnapdeltaserv ice-11 2016-09-22 03:33:16 INFO juju.worker. uniter. charm bundles.go:69 download complete snapdeltaservic e-11 snapdeltaservic e-11 exiting ra28611616d3e50 aab788f8a2afbff 62b18c02fe2/ 2" shutting down: ModeInstalling local:tr ervice- 11: preparing operation "install local:trusty/ snapdeltaservic e-11": failed to download charm "local: trusty/ snapdeltaservic e-11" from ["h 10.50.28. 64:17070/ environment/ 7da51b76- a297-4b45- 8b9b-fcfb224f3b 90/charms? file=%2A& url=local% 3Atrusty% 2Fsnapdeltaserv ice-11" ]: expected sha256 "c556018 16b759d8e84b26f 0e74df8c9036bb7 162076152168" , got "35372b4cd40573 879f3007fc9a365 7ad46aae49ecef1 0e81ff34c4563a9 8f66a"
2016-09-22 03:33:15 INFO juju.worker.uniter modes.go:567 ModeContinue starting
2016-09-22 03:33:15 INFO juju.worker.uniter modes.go:59 resuming charm install
2016-09-22 03:33:15 INFO juju.worker.uniter modes.go:569 ModeContinue exiting
2016-09-22 03:33:15 INFO juju.worker.uniter modes.go:567 ModeInstalling local:trusty/
2016-09-22 03:33:15 INFO juju.worker.
2016-09-22 03:33:15 INFO juju.worker.
2016-09-22 03:33:15 INFO juju.worker.
2016-09-22 03:33:16 DEBUG juju.worker.uniter modes.go:31 [AGENT-STATUS] failed: install local:trusty/
2016-09-22 03:33:16 INFO juju.worker.uniter modes.go:569 ModeInstalling local:trusty/
2016-09-22 03:33:16 INFO juju.worker.uniter uniter.go:203 unit "sds-worker-
usty/snapdeltas
ttps://
53a737783602d1e
2016-09-22 03:33:16 DEBUG juju.worker.uniter runlistener.go:109 juju-run listener stopping
2016-09-22 03:33:16 DEBUG juju.worker.uniter runlistener.go:129 juju-run listener stopped
```
The fix was to run:
sudo mv /var/lib/ juju/charm- get-cache/ local_3a_ trusty_ 2f_snapdeltaser vice-11. zip{,.EVIL}
on machine 0.