Charm GC in 1.25.6 leaves breakage behind

Bug #1626304 reported by Kit Randel
26
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Menno Finlay-Smits
juju-core
Fix Released
Critical
Menno Finlay-Smits

Bug Description

Juju 1.25.6's unused charm garbage collection fails to remove all references to the old charm, causing unit agents to infinitely loop and exhaust all disk space in certain circumstances when a new version of the charm is deployed under a reused revision.

Specifically, a revision may be reused if the greatest revision of a charm becomes unreferenced and is GCed, letting PrepareLocalCharmUpload allocate the same revision for a second time. The state server's /var/lib/juju/charm-get-cache is indexed by revision, so deploying a charm by the same name but with different content will reuse the revision number, erroneously hit the cache, and return the old archive.

The unit agent notices the hash mismatch and retries the download attempt, but only after leaving a copy of bad file in /var/lib/juju/agents/unit-*/state/bundles/downloads/inprogress-*. After looping for a while, all disk space is exhausted by thousands of copies of the old charm archive.

On a fresh deployment, affected units show up as "failed" and "Waiting for agent initialization to finish", and at normal levels the logs will be silent except for "ERROR juju.worker.uniter.filter filter.go:137 tomb: dying" every few seconds. This can probably affect upgrade-charm too.

Steps to reproduce
------------------

juju deploy local:trusty/some-charm
# Wait for charm to install.
juju destroy-service some-charm
# Wait for 'running "charm" cleanup: "local:trusty/some-charm-1"' log message
touch trusty/some-charm/anything-that-will-cause-the-zip-to-change
juju deploy local:trusty/some-charm
# Wait for the unit agent to drop into failed, and note the hash mismatches in its log.

Workaround
----------

Delete /var/lib/juju/charm-get-cache/* on the state server and /var/lib/juju/agents/unit-*/state/bundles/downloads/inprogress-* on any affected units. The units should immediately recover without further intervention.

Tags: canonical-is
Revision history for this message
Kit Randel (blr) wrote :
description: updated
Revision history for this message
Kit Randel (blr) wrote :
Revision history for this message
Thomi Richards (thomir-deactivatedaccount) wrote :

We've managed to resolve the situation (see below for the fix). There are several small failures that all contributed to make this bug hard to find and hard to fix.

The root cause is that the copy of the charm on machine 0 in /var/lib/juju/charm-get-cache/ was truncated. This meant that the machines due to host the service would download the charm, but the hash comparison would fail. This sent juju into an endless loop downloading the charm, failing the hash comparison, and downloading the charm again.

There are several issues here:

1) The downloaded charms were never cleared up on disk, resulting in a full disk.
2) The hash comparison log message was at the INFO level, which made it very hard to see amidst the clutter of other log messages. The relevant log messages were:

```
2016-09-22 03:33:15 INFO juju.worker.uniter modes.go:567 ModeContinue starting
2016-09-22 03:33:15 INFO juju.worker.uniter modes.go:59 resuming charm install
2016-09-22 03:33:15 INFO juju.worker.uniter modes.go:569 ModeContinue exiting
2016-09-22 03:33:15 INFO juju.worker.uniter modes.go:567 ModeInstalling local:trusty/snapdeltaservice-11 starting
2016-09-22 03:33:15 INFO juju.worker.uniter.operation executor.go:69 running operation install local:trusty/snapdeltaservice-11
2016-09-22 03:33:15 INFO juju.worker.uniter.operation executor.go:103 preparing operation "install local:trusty/snapdeltaservice-11"
2016-09-22 03:33:15 INFO juju.worker.uniter.charm bundles.go:60 downloading local:trusty/snapdeltaservice-11 from https://10.50.28.64:17070/environment/7da51b 76-a297-4b45-8b9b-fcfb224f3b90/charms?file=%2A&url=local%3Atrusty%2Fsnapdeltaservice-11 2016-09-22 03:33:16 INFO juju.worker.uniter.charm bundles.go:69 download complete
2016-09-22 03:33:16 DEBUG juju.worker.uniter modes.go:31 [AGENT-STATUS] failed: install local:trusty/snapdeltaservice-11
2016-09-22 03:33:16 INFO juju.worker.uniter modes.go:569 ModeInstalling local:trusty/snapdeltaservice-11 exiting
2016-09-22 03:33:16 INFO juju.worker.uniter uniter.go:203 unit "sds-worker-ra28611616d3e50aab788f8a2afbff62b18c02fe2/2" shutting down: ModeInstalling local:tr
usty/snapdeltaservice-11: preparing operation "install local:trusty/snapdeltaservice-11": failed to download charm "local:trusty/snapdeltaservice-11" from ["h
ttps://10.50.28.64:17070/environment/7da51b76-a297-4b45-8b9b-fcfb224f3b90/charms?file=%2A&url=local%3Atrusty%2Fsnapdeltaservice-11"]: expected sha256 "c556018
53a737783602d1e16b759d8e84b26f0e74df8c9036bb7162076152168", got "35372b4cd40573879f3007fc9a3657ad46aae49ecef10e81ff34c4563a98f66a"
2016-09-22 03:33:16 DEBUG juju.worker.uniter runlistener.go:109 juju-run listener stopping
2016-09-22 03:33:16 DEBUG juju.worker.uniter runlistener.go:129 juju-run listener stopped
```

The fix was to run:

sudo mv /var/lib/juju/charm-get-cache/local_3a_trusty_2f_snapdeltaservice-11.zip{,.EVIL}

on machine 0.

Revision history for this message
Anastasia (anastasia-macmood) wrote :

This is great news \o/ Thank you :D

Changed in juju-core:
status: New → Invalid
Revision history for this message
Kit Randel (blr) wrote :

Hi Anastasia,

I'm curious why you've marked this invalid. While we managed to fix our deployment, the bugs listed above in juju are still present.

Revision history for this message
Anastasia (anastasia-macmood) wrote :

Marking it as "Won't Fix" because 1.25 is in Critical-bug-fixes-only mode.

Changed in juju-core:
status: Invalid → Won't Fix
Revision history for this message
William Grant (wgrant) wrote :

I think the root cause is a regression in 1.25.6.

https://bugs.launchpad.net/juju/+bug/1580418 added charm GC, but the archive blobs cached on the bootstrap node aren't cleared. PrepareStoreCharmUpload tries to create a unique charm revision by incrementing the maximum value found in the DB, but GC has obliterated all reference to charms unused by services, so it can end up reusing a revision that has existed in the past. When the unit agent asks the bootstrap agent for the archive by name, it looks it up in /var/lib/juju/charm-get-cache by name and returns the previous incarnation of that revision.

Revision history for this message
William Grant (wgrant) wrote :

The bug can be avoided by never letting the greatest revision of a charm be GCed (eg. deploying a dummy version of it before destroying and redeploying a service), and fixed after the fact by clearing /var/lib/juju/charm-get-cache on the state server.

For anyone trying to reproduce this, note that charm archival is apparently deterministic, so the local charm repo has to change at least slightly to trigger the bug.

summary: - Unit disk fills with inprogress-n files
+ Charm GC in 1.25.6 leaves breakage behind
William Grant (wgrant)
description: updated
description: updated
Changed in juju-core:
assignee: nobody → Menno Smits (menno.smits)
importance: Undecided → High
status: Won't Fix → In Progress
milestone: none → 1.25.7
Changed in juju:
status: New → In Progress
importance: Undecided → High
assignee: nobody → Menno Smits (menno.smits)
milestone: none → 2.0-rc2
Changed in juju-core:
importance: High → Critical
Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

The charm-get-cache is unnecesssary these days and has already been removed in 2.0. I'm going to backport that change to 1.25 now.

The /var/lib/juju/agents/unit-*/state/bundles/downloads/inprogress-* files being left behind on the unit agents shouldn't be too hard to fix. This needs to be addressed in both 1.25 and 2.0.

Revision history for this message
Kit Randel (blr) wrote :

Thanks Menno, that's great to hear.

William Grant (wgrant)
description: updated
description: updated
Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

Removal of the apiserver charm get cache: https://github.com/juju/juju/pull/6309

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

Removal of uniter temp download files: https://github.com/juju/juju/pull/6314

Changed in juju-core:
status: In Progress → Fix Committed
Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

Removal of uniter temp download files for 2.x: https://github.com/juju/juju/pull/6325

Changed in juju:
status: In Progress → Fix Committed
Paul Gear (paulgear)
tags: added: canonical-is
Curtis Hovey (sinzui)
Changed in juju:
status: Fix Committed → Fix Released
Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.