2.5.0: race condition when upgrading multiple charms in the same machine

Bug #1813044 reported by Guillermo Gonzalez
40
This bug affects 6 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Heather Lanigan

Bug Description

When doing a charm upgrade in a machine running haproxy and 4 subordinates, we get:

The charm upgrade is done for haproxy and nrpe, both deployed to the same machine

running: juju upgrade-charm nrpe
retcode: 1
 stdout:
 stderr: Added charm "cs:xenial/nrpe-11" to the model.
ERROR unable to set charm profile: upgrade charm profile already in process for machine 5, profile from ""

information type: Public → Private
Changed in juju:
status: New → Triaged
importance: Undecided → High
Revision history for this message
Heather Lanigan (hmlanigan) wrote :

This looks like a race of upgrading the charms for haproxy and nrpe (subordinate) at nearly the same time. Trying to upgrade nrpe against after this error, did work.

summary: - race condition when upgrading multiple charms in the same machine
+ 2.5.0: race condition when upgrading multiple charms in the same machine
Changed in juju:
milestone: none → 2.5.1
description: updated
information type: Private → Public
summary: - 2.5.0: race condition when upgrading multiple charms in the same machine
+ 2.5.0: lxd profile race condition when upgrading multiple charms in the
+ same machine
summary: - 2.5.0: lxd profile race condition when upgrading multiple charms in the
- same machine
+ 2.5.0: race condition when upgrading multiple charms in the same machine
Revision history for this message
Heather Lanigan (hmlanigan) wrote :

Can be reproduced with:
  juju deploy -n 6 ./testcharms/charm-repo/quantal/lxd-profile
  juju deploy ./testcharms/charm-repo/quantal/lxd-profile-subordinate
  juju add-relation lxd-profile lxd-profile-subordinate
  juju deploy ~/charms/ubuntu --to 0
  juju add-unit ubuntu -n 5 --to 1,2,3,4,5
  juju deploy ~/charms/ntp
  juju add-relation ntp ubuntu
  - let config settle

As "one command" upgrade the charms at once:
  juju upgrade-charm lxd-profile-subordinate --path ./testcharms/charm-repo/quantal/lxd-profile-subordinate; juju upgrade-charm lxd-profile --path ./testcharms/charm-repo/quantal/lxd-profile ; juju upgrade-charm ntp --path ~/charms/ntp ; juju upgrade-charm ubuntu --path ~/charms/ubuntu

WARNING making "testcharms/charm-repo/quantal/lxd-profile-subordinate/hooks/start" executable in charm
Added charm "local:bionic/lxd-profile-subordinate-1" to the model.
-bash: wait: pid 30 is not a child of this shell
Added charm "local:bionic/lxd-profile-1" to the model.
ERROR unable to set charm profile: upgrade charm profile already in process for machine 0, profile from "local:bionic/lxd-profile-subordinate-1"
-bash: wait: pid 30 is not a child of this shell
Added charm "local:bionic/ntp-1" to the model.
-bash: wait: pid 30 is not a child of this shell
Added charm "local:bionic/ubuntu-1" to the model.
ERROR unable to set charm profile: upgrade charm profile already in process for machine 0, profile from ""

Upgrading the ubuntu and lxd-charms separately with a minutes in-between after succeeds.

Revision history for this message
Heather Lanigan (hmlanigan) wrote :

To fix, we need a way to have multiple instanceCharmProfileData docs per machine. This is okay for work in the Uniter which already watches based on machine & specific applications.

The challenging piece comes with the provisioner and actually working with lxd profile changes on a "machine". For every time a watcher (watchCharmProfiles) reports that a profile needs updating, the provisioner will need to find the appropriate data, while only knowing the machine number.

Changed in juju:
status: Triaged → In Progress
assignee: nobody → Heather Lanigan (hmlanigan)
Revision history for this message
Colin Watson (cjwatson) wrote :

I ran into this today on PS4.5, except I upgraded the nrpe subordinate first (actually twice; "juju upgrade-charm nrpe --switch cs:nrpe-52" followed shortly afterwards by "juju upgrade-charm nrpe --switch /srv/mojo/mojo-stg-ols-snap-build/xenial/staging/charms/xenial/nrpe-external-master" when I realised that I'd meant to use Mojo's local repository), and then tried to upgrade a primary application charm. Fifteen minutes after the nrpe upgrade-charm finished and it's still stuck:

  [prodstack-is:admin/stg-ols-snap-build] stg-ols-snap-build@wendigo:~$ juju upgrade-charm apache --path /srv/mojo/mojo-stg-ols-snap-build/xenial/staging/charms/xenial/apache2
  Added charm "local:xenial/apache2-23" to the model.
  ERROR unable to set charm profile: upgrade charm profile already in process for machine 6, profile from ""

Is this likely to be permanent? Is there any way I can unstick it, or look at the state of the upgrade charm profile without actually running juju upgrade-charm?

(It's also worth noting that these aren't LXD machines in the first place, but rather nova-managed VMs ...)

Revision history for this message
Richard Harding (rharding) wrote :

@hml from a stuck state like Colin is in can we clear the upgrade docs cleanly and get back to a "reset" point in order to go back through the upgrade charms one by one?

Revision history for this message
Colin Watson (cjwatson) wrote :

We indeed managed to work around this by deleting the relevant instanceCharmProfileData docs and then continuing with upgrade-charm.

Revision history for this message
Xav Paice (xavpaice) wrote :

is there any documented way to safely perform that workaround? Sounds like deleting instanceCharmProfileData docs is a mongo action, this is affecting production sites and that's risky.

tags: added: canonical-bootstack
Revision history for this message
Xav Paice (xavpaice) wrote :

Added field-critical, this is blocking a production customer upgrade.

Revision history for this message
Heather Lanigan (hmlanigan) wrote :

The instanceCharmProfileData docs are meant to be transient. Having them exist after a charm upgrade is complete is a bug.

IF no charm upgrades are in progress it is safe to delete any instanceCharmProfileData docs from the db. Depending on needs, the delete operation can be targets at specific models, machines, and/or units.

db.instanceCharmProfileData.deleteMany({}) will remove any existing docs of this type.

There are a number of issues to resolve this bug, they are targeted to 2.5.2 currently.

Tim Penhey (thumper)
Changed in juju:
milestone: 2.5.1 → 2.5.2
Changed in juju:
status: In Progress → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
Revision history for this message
Haw Loeung (hloeung) wrote :

Or to limit the deletion to a specific model:

| db.instanceCharmProfileData.deleteMany({"model-uuid":"..."})

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.