remove-application regression on 2.1 -> 2.2 upgrade with subordinates

Bug #1699050 reported by William Grant
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Christian Muirhead

Bug Description

Removing an application from a model upgraded from 2.1.1 to 2.2.0 is sometimes impossible. remove-application succeeds, but the application never goes away.

Steps to reproduce:

  # Start with a 2.1.1 controller, deploy two principals and two subordinates.
  juju add-model test
  juju deploy cs:ubuntu # machine 0
  juju deploy cs:ubuntu ubuntu2 # machine 1
  juju deploy cs:canonical-livepatch
  juju deploy cs:nrpe
  juju add-unit -n1 ubuntu # machine 2
  juju add-unit -n1 ubuntu2 # machine 3
  # Wait for stability.

  # Relate each subordinate to each principal and the other subordinate.
  juju add-relation canonical-livepatch nrpe
  juju add-relation canonical-livepatch ubuntu
  juju add-relation canonical-livepatch ubuntu2
  juju add-relation nrpe ubuntu
  juju add-relation nrpe ubuntu2
  # Wait for stability.

  # Upgrade the controller and app models to 2.2.0.
  juju upgrade-juju -m controller --agent-version=2.2.0
  juju upgrade-juju

  # Remove both machines from the ubuntu2 service (work around bug #1686696).
  juju remove-machine --force 1
  juju remove-machine --force 3
  # Wait until both machines are gone and the ubuntu2 app is waiting and 0.

  # Try to remove the ubuntu2 application.
  juju remove-application ubuntu2
  # Argh why is the app still there it is empty and dying.
  juju remove-application ubuntu2
  # Hm this isn't good.
  juju remove-application ubuntu2
  # Argh. My model is plagued by a waiting app forever.

The ubuntu2 application and its subordinate relations are alive in the DB (but, at least on our affected prod model, the principal's relations to other principals are gone). Interestingly, the unitcounts on the remaining relations for the dying application are still more than a dozen, when there were only ever two units of the principal.

Revision history for this message
William Grant (wgrant) wrote :

https://pastebin.canonical.com/191371/ lists some scenarios I've tried. I'm not sure test3 was necessarily a good test.

William Grant (wgrant)
description: updated
Ian Booth (wallyworld)
Changed in juju:
milestone: none → 2.2.1
importance: Undecided → High
status: New → Triaged
assignee: nobody → Christian Muirhead (2-xtian)
Revision history for this message
Anastasia (anastasia-macmood) wrote :

@William Grant,

Thank you for the repro steps and an awesome bug report!

I *think* that the problem in this case is 2 fold. I'll dig deeper into the code tomorrow. Essentially, the problems are that the relations are not removed (although their life is 'Dying') which also means that relations count on application has not reduced but the application is also marked as 'Dying'.

To workaround and remove offending application, in db, I had to:

1. remove application relations from relations collections;
2. reduce relations count on application (should be 0);
3. change application life back to "Alive" (0).

Then from CLI, you should be able to say 'juju remove-application ...'. This will actually remove the application :D

Changed in juju:
assignee: Christian Muirhead (2-xtian) → Anastasia (anastasia-macmood)
Changed in juju:
status: Triaged → In Progress
Revision history for this message
Anastasia (anastasia-macmood) wrote :

I'll try adjusting the unit count on relations instead of just removing them.

However, the 2nd part of the problem remains - since we have not removed the application nor its relations explicitly yet, life for neither should be 'Dying'.

Revision history for this message
Anastasia (anastasia-macmood) wrote :

@William Grant,

Just verified - you are right :)

To cleanly work around the problem, you'd need to reduce all application's relations unit count to reflect reality - in the scenario above, it'll have to be 0.

This should allow you to 'juju remove-application...' with no problem.

(If you have previously tried to remove this application, you will also need to adjust the application and its relations' life back to 'Alive' before 'remove-application' can succeed).

I'll work on code fix. Happy to provide queries for above if needed...

Revision history for this message
Anastasia (anastasia-macmood) wrote :

This is definitely a failure in upgraded controllers only as per scenario.

In order to have a proper code fix, we need an upgrade step that will set unit count on relations to reflect reality.

Changed in juju:
assignee: Anastasia (anastasia-macmood) → Christian Muirhead (2-xtian)
Revision history for this message
William Grant (wgrant) wrote :

I confirmed there were no relevant relationscopes on my production controller, shut down the machine-0 agent, and ran:

  db.applications.update({_id: /.*snapdevicegw-(r5d9396c|r6882fa0|rc652aa1).*/}, {"$set": {"life": 0}}, {multi: true})
  db.relations.update({_id: /.*snapdevicegw-(r5d9396c|r6882fa0|rc652aa1).*/}, {"$set": {"life": 0, "unitcount": 0}}, {multi: true})

I was then able to remove the applications successfully.

Thanks for the workaround.

Revision history for this message
Christian Muirhead (2-xtian) wrote :
Revision history for this message
Christian Muirhead (2-xtian) wrote :

That PR had a bug, fixed here: https://github.com/juju/juju/pull/7541

Changed in juju:
status: In Progress → Fix Committed
Ian Booth (wallyworld)
Changed in juju:
status: Fix Committed → In Progress
Revision history for this message
Christian Muirhead (2-xtian) wrote :

The upgrade step works, and everything behaves correctly if upgrading from 2.1 -> 2.2 -> 2.2.1 (including models at each step). Unfortunately if you try to go straight from 2.1 -> 2.2.1, after the controller has been upgraded (and the relation unit counts corrected in the DB) the (not-upgraded) uniters connect back to the API and add the invalid units back into the DB before we have a chance to upgrade them.

This PR (on 2.2) fixes the API server to discard the bad EnterScope requests: https://github.com/juju/juju/pull/7547

With this change, upgrading from 2.1 -> 2.2.1 directly works and the relation unitcounts are right.

Changed in juju:
status: In Progress → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.