Juju 2.8-beta1.3273 cannot deploy/remove applications

Bug #1865439 reported by Barry Price
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Expired
Undecided
Unassigned

Bug Description

After accidentally upgrading to 2.8-beta1.3273 in LP:1865416, I found that I was unable to deploy or remove applications in an active k8s model.

This also blocked destroy-model, even with the --force and --no-wait flags.

Manually destroying the pods (and even the entire namespace) on the k8s side made no difference.

I did try forcing a restart of the controller pod, but this had no apparent effect.

The only solution I found was to manually destroy the controller pod on the k8s side, after which I could successfully 'juju destroy-controller --destroy-all-models'.

Barry Price (barryprice)
description: updated
Revision history for this message
Ian Booth (wallyworld) wrote :

Do we have any debug-logs?
This is not an issue we tend to see and having logs that would help identify any issue would help understand what's happening.

tags: added: k8s
Revision history for this message
Barry Price (barryprice) wrote :

Unfortunately it seems to stop logging at the time of the upgrade.

The controller pod is then destroyed and replaced, and nothing shows up in logs (even at --level DEBUG) for the controller after that. Here's the full log from the controller (upgrade started at 11:58:25):

https://paste.ubuntu.com/p/DzNBddrfB9/

In other model logs, post-upgrade we see only ERROR lines and all controller functionality is lost (again, with the upgrade performed at 11:58:25):

https://paste.ubuntu.com/p/FhFp95jCRr/

There's no /var/log/juju directory on the controller pod, so I'm not sure where to look for on-disk logs. The 2.7 controller pod is destroyed very quickly upon upgrade, so any useful info stored locally there is going to be difficult to retrieve.

Revision history for this message
Ian Booth (wallyworld) wrote :

There is a /var/log/juju with logs. eg assume the controller is called "foo"

$ kubectl -n controller-foo exec -ti controller-0 -c api-server bash
$ ls /var/log/juju/
audit.log lease.log machine-lock.log

The controller pod has 2 containers - one for mongo and one for the controller agent. You need to specify which container you want to exec into (either mongodb or api-server).

Given there's a number of possible causes here, and the beta is evolving daily, and we haven't had any other reports of similar issues, I'll go ahead and mark this as Incomplete. But please re-open with any extra info if it happens again.

Changed in juju:
status: New → Incomplete
Revision history for this message
Barry Price (barryprice) wrote :

Ah, I was obviously on the wrong container, apologies.

From 2.7.4 a "juju-upgrade" on the controller now puts me onto 2.8-beta1.3350, but I can still reproduce.

I cannot upgrade models once on this version (LP:1867224), hence this is happening on a 2.7.4 model that existed before the controller upgrade - but I can try to repeat the experiment, but use a fresh 2.8 beta model instead if that's any use:

A deploy command appears to execute without error, but watching 'juju status' shows no progress from:

Model Controller Cloud/Region Version SLA Timestamp
wptest myk8s-localhost myk8s/localhost 2.7.4 unsupported 22:35:28+07:00

App Version Status Scale Charm Store Rev OS Address Notes
wordpress waiting 0/1 wordpress-k8s local 0 kubernetes agent initializing

Unit Workload Agent Address Ports Message
wordpress/0 waiting allocating agent initializing

After a while I gave up and attempted to destroy-model:

$ juju destroy-model wptest -y
Destroying model
Waiting for model to be removed, 1 application(s)...............................
................................................................................
................................................................................
................................................................................
................................................................................
................................................................................
................................................................................
................................................................................
................................................................................
................................................................................
................................................................................
..............................................................ERROR timeout after 30m0s timeout
$

Unfortunately there's not much in the way of logs on the beta controller container:

root@controller-0:/var/log/juju# ls -lart
total 304
drwxr-xr-x 1 root root 4096 Mar 19 14:30 ..
-rw-r----- 1 syslog adm 0 Mar 19 14:30 machine-lock.log
drw-r--r-- 2 root root 4096 Mar 19 14:30 .
-rw-r----- 1 syslog adm 298128 Mar 19 15:04 audit.log
root@controller-0:/var/log/juju#

Here's a paste of audit.log:

https://paste.ubuntu.com/p/vvBzBGqGKT/

Changed in juju:
status: Incomplete → New
Revision history for this message
Barry Price (barryprice) wrote :

Sorry, previous paste was a double-paste.

Here's a single one:

https://paste.ubuntu.com/p/gf6MkMM5ny/

Revision history for this message
Ian Booth (wallyworld) wrote :

When this happens, what we really need is the output of juju dump-model (both of the model being destroyed and the controller model, with any secrets redacted).
That will show what is stuck in the dying state and hence holding up graceful removal of the model.
The controller logs around the time of destroy would also help.

Can we get an example of that info to enable us to work on diagnosing what is happening? Dos --force work? Using --force bypasses Juju's expectation that stuff will shutdown cleanly and it eventually just removes model entities regardless.

There's usually 2 root causes - storage not being detached / volumes removed, or unis not leaving scope because relation departed/broken hooks are not completed successfully.

tags: added: destroy-model
Pen Gale (pengale)
Changed in juju:
status: New → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for juju because there has been no activity for 60 days.]

Changed in juju:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.