stuck k8s workload unit following upgrade-charm with new image

Bug #1878329 reported by Paul Collins
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
juju
High
Yang Kelvin Liu

Bug Description

Following a "juju upgrade-charm --path /local/mycharm --resource mycharm_image=image_url mycharm" to test a newer image for a charm I'm working on, I've ended up with a stuck unit that is blocking subsequent charm upgrades, although all the changes on the k8s side have been executed correctly.

[agnew(~)] snap info juju | grep ^installed:
installed: 2.8-rc1 (11992) 134MB classic
[agnew(~)] juju version
2.8-rc1-focal-amd64
[agnew(~)] juju debug-log --no-tail
application-mattermost: 15:18:16 DEBUG juju.worker.dependency "migration-inactive-flag" manifold worker started at 2020-05-13 03:18:16.620862773 +0000 UTC
application-mattermost: 15:18:16 INFO juju.worker.migrationminion migration phase is now: NONE
application-mattermost: 15:18:16 DEBUG juju.worker.dependency "charm-dir" manifold worker started at 2020-05-13 03:18:16.631889653 +0000 UTC
application-mattermost: 15:18:16 DEBUG juju.worker.logger initial log config: "<root>=DEBUG"
application-mattermost: 15:18:16 INFO juju.worker.logger logger worker started
application-mattermost: 15:18:16 DEBUG juju.worker.dependency "logging-config-updater" manifold worker started at 2020-05-13 03:18:16.632448582 +0000 UTC
application-mattermost: 15:18:16 DEBUG juju.worker.dependency "api-address-updater" manifold worker started at 2020-05-13 03:18:16.633285603 +0000 UTC
application-mattermost: 15:18:16 DEBUG juju.worker.dependency "proxy-config-updater" manifold worker started at 2020-05-13 03:18:16.633441178 +0000 UTC
application-mattermost: 15:18:16 DEBUG juju.worker.logger reconfiguring logging from "<root>=DEBUG" to "<root>=WARNING"
application-mattermost: 15:32:59 ERROR juju.worker.uniter resolver loop error: executing operation "remote init": caas-unit-init for unit "mattermost/0" failed: ERROR failed to remove unit tools dir /var/lib/juju/tools/unit-mattermost-0: unlinkat /var/lib/juju/tools/unit-mattermost-0/goal-state: permission denied
ERROR failed to remove unit tools dir /var/lib/juju/tools/unit-mattermost-0: unlinkat /var/lib/juju/tools/unit-mattermost-0/goal-state: permission denied
: command terminated with exit code 1
[agnew(~)] juju status --format yaml
model:
  name: mattermost
  type: caas
  controller: beta
  cloud: k8s
  region: localhost
  version: 2.8-rc1
  model-status:
    current: available
    since: 13 May 2020 15:17:30+12:00
  sla: unsupported
machines: {}
applications:
  mattermost:
    charm: local:kubernetes/mattermost-3
    series: kubernetes
    os: kubernetes
    charm-origin: local
    charm-name: mattermost
    charm-rev: 3
    charm-version: 89783cd-dirty
    scale: 1
    provider-id: 4fb6cece-d45d-43a0-907e-f7a08d71bdae
    address: 10.152.183.25
    exposed: false
    application-status:
      current: active
      since: 13 May 2020 15:46:04+12:00
    units:
      mattermost/0:
        workload-status:
          current: terminated
          message: unit stopped by the cloud
          since: 13 May 2020 15:33:04+12:00
        juju-status:
          current: idle
          since: 13 May 2020 15:33:04+12:00
        upgrading-from: local:kubernetes/mattermost-1
        open-ports:
        - 8000/TCP
        address: 10.1.1.24
        provider-id: fef2495d-2654-421e-8d6d-9909d7b62feb
      mattermost/1:
        workload-status:
          current: waiting
          message: Not a leader
          since: 13 May 2020 15:33:02+12:00
        juju-status:
          current: idle
          since: 13 May 2020 15:38:51+12:00
        leader: true
        upgrading-from: local:kubernetes/mattermost-1
        open-ports:
        - 8000/TCP
        address: 10.1.1.25
        provider-id: 4fc05039-ee2c-4a05-9ad4-546fb0425c97
storage: {}
controller:
  timestamp: 15:48:51+12:00
[agnew(~)] kubectl get all -n mattermost
NAME READY STATUS RESTARTS AGE
pod/mattermost-6bfd44b4f6-xnpml 1/1 Running 4 15m
pod/mattermost-operator-0 1/1 Running 0 30m

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/mattermost ClusterIP 10.152.183.25 <none> 8000/TCP 30m
service/mattermost-operator ClusterIP 10.152.183.252 <none> 30666/TCP 30m

NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/mattermost 1/1 1 1 30m

NAME DESIRED CURRENT READY AGE
replicaset.apps/mattermost-6bfd44b4f6 1 1 1 15m

NAME READY AGE
statefulset.apps/mattermost-operator 1/1 30m
[agnew(~)] _

Tags: k8s Edit Tag help
Paul Collins (pjdc)
summary: - stuck k8s workload unit following upgrade-charm with new imagea
+ stuck k8s workload unit following upgrade-charm with new image
Ian Booth (wallyworld)
Changed in juju:
milestone: none → 2.8-rc2
importance: Undecided → High
status: New → Triaged
Revision history for this message
Ian Booth (wallyworld) wrote :

The root cause here is the issue identified in bug 1877935
Juju needs to take account the pod's phase - whether it is still running init containers - before deciding how to interact with it.

tags: added: k8s
Ian Booth (wallyworld)
Changed in juju:
assignee: nobody → Yang Kelvin Liu (kelvin.liu)
status: Triaged → In Progress
Ian Booth (wallyworld)
Changed in juju:
status: In Progress → Fix Committed
Harry Pidcock (hpidcock)
Changed in juju:
status: Fix Committed → Fix Released
Revision history for this message
Paul Collins (pjdc) wrote :
Download full text (6.7 KiB)

This seems to still happen with 2.8-rc2:

[agnew(charm-mattermost-k8s)] juju deploy ./ --resource mattermost_image='localhost:32000/mattermost:v4' mattermost
Deploying charm "local:kubernetes/mattermost-0".
[agnew(charm-mattermost-k8s)] juju status
Model Controller Cloud/Region Version SLA Timestamp
betabeta-mattermost betabeta k8s/localhost 2.8-rc2 unsupported 13:23:05+12:00

App Version Status Scale Charm Store Rev OS Address Notes
mattermost waiting 0/1 mattermost local 0 kubernetes agent initializing

Unit Workload Agent Address Ports Message
mattermost/0 waiting allocating agent initializing

[agnew(charm-mattermost-k8s)] juju status
Model Controller Cloud/Region Version SLA Timestamp
betabeta-mattermost betabeta k8s/localhost 2.8-rc2 unsupported 13:24:37+12:00

App Version Status Scale Charm Store Rev OS Address Notes
mattermost active 1 mattermost local 0 kubernetes 10.152.183.176

Unit Workload Agent Address Ports Message
mattermost/0* active idle 10.1.1.86 8000/TCP

[agnew(charm-mattermost-k8s)] kubectl get all -n betabeta-mattermost
NAME READY STATUS RESTARTS AGE
pod/mattermost-99bb9c774-7g62w 1/1 Running 0 88s
pod/mattermost-operator-0 1/1 Running 0 96s
pod/modeloperator-687fdfbcf8-jqwv5 1/1 Running 0 111s

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/mattermost ClusterIP 10.152.183.176 <none> 8000/TCP 88s
service/mattermost-operator ClusterIP 10.152.183.225 <none> 30666/TCP 96s
service/modeloperator ClusterIP 10.152.183.118 <none> 17071/TCP 112s

NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/mattermost 1/1 1 1 88s
deployment.apps/modeloperator 1/1 1 1 112s

NAME DESIRED CURRENT READY AGE
replicaset.apps/mattermost-99bb9c774 1 1 1 88s
replicaset.apps/modeloperator-687fdfbcf8 1 1 1 112s

NAME READY AGE
statefulset.apps/mattermost-operator 1/1 96s
[agnew(charm-mattermost-k8s)] juju upgrade-charm --path ./ --resource mattermost_image=localhost:32000/mattermost:v5 mattermost
Added charm "local:kubernetes/mattermost-1" to the model.
[agnew(charm-mattermost-k8s)] juju status
Model Controller Cloud/Region Version SLA Timestamp
betabeta-mattermost betabeta k8s/localhost 2.8-rc2 unsupported 13:25:46+12:00

App Version Status Scale Charm Store Rev OS Address Notes
mattermost active 1 mattermost local 1 kubernetes 10.152.183.176

Unit Workload Agent Address Ports Message
mattermost/0 terminated idle 10.1.1.86 8000/TCP unit stopped by th...

Read more...

Changed in juju:
status: Fix Released → New
Revision history for this message
Ian Booth (wallyworld) wrote :

There's a different root cause issue here - a new bug 1879598 has been opened to track it.

This fix was for the case where there was a race execing into the workload pod as its init containers were running.

The new issue appears related to pod termination when a rolling update is done by the k8s deployment controller.

It's been hard to reproduce - any chance of getting access to the mattermost charm so we can try that to see if we can get things to misbehave?

Changed in juju:
status: New → Fix Released
Revision history for this message
Paul Collins (pjdc) wrote :

I'll go update the other bug, but I'm not seeing the difference between the reproductions I supplied here. What's different?

Revision history for this message
Ian Booth (wallyworld) wrote :

The difference is the underlying juju issue that needs to be fixed.

The symptoms appear similar in this case but because the underlying root cause is different there may also be other dissimilar manifestations of the issue. So best to raise a new bug to track the work to implement the changes to a different piece of code.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers