error: cannot refresh "kubectl": snap "kubectl" has running apps (kubectl)

Bug #1987331 reported by Chris Johnston
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Kubernetes Control Plane Charm
New
Undecided
Unassigned
Kubernetes Worker Charm
Incomplete
Undecided
Unassigned

Bug Description

When new revisions for the kubectl snap are released we are seeing this error in the k-c-p logs as well as the unit being in error state:

2022-06-19 23:57:18 INFO unit.kubernetes-master/5.juju-log server.go:327 coordinator:11: status-set: maintenance: Joining snap cohort.
2022-06-19 23:57:18 WARNING unit.kubernetes-master/5.coordinator-relation-changed logger.go:60 error: cannot refresh "kubectl": snap "kubectl" has running apps (kubectl)
2022-06-19 23:57:18 ERROR unit.kubernetes-master/5.juju-log server.go:327 coordinator:11: Hook error:
Traceback (most recent call last):
File "/var/lib/juju/agents/unit-kubernetes-master-5/.venv/lib/python3.6/site-packages/charms/reactive/__init__.py", line 74, in main
bus.dispatch(restricted=restricted_mode)
File "/var/lib/juju/agents/unit-kubernetes-master-5/.venv/lib/python3.6/site-packages/charms/reactive/bus.py", line 390, in dispatch
_invoke(other_handlers)
File "/var/lib/juju/agents/unit-kubernetes-master-5/.venv/lib/python3.6/site-packages/charms/reactive/bus.py", line 359, in _invoke
handler.invoke()
File "/var/lib/juju/agents/unit-kubernetes-master-5/.venv/lib/python3.6/site-packages/charms/reactive/bus.py", line 181, in invoke
self._action(*args)
File "/var/lib/juju/agents/unit-kubernetes-master-5/charm/reactive/kubernetes_master.py", line 469, in join_or_update_cohorts
snap.join_cohort_snapshot(snapname, cohort_key)
File "lib/charms/layer/snap.py", line 455, in join_cohort_snapshot
subprocess.check_output(["snap", "refresh", snapname, "--cohort", cohort_key])
File "/usr/lib/python3.6/subprocess.py", line 356, in check_output
**kwargs).stdout
File "/usr/lib/python3.6/subprocess.py", line 438, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['snap', 'refresh', 'kubectl', '--cohort', 'REDACTED']' returned non-zero exit status 1.

I believe that this is similar to LP#1978005 [1] and LP#1975714 [2]. The fix for this is currently in version 2.57, which is in candidate. Once this is released, AIUI the kubectl snap would need to be changed to set `refresh-mode: ignore-running` in order for refreshes to be processed [3][4].

[1] https://bugs.launchpad.net/snapd/+bug/1978005
[2] https://bugs.launchpad.net/snapd/+bug/1975714
[3] https://github.com/snapcore/snapd/pull/11855/files
[4] https://chat.canonical.com/canonical/pl/47m58wxjt3bzfehpedogg7dwio

Revision history for this message
George Kraft (cynerva) wrote :

I'm able to repro by running a snap refresh to a different revision while running kubectl in the background.

The documentation around `refresh-mode: ignore-running` suggests that it's for daemons[1], but I'll give it a try and see if it works for non-daemon apps too.

[1]: https://snapcraft.io/docs/services-and-daemons

Changed in charm-kubernetes-master:
status: New → Confirmed
Revision history for this message
George Kraft (cynerva) wrote :

Hmm, nope, can't apply the refresh-mode option to a non-daemon command:

$ snapcraft
Issues while validating snapcraft.yaml: The 'apps/kubectl' property does not match the required schema: 'daemon' is a dependency of 'refresh-mode'

I'll see if I can find another way.

Revision history for this message
George Kraft (cynerva) wrote :

I haven't had any luck finding a way to fix this in snapcraft.yaml and I haven't received a response from the snap team.

The snap refresh command has an undocumented --ignore-running option that seems to do the trick, so we could at least fix this in the charms by utilizing that.

Changed in charm-kubernetes-master:
status: Confirmed → Triaged
Changed in charm-kubernetes-worker:
status: New → Triaged
Changed in charm-kubernetes-master:
importance: Undecided → High
Changed in charm-kubernetes-worker:
importance: Undecided → High
Revision history for this message
George Kraft (cynerva) wrote :

I'm targeting this for 1.29 initially. We are in the middle of a complete rewrite of the kubernetes-control-plane and kubernetes-worker charms, so a backport to 1.28 will be nontrivial. If you do need it prior to 1.29, let me know.

George Kraft (cynerva)
Changed in charm-kubernetes-master:
milestone: none → 1.29
Changed in charm-kubernetes-worker:
milestone: none → 1.29
Changed in charm-kubernetes-master:
assignee: nobody → George Kraft (cynerva)
Changed in charm-kubernetes-worker:
assignee: nobody → George Kraft (cynerva)
Changed in charm-kubernetes-master:
status: Triaged → In Progress
Changed in charm-kubernetes-worker:
status: Triaged → In Progress
Revision history for this message
George Kraft (cynerva) wrote :
George Kraft (cynerva)
Changed in charm-kubernetes-master:
status: In Progress → Fix Committed
Changed in charm-kubernetes-worker:
status: In Progress → Fix Committed
tags: added: backport-needed
Revision history for this message
George Kraft (cynerva) wrote :

I've reopened this. While the existing PR should help, recent evidence suggests that kubectl pids are being leaked by the auth-webhook when it fetches token secrets[1]. I strongly suspect that the run function's timeout mechanism[2] is leaving behind stale processes.

[1]: https://github.com/charmed-kubernetes/charm-kubernetes-control-plane/blob/9ca52889800937509bd0065d285c6646e04cb745/templates/cdk.master.auth-webhook.py#L347-L349
[2]: https://github.com/charmed-kubernetes/charm-kubernetes-control-plane/blob/9ca52889800937509bd0065d285c6646e04cb745/templates/cdk.master.auth-webhook.py#L52-L61

Changed in charm-kubernetes-master:
status: Fix Committed → Triaged
Changed in charm-kubernetes-worker:
status: Fix Committed → Triaged
Revision history for this message
Kiran Kumar Ch (kiranch97) wrote :

Hi Everyone,

We have a similar issue,

2023-12-28 15:46:15 WARNING unit.kubernetes-master/2.kube-masters-relation-changed logger.go:60 pids:
2023-12-28 15:46:15 WARNING unit.kubernetes-master/2.kube-masters-relation-changed logger.go:60 36212,42968,43163,200955,201120,201764,202935,310701,313305,313375,313421,321260,323455,324106,327348,327350,327416,335179,335181,341028,343275,446946,446992,447033,449256,452502,453211,462071,464088,498248,498384,500135,500196,500851,501400,502283,502455,505362,505582,506004,517629,526262,529364,703605,745127,745776,745910,770592,771348,774977,777352,777414,777580,778537,780986,781027,781088,781099,783397,791875,793824,796301,796594,796607,796717,796824,796877,796879,799912,852716,916983,1043442,1093295,1093579,1161711,1166755,1171172,1172291,1233790,1234180,1234244,1311667,1396925,1413481,1413511,1427319,1427662,1427811,1439295,1455578,1456066,1674351,1674391,1730648,1812663,1812770,1812816,1812848,1812860,1842595,1845856,1845858,1845935,1845973,1846014,1846040,1846077,1846092,1846158,1848678,1848728,1848776,1872062,1873130,1873424,1873914,2062722,2078069,2078158,2078726,2079978,2080486,2080600,2080668,2121030,2162097,2643785,2697736,2697771,2697772,2697783,2697791,2697792,2697802,2697811,2697974,2697997,2698030,2698076,2698080,3091793,3091866,3091906,3099778,3191321,3191815,3238242,3405872,3405886,3406171,3406185,3408896,3409065,3410803,3411416,3412466,3415117,3418725,3418969,3419109,3419203,3419278,3419776,3422940,3422942,3423024,3423427,3423874,3424087,3424705,3608211,3608248,3608283,3608566,3662921,4072151,4147269,4149375,4193617
2023-12-28 15:46:15 ERROR unit.kubernetes-master/2.juju-log server.go:316 kube-masters:8: Hook error:

$ cat ps | grep 327350
root 327350 0.0 0.0 762272 43024 ? - Dec09 3:05 /snap/kubectl/3117/kubectl --kubeconfig=/root/.kube/config get secrets -n kube-system -o json
birru:~/Documents/Canonical/Cases/375903/sosreport-hpc-k8s-vm-mcp-2-00375903-2023-12-28-qvvgdyo$ cat ps | grep 36212
root 36212 0.0 0.0 761760 42212 ? - Dec07 3:25 /snap/kubectl/3117/kubectl --kubeconfig=/root/.kube/config get secrets -n kube-system -o json
birru:~/Documents/Canonical/Cases/375903/sosreport-hpc-k8s-vm-mcp-2-00375903-2023-12-28-qvvgdyo$ cat ps | grep 42968
root 42968 0.0 0.0 761760 12932 ? - Nov29 4:50 /snap/kubectl/3117/kubectl --kubeconfig=/root/.kube/config get secrets -n kube-system -o json
birru:~/Documents/Canonical/Cases/375903/sosreport-hpc-k8s-vm-mcp-2-00375903-2023-12-28-qvvgdyo$ cat ps | grep 327348
root 327348 0.0 0.0 762016 43128 ? - Dec09 3:08 /snap/kubectl/3117/kubectl --kubeconfig=/root/.kube/config get secrets -n kube-system -o json

Revision history for this message
Ebrar Leblebici (birru2) wrote :

As a workaround, as the "cdk.master.auth-webhook" service was creating the "kubectl" pids, we first stopped this service. Then killed the processes we see in the error log and run the below for the unit which is in error state.

juju resolved kubernetes-master/2

Then started the service again "cdk.master.auth-webhook".

Revision history for this message
Chris Johnston (cjohnston) wrote :

I was able to work around this by doing:

sudo kill -9 $(pidof kubectl)

Changed in charm-kubernetes-master:
milestone: 1.29 → 1.29+ck1
Changed in charm-kubernetes-worker:
milestone: 1.29 → 1.29+ck1
Changed in charm-kubernetes-master:
assignee: George Kraft (cynerva) → nobody
Changed in charm-kubernetes-worker:
assignee: George Kraft (cynerva) → nobody
Revision history for this message
Kevin W Monroe (kwmonroe) wrote :

Hi folks, I've had a 1.29 env up for a few weeks now that has completed the 1.29.(0...3) snap refreshes without issue (no rogue kubectl processes; no snap refresh errors in debug-log). I suspect the `ignore_running` fix coupled with our new pattern for reconciling states in 1.29+ has solved this [0].

That said, comment 7 with kubectl-3117 suggests this could still affect releases < 1.29. I'm removing the 1.29 milestones and setting this bug to 'incomplete' for now; if this is still affecting folks, let us know the snapd and charmed k8s versions and we'll triage a different milestone as needed.

[0] https://github.com/charmed-kubernetes/charm-kubernetes-control-plane/blob/release_1.29/src/charm.py#L484-L485

Changed in charm-kubernetes-worker:
importance: High → Undecided
milestone: 1.29+ck1 → none
status: Triaged → Incomplete
Changed in charm-kubernetes-master:
importance: High → Undecided
milestone: 1.29+ck1 → none
status: Triaged → Incomplete
tags: removed: backport-needed
Revision history for this message
Chris Johnston (cjohnston) wrote (last edit ):

I'm still seeing this on k-c-p 1.28/stable revno 321.

snapd package 2.61.3+22.04
snapd snap 2.62 revno 21465

Changed in charm-kubernetes-master:
status: Incomplete → New
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.