Resource: res_kube_apiserver_snap.kube_apiserver.daemon not running

Bug #1859044 reported by Joshua Genet
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Kubernetes API Load Balancer
Fix Released
Critical
George Kraft
Kubernetes Control Plane Charm
Fix Released
Critical
George Kraft
charm-interface-hacluster
Fix Released
Undecided
Unassigned

Bug Description

k8s 1.17
kubernetes-master rev-788

---

I believe this is related to this bug:
https://bugs.launchpad.net/charm-kubernetes-master/+bug/1841005
We could be not monitoring that resource correctly still.
So it *may* be a regression in 1.17.

---

I've got 2/3 masters with blocked hacluster:
kubernetes-master/0 active idle 12/lxd/1 10.246.65.36 6443/tcp Kubernetes master running.
  containerd/10 active idle 10.246.65.36 Container runtime available
  filebeat/10 active idle 10.246.65.36 Filebeat ready.
  flannel/10 active idle 10.246.65.36 Flannel subnet 10.1.30.1/24
  hacluster-kubernetes-master/2 blocked idle 10.246.65.36 Resource: res_kube_apiserver_snap.kube_apiserver.daemon not running

kubernetes-master/1 active idle 13/lxd/1 10.246.65.34 6443/tcp Kubernetes master running.
  containerd/9 active idle 10.246.65.34 Container runtime available
  filebeat/9 active idle 10.246.65.34 Filebeat ready.
  flannel/9 active idle 10.246.65.34 Flannel subnet 10.1.57.1/24
  hacluster-kubernetes-master/1 blocked idle 10.246.65.34 Resource: res_kube_apiserver_snap.kube_apiserver.daemon not running

---

If I go to any of the masters I can see that the service is in fact active:

$ systemctl status snap.kube-apiserver.daemon.service
● snap.kube-apiserver.daemon.service - Service for snap application kube-apiserver.daemon
   Loaded: loaded (/etc/systemd/system/snap.kube-apiserver.daemon.service; enabled; vendor preset: enabled)
  Drop-In: /etc/systemd/system/snap.kube-apiserver.daemon.service.d
           └─always-restart.conf, file-limit.conf
   Active: active (running) since Wed 2020-01-08 19:04:15 UTC; 2h 59min ago
 Main PID: 486957 (kube-apiserver)
    Tasks: 31 (limit: 4915)
   CGroup: /system.slice/snap.kube-apiserver.daemon.service
           └─486957 /snap/kube-apiserver/1493/kube-apiserver --allow-privileged=true --service-cluster-ip-range=10.152.183.0/24 --min-request-timeout=300 --v=4 --tls-cert

Revision history for this message
George Kraft (cynerva) wrote :

What revision of the hacluster charm are you running?

Revision history for this message
Joshua Genet (genet022) wrote :

rev-63

Here's everything

---

App Version Status Scale Charm Store Rev OS Notes
apache2 unknown 1 apache2 jujucharms 33 ubuntu exposed
canonical-livepatch active 18 canonical-livepatch jujucharms 34 ubuntu
ceph-mon 12.2.12 active 3 ceph-mon jujucharms 44 ubuntu
ceph-osd 12.2.12 active 9 ceph-osd jujucharms 294 ubuntu
containerd active 12 containerd jujucharms 53 ubuntu
easyrsa 3.0.1 active 1 easyrsa jujucharms 295 ubuntu
elasticsearch 5.6.16 active 1 elasticsearch jujucharms 39 ubuntu
etcd 3.2.10 active 3 etcd jujucharms 485 ubuntu
filebeat 6.8.6 active 12 filebeat jujucharms 25 ubuntu
flannel 0.11.0 active 12 flannel jujucharms 466 ubuntu
grafana active 1 grafana jujucharms 38 ubuntu exposed
graylog 3.0.1 active 1 graylog jujucharms 40 ubuntu
hacluster-kubernetes-master blocked 3 hacluster jujucharms 63 ubuntu
hacluster-mysql active 3 hacluster jujucharms 63 ubuntu
hacluster-vault active 3 hacluster jujucharms 63 ubuntu
kubernetes-master 1.17.0 active 3 kubernetes-master jujucharms 788 ubuntu
kubernetes-worker 1.17.0 active 9 kubernetes-worker jujucharms 623 ubuntu exposed
mongodb 3.6.3 active 1 mongodb jujucharms 53 ubuntu
mysql 5.7.20 active 3 percona-cluster jujucharms 281 ubuntu
prometheus active 1 prometheus2 jujucharms 12 ubuntu
telegraf active 12 telegraf jujucharms 29 ubuntu
vault 1.1.1 active 3 vault jujucharms 32 ubuntu

Revision history for this message
Joshua Genet (genet022) wrote :

Ok wow, sorry that didn't paste well. Here's a pastebin:
https://pastebin.canonical.com/p/B2js3tdX4n/

Revision history for this message
George Kraft (cynerva) wrote :

Thanks. Looks like you're on the latest stable kubernetes-master and hacluster. I've confirmed that the code fixes for https://bugs.launchpad.net/charm-kubernetes-master/+bug/1841005 are present in kubernetes-master-788. We'll need to figure out why the problem is still occurring despite that.

Changed in charm-kubernetes-master:
status: New → Confirmed
Revision history for this message
Joshua Genet (genet022) wrote :

Just a little more info, this is definitely intermittent.
I stood up the same hacluster deploy today and had no issues with it. I'll be doing a fair amount of hacluster work next week, so I'll let you know if I run in to it again.

Revision history for this message
Joshua Genet (genet022) wrote :

Ran into it again today. 1/3 of my masters is in blocked state due to "Resource: res_kube_apiserver_snap.kube_apiserver.daemon not running".

I've encountered this in 2/5 deploys.

George Kraft (cynerva)
no longer affects: charm-kubernetes-master
Changed in charm-kubernetes-master:
status: New → Confirmed
Changed in charm-kubeapi-load-balancer:
status: New → Confirmed
Revision history for this message
George Kraft (cynerva) wrote :

The bug appears to originate from interface-hacluster. There was recent work to prevent pacemaker from giving up on stopped resources[1]. However, when it was merged[2], the same fix was not applied to the SystemdService class that had also been introduced by another recent commit.

We're using the SystemdService class, and it's missing that fix. Once a resource is stopped, it stays stopped.

As a workaround, it looks like you can manually clear the stopped resource by running the cleanup action:

juju run-action hacluster/0 cleanup --wait

[1]: https://github.com/openstack/charm-interface-hacluster/commit/fe9d009520be5082f82ff99bce8d460a02bd7c93
[2]: https://github.com/openstack/charm-interface-hacluster/commit/38590837dae187a96f092236aeda61c8b2fda7cc

Revision history for this message
Joshua Genet (genet022) wrote :

Thanks for investigating and figuring out a workaround!

Revision history for this message
George Kraft (cynerva) wrote :

Attached a git diff for interface-hacluster that I believe would fix this issue. Unfortunately, I have to prioritize other work so I won't be able to test or submit this myself.

Revision history for this message
George Kraft (cynerva) wrote :

This has become a release blocker. I'm working on it now.

Changed in charm-kubeapi-load-balancer:
importance: Undecided → Critical
Changed in charm-kubernetes-master:
importance: Undecided → Critical
assignee: nobody → George Kraft (cynerva)
Changed in charm-kubeapi-load-balancer:
assignee: nobody → George Kraft (cynerva)
Changed in charm-kubernetes-master:
milestone: none → 1.18
Changed in charm-kubeapi-load-balancer:
milestone: none → 1.18
status: Confirmed → In Progress
Changed in charm-kubernetes-master:
status: Confirmed → In Progress
Revision history for this message
George Kraft (cynerva) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-interface-hacluster (master)

Reviewed: https://review.opendev.org/718182
Committed: https://git.openstack.org/cgit/openstack/charm-interface-hacluster/commit/?id=ef1f8503f4de4edbbd8a03c8a67bae4f10a59e5a
Submitter: Zuul
Branch: master

commit ef1f8503f4de4edbbd8a03c8a67bae4f10a59e5a
Author: George Kraft <email address hidden>
Date: Tue Apr 7 13:07:40 2020 -0500

    Make SystemdService never give up on resources

    Change-Id: Icd202be7cf55f8bd883d102c81881ed15a0e5191
    Closes-Bug: #1859044

Changed in charm-interface-hacluster:
status: In Progress → Fix Released
Revision history for this message
George Kraft (cynerva) wrote :
Changed in charm-kubeapi-load-balancer:
status: In Progress → Fix Committed
Changed in charm-kubernetes-master:
status: In Progress → Fix Committed
Changed in charm-kubeapi-load-balancer:
status: Fix Committed → Fix Released
Changed in charm-kubernetes-master:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.