Add service checks for nodes/system pods

Bug #1815500 reported by Drew Freiberger
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Kubernetes Control Plane Charm
Fix Released
Wishlist
Mike Wilson

Bug Description

It has been found that there are no monitors for status of kubernetes workers/nodes being active in the cluster.

We found recently a snap-based kubelet had not restarted properly and dropped out of the kubernetes system causing several kube-system pods to become unavailable due to CA issues.

Example states found in kubectl get nodes and kubectl get pods --namespace kube-system

$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-172-25-10-242.xyz Ready <none> 62d v1.12.5
ip-172-25-11-113.xyz Ready <none> 62d v1.12.5
ip-172-25-11-152.xyz NotReady <none> 62d v1.12.4
ip-172-25-11-29.xyz Ready <none> 62d v1.12.5

$ kubectl get pods --namespace kube-system
NAME READY STATUS RESTARTS AGE
calico-policy-controller-69cf89bf65-czxs7 1/1 Running 0 62d
heapster-v1.6.0-beta.1-f5559b86b-vf4n6 4/4 Running 0 41h
heapster-v1.6.0-beta.1-f5559b86b-wqrwl 4/4 Unknown 0 56d
kube-dns-596fbb8fbd-z5cr9 3/3 Running 0 56d
kubernetes-dashboard-6bdb474f78-qfgml 1/1 Running 0 41h
metrics-server-v0.3.1-67bb5c8d7-5dmf5 2/2 Running 0 62d
monitoring-influxdb-grafana-v4-65cc9bb8c8-bch4c 2/2 Running 3 56d
tiller-deploy-7b4c999868-6k7z6 1/1 Running 0 41h
tiller-deploy-7b4c999868-7vnlk 1/1 Unknown 3 55d

In this case, 172.25.11.152 needs to alert as being in a NotReady state. This would be similar to an openstack nova-compute service being disabled/down and should generate an alert.

This incident happened after kubelet snap updates from 1.12.4 to 1.12.5 didn't restart the kubelet service properly, but the checks for the kubelet service were still showing running ok, so this needs to be checked at the api layer, not just the process layer.

We'd also like to see alerts for pods in a configurable set of namespaces to be monitored for status other than "Running" such as the Unknown status tiller/heapster pods above.

It is my opinion that kube-system pods should be monitored as part of the undercloud, as this is where services such as dashboard, tiller, heapster, etc that are installed as part of the CDK charms are run from.

william (ordertrama)
summary: - Add service checks for nodes/system pods
+ Best Place to Buy Tramadol Online Without Prescription:: Overnight
+ Delivery
description: updated
William Grant (wgrant)
description: updated
summary: - Best Place to Buy Tramadol Online Without Prescription:: Overnight
- Delivery
+ Add service checks for nodes/system pods
skujhgnm (cvmdvhdbvh)
summary: - Add service checks for nodes/system pods
+ Buy Tramadol Online Add service checks for nodes/system pods
description: updated
summary: - Buy Tramadol Online Add service checks for nodes/system pods
+ Buy Tramadol Online without a Prescription Add service checks for
+ nodes/system pods
summary: - Buy Tramadol Online without a Prescription Add service checks for
+ Buy Tramadol Online without a Prescription :: Add service checks for
nodes/system pods
description: updated
Xav Paice (xavpaice)
description: updated
summary: - Buy Tramadol Online without a Prescription :: Add service checks for
- nodes/system pods
+ Add service checks for nodes/system pods
Changed in charm-kubernetes-master:
status: New → Triaged
importance: Undecided → Wishlist
assignee: nobody → Mike Wilson (knobby)
tags: added: monitoring
Revision history for this message
Mike Wilson (knobby) wrote :

Where would you like to see the pod monitoring? Are you using prometheus to monitor the cluster or just nagios? If just nagios, how would you suggest something not tied to a single host is handled in nagios?

Revision history for this message
Mike Wilson (knobby) wrote :
Changed in charm-kubernetes-master:
status: Triaged → In Progress
milestone: none → 1.17
Revision history for this message
Kevin W Monroe (kwmonroe) wrote :
Changed in charm-kubernetes-master:
status: In Progress → Fix Committed
Changed in charm-kubernetes-master:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.