check_node can block longer than nagios is willing to wait

Bug #1890649 reported by Adam Dyess
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Kubernetes Worker Charm
Fix Released
Medium
Unassigned

Bug Description

when check_node command runs, it calls via subprocess

/snap/bin/kubectl --kubeconfig /var/lib/nagios/.kube/config get no -o=yaml

If this command blocks for 20 or 30 seconds, NRPE gives up before then resulting in an invalid check.

If i run the command locally (30s to fail):
$ sudo /snap/bin/kubectl --kubeconfig /var/lib/nagios/.kube/config get no -o=yaml
Unable to connect to the server: net/http: TLS handshake timeout

I can see there's an issue connecting to the master. Well that's nice -- but nagios didn't know this happened.

Maybe increase the expected timeout or update the check to catch this type of error

Revision history for this message
George Kraft (cynerva) wrote :

We should be able to repro this by deploying Charmed Kubernetes with Nagios, stopping the snap.kube-apiserver.daemon service on all masters, and then observing the kubernetes-worker status as reported by Nagios.

Changed in charm-kubernetes-worker:
importance: Undecided → Medium
status: New → Triaged
Revision history for this message
Adam Dyess (addyess) wrote :
George Kraft (cynerva)
tags: added: review-needed
Revision history for this message
George Kraft (cynerva) wrote :

It looks like this went out with 1.19+ck1.

tags: removed: review-needed
Changed in charm-kubernetes-worker:
milestone: none → 1.19+ck1
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.