check_node can block longer than nagios is willing to wait
Bug #1890649 reported by
Adam Dyess
This bug affects 2 people
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Kubernetes Worker Charm |
Fix Released
|
Medium
|
Unassigned |
Bug Description
when check_node command runs, it calls via subprocess
/snap/bin/kubectl --kubeconfig /var/lib/
If this command blocks for 20 or 30 seconds, NRPE gives up before then resulting in an invalid check.
If i run the command locally (30s to fail):
$ sudo /snap/bin/kubectl --kubeconfig /var/lib/
Unable to connect to the server: net/http: TLS handshake timeout
I can see there's an issue connecting to the master. Well that's nice -- but nagios didn't know this happened.
Maybe increase the expected timeout or update the check to catch this type of error
tags: | added: review-needed |
To post a comment you must log in.
We should be able to repro this by deploying Charmed Kubernetes with Nagios, stopping the snap.kube- apiserver. daemon service on all masters, and then observing the kubernetes-worker status as reported by Nagios.