Kubernetes Worker Charm

check_node can block longer than nagios is willing to wait

Bug #1890649 reported by Adam Dyess on 2020-08-06

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	Kubernetes Worker Charm	Fix Released	Medium	Unassigned	Kubernetes Worker Charm 1.19+ck1

Bug Description

when check_node command runs, it calls via subprocess

/snap/bin/kubectl --kubeconfig /var/lib/nagios/.kube/config get no -o=yaml

If this command blocks for 20 or 30 seconds, NRPE gives up before then resulting in an invalid check.

If i run the command locally (30s to fail):
$ sudo /snap/bin/kubectl --kubeconfig /var/lib/nagios/.kube/config get no -o=yaml
Unable to connect to the server: net/http: TLS handshake timeout

I can see there's an issue connecting to the master. Well that's nice -- but nagios didn't know this happened.

Maybe increase the expected timeout or update the check to catch this type of error

Revision history for this message

George Kraft (cynerva) wrote on 2020-08-06:

We should be able to repro this by deploying Charmed Kubernetes with Nagios, stopping the snap.kube-apiserver.daemon service on all masters, and then observing the kubernetes-worker status as reported by Nagios.

Changed in charm-kubernetes-worker:
importance:	Undecided → Medium
status:	New → Triaged

Revision history for this message

Adam Dyess (addyess) wrote on 2020-08-28:

Raised PR to address: https://github.com/charmed-kubernetes/charm-kubernetes-worker/pull/70

George Kraft (cynerva) on 2020-08-28

tags:

added: review-needed

Revision history for this message

George Kraft (cynerva) wrote on 2020-11-25:

It looks like this went out with 1.19+ck1.

tags:	removed: review-needed
Changed in charm-kubernetes-worker:
milestone:	none → 1.19+ck1
status:	Triaged → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.