Nagios plugin fails with Unable to run kubectl and parse output

Bug #1866382 reported by David Coronel
30
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Kubernetes Worker Charm
Fix Released
High
Joseph Borg
NRPE Charm
Fix Released
High
Joe Guo

Bug Description

The Nagios check nagios-<hostname>-node fails with a critical error "Unable to run kubectl and parse output" in Nagios.

Running the command manually on the node as the nagios user results in this error:

nagios@<hostname>:~$ /usr/lib/nagios/plugins/check_k8s_worker.py

2020/03/06 17:47:15.318776 cmd_run.go:884: WARNING: cannot create user data directory: cannot create "/var/lib/nagios/snap/kubectl/1424": mkdir /var/lib/nagios/snap: permission denied

cannot create user data directory: /var/lib/nagios/snap/kubectl/1424: Permission denied

Unable to run kubectl and parse output

Additional details:

kubernetes-worker charm revision 634
nagios charm revision 36

Related branches

Revision history for this message
David Coronel (davecore) wrote :

subscribed ~field-medium

George Kraft (cynerva)
Changed in charm-kubernetes-worker:
status: New → Confirmed
Revision history for this message
George Kraft (cynerva) wrote :

Looks to me like the nagios user has HOME=/var/lib/nagios and as a result, fails to run kubectl due to this bug in snapd: https://bugs.launchpad.net/snapd/+bug/1776800

Changed in charm-kubernetes-worker:
importance: Undecided → Medium
George Kraft (cynerva)
Changed in charm-kubernetes-worker:
status: Confirmed → Triaged
Revision history for this message
Joseph Borg (joeborg) wrote :

ubuntu@ip-172-31-41-64:~$ sudo runuser -u nagios /usr/lib/nagios/plugins/check_k8s_worker.py
OK - No memory, disk, or PID pressure. Registered with API server

I can't seem to be able to reproduce this with

kubernetes-worker charm revision 661
nagios charm revision 37

Are you able to David? If so please let me know.

Many thanks,
Joe

George Kraft (cynerva)
Changed in charm-kubernetes-worker:
importance: Medium → High
Revision history for this message
David Coronel (davecore) wrote :

I don't have access to this environment anymore.

Joseph Borg (joeborg)
Changed in charm-kubernetes-worker:
assignee: nobody → Joseph Borg (joeborg)
Revision history for this message
Chris Johnston (cjohnston) wrote :

This is still happening, but not 100% of the time. What we have found is on working machines:

ls -la /var/lib/nagios/
total 20
drwxr-xr-x 5 nagios nagios 4096 Jul 29 17:15 .
drwxr-xr-x 54 root root 4096 Jul 29 17:08 ..
drwxr-xr-x 4 nagios nagios 4096 Jul 29 17:15 .kube
drwxr-xr-x 2 root root 4096 Jul 29 17:09 export
drwxr-xr-x 3 nagios nagios 4096 Jul 29 17:15 snap

But on non-working machines:
drwxr-xr-x 5 root root 4096 Jul 29 17:15 .

Doing a chown nagios:nagios /var/lib/nagios makes the issue go away.

Revision history for this message
Joseph Borg (joeborg) wrote :

Moving to NRPE as I believe this needs to be fixed in this charm.

affects: charm-kubernetes-worker → charm-nrpe
Changed in charm-nrpe:
assignee: Joseph Borg (joeborg) → nobody
status: Triaged → Confirmed
Changed in charm-kubernetes-master:
status: New → Triaged
importance: Undecided → High
assignee: nobody → Joseph Borg (joeborg)
Revision history for this message
Hua Zhang (zhhuabj) wrote :

Considering that this problem does not happen on all workers as Chris mentioned, so I'm suspecting there's some sort of race condition. Let's image one possible scenario:

1, charm-kubernetes-worker runs ahead of charm-nrpe, so L1081 [1] will create /var/lib/nagios/.kube/config

2, what's more, L1083[3] implies that the user nagios already exists at this time.

3, when charm-nrpe starts to install nagios-nrpe-server, the line 'adduser --system --group --home /var/lib/nagios --quiet nagios' [2] will not be run because the user nagios exists. So the permission of /var/lib/nagios will be root rather than nagios.

and I also tried to simulate this scenario with the following commands.

1, apt remove --purge -y nagios-nrpe-server && userdel -r nagios && rm -rf /var/lib/nagios #reset test env
2, mkdir -p /var/lib/nagios/.kube/config #it's equivalent to L1081 [1]
3, useradd nagios #not clear who creates the user nagios without homedir /var/lib/nagios
4, chown -R nagios:nagios /var/lib/nagios/.kube #it's equivalent to L1083 [3]
5, nagios-nrpe-server.preinst [2] will not run 'adduser --system --group --home /var/lib/nagios --quiet nagios' becasue the user nagios exists

root@test:~# ls -al /var/lib/nagios/
total 12
drwxr-xr-x 3 root root 4096 Jul 31 07:26 .
drwxr-xr-x 40 root root 4096 Jul 31 07:26 ..
drwxr-xr-x 3 nagios nagios 4096 Jul 31 07:26 .kube

But it's still not clear who creates the user nagios in above step 3. So Chris, can you pls paste the output of the following two commands in your non-working machines.

sudo -u nagios -- echo $HOME
id nagios -g -n

[1] https://github.com/charmed-kubernetes/charm-kubernetes-worker/blob/master/reactive/kubernetes_worker.py#L1081
[2] https://git.launchpad.net/ubuntu/+source/nagios-nrpe/tree/debian/nagios-nrpe-server.preinst?h=applied/ubuntu/bionic#n28
[3] https://github.com/charmed-kubernetes/charm-kubernetes-worker/blob/master/reactive/kubernetes_worker.py#L1083

Revision history for this message
Hua Zhang (zhhuabj) wrote :

'sudo -u nagios -- echo $HOME' will always return /home/ubuntu, so we can replace it with the following command to double confirm if the user nagios is created by nagios-nrpe-server.preinst's 'adduser --system --group --home /var/lib/nagios --quiet nagios'

cat /etc/passwd | awk -F: '{printf "User %s Home %s\n", $1, $6}' |grep nagios

Revision history for this message
Hua Zhang (zhhuabj) wrote :

More code analysis

1, The hook nrpe-install installs nagios-nrpe-server package [2], then nagios-nrpe-server package will invoke nagios-nrpe-server.preinst to run 'adduser --system --group --home /var/lib/nagios --quiet nagios' [3]

2, nrpe requires interface-nrpe-external-master [1] so interface-nrpe-external-master will set the flag nrpe-external-master.available [4] after running 'juju add-relation nrpe:nrpe-external-master kubernetes-worker:nrpe-external-master'

3, charm-kubernetes-worker creates the dir /var/lib/nagios/.kube/config in update_nrpe_config after the flag nrpe-external-master.available is set [5].

The above process seems to be linear, so nrpe#nrpe-install will always run before charm-kubernetes-worker#update_nrpe_config/ but seems some other flags like nrpe-external-master.reconfigure (be set in L631 [7]) can also trigger update_nrpe_config. So there exists a race condition here in theory. If that's true, I drafted the following patch.

$ git diff
diff --git a/reactive/kubernetes_worker.py b/reactive/kubernetes_worker.py
index 4131a61..11216de 100644
--- a/reactive/kubernetes_worker.py
+++ b/reactive/kubernetes_worker.py
@@ -1070,6 +1070,11 @@ def get_kube_api_servers(kube_api):
           'config.changed.nagios_servicegroups',
           'nrpe-external-master.reconfigure')
 def update_nrpe_config():
+ # if /var/lib/nagios doesn't exist, just wait for nagios-nrpe-server
+ # to create it, then this function will be run again due to
+ # nrpe-external-master.reconfigure (LP: #1866382)
+ if not os.path.isdir(nrpe.homedir):
+ time.sleep(1)
     services = ['snap.{}.daemon'.format(s) for s in worker_services]
     data = render('nagios_plugin.py', context={'node_name': get_node_name()})
     plugin_path = install_nagios_plugin_from_text(data,

However, that's just my guess, now all problematic environments have been fixed by workaround, and I can't reproduce it as well. So I am not 100% sure my guess is true.

[1] https://git.launchpad.net/charm-nrpe/tree/metadata.yaml?h=stable/20.05#n20
[2] https://git.launchpad.net/charm-nrpe/tree/hooks/nrpe_utils.py?h=stable/20.05#n33
[3] https://git.launchpad.net/ubuntu/+source/nagios-nrpe/tree/debian/nagios-nrpe-server.preinst?h=applied/ubuntu/bionic#n28
[4] https://github.com/cmars/nrpe-external-master-interface/blob/master/provides.py#L16
[5] https://github.com/charmed-kubernetes/charm-kubernetes-worker/blob/master/reactive/kubernetes_worker.py#L1066
[6] https://github.com/charmed-kubernetes/charm-kubernetes-worker/blob/master/reactive/kubernetes_worker.py#L1071
[7] https://github.com/charmed-kubernetes/charm-kubernetes-worker/blob/master/reactive/kubernetes_worker.py#L631

Revision history for this message
Adam Dyess (addyess) wrote :

I found this workaround applied to this bug to be successful:

$ juju run -a kubernetes-worker 'chown nagios:nagios /var/lib/nagios/'

Revision history for this message
Hua Zhang (zhhuabj) wrote :

Added charm-kubernetes-worker as a project that might be affected, but we don't have a reproducer yet so haven't determined which side will be better to fix, charm-nrpe side or charm-kubernetes-worker side

Revision history for this message
Hua Zhang (zhhuabj) wrote :

Seeing that we don't have a reproducer yet, I am not sure if we should first submit a defense patch similar to the following, it's harmless but it should also be able to solve the problem.

$ git diff
diff --git a/reactive/kubernetes_worker.py b/reactive/kubernetes_worker.py
index 4131a61..e46b0a3 100644
--- a/reactive/kubernetes_worker.py
+++ b/reactive/kubernetes_worker.py
@@ -1094,6 +1094,7 @@ def update_nrpe_config():
cmd = ['chown', '-R', 'nagios:nagios',
os.path.dirname(nrpe_kubeconfig_path)]
check_call(cmd)
+ check_call(['chown', 'nagios:nagios', '/var/lib/nagios'])

remove_state('nrpe-external-master.reconfigure')
set_state('nrpe-external-master.initial-config')

George Kraft (cynerva)
Changed in charm-kubernetes-worker:
importance: Undecided → High
status: New → Triaged
no longer affects: charm-kubernetes-master
Joseph Borg (joeborg)
Changed in charm-kubernetes-worker:
assignee: nobody → Joseph Borg (joeborg)
status: Triaged → In Progress
Revision history for this message
Joseph Borg (joeborg) wrote :
tags: added: needs-review
tags: removed: needs-review
Changed in charm-kubernetes-worker:
status: In Progress → Fix Committed
milestone: none → 1.19+ck1
Revision history for this message
Adam Dyess (addyess) wrote :

It doesn't seem that this big was ever due to anything incorrect in the charm-nrpe, but that the k8s-worker created a directory owned by root before nagios package was installed. Seems that a proper solution has been applied to the k8s-worker and there is nothing left to do in nrpe charm.

Changed in charm-nrpe:
status: Confirmed → Won't Fix
no longer affects: charm-nrpe
tags: added: backport-needed
tags: removed: backport-needed
Changed in charm-kubernetes-worker:
status: Fix Committed → Fix Released
Revision history for this message
Xav Paice (xavpaice) wrote :

Added charm-nrpe as the nrpe charm needs to make sure that the directory permissions are correct. See LP:#1904045 and LP:#1906991 for similar issues where the same directory has incorrect permissions for various reasons, fixed by having the correct permissions.

We can prevent race conditions like this in the future if we use Juju to ensure the directory is correct, either by update-status checks or on config-changed.

Joe Guo (guoqiao)
Changed in charm-nrpe:
status: New → Fix Committed
importance: Undecided → High
assignee: nobody → Joe Guo (guoqiao)
milestone: none → 21.04
Celia Wang (ziyiwang)
Changed in charm-nrpe:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.