lxd

Nagios reports "NRPE: Unable to read output" for ntp check

Bug #1921392 reported by Andre Ruiz
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
NRPE Charm
Won't Fix
Medium
Unassigned
lxd
New
Undecided
Unassigned

Bug Description

I'm seeing a status "UNKNOWN: NRPE: Unable to read output" in a check for NTP in nagios.

I have dozens of checks of this same charm and only one is reporting that.

When entering the unit and manually running the NRPE check, I get:

root@brtlvlty0593co:~# /opt/ntpmon-ntp-charm/check_ntpmon.py
Traceback (most recent call last):
  File "/opt/ntpmon-ntp-charm/check_ntpmon.py", line 80, in <module>
    main()
  File "/opt/ntpmon-ntp-charm/check_ntpmon.py", line 71, in main
    checkobjs = ntpchecks(args.check, debug=args.debug)
  File "/opt/ntpmon-ntp-charm/process.py", line 141, in ntpchecks
    (output, elapsed) = execute('peers', debug=debug, implementation=implementation)
  File "/opt/ntpmon-ntp-charm/process.py", line 105, in execute
    output = execute_subprocess(cmd, timeout=timeout, debug=debug, errfatal=errfatal)
  File "/opt/ntpmon-ntp-charm/process.py", line 46, in execute_subprocess
    output = subprocess.check_output(
  File "/usr/lib/python3.8/subprocess.py", line 411, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/lib/python3.8/subprocess.py", line 489, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/usr/lib/python3.8/subprocess.py", line 854, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/usr/lib/python3.8/subprocess.py", line 1702, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'ntpq'

While in another (good) machine I get:

root@brtlvlty0589co:/opt/ntpmon-ntp-charm# /opt/ntpmon-ntp-charm/check_ntpmon.py
OK: offset is -0.000331 | frequency=-6.007000 offset=-0.000331 peers=8 reach=100.000000 result=0 rootdelay=0.009868 rootdisp=0.001637 runtime=157263 stratum=3 sync=1.000000 sysjitter= sysoffset=0.000040687 tracehosts= traceloops= tracetime=

So, looking at the code, it tires to detect if the server is running ntpd or chrony, and uses different external commands to get information accordingly to the daemon it detects. This code is located at /opt/ntpmon-ntp-charm/process.py. It is a library (imported from the main program) but it also has a main() function that can be used to print debugging info. This is what I get from those same two instances:

The bad one:

root@brtlvlty0593co:/opt/ntpmon-ntp-charm# python3 ./process.py
Running ntpd
{'runtime': 165948.77594351768}
Traceback (most recent call last):
  File "./process.py", line 220, in <module>
    main()
  File "./process.py", line 213, in main
    checkobjs = ntpchecks(checks, debug=True, implementation=implementation)
  File "./process.py", line 141, in ntpchecks
    (output, elapsed) = execute('peers', debug=debug, implementation=implementation)
  File "./process.py", line 105, in execute
    output = execute_subprocess(cmd, timeout=timeout, debug=debug, errfatal=errfatal)
  File "./process.py", line 46, in execute_subprocess
    output = subprocess.check_output(
  File "/usr/lib/python3.8/subprocess.py", line 411, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/lib/python3.8/subprocess.py", line 489, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/usr/lib/python3.8/subprocess.py", line 854, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/usr/lib/python3.8/subprocess.py", line 1702, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'ntpq'

The good one:

root@brtlvlty0589co:/opt/ntpmon-ntp-charm# python3 ./process.py
Running chronyd
{'runtime': 157434.05309963226}
#,x,PHC0,0,3,377,12,-6.935307503,-6.935307503,0.000000067
^,+,10.238.74.66,2,10,377,141,-0.003402711,-0.003428298,0.012015752
^,*,10.129.178.22,2,6,377,9,0.002039364,0.002021294,0.006889783
^,+,10.129.178.21,3,6,377,35,0.002996356,0.002977652,0.007242468
^,+,10.238.74.67,2,10,377,1734,-0.001577901,-0.001753645,0.013376529
^,+,10.129.219.1,3,6,377,58,-0.000416808,-0.000436077,0.006574661
^,+,10.129.219.2,3,6,377,3,-0.000451969,-0.000451969,0.006648029
^,+,10.129.219.3,3,6,377,21,-0.002105824,-0.002124204,0.008361530

elapsed time: 0.003 seconds
0A81B216,10.129.178.22,3,1616682657.673360774,0.000058228,-0.000018070,0.000019482,-5.923,0.001,0.022,0.009868151,0.001558296,64.1,Normal

elapsed time: 0.004 seconds
ntpmon frequency=-5.923000000,offset=-0.000456429,reach=100.000000000,rootdelay=0.009868151,rootdisp=0.001558296,runtime=157434.159697294,stratum=3i,sysoffset=0.000058228
ntpmon_peers,peertype=backup count=0i
ntpmon_peers,peertype=excess count=0i
ntpmon_peers,peertype=false count=1i
ntpmon_peers,peertype=invalid count=0i
ntpmon_peers,peertype=outlier count=0i
ntpmon_peers,peertype=pps count=0i
ntpmon_peers,peertype=survivor count=7i
ntpmon_peers,peertype=sync count=1i

So, it clearly thinks one of them is using chrony and the other is using ntpd. But, in both cases, ntpd package is not even installed (but chrony is):

root@brtlvlty0593co:/opt/ntpmon-ntp-charm# dpkg -l | grep -i -e chrony -e ntp
ii chrony 3.5-6ubuntu6.2 amd64 Versatile implementation of the Network Time Protocol
ii ntpdate 1:4.2.8p12+dfsg-3ubuntu4 amd64 client for setting system time from NTP servers (deprecated)
rc systemd-timesyncd 245.4-4ubuntu3.5 amd64 minimalistic service to synchronize local time with NTP servers

root@brtlvlty0589co:/opt/ntpmon-ntp-charm# dpkg -l | grep -i -e chrony -e ntp
ii chrony 3.5-6ubuntu6.2 amd64 Versatile implementation of the Network Time Protocol
ii ntpdate 1:4.2.8p12+dfsg-3ubuntu4 amd64 client for setting system time from NTP servers (deprecated)
rc systemd-timesyncd 245.4-4ubuntu3.5 amd64 minimalistic service to synchronize local time with NTP servers

Again checking the code, it is not checking for the installed package but instead by which daemon is *running* in the proccess list. And this is the problem, we can find an "ntpd" running in one of them:

root@brtlvlty0593co:/opt/ntpmon-ntp-charm# ps axuw | grep -i ntpd
1000111 245835 0.0 0.0 107772 4444 ? Ssl Mar23 0:11 /usr/sbin/ntpd -p /var/run/ntpd.pid -g -u 111:116

And looking closely, it is inside a snap:

root 30547 0.0 0.0 1898184 17428 ? Ss Mar23 0:00 [lxc monitor] /var/snap/lxd/common/lxd/containers juju-ce5995-2-lxd-3
1000000 30568 0.0 0.0 225560 9544 ? Ss Mar23 0:20 \_ /sbin/init
1000111 245835 0.0 0.0 107772 4444 ? Ssl Mar23 0:11 \_ /usr/sbin/ntpd -p /var/run/ntpd.pid -g -u 111:116

It makes sense to have lxd initialized because these are baremetals with lots of lxds inside (both examples). But the question is why one of them initialized ntpd and the other did not.

Also, the ntpmon check should not be relying on any ntpd running to "guess" it should use ntpd commands to get status.

Workaround:

Edit process.py, on line 166, change 'ntpd' for something else like 'ntpdxxx', it will not match anymore and chrony clients will be used.

Revision history for this message
Andre Ruiz (andre-ruiz) wrote :

Added lxd because it seems wrong to start ntpd inside the charm when the host is managing time using chrony. And, if this is right, then it's unclear why it did not happen on the other units.

Revision history for this message
Andre Ruiz (andre-ruiz) wrote :

*** I meant "inside the snap"

Revision history for this message
Yoshi Kadokawa (yoshikadokawa) wrote :

I'm seeing the same issue on the baremetal nodes where ceph-fs is running in the LXD container.
Since this is blocking the deployment with an "UNKNOWN" status in Nagios, I'm going to subscribe this with field-high.

Revision history for this message
Yoshi Kadokawa (yoshikadokawa) wrote :

sorry, since this is not a warning or critical, and unexpected ntp running in the container, I'm unsubscribing field-high

Revision history for this message
Nobuto Murata (nobuto) wrote :
Revision history for this message
Eric Chen (eric-chen) wrote :

NRPE charm is under maintenance mode. Only critical bug will be handled.

Changed in charm-nrpe:
importance: Undecided → Medium
status: New → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.