check_ntpmon fails if host machine is running LXC using different NTP service
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
NTP Charm |
New
|
Undecided
|
Unassigned |
Bug Description
Machine: Bionic (running chronyd)
LXC: Xenial (running ntpd)
After upgrading NTP charm to latest (rev 47), I received an alert for Unknown error: NRPE: Unable to read output. Running the check manually, I see that the check was trying to run an "ntpq" command on a machine running chrony. Diving into the code, I found that ntpmon-
EX:
$ ps aux | egrep 'chronyd|ntpd'
100112 1547742 0.0 0.0 110616 5348 ? Ssl 2020 62:49 /usr/sbin/ntpd -p /var/run/ntpd.pid -g -u 112:116
_chrony 1701357 0.0 0.0 105588 2876 ? S 15:57 0:00 /usr/sbin/chronyd
As a workaround, I just edited the NTPProcess.names list, removing ntpd entirely so it only checks for chronyd (which we know is running).
To fix, the NTPProcess should be reworked to more reliably select the correct NTP implementation.
This was found on a focal deployment with cs:ntp deployed as a subordinate to the metal charm, and ceph-fs deployed on an lxd on top of that metal. It's random as to which of the processes, ntpd in the container, or chronyd on the metal, comes up in the proc table first.
For a very simple fix, since we typically would not run cs:ntp inside a container, you could ignore containerized ntpd processes by requiring PPID of the process to be 1 as well as matching the process name to the list of options.
Suggest changing process.py line 180 from:
ppid = proc.ppid() if self.PSUTIL2 else proc.ppid
if name in self.names:
to:
if ppid == 1 and name in self.names:
Ultimately, checking the package(s) installed in the current system is much more valuable to determine which self.names should be used.