default verify_ntp_servers to true and/or nagios warning for Not Sychronized

Bug #1866116 reported by Bryan Quigley
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
NTP Charm
Invalid
Undecided
Unassigned

Bug Description

We should default to a setup that will alert in some way that the setup is not working.

1. Deploy ntp charm with invalid ntp servers
2. Note: Juju / Nagios say everything is great

Setting verify_ntp_servers=True by default is a good start, but a more robust nagios check might be useful too.

Setup:
juju deploy ubuntu --series xenial u16
juju deploy ubuntu --series bionic u18
juju deploy ubuntu --series focal u20 --force
juju deploy cs:ntp
juju add-relation ntp u16
juju add-relation ntp u18
juju add-relation ntp u20

juju config ntp pools=""
juju config ntp source="1.1.1.1" #Something that doesn't work

Juju status show everything is awesome:
...
Unit Workload Agent Machine Public address Ports Message
u16/0* active idle 0 10.5.0.4 ready
  ntp/0* active idle 10.5.0.4 123/udp ntp: Ready
u18/0* active idle 1 10.5.0.31 ready
  ntp/1 active idle 10.5.0.31 123/udp chrony: Ready
u20/0* active idle 2 10.5.0.7 ready
  ntp/2 active idle 10.5.0.7 123/udp chrony: Ready

If you go ahead and set juju config ntp verify_ntp_servers=true
They all report NTP servers are not reachable: 1.1.1.1. Instead of that option it looks we just need to expose the following commands:

Apparently commands from comment#1 mostly show it's failing.:
juju run --all -- /opt/ntpmon-ntp-charm/check_ntpmon.py --check reach
- MachineId: "0" Stdout: | OK: reachability is nan% | frequency= offset=nan peers=0 reach=nan result=0 rootdelay= rootdisp= runtime= stratum= sync=0.000000 sysjitter= sysoffset= tracehosts= traceloops= tracetime=
- MachineId: "1" ReturnCode: 2 Stdout: | CRITICAL: reachability is too low (0.00%) - must be greater than 50.00% | frequency= offset=nan peers=1 reach=0.000000 result=2 rootdelay= rootdisp= runtime= stratum= sync=0.000000 sysjitter= sysoffset= tracehosts= traceloops= tracetime=
- MachineId: "2" ReturnCode: 2 Stdout: | CRITICAL: reachability is too low (0.00%) - must be greater than 50.00% | frequency= offset=nan peers=1 reach=0.000000 result=2 rootdelay= rootdisp= runtime= stratum= sync=0.000000 sysjitter= sysoffset= tracehosts= traceloops= tracetime=

- MachineId: "0" ReturnCode: 2 Stdout: | CRITICAL: No sync peer selected | frequency= offset=nan peers=0 reach=nan result=2 rootdelay= rootdisp= runtime= stratum= sync=0.000000 sysjitter= sysoffset= tracehosts= traceloops= tracetime=
- MachineId: "1" ReturnCode: 2 Stdout: | CRITICAL: No sync peer selected | frequency= offset=nan peers=1 reach=0.000000 result=2 rootdelay= rootdisp= runtime= stratum= sync=0.000000 sysjitter= sysoffset= tracehosts= traceloops= tracetime=
- MachineId: "2" ReturnCode: 2 Stdout: | CRITICAL: No sync peer selected | frequency= offset=nan peers=1 reach=0.000000 result=2 rootdelay= rootdisp= runtime= stratum= sync=0.000000 sysjitter= sysoffset= tracehosts= traceloops= tracetime=

Tags: seg sts
tags: added: sts
Felipe Reyes (freyes)
tags: added: seg
removed: sts
Revision history for this message
Drew Freiberger (afreiberger) wrote :

Can you please provide information regarding the 'chronyc sources' or 'ntpq -p' output from the environment where you're seeing this issue. It appears that the validate_reach function is enabled by default.

Also please provide the output of:
/opt/ntpmon-ntp-charm/check_ntpmon.py --check reach
/opt/ntpmon-ntp-charm/check_ntpmon.py --check sync

Revision history for this message
Bryan Quigley (bryanquigley) wrote :

Updated the description with more details (for the non nagios). Looks like we just need to make sure those two provided commands are actually reported via Juju/Nagios (in fact verify_ntp_servers should be removed and replaced by them IMHO).

Also tried doing a simple nagios relationship, but it just reported everything as fine from ntp charm, and failed to associate with ubuntu.

description: updated
Revision history for this message
Drew Freiberger (afreiberger) wrote :

I've been able to reproduce this issue with the following on a new model.

juju deploy ubuntu ubuntu-ntp-test
juju deploy ntp
juju config ntp pools=""
juju config ntp source="1.1.1.1"
juju deploy nrpe
juju deploy nagios --to lxd:0
juju add-relation ntp ubuntu-ntp-test
juju add-relation nrpe ubuntu-ntp-test
juju add-relation nrpe ntp
juju add-relation nagios nrpe

The Nagios view shows:

Current Status:
  OK
 (for 0d 0h 1m 1s)
Status Information: CRITICAL: offset is out of range (nan) - must be between -0.050000 and 0.050000
Performance Data: frequency=0.000000 offset=nan peers=1 reach=0.000000 result=0 rootdelay=1.000000 rootdisp=1.000000 runtime=152 stratum=0 sync=0.000000 sysjitter= sysoffset=0.000000000 tracehosts= traceloops= tracetime=

Tracing through the code, I see that the NTPAlerter class doesn't alert CRITICAL for at least 8 * 64 second cycles of the ntpd/chrony daemon being up and running. When I wait 10 minutes and tell nagios to refresh the check_ntpmon service on the host, it shows up as critical.

The code comment on the ntp charm's /opt/ntpmon-ntp-charm/alert.py file shows:

        Don't return anything other than OK until the NTP daemon has been running
        for at least enough time for 8 polling intervals of 64 seconds each. This
        prevents false positives due to restarts or short-lived VMs.

So, I am able to reproduce the no-error state, but it does turn to error within 20 minutes (10 minute checks, 10 minuutes of delay before first potential bad check. We will need to make sure nagios validation is done well after deploy has settled.

I think this is working as designed and should be marked Won't Fix.

Additionally, I then set verify_ntp_servers=True, and the charm does go into a blocking state:
  ntp/0* blocked idle 10.0.8.108 123/udp NTP servers are not reachable: 1.1.1.1

This functionality is all working as intended.

Revision history for this message
Bryan Quigley (bryanquigley) wrote :

Interesting. We had a PCB cloud that did not have working NTP that had no Nagios alerts - it was operating for 1+ week before ceph started having significant issues.

In either case, I think we need juju status to show the errors at the same time.

Revision history for this message
Bryan Quigley (bryanquigley) wrote :

Now with nagios setup, the NTP charm is actually reporting some problems to Juju Status:
  ntp/0* active idle 10.5.0.4 123/udp ntp: Ready, CRITICAL: offset is out of range (nan) - must be between -0.050000 and 0.050000

Does that functionality depend on NRPE/Nagios being setup? Can it be enabled by default for just the use of the NTP charm alone?

tags: added: sts
Changed in ntp-charm:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.