MAAS

MAAS NTP config on region/rack controllers seems to be including MAAS peers even when "Use external NTP servers only" is checked

Bug #1939901 reported by Paul Goins on 2021-08-13

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	MAAS	Invalid	High	Unassigned

Bug Description

For a particular environment, we have NTP configured across the cloud via charm-ntp to point directly at 2 customer-controlled NTP servers. The intent is for this to be the case across the entire cloud, including the MAAS nodes. Obviously that's 2 sources of truth regarding NTP config: MAAS and charm-ntp. This is OK as long as the end result of either is effectively the same - unfortunately, it's not. MAAS is adding peer addresses which we don't want.

We have 3 MAAS nodes, and all of them are region and rack controllers. We've specified the customer's NTP servers and have checked the "Use external NTP servers only" checkbox.

What we see is: when MAAS rewrites the chrony config, in /etc/chrony/maas.conf it's including both the upstream servers of the customer, plus it's including peer directives pointing at the other MAAS nodes. We actually don't want the peer references, and I wouldn't have expected those with the "Use external NTP servers only" checkbox checked.

Is there a way to have MAAS write config with only the upstream servers, and without the peers?

Tags:

Revision history for this message

Björn Tillenius (bjornt) wrote on 2021-08-16:

Yes, I think you're right. If you specify that there should be only external servers, there shouldn't be any peers configured.

The relevant code is in src/maasserver/ntp.py. get_peers_for() needs to check _ntp_external_only().

I don't think there's any workaround, but the fix should be fairly simple.

tags:	added: trivial
Changed in maas:
status:	New → Triaged
importance:	Undecided → High
milestone:	none → next

Revision history for this message

Björn Tillenius (bjornt) wrote on 2022-03-31:

On second thought, adding the MAAS nodes as peers actually does seem to make sense. That way, if the upstream ntp would go down, the MAAS nodes would still be in sync internally.

What kind of problems are you seeing this actually causing?

Changed in maas:
milestone:	next → none
status:	Triaged → Incomplete

Revision history for this message

Paul Goins (vultaire) wrote on 2022-05-05:

I'm not sure that there was an actual problem here; the "bug" may have been due to my misunderstanding about how ntp servers vs peers work. I still don't really grok it at this point, honestly.

We did have an NTP-related customer issue, and this issue was found around the same time, so it was thought that it might be related. The customer was having intermittent alerts from charm-ntp's NRPE check (e.g. "CRITICAL: offset is out of range (-0.064120) - must be between -0.050000 and 0.050000"). Honestly, I'm unsure if this was due to a delta between the 2 upstream NTP servers they were using, or an out-of-sync clock on one of the MAAS nodes, or what else it could have been.

I honestly didn't (and still don't) fully grok the difference in usage of NTP servers vs. NTP peers. If it's purely a fallback in case of not being able to reach the upstream servers, then yeah, it sounds like the current behavior may be desirable, and that if the alert was caused by peers being out of sync, then maybe that's charm-ntp's job to make that more clear. (Granted, this is hypothetical; I haven't looked at this problem in some time.)

If this makes sense, then perhaps it's acceptable to close this ticket as invalid, since it's believed to work as expected from MAAS's perspective?

Best Regards,
Paul Goins

Revision history for this message

Alexsander de Souza (alexsander-souza) wrote on 2022-05-05:

A NTP server is an authoritative source of time, often directly connected to a high-precision source (e.g. GPS antenna). Its precision is shown by its stratum number (lower is better). Stratum zero (S0) clocks cannot be accessed through the Internet, so S1 is the best we can use.

A NTP peer can be any system that runs a NTP daemon. When it's connected to a server (normal operation), it gets the correct time from there and the peer connections are not used. In the absence of a server, peers slowly converge to the same time, using stored drift data to figure out the correct time.

The current MAAS behaviour seems reasonable, so I agree this ticket can be closed.

I suggest adding a third NTP server to the customer setup, because with only 2 the protocol doesn't converge. Also review the servers being used, avoiding mixing different stratums and preferring geographically close servers.

Changed in maas:
status:	Incomplete → Invalid

Revision history for this message

Paul Goins (vultaire) wrote on 2023-06-06:

Download full text (3.9 KiB)

I'd like to re-open this.

I am concretely seeing issues on a live customer deployment right now, and not having this work is causing us problems.

Here is an example server, IPs modified for the sake of anonymity, showing "chronyc -n sources -v" output:

------------------
$ sudo chronyc -n sources -v
210 Number of sources = 4

.-- Source mode '^' = server, '=' = peer, '#' = local clock.
/ .- Source state '*' = current synced, '+' = combined , '-' = not combined,
| / '?' = unreachable, 'x' = time may be in error, '~' = time too variable.
|| .- xxxx [ yyyy ] +/- zzzz
|| Reachability register (octal) -. | xxxx = adjusted offset,
|| Log2(Polling interval) --. | | yyyy = measured offset,
|| \ | | zzzz = estimated error.
|| | | \
MS Name/IP address Stratum Poll Reach LastRx Last sample
===============================================================================
=- 10.0.0.4 5 6 377 479 -2631us[-2631us] +/- 333ms
=? 10.0.0.101 6 10 0 202d -6774ms[+1674us] +/- 618ms
^- 10.1.0.1 4 6 377 48 -8092us[-8092us] +/- 337ms
^* 10.1.0.2 4 10 377 450 -18ms[-6504us] +/- 373ms
------------------

10.0.0.101 is one of the peers. For some reason, its LastRx field is clearly problematic. It might be because 10.0.0.101 is a virtual IP attached to that host and not the primary IP for the adapter chosen for the peer configuration; I am honestly not sure. Regardless, it seems that there is some sort of a problem with syncing via the peers, and this seems to be manifesting as problems with consumers of the MAAS machines' NTP time. We're seeing alerts showing offset deltas sometimes exceeding 10 seconds because of this problem.

Based upon the above paste, the upstreams look essentially in sync. If we could optionally have MAAS *only* use the upstream, because we have some sort of wonky problem like this preventing the sync from working nice, that would be helpful for this bug. However, MAAS doesn't allow us to do this because of this bug.

At the risk of going on a tangent, the peer IP selection logic *might* be unpredictable - the environment I'm working on is on MAAS 2.8.9, so maybe this has changed in newer versions. However, checking the checkbox this bug refers to, after restarting the rackd/regiond on one of the controllers (as /etc/chrony/*.conf wasn't updated until this was done), we did get an unexpected "fix". That is, while we still had the peer clauses in /etc/chrony/maas.conf, they had changed to use IPs from a different adapter, and the NTP connections on these adapters worked without problem. The MAAS controllers synced their time successfully, we no longer have large offsets between the peers, and the alerts on the consumers all resolved.

I won't go further than to point the above out, as it may be its own bug (if MAAS doesn't deterministically select a network adapter and IP to use for the NTP peer clauses). I'm just pointing...

I'd like to re-open this.

I am concretely seeing issues on a live customer deployment right now, and not having this work is causing us problems.

Here is an example server, IPs modified for the sake of anonymity, showing "chronyc -n sources -v" output:

------------------
$ sudo chronyc -n sources -v
210 Number of sources = 4

.-- Source mode  '^' = server, '=' = peer, '#' = local clock.
 / .- Source state '*' = current synced, '+' = combined , '-' = not combined,
| /   '?' = unreachable, 'x' = time may be in error, '~' = time too variable.
||                                                 .- xxxx [ yyyy ] +/- zzzz
||      Reachability register (octal) -.           |  xxxx = adjusted offset,
||      Log2(Polling interval) --.      |          |  yyyy = measured offset,
||                                \     |          |  zzzz = estimated error.
||                                 |    |           \
MS Name/IP address         Stratum Poll Reach LastRx Last sample               
===============================================================================
=- 10.0.0.4                      5   6   377   479  -2631us[-2631us] +/-  333ms
=? 10.0.0.101                    6  10     0  202d  -6774ms[+1674us] +/-  618ms
^- 10.1.0.1                      4   6   377    48  -8092us[-8092us] +/-  337ms
^* 10.1.0.2                      4  10   377   450    -18ms[-6504us] +/-  373ms
------------------

10.0.0.101 is one of the peers.  For some reason, its LastRx field is clearly problematic.  It might be because 10.0.0.101 is a virtual IP attached to that host and not the primary IP for the adapter chosen for the peer configuration; I am honestly not sure.  Regardless, it seems that there is some sort of a problem with syncing via the peers, and this seems to be manifesting as problems with consumers of the MAAS machines' NTP time.  We're seeing alerts showing offset deltas sometimes exceeding 10 seconds because of this problem.

Based upon the above paste, the upstreams look essentially in sync.  If we could optionally have MAAS *only* use the upstream, because we have some sort of wonky problem like this preventing the sync from working nice, that would be helpful for this bug.  However, MAAS doesn't allow us to do this because of this bug.

At the risk of going on a tangent, the peer IP selection logic *might* be unpredictable - the environment I'm working on is on MAAS 2.8.9, so maybe this has changed in newer versions.  However, checking the checkbox this bug refers to, after restarting the rackd/regiond on one of the controllers (as /etc/chrony/*.conf wasn't updated until this was done), we did get an unexpected "fix".  That is, while we still had the peer clauses in /etc/chrony/maas.conf, they had changed to use IPs from a different adapter, and the NTP connections on these adapters worked without problem.  The MAAS controllers synced their time successfully, we no longer have large offsets between the peers, and the alerts on the consumers all resolved.

I won't go further than to point the above out, as it may be its own bug (if MAAS doesn't deterministically select a network adapter and IP to use for the NTP peer clauses).  I'm just pointing out that MAAS's apparent behavior here with regards to peer clauses in /etc/chrony/maas.conf is not always desirable, and sometimes it is preferable to simply use the servers specified by the user in case that MAAS isn't able to come up with an adequate configuration by itself.

Circling back to the main bug here: we had a clear concrete case illustrated above where a peer was not syncing reliably, but where there was no (reliable) way via MAAS to remove the offending peer entry in the Chrony config, which may have necessitated manual edits and marking chrony config files immutable as a workaround if it weren't for what appears to have been some luck when the chrony config files were rewritten to use different peer IPs.  Thus, I think this should be work fixing.  Please consider it.

Changed in maas:
status:	Invalid → New

Revision history for this message

Paul Goins (vultaire) wrote on 2023-06-06:

I've dug into this more on a different cloud, and I see that the chrony config being used, if I understand things right, is done in a way where the peers shouldn't even come into play unless upstreams become unavailable, due to the use of e.g. "local stratum 8 orphan" in the chrony config.

So, I'll back off on this. I still question the wording on the UI here, but perhaps it is better to have the peer entries in given the local directive mentioned above.

Changed in maas:
status:	New → Invalid

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.