100.114 "NTP configuration does not contain any valid or reachable NTP servers." major alarm not issued when no NTP sources

Bug #1928347 reported by Takamasa Takenaka
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Low
Takamasa Takenaka

Bug Description

Brief Description
-----------------
In standard system, 100.114 "NTP configuration does not contain any valid or reachable NTP servers." major alarm not issued when no NTP sources

Severity
--------
Major

Steps to Reproduce
------------------
1. on standby controller (controller-1), block the CIDR for one of the NTP servers
2. minor reachability alarm for NTP server generated (as expected)
3. on standby controller, block the CIDR for two other NTP servers (no more external sources)
4. minor reachability alarms for NTP servers generated (as expected)
5. minor alarm for no external sources, syncing with peer controller issued (as expected)
6. swact to controller-1
7. lock and power off controller-0
8. ntpq shows no selected NTP soruces... major NTP alarm not issued (==> Issue-1)
9. ntpq shows no reachability to mate source... still no major NTP alarm issued
10. system sat in this state overnight, and still did not clear "syncing with peer" alarm (==> Issue-2), nor issue the major NTP alarm for no reachable servers..

Expected Behavior
------------------
When there is no reachable NTP server:
1. Major alarm "NTP configuration does not contain any valid or reachable NTP servers." should be raised.
2. Minor alarm "NTP cannot reach external time source; syncing with peer controller only" should be suppressed.

Actual Behavior
----------------
When there is no reachable NTP server:
1. Major alarm "NTP configuration does not contain any valid or reachable NTP servers." is not raised.
2. Minor alarm "NTP cannot reach external time source; syncing with peer controller only" stays.

Reproducibility
---------------
Reproducible

System Configuration
--------------------
Two node system

Branch/Pull Time/Commit
-----------------------
stx4 as of 2020-06-27 18:37:38 -0400

Last Pass
---------
no known

Timestamp/Logs
--------------
There are two issues.
Issue-1:
Following alarm did not clear for almost 10+ hours even though there was no peer controller was available.
2021-03-30T03:05:42.000 controller-1 fmManager: info

{ "event_log_id" : "100.114", "reason_text" : "NTP cannot reach external time source; syncing with peer controller only", "entity_instance_id" : "region=RegionOne.system=central-region-2.host=controller-0.ntp", "severity" : "minor", "state" : "set", "timestamp" : "2021-03-30 03:05:42.218814" }
Issue-2:
Even though all NTP sources unavailable as of 2021-03-30T03:05:42.000, 100.114 "NTP configuration does not contain any valid or reachable NTP servers." major alarm is not issued.

2021-03-30T03:00:42.000 controller-1 fmManager: info

{ "event_log_id" : "100.114", "reason_text" : "NTP address 2607:f160:10:9200::b is not a valid or a reachable NTP server.", "entity_instance_id" : "region=RegionOne.system=central-region-2.host=controller-0.ntp=2607:f160:10:9200::b", "severity" : "minor", "state" : "set", "timestamp" : "2021-03-30 03:00:42.138096" }
2021-03-30T03:00:42.000 controller-1 fmManager: info

{ "event_log_id" : "100.114", "reason_text" : "NTP address 2607:f160:10:8200::a is not a valid or a reachable NTP server.", "entity_instance_id" : "region=RegionOne.system=central-region-2.host=controller-0.ntp=2607:f160:10:8200::a", "severity" : "minor", "state" : "set", "timestamp" : "2021-03-30 03:00:42.139967" }
2021-03-30T03:05:42.000 controller-1 fmManager: info

{ "event_log_id" : "100.114", "reason_text" : "NTP address 2607:f160:10:9200::a is not a valid or a reachable NTP server.", "entity_instance_id" : "region=RegionOne.system=central-region-2.host=controller-0.ntp=2607:f160:10:9200::a", "severity" : "minor", "state" : "set", "timestamp" : "2021-03-30 03:05:42.176966" }
Timestamp when failure occurred:

2021-03-30T03:00:42.000 controller-0 collectd[3468]: err NTP query plugin 'set_fault' exception ; 100.114:host=controller-0.ntp=2607:f160:10:9200::a:minor ; Failed to execute set_fault.

Test Activity
-------------
Evaluation

Workaround
----------
N/A

Tags: stx.config
Revision history for this message
Takamasa Takenaka (ttakenak) wrote :
Download full text (3.6 KiB)

This bug is reproducible.

1. Create the state which peer is selected and out server are unreachable:

controller-1:~$ ntpq -np
     remote refid st t when poll reach delay offset jitter
==============================================================================
*192.168.204.2
                 206.108.0.132 2 u 4 64 377 0.090 -9.015 1.158
 172.217.13.142
                 .INIT. 16 u - 64 0 0.000 0.000 0.000
 74.6.143.26
                 .INIT. 16 u - 64 0 0.000 0.000 0.000

2. Confirm we have alarm "NTP cannot reach external time source; syncing with peer controller only" and two "NTP address [ip] is not a valid or a reachable NTP server."

[sysadmin@controller-0 ~(keystone_admin)]$ fm alarm-list
+-------+------------------------------+--------------------------------------+----------+-------------+
| Alarm | Reason Text | Entity ID | Severity | Time Stamp |
| ID | | | | |
+-------+------------------------------+--------------------------------------+----------+-------------+
| 100. | NTP address 74.6.143.26 is | host=controller-1.ntp=74.6.143.26 | minor | 2021-05-06T |
| 114 | not a valid or a reachable | | | 13:30:06. |
| | NTP server. | | | 911250 |
| | | | | |
| 100. | NTP address 172.217.13.142 | host=controller-1.ntp=172.217.13.142 | minor | 2021-05-06T |
| 114 | is not a valid or a | | | 13:30:06. |
| | reachable NTP server. | | | 908394 |
| | | | | |
| 100. | NTP cannot reach external | host=controller-1.ntp | minor | 2021-05-06T |
| 114 | time source; syncing with | | | 13:25:06. |
| | peer controller only | | | 967746 |

3. swact controller-0, lock controller-0, power off controller-0 and wait until no NTP server is selected

[sysadmin@controller-1 ~(keystone_admin)]$ ntpq -np
     remote refid st t when poll reach delay offset jitter
==============================================================================
 192.168.204.2
                 206.108.0.132 2 u 758 64 0 0.001 -8.771 0.000
 172.217.13.142
                 .INIT. 16 u - 1024 0 0.000 0.000 0.000
 74.6.143.26
                 .INIT. 16 u - 1024 0 0.000 0.000 0.000

[sysadmin@controller-1 ~(keystone_admin)]$ fm alarm-list | grep 100.114
| 100.114 | NTP cannot reach external time source; syncing with peer controller only | host=controller-1.ntp | minor | 2021-05-06T13:55:06...

Read more...

Changed in starlingx:
assignee: nobody → Takamasa Takenaka (ttakenak)
status: New → In Progress
Revision history for this message
Takamasa Takenaka (ttakenak) wrote :

[Root Cause]
When peer is selected, the script launched the minor alarm
"NTP cannot reach external time source; syncing with peer controller only."
But the major alarm "NTP configuration does not contain any valid or reachable NTP servers."
is not launched once this minor alarm is on.
And peer selected flag is not cleared once peer is selected but later peer is unreachable.
As a result, this minor alarm is not cleared after peer is unreachable and major alarm is not launched.

[Overview Design]
In the bug LP:1889101, we decided to remove minor alarm
"NTP cannot reach external time source; syncing with peer controller only."
because NTP does not prioritize external time source over peer.

Expected behavior with the patch for LP:1889101 are:
When no NTP server is reachable: major alarm "NTP configuration does not contain any valid or reachable NTP servers."
When peer is selected with reliable upstream NTP server: No major alarm.
When outer server is unreachable: minor alarm "NTP address [server ip] is not a valid or a reachable NTP server."

According to revised spec with the patch for LP:1889101, it will fix this issue because the minor alarm is removed and there is no peer flag is used in alarm activate in code.

Revision history for this message
Takamasa Takenaka (ttakenak) wrote :

Here is the link for patch:
https://review.opendev.org/c/starlingx/monitoring/+/787588

** This patch is created for LP:1889101 but this patch will also fix this issue.

Revision history for this message
Takamasa Takenaka (ttakenak) wrote :
Download full text (6.3 KiB)

[TEST RESULT stx:master (IPv4)]

1. No NTP is connected in controller-1

controller-1:~$ ntpq -np
     remote refid st t when poll reach delay offset jitter
==============================================================================
 192.168.204.2
                 .INIT. 16 u - 64 0 0.000 0.000 0.000
 172.217.164.238
                 .INIT. 16 u - 64 0 0.000 0.000 0.000
 74.6.143.26
                 .INIT. 16 u - 64 0 0.000 0.000 0.000

[sysadmin@controller-0 ~(keystone_admin)]$ fm alarm-list
+----------+--------------------------------------------------------------------------------+--------------------------------------+----------+---------------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+--------------------------------------------------------------------------------+--------------------------------------+----------+---------------------+
| 100.114 | NTP configuration does not contain any valid or reachable NTP servers. | host=controller-1 | major | 2021-05-10T14:07:38 |
| | | | | .846405 |
| | | | | |
| 100.114 | NTP address 74.6.143.26 is not a valid or a reachable NTP server. | host=controller-1=74.6.143.26 | minor | 2021-05-10T14:07:38 |
| | | | | .676781 |
| | | | | |
| 100.114 | NTP address 172.217.164.238 is not a valid or a reachable NTP server. | host=controller-1=172.217.164.238 | minor | 2021-05-10T14:07:38 |
| | | | | .619920 |
==> One major alarm and two minor alarms are raised as expected.

2. Peer is selected but other servers are not reachable in controller-1

controller-1:~$ ntpq -np
     remote refid st t when poll reach delay offset jitter
==============================================================================
*192.168.204.2
                 209.115.181.102 3 u 27 64 76 0.185 3.546 3.505
 172.217.164.238
                 .INIT. 16 u - 64 0 0.000 0.000 0.000
 74.6.143.26
                 .INIT. 16 u - 64 0 0.000 0.000 0.000

[sysadmin@controller-0 ~(keystone_admin)]$ fm alarm-list
+----------+--------------------------------------------------------------------------------+-----------------------...

Read more...

Revision history for this message
Takamasa Takenaka (ttakenak) wrote :
Download full text (8.3 KiB)

[TEST RESULT: stx:master (IPv6)]

1. No NTP is connected in controller-1

controller-1:~$ ntpq -np
     remote refid st t when poll reach delay offset jitter
==============================================================================
 face::2
                 41.239.70.93 3 u 15 64 3 0.195 -4.035 0.274
 64:ff9b::acd9:dae
                 .INIT. 16 u - 64 0 0.000 0.000 0.000
 2001:4998:44:3507::8000
                 .INIT. 16 u - 64 0 0.000 0.000 0.000
 64:ff9b::cdfb:f267
                 .INIT. 16 u - 64 0 0.000 0.000 0.000

[sysadmin@controller-0 ~(keystone_admin)]$ fm alarm-list
+----------+-------------------------------------------------------------------------------+------------------------------+----------+---------------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+-------------------------------------------------------------------------------+------------------------------+----------+---------------------+
| 100.114 | NTP configuration does not contain any valid or reachable NTP servers. | host=controller-1 | major | 2021-05-10T20:02:04 |
| | | | | .282939 |
| | | | | |
| 100.114 | NTP address 64:ff9b::cdfb:f267 is not a valid or a reachable NTP server. | host=controller-1=64:ff9b:: | minor | 2021-05-10T20:02:04 |
| | | cdfb:f267 | | .239429 |
| | | | | |
| 100.114 | NTP address 2001:4998:44:3507::8000 is not a valid or a reachable NTP server. | host=controller-1=2001:4998: | minor | 2021-05-10T20:02:04 |
| | | 44:3507::8000 | | .195559 |
| | | | | |
| 100.114 | NTP address 64:ff9b::acd9:dae is not a valid or a reachable NTP server. | host=controller-1=64:ff9b:: | minor | 2021-05-10T20:02:04 |
| | | acd9:dae | | .151696 |
| | | | | |
+----------+-------------------------------------------------------------------------------+------------------------------+----------+------...

Read more...

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Low
tags: added: stx.config
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.