Unexpected NTP alarms seen sometimes on IPv6 system

Bug #1851043 reported by Yang Liu
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Kristine Bujold

Bug Description

Brief Description
-----------------
Following alarm against controller-1 seems to be invalid.
Is is seen when controller-1 ntpq reports healthy state with ntp server "*64:ff9b::ce6c:83 .PTP0." selected.

Suspect this is due to the special refid - .PTP0

[sysadmin@controller-1 ~(keystone_admin)]$ fm alarm-list
+----------+--------------------------------------------------------------------------+-----------------------+----------+----------------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+--------------------------------------------------------------------------+-----------------------+----------+----------------------+
| 100.114 | NTP cannot reach external time source; syncing with peer controller only | host=controller-0.ntp | minor | 2019-11-02T14:04:05. |
| | | | | 907029 |
| | | | | |
| 100.114 | NTP configuration does not contain any valid or reachable NTP servers. | host=controller-1.ntp | major | 2019-11-02T03:03:25. |
| | | | | 836876 |
| | | | | |
+----------+--------------------------------------------------------------------------+-----------------------+----------+----------------------+

     remote refid st t when poll reach delay offset jitter
==============================================================================
+abcd:204::2
                 232.178.14.219 3 u 335 1024 377 0.084 0.607 0.352
*64:ff9b::ce6c:83
                 .PTP0. 1 u 638 1024 377 8.263 -0.172 0.120
+64:ff9b::c7b6:ccc5
                 219.119.208.14 2 u 400 1024 377 76.361 0.814 0.161
+64:ff9b::8180:c14
                 172.30.248.10 2 u 403 1024 377 47.996 -1.377 0.182

Severity
--------
Major

Steps to Reproduce
------------------
* Not sure how to reproduce it, because the ntp server selected (sudo ntpq -pn) is not controllable.
- Install and config
- Run sanity that involves lock/unlock, swact, host reboot, etc
- Observe ntp alarms

Expected Behavior
------------------
- Valid or no ntp alarms

Actual Behavior
----------------
- Unexpected NTP alarm when a good ntp server is returned by the pool

# valid ntp server:
Sat Nov 2 14:18:28 UTC 2019
*64:ff9b::ce6c:83
                 .PTP0. 1 u 638 1024 377 8.263 -0.172 0.120

# unexpected stale alarm that does not clear
| 100.114 | NTP configuration does not contain any valid or reachable NTP servers. | host=controller-1.ntp | major | 2019-11-02T03:03:25.

Reproducibility
---------------
Intermittent

System Configuration
--------------------
Multi-node system, IPv6
Lab-name:
WCP78-79

Branch/Pull Time/Commit
-----------------------
20191031

Last Pass
---------
Not sure, because the ntp server selected is random.

Timestamp/Logs
--------------
| 100.114 | NTP configuration does not contain any valid or reachable NTP servers. | host=controller-1.ntp | major | 2019-11-02T03:03:25.

Test Activity
-------------
Sanity

Revision history for this message
Yang Liu (yliu12) wrote :
Yang Liu (yliu12)
tags: added: stx.retestneeded
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Assigning to Kristine to triage

tags: added: stx.config
Changed in starlingx:
assignee: nobody → Kristine Bujold (kbujold)
Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.3.0 / medium priority - requires further investigation

tags: added: stx.3.0
Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

I have not looked at the conditions surrounding the first failure and if that is transient due to swact.
If error is transient we should consider debouncing the alarm assertion for this error condition.

Revision history for this message
Kristine Bujold (kbujold) wrote :
Download full text (3.8 KiB)

I have tested a host-swact and it looks like after a swact the ntpq looses its reachability statuses.

Controller-0:
     remote refid st t when poll reach delay offset jitter
==============================================================================
+face::3
                 107.242.40.132 3 u 588 1024 377 0.089 1.210 0.265
+64:ff9b::3627:1740
                 192.99.2.172 3 u 409 1024 377 15.674 -0.118 0.758
*64:ff9b::c063:2ac
                 213.251.128.249 2 u 795 1024 377 15.055 0.259 1.224
+64:ff9b::a29f:c801
                 10.14.8.68 3 u 657 1024 377 18.819 0.097 0.937

After the swact

     remote refid st t when poll reach delay offset jitter
==============================================================================
 face::3
                 127.0.0.1 12 u 586 1024 7 0.065 1.160 0.034
 64:ff9b::3627:1740
                 192.99.2.172 3 u 611 1024 7 14.417 0.016 0.439
 64:ff9b::c063:2ac
                 213.251.128.249 2 u 677 1024 3 15.656 0.239 0.070
 64:ff9b::a29f:c801
                 10.14.8.68 3 u 613 1024 7 18.327 0.247 0.351

Controller-1:

     remote refid st t when poll reach delay offset jitter
==============================================================================
+face::2
                 143.161.167.89 3 u 428 1024 377 0.104 -1.132 0.255
+64:ff9b::cdce:4607
                 206.108.0.131 2 u 342 1024 377 43.230 -0.699 0.425
*64:ff9b::23b7:39a9
                 206.108.0.131 2 u 483 1024 377 10.450 1.795 0.385
+64:ff9b::9538:2f3c
                 206.108.0.132 2 u 123 1024 377 14.716 1.211 0.879

After swact

     remote refid st t when poll reach delay offset jitter
==============================================================================
 face::2
                 127.0.0.1 12 u 27 64 1 0.154 -1.134 0.000
 64:ff9b::cdce:4607
                 206.108.0.131 2 u 27 64 1 43.220 -0.894 0.000
 64:ff9b::23b7:39a9
                 206.108.0.131 2 u 29 64 1 10.626 1.972 0.000
 64:ff9b::9538:2f3c
                 206.108.0.132 2 u 29 64 1 15.316 0.934 0.000

from daemon.log on controller-1

2019-11-20T17:42:58.000 controller-0 ntpd[100062]: info Deleting interface #16 eno1, 2620:10a:a001:a103::1218#123, interface stats: received=422, sent=422, dropped=0, active_time=71038 secs
2019-11-20T17:42:58.000 controller-0 ntpd[100062]: info 64:ff9b::a29f:c801 interface 2620:10a:a001:a103::1218 -> (none)
2019-11-20T17:42:58.000 controller-0 ntpd[100062]: info 64:ff9b::c063:2ac interface 2620:10a:a001:a103::1218 -> (none)
2019-11-20T17:42:58.000 controller-0 ntpd[100062]: info 64:ff9b::3627:1740 interface 2620:10a:a001:a103::1218 -> (none)
2019-11-20T17:42:58.000 controller-0 ntpd[100062]: info Deleting interface #14 vlan116, feed:beef::1#123, interface stats: received=0, sent=0, dropped=0, active_time=71038 secs
2019-11-20T17:42:58.000 controller-0 ntpd[100062]: info Deleting interface #4 ...

Read more...

Revision history for this message
Kristine Bujold (kbujold) wrote :

With a swact the oam interface changes. This must cause the ntpq to reset the reachability.

With unlocks/reboots the ntpq needs time to determine reachabilty. Our current ntpq audit is 10 minutes.

We could audit more frequently and/or use debouncing before raising alarm. Debouncing would reduce ntp alarm noise.

Revision history for this message
Yang Liu (yliu12) wrote :

If I remember correctly, this alarm did not go away. There is a separate LP for swact issue, where alarms go away after 20 minutes or so.

In this specific LP, I suspect the following selected NTP server is not recognized as a valid server in our code:
*64:ff9b::ce6c:83
                 .PTP0. 1 u 638 1024 377 8.263 -0.172 0.120

Is it possible to create a unit test using above server or check if the code parses the Refid?

Revision history for this message
Kristine Bujold (kbujold) wrote :
Download full text (4.1 KiB)

I am seeing errors in the daemon.log regarding raising and clearing the NTP alarms "Failed to execute clear_fault"

2019-11-02T03:13:23.877 controller-1 collectd[113678]: info WARNING:root:fm_python_extension: Failed to connect to FM manager
2019-11-02T03:13:23.880 controller-1 collectd[113678]: info NTP query plugin 'clear_fault' exception ; 100.114:host=controller-1.ntp ; Failed to execute clear_fault.
2019-11-02T03:13:23.884 controller-1 collectd[113678]: info NTP query plugin 100.114:host=controller-1.ntp=64:ff9b::ce6c:83 alarm cleared
2019-11-02T03:13:23.885 controller-1 collectd[113678]: info NTP query plugin selected server changed from 'None' to '64:ff9b::ce6c:83'
2019-11-02T03:13:23.885 controller-1 collectd[113678]: info NTP query plugin reachable servers: ['64:ff9b::ce6c:83']

The alarm for "64:ff9b::ce6c:83" appears to eventually get cleared 20 minutes later

fm-manager.log:2019-11-02T03:53:23.836 fmMsgServer.cpp(398): Raising Alarm/Log, (100.114) (host=controller-1.ntp=64:ff9b::ce6c:83)
fm-manager.log:2019-11-02T03:53:23.838 fmMsgServer.cpp(421): Alarm created/updated: (100.114) (host=controller-1.ntp=64:ff9b::ce6c:83) (2) (8606362d-0906-4610-b4df-983d11f1eb9d)

fm-manager.log:2019-11-02T04:13:23.837 fmMsgServer.cpp(494): Deleted alarm: (100.114) (host=controller-1.ntp=64:ff9b::ce6c:83)

But also I also see these generic NTP alarms

cat alarms.info
--------------------------------------------------------------------
Sat Nov 2 14:21:24 UTC 2019 : : fm alarm-list
--------------------------------------------------------------------
+----------+--------------------------------------------------------------------------+-----------------------+----------+----------------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+--------------------------------------------------------------------------+-----------------------+----------+----------------------+
| 100.114 | NTP cannot reach external time source; syncing with peer controller only | host=controller-0.ntp | minor | 2019-11-02T14:14:05. |
| | | | | 895979 |
| | | | | |
| 100.114 | NTP configuration does not contain any valid or reachable NTP servers. | host=controller-1.ntp | major | 2019-11-02T03:03:25. |
| | | | | 836876 |
| | | | | |
+----------+--------------------------------------------------------------------------+-----------------------+----------+----------------------+

These should of been cleared based on the ntpq output

2019-11-02T14:13:23.807 controller-1 collectd[113678]: info NTP query plugin server list: ['0.pool.ntp.org', ...

Read more...

Revision history for this message
Yang Liu (yliu12) wrote :

Yes you are right, the stuck alarms were the generic ones complaining about no valid/reachable servers.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to monitoring (master)

Fix proposed to branch: master
Review: https://review.opendev.org/695581

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to monitoring (master)

Reviewed: https://review.opendev.org/695581
Committed: https://git.openstack.org/cgit/starlingx/monitoring/commit/?id=c0b2838e988050096b89c176428b5e8e0142429c
Submitter: Zuul
Branch: master

commit c0b2838e988050096b89c176428b5e8e0142429c
Author: Kristine Bujold <email address hidden>
Date: Thu Nov 21 13:05:04 2019 -0500

    Fix alarm issues with NTP

    This submit fixes a bug that would prevent NTP alarms clears from
    being retried on subsequent audits if the FM manager was returning
    an exception.

    Closes-bug: 1851043
    Change-Id: I3dc962ddcf78f114b8b361ab641eb11b0ef461e4
    Signed-off-by: Kristine Bujold <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Yang Liu (yliu12) wrote :

This issue is not seen in recent sanity.

tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.