StarlingX

Unexpected NTP alarms seen sometimes on IPv6 system

Bug #1851043 reported by Yang Liu on 2019-11-02

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Medium	Kristine Bujold

Bug Description

Brief Description
-----------------
Following alarm against controller-1 seems to be invalid.
Is is seen when controller-1 ntpq reports healthy state with ntp server "*64:ff9b::ce6c:83 .PTP0." selected.

Suspect this is due to the special refid - .PTP0

     remote refid st t when poll reach delay offset jitter
==============================================================================
+abcd:204::2
                 232.178.14.219 3 u 335 1024 377 0.084 0.607 0.352
*64:ff9b::ce6c:83
                 .PTP0. 1 u 638 1024 377 8.263 -0.172 0.120
+64:ff9b::c7b6:ccc5
                 219.119.208.14 2 u 400 1024 377 76.361 0.814 0.161
+64:ff9b::8180:c14
                 172.30.248.10 2 u 403 1024 377 47.996 -1.377 0.182

Severity
--------
Major

Steps to Reproduce
------------------
* Not sure how to reproduce it, because the ntp server selected (sudo ntpq -pn) is not controllable.
- Install and config
- Run sanity that involves lock/unlock, swact, host reboot, etc
- Observe ntp alarms

Expected Behavior
------------------
- Valid or no ntp alarms

Actual Behavior
----------------
- Unexpected NTP alarm when a good ntp server is returned by the pool

# valid ntp server:
Sat Nov 2 14:18:28 UTC 2019
*64:ff9b::ce6c:83
.PTP0. 1 u 638 1024 377 8.263 -0.172 0.120

Reproducibility
---------------
Intermittent

System Configuration
--------------------
Multi-node system, IPv6
Lab-name:
WCP78-79

Branch/Pull Time/Commit
-----------------------
20191031

Last Pass
---------
Not sure, because the ntp server selected is random.

Test Activity
-------------
Sanity

Tags:

Revision history for this message

Yang Liu (yliu12) wrote on 2019-11-02:

LP1851043.tar Edit (42.5 MiB, application/x-tar)

Yang Liu (yliu12) on 2019-11-04

tags:

added: stx.retestneeded

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-11-04:

Assigning to Kristine to triage

tags:	added: stx.config
Changed in starlingx:
assignee:	nobody → Kristine Bujold (kbujold)

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-11-13:

stx.3.0 / medium priority - requires further investigation

tags:	added: stx.3.0
Changed in starlingx:
importance:	Undecided → Medium
status:	New → Triaged

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2019-11-20:

I have not looked at the conditions surrounding the first failure and if that is transient due to swact.
If error is transient we should consider debouncing the alarm assertion for this error condition.

Revision history for this message

Kristine Bujold (kbujold) wrote on 2019-11-20:

Download full text (3.8 KiB)

I have tested a host-swact and it looks like after a swact the ntpq looses its reachability statuses.

Controller-0:
     remote refid st t when poll reach delay offset jitter
==============================================================================
+face::3
                 107.242.40.132 3 u 588 1024 377 0.089 1.210 0.265
+64:ff9b::3627:1740
                 192.99.2.172 3 u 409 1024 377 15.674 -0.118 0.758
*64:ff9b::c063:2ac
                 213.251.128.249 2 u 795 1024 377 15.055 0.259 1.224
+64:ff9b::a29f:c801
                 10.14.8.68 3 u 657 1024 377 18.819 0.097 0.937

After the swact

     remote refid st t when poll reach delay offset jitter
==============================================================================
face::3
                 127.0.0.1 12 u 586 1024 7 0.065 1.160 0.034
64:ff9b::3627:1740
                 192.99.2.172 3 u 611 1024 7 14.417 0.016 0.439
64:ff9b::c063:2ac
                 213.251.128.249 2 u 677 1024 3 15.656 0.239 0.070
64:ff9b::a29f:c801
                 10.14.8.68 3 u 613 1024 7 18.327 0.247 0.351

Controller-1:

     remote refid st t when poll reach delay offset jitter
==============================================================================
+face::2
                 143.161.167.89 3 u 428 1024 377 0.104 -1.132 0.255
+64:ff9b::cdce:4607
                 206.108.0.131 2 u 342 1024 377 43.230 -0.699 0.425
*64:ff9b::23b7:39a9
                 206.108.0.131 2 u 483 1024 377 10.450 1.795 0.385
+64:ff9b::9538:2f3c
                 206.108.0.132 2 u 123 1024 377 14.716 1.211 0.879

After swact

     remote refid st t when poll reach delay offset jitter
==============================================================================
face::2
                 127.0.0.1 12 u 27 64 1 0.154 -1.134 0.000
64:ff9b::cdce:4607
                 206.108.0.131 2 u 27 64 1 43.220 -0.894 0.000
64:ff9b::23b7:39a9
                 206.108.0.131 2 u 29 64 1 10.626 1.972 0.000
64:ff9b::9538:2f3c
                 206.108.0.132 2 u 29 64 1 15.316 0.934 0.000

from daemon.log on controller-1

2019-11-20T17:42:58.000 controller-0 ntpd[100062]: info Deleting interface #16 eno1, 2620:10a:a001:a103::1218#123, interface stats: received=422, sent=422, dropped=0, active_time=71038 secs
2019-11-20T17:42:58.000 controller-0 ntpd[100062]: info 64:ff9b::a29f:c801 interface 2620:10a:a001:a103::1218 -> (none)
2019-11-20T17:42:58.000 controller-0 ntpd[100062]: info 64:ff9b::c063:2ac interface 2620:10a:a001:a103::1218 -> (none)
2019-11-20T17:42:58.000 controller-0 ntpd[100062]: info 64:ff9b::3627:1740 interface 2620:10a:a001:a103::1218 -> (none)
2019-11-20T17:42:58.000 controller-0 ntpd[100062]: info Deleting interface #14 vlan116, feed:beef::1#123, interface stats: received=0, sent=0, dropped=0, active_time=71038 secs
2019-11-20T17:42:58.000 controller-0 ntpd[100062]: info Deleting interface #4 ...

I have tested a host-swact and it looks like after a swact the ntpq looses its reachability statuses.

Controller-0:
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
+face::3
                 107.242.40.132   3 u  588 1024  377    0.089    1.210   0.265
+64:ff9b::3627:1740
                 192.99.2.172     3 u  409 1024  377   15.674   -0.118   0.758
*64:ff9b::c063:2ac
                 213.251.128.249  2 u  795 1024  377   15.055    0.259   1.224
+64:ff9b::a29f:c801
                 10.14.8.68       3 u  657 1024  377   18.819    0.097   0.937

After the swact

remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 face::3
                 127.0.0.1       12 u  586 1024    7    0.065    1.160   0.034
 64:ff9b::3627:1740
                 192.99.2.172     3 u  611 1024    7   14.417    0.016   0.439
 64:ff9b::c063:2ac
                 213.251.128.249  2 u  677 1024    3   15.656    0.239   0.070
 64:ff9b::a29f:c801
                 10.14.8.68       3 u  613 1024    7   18.327    0.247   0.351

Controller-1:

remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
+face::2
                 143.161.167.89   3 u  428 1024  377    0.104   -1.132   0.255
+64:ff9b::cdce:4607
                 206.108.0.131    2 u  342 1024  377   43.230   -0.699   0.425
*64:ff9b::23b7:39a9
                 206.108.0.131    2 u  483 1024  377   10.450    1.795   0.385
+64:ff9b::9538:2f3c
                 206.108.0.132    2 u  123 1024  377   14.716    1.211   0.879

After swact

remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 face::2
                 127.0.0.1       12 u   27   64    1    0.154   -1.134   0.000
 64:ff9b::cdce:4607
                 206.108.0.131    2 u   27   64    1   43.220   -0.894   0.000
 64:ff9b::23b7:39a9
                 206.108.0.131    2 u   29   64    1   10.626    1.972   0.000
 64:ff9b::9538:2f3c
                 206.108.0.132    2 u   29   64    1   15.316    0.934   0.000

from daemon.log on controller-1
           
2019-11-20T17:42:58.000 controller-0 ntpd[100062]: info Deleting interface #16 eno1, 2620:10a:a001:a103::1218#123, interface stats: received=422, sent=422, dropped=0, active_time=71038 secs
2019-11-20T17:42:58.000 controller-0 ntpd[100062]: info 64:ff9b::a29f:c801 interface 2620:10a:a001:a103::1218 -> (none)
2019-11-20T17:42:58.000 controller-0 ntpd[100062]: info 64:ff9b::c063:2ac interface 2620:10a:a001:a103::1218 -> (none)
2019-11-20T17:42:58.000 controller-0 ntpd[100062]: info 64:ff9b::3627:1740 interface 2620:10a:a001:a103::1218 -> (none)
2019-11-20T17:42:58.000 controller-0 ntpd[100062]: info Deleting interface #14 vlan116, feed:beef::1#123, interface stats: received=0, sent=0, dropped=0, active_time=71038 secs
2019-11-20T17:42:58.000 controller-0 ntpd[100062]: info Deleting interface #4 ens801f0, 192.168.202.1#123, interface stats: received=0, sent=0, dropped=0, active_time=71038 secs
2019-11-20T17:43:06.000 controller-0 ntpd[100062]: info Deleting interface #17 vlan115, face::4#123, interface stats: received=105, sent=115, dropped=0, active_time=70977 secs
2019-11-20T17:43:06.000 controller-0 ntpd[100062]: info face::3 interface face::4 -> (none)
2019-11-20T17:43:08.000 controller-0 ntpd[100062]: info 0.0.0.0 0618 08 no_sys_peer
2019-11-20T17:43:20.000 controller-0 ntpd[100062]: info Deleting interface #12 vlan115, face::1#123, interface stats: received=1, sent=3, dropped=0, active_time=71060 secs
2019-11-20T17:43:20.000 controller-0 ntpd[100062]: info face::3 interface face::1 -> (none)

Revision history for this message

Kristine Bujold (kbujold) wrote on 2019-11-20:

With a swact the oam interface changes. This must cause the ntpq to reset the reachability.

With unlocks/reboots the ntpq needs time to determine reachabilty. Our current ntpq audit is 10 minutes.

We could audit more frequently and/or use debouncing before raising alarm. Debouncing would reduce ntp alarm noise.

Revision history for this message

Yang Liu (yliu12) wrote on 2019-11-20:

If I remember correctly, this alarm did not go away. There is a separate LP for swact issue, where alarms go away after 20 minutes or so.

In this specific LP, I suspect the following selected NTP server is not recognized as a valid server in our code:
*64:ff9b::ce6c:83
.PTP0. 1 u 638 1024 377 8.263 -0.172 0.120

Is it possible to create a unit test using above server or check if the code parses the Refid?

Revision history for this message

Kristine Bujold (kbujold) wrote on 2019-11-20:

Download full text (4.1 KiB)

I am seeing errors in the daemon.log regarding raising and clearing the NTP alarms "Failed to execute clear_fault"

2019-11-02T03:13:23.877 controller-1 collectd[113678]: info WARNING:root:fm_python_extension: Failed to connect to FM manager
2019-11-02T03:13:23.880 controller-1 collectd[113678]: info NTP query plugin 'clear_fault' exception ; 100.114:host=controller-1.ntp ; Failed to execute clear_fault.
2019-11-02T03:13:23.884 controller-1 collectd[113678]: info NTP query plugin 100.114:host=controller-1.ntp=64:ff9b::ce6c:83 alarm cleared
2019-11-02T03:13:23.885 controller-1 collectd[113678]: info NTP query plugin selected server changed from 'None' to '64:ff9b::ce6c:83'
2019-11-02T03:13:23.885 controller-1 collectd[113678]: info NTP query plugin reachable servers: ['64:ff9b::ce6c:83']

The alarm for "64:ff9b::ce6c:83" appears to eventually get cleared 20 minutes later

fm-manager.log:2019-11-02T03:53:23.836 fmMsgServer.cpp(398): Raising Alarm/Log, (100.114) (host=controller-1.ntp=64:ff9b::ce6c:83)
fm-manager.log:2019-11-02T03:53:23.838 fmMsgServer.cpp(421): Alarm created/updated: (100.114) (host=controller-1.ntp=64:ff9b::ce6c:83) (2) (8606362d-0906-4610-b4df-983d11f1eb9d)

fm-manager.log:2019-11-02T04:13:23.837 fmMsgServer.cpp(494): Deleted alarm: (100.114) (host=controller-1.ntp=64:ff9b::ce6c:83)

But also I also see these generic NTP alarms

These should of been cleared based on the ntpq output

2019-11-02T14:13:23.807 controller-1 collectd[113678]: info NTP query plugin server list: ['0.pool.ntp.org', ...

I am seeing errors in the daemon.log regarding raising and clearing the NTP alarms "Failed to execute clear_fault"

The alarm for "64:ff9b::ce6c:83" appears to eventually get cleared 20 minutes later

fm-manager.log:2019-11-02T04:13:23.837 fmMsgServer.cpp(494): Deleted alarm: (100.114) (host=controller-1.ntp=64:ff9b::ce6c:83)

But also I also see these generic NTP alarms

cat alarms.info
--------------------------------------------------------------------
Sat Nov  2 14:21:24 UTC 2019 :  : fm alarm-list
--------------------------------------------------------------------
+----------+--------------------------------------------------------------------------+-----------------------+----------+----------------------+
| Alarm ID | Reason Text                                                              | Entity ID             | Severity | Time Stamp           |
+----------+--------------------------------------------------------------------------+-----------------------+----------+----------------------+
| 100.114  | NTP cannot reach external time source; syncing with peer controller only | host=controller-0.ntp | minor    | 2019-11-02T14:14:05. |
|          |                                                                          |                       |          | 895979               |
|          |                                                                          |                       |          |                      |
| 100.114  | NTP configuration does not contain any valid or reachable NTP servers.   | host=controller-1.ntp | major    | 2019-11-02T03:03:25. |
|          |                                                                          |                       |          | 836876               |
|          |                                                                          |                       |          |                      |
+----------+--------------------------------------------------------------------------+-----------------------+----------+----------------------+

These should of been cleared based on the ntpq output

2019-11-02T14:13:23.807 controller-1 collectd[113678]: info NTP query plugin server list: ['0.pool.ntp.org', '1.pool.ntp.org', '3.pool.ntp.org']
2019-11-02T14:13:23.836 controller-1 collectd[113678]: info NTPQ: +abcd:204::2
2019-11-02T14:13:23.836 controller-1 collectd[113678]: info NTPQ:                  232.178.14.219   3 u  689 1024  377    0.084    0.607   0.352
2019-11-02T14:13:23.836 controller-1 collectd[113678]: info NTPQ: *64:ff9b::ce6c:83
2019-11-02T14:13:23.836 controller-1 collectd[113678]: info NTPQ:                  .PTP0.           1 u  992 1024  377    8.263   -0.172   0.120
2019-11-02T14:13:23.836 controller-1 collectd[113678]: info NTPQ: +64:ff9b::c7b6:ccc5
2019-11-02T14:13:23.836 controller-1 collectd[113678]: info NTPQ:                  219.119.208.14   2 u  754 1024  377   76.361    0.814   0.161
2019-11-02T14:13:23.836 controller-1 collectd[113678]: info NTPQ: +64:ff9b::8180:c14
2019-11-02T14:13:23.836 controller-1 collectd[113678]: info NTPQ:                  172.30.248.10    2 u  757 1024  377   47.996   -1.377   0.182

Maybe there is an issue with the NTP query plugin code in collectd?

Revision history for this message

Yang Liu (yliu12) wrote on 2019-11-21:

Yes you are right, the stuck alarms were the generic ones complaining about no valid/reachable servers.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-11-21: Fix proposed to monitoring (master)

#10

Fix proposed to branch: master
Review: https://review.opendev.org/695581

Changed in starlingx:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-11-22: Fix merged to monitoring (master)

#11

Reviewed: https://review.opendev.org/695581
Committed: https://git.openstack.org/cgit/starlingx/monitoring/commit/?id=c0b2838e988050096b89c176428b5e8e0142429c
Submitter: Zuul
Branch: master

commit c0b2838e988050096b89c176428b5e8e0142429c
Author: Kristine Bujold <email address hidden>
Date: Thu Nov 21 13:05:04 2019 -0500

Fix alarm issues with NTP

    This submit fixes a bug that would prevent NTP alarms clears from
    being retried on subsequent audits if the FM manager was returning
    an exception.

    Closes-bug: 1851043
    Change-Id: I3dc962ddcf78f114b8b361ab641eb11b0ef461e4
    Signed-off-by: Kristine Bujold <email address hidden>