100.114 alarm "NTP cannot reach external time source; syncing with peer controller only " raised when other external NTP servers are available

Bug #1889101 reported by Peng Peng
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Low
Gustavo Dobro

Bug Description

Brief Description
-----------------
In DX system, create patch-strategy and apply it. After patch applied, remove the patch and delete patch-strategy successfully, 100.114 alarm "NTP cannot reach external time source; syncing with peer controller only" raised.

Severity
--------
Major

Steps to Reproduce
------------------
create patch-strategy
apply it
After patch applied, remove the patch
delete patch-strategy
check alarm-list

TC-name: functional/patching/test_patch_orchestration.py::test_patch_orch_with_ignored_alarms

Expected Behavior
------------------
no 100.114 alamr

Actual Behavior
----------------
100.114 alarm raised

Reproducibility
---------------
Reproducible

System Configuration
--------------------
Two node system

Lab-name: WP_13-14

Branch/Pull Time/Commit
-----------------------
2020-07-25_00-00-00

Last Pass
---------
no known

Timestamp/Logs
--------------
[2020-07-27 03:45:25,776] 314 DEBUG MainThread ssh.send :: Send 'sw-manager --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[abcd:204::1]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-interface internal --os-region-name RegionOne patch-strategy delete'

[2020-07-27 03:45:27,774] 314 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[abcd:204::1]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'
[2020-07-27 03:45:28,901] 436 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+----------+--------------------------------------------------------------------------+-----------------------+----------+----------------------------+
| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+--------------------------------------+----------+--------------------------------------------------------------------------+-----------------------+----------+----------------------------+
| 1329c819-8198-409b-a030-6d7d0ccbb299 | 100.114 | NTP cannot reach external time source; syncing with peer controller only | host=controller-1.ntp | minor | 2020-07-27T03:42:01.451586 |
+--------------------------------------+----------+--------------------------------------------------------------------------+-----------------------+----------+----------------------------+

[sysadmin@controller-0 ~(keystone_admin)]$ fm alarm-list --nowrap
+----------+--------------------------------------------------------------------------+-----------------------+----------+----------------------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+--------------------------------------------------------------------------+-----------------------+----------+----------------------------+
| 100.114 | NTP cannot reach external time source; syncing with peer controller only | host=controller-1.ntp | minor | 2020-07-27T03:52:01.494746 |
+----------+--------------------------------------------------------------------------+-----------------------+----------+----------------------------+

# Following output indicates there are 3 external ntp servers in green available; however the peer controller in yellow was selected to be the ntp server, causing above alarm to be raised.
controller-1:~$ sudo ntpq -pn
Password:
     remote refid st t when poll reach delay offset jitter
==============================================================================
*abcd:204::2
                 153.24.162.44 2 u 426 1024 177 0.075 -0.677 0.898
+64:ff9b::4246:ac11
                 51.79.69.205 3 u 421 1024 177 15.158 -1.297 2.135
+64:ff9b::a29f:c87b
                 10.16.15.99 3 u 429 1024 177 27.097 -1.658 1.748
+64:ff9b::d8e8:844d
                 206.108.0.131 2 u 499 1024 77 68.415 -2.034 0.858
controller-1:~$

Test Activity
-------------
Sanity

Revision history for this message
Yang Liu (yliu12) wrote :

It might be a coincidence that this alarm happened to be raised during/after a patch test.

The issue is a peer controller could be selected as NTP server, while other external NTP servers are available, causing an alarm to be raised.

summary: 100.114 alarm "NTP cannot reach external time source; syncing with peer
- controller only " raised after patching
+ controller only " raised when other external NTP servers are available
Ghada Khalil (gkhalil)
tags: added: stx.config
Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
tags: added: stx.5.0
Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.5.0 / medium priority - should be investigated; it looks like the selected ntp server is incorrect.

Changed in starlingx:
assignee: nobody → Les Bowness (lbowness)
Les Bowness (lbowness)
Changed in starlingx:
assignee: Les Bowness (lbowness) → Gustavo Dobro (mgdobro)
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Lowering the priority given there are no reports that this is still reproducible in more recent builds

Changed in starlingx:
importance: Medium → Low
tags: removed: stx.5.0
Revision history for this message
Takamasa Takenaka (ttakenak) wrote :

It is reproducible with the latest starlingx green build (of March 6, 2012)

[sysadmin@controller-0 ~(keystone_admin)]$ ntpq -p
     remote refid st t when poll reach delay offset jitter
==============================================================================
*controller-1
                 168.96.251.197 2 u 39 64 377 0.068 6.460 2.921
+srv2.gyllen.cl
                 200.54.149.24 2 u 60 64 377 253.600 -60.660 23.801
+time.cloudflare.com
                 10.44.9.236 3 u 62 64 377 31.379 8.345 24.313

[sysadmin@controller-1 ~(keystone_admin)]$ ntpq -p
     remote refid st t when poll reach delay offset jitter
==============================================================================
 controller-0
                 192.168.204.4 3 u 46 64 376 0.630 -1.277 6.538
+ntp.xtom.com.hk
                 223.255.185.3 3 u 108 128 377 355.623 2.050 57.172
*ptp.aggo-conicet.gob.ar
                 .PTP0. 1 u 42 128 377 32.424 -3.921 8.707

[sysadmin@controller-1 ~(keystone_admin)]$ fm alarm-list
+----------+--------------------------------------------------------------------------+-----------------------+----------+---------------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+--------------------------------------------------------------------------+-----------------------+----------+---------------------+
| 100.114 | NTP cannot reach external time source; syncing with peer controller only | host=controller-0.ntp | minor | 2021-04-19T15:13:28 |
| | | | | .304003 |
| | | | | |
+----------+--------------------------------------------------------------------------+-----------------------+----------+---------------------+

Revision history for this message
Takamasa Takenaka (ttakenak) wrote :

[Root Cause]
/var/log/daemon.log (filtered with "NTP")
{noformat}
2021-04-20T19:21:40.900 controller-1 collectd[113955]: info NTP query plugin server list: ['0.pool.ntp.org', '1.pool.ntp.org']
2021-04-20T19:21:40.912 controller-1 collectd[113955]: info NTPQ: *192.168.204.2
2021-04-20T19:21:41.043 controller-1 collectd[113955]: info NTP query plugin 100.114:host=controller-1.ntp alarm cleared
2021-04-20T19:21:41.044 controller-1 collectd[113955]: info NTPQ: 194.0.5.123 3 u 35 64 377 0.305 1.634 3.153
2021-04-20T19:21:41.044 controller-1 collectd[113955]: info NTPQ: +216.6.2.70
2021-04-20T19:21:41.045 controller-1 collectd[113955]: info NTPQ: 195.219.205.18 2 u 39 64 377 209.253 1.348 18.771
2021-04-20T19:21:41.045 controller-1 collectd[113955]: info NTPQ: +138.197.135.239
2021-04-20T19:21:41.045 controller-1 collectd[113955]: info NTPQ: 206.108.0.133 2 u 37 64 377 211.428 4.507 27.263
2021-04-20T19:21:41.045 controller-1 collectd[113955]: info NTP query plugin peer is selected
2021-04-20T19:21:41.089 controller-1 collectd[113955]: info NTP query plugin raised alarm 100.114:host=controller-1.ntp
{noformat}
When peer is selected as ntp server, script ntpq.py checks refid. But refid for server is not in the same line as shown above.
As a result, script could not find reliable refid, even it exists (in the following line) and does not set obj.selected_server.
Later in script, raise alarm because there is no selected_server.

[Overview Design]
When aquiring data from ntpq, conbine data lines to 1 line data if it seems 2 lines data.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to monitoring (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/monitoring/+/787588

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
Takamasa Takenaka (ttakenak) wrote :

After internal discussion, we decided to remove this minor alarm "NTP cannot reach external time source; syncing with peer controller only" because NTP does not prioritize external time source over peer.

This bug is available since any alarm should not be raised when peer is selected with reliable source.

I'll update the patch.

Revision history for this message
Takamasa Takenaka (ttakenak) wrote :
Download full text (15.0 KiB)

Patch is updated and test result with https://review.opendev.org/c/starlingx/monitoring/+/787588/4

[IPv4]
When ntp server is not selected:

controller-1:~$ ntpq -np
     remote refid st t when poll reach delay offset jitter
==============================================================================
 192.168.204.2
                 .INIT. 16 u - 64 0 0.000 0.000 0.000
 206.75.147.25
                 192.168.10.254 2 u 46 64 1 250.884 17.371 0.000
 149.56.121.17
                 213.251.128.249 2 u 45 64 1 269.301 20.717 0.000
==> No NTP server is selected.

[sysadmin@controller-0 ~(keystone_admin)]$ fm alarm-list
+----------+------------------------------------------------------------------------+-------------------------------+----------+---------------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+------------------------------------------------------------------------+-------------------------------+----------+---------------------+
| 100.114 | NTP configuration does not contain any valid or reachable NTP servers. | host=controller-1.ntp | major | 2021-04-28T12:30:31 |
| | | | | .574720 |
| | | | | |
| 100.114 | NTP address 149.56.121.17 is not a valid or a reachable NTP server. | host=controller-1.ntp=149.56. | minor | 2021-04-28T12:30:31 |
| | | 121.17 | | .531136 |
| | | | | |
| 100.114 | NTP address 206.75.147.25 is not a valid or a reachable NTP server. | host=controller-1.ntp=206.75. | minor | 2021-04-28T12:30:31 |
| | | 147.25 | | .521752 |
| | | | | |
+----------+------------------------------------------------------------------------+-------------------------------+----------+---------------------+
==> 1 major alarm and 2 minor alarms as expected

2021-04-28T12:30:31.515 controller-1 collectd[119298]: info NTP query plugin server list: ['0.pool.ntp.org', '1.pool.ntp.org']
2021-04-28T12:30:31.521 controller-1 collectd[119298]: info NTPQ: 192.168.204.2 .INIT. 16 u - 64 0 0.000 0.000 0.000
2021-04-28T12:30:31.521 controller-1 collectd[119298]: info NTPQ: 206.75.147.25 192.168.10.254 2 u 24 64 1 250.884 17.371 0.000
2021-04-28T12:30:31.531 controller-1 collectd[1...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to monitoring (master)

Reviewed: https://review.opendev.org/c/starlingx/monitoring/+/787588
Committed: https://opendev.org/starlingx/monitoring/commit/2ef5451f442482636db3c0c3641e8412821bd8c5
Submitter: "Zuul (22348)"
Branch: master

commit 2ef5451f442482636db3c0c3641e8412821bd8c5
Author: Takamasa Takenaka <email address hidden>
Date: Thu Apr 22 12:28:37 2021 -0300

    Format 2 lines ntpq data into 1 lines

    The problem was logic expected one line data for
    ntpq result. But it was 2 lines for each ntp server
    entry. When peer server is selected, script checked
    refid if refid is reliable or not but it could not
    find because refid is in the following line.
    This fix formats 2 lines data into 1 line.

    The minor alarm "minor alarm "NTP cannot reach
    external time source; syncing with peer controller
    only" is removed because NTP does not prioritize
    external time source over peer.

    Closes-Bug: 1889101

    Signed-off-by: Takamasa Takenaka <email address hidden>
    Change-Id: Icc8316bb1a7041bf0351165c671ebf35b97fa3bc

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to monitoring (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/monitoring/+/792244

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to monitoring (f/centos8)
Download full text (7.8 KiB)

Reviewed: https://review.opendev.org/c/starlingx/monitoring/+/792244
Committed: https://opendev.org/starlingx/monitoring/commit/fdc0d099fb0d65cbf8f037fe0cc9ac8125410284
Submitter: "Zuul (22348)"
Branch: f/centos8

commit 2ef5451f442482636db3c0c3641e8412821bd8c5
Author: Takamasa Takenaka <email address hidden>
Date: Thu Apr 22 12:28:37 2021 -0300

    Format 2 lines ntpq data into 1 lines

    The problem was logic expected one line data for
    ntpq result. But it was 2 lines for each ntp server
    entry. When peer server is selected, script checked
    refid if refid is reliable or not but it could not
    find because refid is in the following line.
    This fix formats 2 lines data into 1 line.

    The minor alarm "minor alarm "NTP cannot reach
    external time source; syncing with peer controller
    only" is removed because NTP does not prioritize
    external time source over peer.

    Closes-Bug: 1889101

    Signed-off-by: Takamasa Takenaka <email address hidden>
    Change-Id: Icc8316bb1a7041bf0351165c671ebf35b97fa3bc

commit d37490b81408ca53b1b8fd61992c6c9337dbcaed
Author: Eric MacDonald <email address hidden>
Date: Tue Apr 20 10:03:07 2021 -0400

    Add alarm audit to starlingx collectd fm notifier plugin

    This update adds common plugin support for alarm state auditing.
    The audit is able to detect and correct the following alarm
    state errors:

       Error Case Correction Action
       ----------------------- -----------------
     - stale alarm ; delete alarm
     - missing alarm ; assert alarm
     - alarm severity mismatch ; refresh alarm

    The common audit is enabled for the fm_notifier plugin that supports
    alarm managment for the following resources.

     - CPU with alarm id 100.101
     - Memory with alarm id 100.103
     - Filesystem with alarm id 100.104

    Other plugins may use this common audit in the future but only the
    above resources have the audit enabled for them by this update.

    Test Plan:

    PASS: Verify stale alarm detection/correction handling
    PASS: Verify missing alarm detection/correction handling
    PASS: Verify alarm severity mismatch detection/correction handling
    PASS: Verify hosts only audits its own specified alarms
    PASS: Verify success path of monitoring a single and mix
          of base and instance alarms of varying severity while
          such alarm conditions come and go
    PASS: Verify alarm audit of mix of base and instance alarms
          over a collectd process restart
    PASS: Verify audit handling of alarm that migrates from
          major to critical to major to clear
    PASS: Verify audit handling transition between alarm and
          no alarm conditions
    PASS: Verify soak of random cpu, memory and filesystem
          overage alarm assertions and clears that also involve
          manual alarm deletions, assertions and severity changes
          that exercise new audit features

    Regression:

    PASS: Verify alarm and audit handling over Swact with mounted
          filesystem that has active alarm
  ...

Read more...

tags: added: in-f-centos8
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.