Config out-of-date alarm takes 35mins to clear after lock/unlock, following an ntp/ptp config change

Bug #1897629 reported by Difu Hu
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Paul-Ionut Vaduva

Bug Description

Brief Description
-----------------
After change controller-0 clock_synchronization and lock/unlock controller-0, alarm "250.001 controller-0 Configuration is out-of-date" take long time to disappear on DC-3 subcloud

Severity
--------
Major

Steps to Reproduce
------------------
On DC-3 subcloud (e.g. subcloud2):
system host-update controller-0 clock_synchronization=ptp
system host-update controller-0 clock_synchronization=ntp
lock/unlock controller-0

Expected Behavior
------------------
alarm "250.001 controller-0 Configuration is out-of-date" disappears after lock/unlock.

Actual Behavior
----------------
Alarm "250.001 controller-0 Configuration is out-of-date" takes long time to disappear (about 35 minutes)

Reproducibility
---------------
Happened 3/3 times on DC-3 subcloud2
(Issue doesn't occur on sm_1)

System Configuration
--------------------
DC system
Lab-name: wp_18_21 (DC-3)
subcloud2

Branch/Pull Time/Commit
-----------------------
2020-09-27_20-00-07

Last Pass
---------
Not sure

Timestamp/Logs
--------------
+----------+--------------------------------------------+-------------------+----------+----------------------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+--------------------------------------------+-------------------+----------+----------------------------+
| 250.001 | controller-0 Configuration is out-of-date. | host=controller-0 | major | 2020-09-28T15:29:05.892837 |
+----------+--------------------------------------------+-------------------+----------+----------------------------+

Test Activity
-------------
Regression Testing

Revision history for this message
Difu Hu (difuhu) wrote :
Revision history for this message
Paul-Ionut Vaduva (pvaduva) wrote :

By looking at var/extra/database/fm.db.sql.txt
the history of alarms set and cleared is like this:

Alarms Set Alarm cleared
2020-09-28 08:46:31.456314 --- 2020-09-28 08:54:31.006866

2020-09-28 09:44:32.492961 --- 2020-09-28 10:40:30.987332
2020-09-28 11:37:33.310559 --- 2020-09-28 12:17:32.351701
2020-09-28 12:46:59.391213 --- 2020-09-28 16:01:19.306163T

Observation:
The alarms are cleared with a glitch (Clear, set, clear)
eg
250.001 clear (2020-09-28 08:54:23.200403)
250.001 set (2020-09-28 08:54:26.254063)
250.001 clear (2020-09-28 08:54:31.006866)

By looking at var/log/bash.log
the history of config update and and config apply (lock/unlock)

2020-09-28T12:46:58.000 clock_synchronization=ptp
2020-09-28T13:38:54.000 clock_synchronization=ntp

2020-09-28T13:39:10.000 host-lock controller-0
2020-09-28T13:39:56.00 host-unlock controller-0

2020-09-28T14:53:59.000 clock_synchronization=ptp
2020-09-28T14:55:10.000 clock_synchronization=ntp

2020-09-28T15:00:46.000 clock_synchronization=ptp
2020-09-28T15:01:46.000 clock_synchronization=ntp

2020-09-28T15:07:54.000 clock_synchronization=ptp
2020-09-28T15:08:54.000 clock_synchronization=ntp

2020-09-28T15:09:10.000 host-lock controller-0
2020-09-28T15:09:56.000 host-unlock controller-0

2020-09-28T15:18:52.000 host-lock controller-0
2020-09-28T15:20:43.000 host-unlock controller-0

Observation:
There is a continuously appearing error in sm.log from 2020-09-28T08:46:09 to 2020-09-28T16:18:06 (tens of times each second)
controller-0 sm: debug time[3290.984] log<75888> ERROR: sm[93126]: sm_msg.c(880): Failed to send message on socket for interface (vlan49), error=Network is unreachable.
controller-0 sm: debug time[3290.984] log<75890> ERROR: sm[93126]: sm_msg.c(880): Failed to send message on socket for interface (vlan48), error=Network is unreachable.

From networking.info
---
48: vlan48@oam0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc htb state UP mode DEFAULT group default qlen 1000
    link/ether 3c:fd:fe:b5:7e:ec brd ff:ff:ff:ff:ff:ff
    RX: bytes packets errors dropped overrun mcast
    3546091 37207 0 0 0 61
    TX: bytes packets errors dropped carrier collsns
    13365456 66206 0 0 0 0
49: vlan49@oam0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 3c:fd:fe:b5:7e:ec brd ff:ff:ff:ff:ff:ff
    RX: bytes packets errors dropped overrun mcast
    0 0 0 0 0 0
    TX: bytes packets errors dropped carrier collsns
    3822854 27905 0 0 0 0
---
fd01:13::4 dev vlan48 used 3/2987/0 probes 6 FAILED

Ghada Khalil (gkhalil)
summary: - Alarm "250.001 controller-0 Configuration is out-of-date" takes long
- time to disappear after lock/unlock
+ Configuration is out-of-date alarm takes long time to clear after
+ lock/unlock, following an ntp/ptp config change
summary: - Configuration is out-of-date alarm takes long time to clear after
+ Configuration is out-of-date alarm takes 35mins to clear after
lock/unlock, following an ntp/ptp config change
summary: - Configuration is out-of-date alarm takes 35mins to clear after
- lock/unlock, following an ntp/ptp config change
+ Config out-of-date alarm takes 35mins to clear after lock/unlock,
+ following an ntp/ptp config change
Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
tags: added: stx.5.0 stx.config
Changed in starlingx:
assignee: nobody → Gerry Kopec (gerry-kopec)
Frank Miller (sensfan22)
Changed in starlingx:
assignee: Gerry Kopec (gerry-kopec) → Paul-Ionut Vaduva (pvaduva)
Revision history for this message
Paul-Ionut Vaduva (pvaduva) wrote :

reboot 2020-09-28T15:23:34.643
drwxr-xr-x 2 paul paul 4.0K Sep 28 18:25 2020-09-28-15-25-55_controller
drwxr-xr-x 2 paul paul 4.0K Sep 28 18:28 2020-09-28-15-28-07_worker
drwxr-xr-x 2 paul paul 4.0K Sep 28 19:01 2020-09-28-16-01-11_controller

the puppet worker manifest 2020-09-28-15-28-07_worker triggers
a sysinv-agent restart just as in the description of the commit here:
https://review.opendev.org/c/starlingx/config/+/745391

2020-09-28T15:28:49.076 Debug: 2020-09-28 15:28:49 +0000 Executing: '/usr/bin/systemctl restart sysinv-agent
As it says in the description if the sysinv-agent is restarted the subsequent puppet manifest is not applied.

This is a duplicate of
https://bugs.launchpad.net/starlingx/+bug/1890834

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as Fix Released give the duplicate LP has been fixed

Changed in starlingx:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.