PTP monitoring reported compute lock to controller with zero skew

Bug #1836884 reported by Eric MacDonald
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Alexander Kozyrev

Bug Description

Brief Description
-----------------
In PTP timestamping nodes lock to a remote grandmaster and report and ever changing sampled deviation from that reference.
The collectd PTP plugin records PTP timestamp deviations as samples in the database.

Severity
--------
Major

Steps to Reproduce
------------------
Configure PTP timestamp monitoring.
Beyond that how to create this issue is unknown.
Suspect something to do with the grandmaster going out of service.

Expected Behavior
------------------
Correct timestamp deviation sampling

Actual Behavior
----------------
Timestamp deviation of zero is seen and sampled.
Would expect a no lock alarm in this case.

Reproducibility
---------------
Seen once after incorrect PTP provisioning steps (see LP 1823975)

System Configuration
--------------------
Multi-node system

Branch/Pull Time/Commit
-----------------------
2019-04-07 23:30:01 +0000

Last Pass
---------
n/a

Timestamp/Logs
--------------
n/a

Test Activity
-------------
Feature test

Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

compute-0_20190409.135329/var/log/daemon.log:2019-04-09T13:36:01.301 compute-0 collectd[10713]: info ptp plugin compute-0 is collecting samples [ 0] with Grand Master 001e67.fffe.38c3e4
compute-0_20190409.135329/var/log/daemon.log:2019-04-09T13:46:01.301 compute-0 collectd[10713]: info ptp plugin compute-0 is collecting samples [ 0] with Grand Master 001e67.fffe.38c3e4

compute-1_20190409.135329/var/log/daemon.log:2019-04-09T13:36:12.310 compute-1 collectd[12434]: info ptp plugin compute-1 is collecting samples [ 0] with Grand Master 001e67.fffe.38c3e4
compute-1_20190409.135329/var/log/daemon.log:2019-04-09T13:46:12.310 compute-1 collectd[12434]: info ptp plugin compute-1 is collecting samples [ 0] with Grand Master 001e67.fffe.38c3e4

Changed in starlingx:
assignee: nobody → Eric MacDonald (rocksolidmtce)
Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

Refer to the following original LP

https://bugs.launchpad.net/starlingx/+bug/1823975

Revision history for this message
Ghada Khalil (gkhalil) wrote :

@Eric / @Alex,
Is this only an issue if incorrect PTP provisioning steps were followed? We're trying to determine if this is a gating issue or not.

Changed in starlingx:
status: New → Incomplete
Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

Will investigate.

Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

WC113-121
HW Timestampiong Mode
GM = GrandMaster

Action: Unlock inactive Controller
 - see inactive Controller become GM
Action: Unlock compute-0
 - see see compute-0 lock to inactive controller become GM with good sampling
Action: Lock and power off inactive controller GM
 - see compute-0 continue to report last sample over and over (-31)
   - restarted collectd and the samer -31 value persisted
 - see compute-0 does not detect loss of GM
 - see compute-0 not raise no-lock alarm
 - see compute-0 continuing to report being locked to same (now absent) GM
Action: Unlock compute-1
 - see compute-1 no-lock alarm raised
Action: Unlock compute-2..3
 - see compute-2..3 no-lock alarm raised
Action: power on inactive controller
 - see compute-1..3 locked onto inactive controller as GM
 - see compute-0 starts reporting major out of tolerance alarm ; crazy large skew
 - see compute-0 not recovering on its own
Action: Restart collectd on compute-0
 - see compute-0 continue to report major out of tolerance alarm ;crazy large skey
Action: Restarted ptp4l process
 - see compute-0 continue to report major out of tolerance alarm ;crazy large skey
Action: Lock/Unlock compute-0
 - see login password expired ; prompted to change.
 - see unreasonable last login date

[sysadmin@controller-1 ~(keystone_admin)]$ ssh compute-0
sysadmin@compute-0's password:
You are required to change your password immediately (password aged)
Last login: Tue Sep 9 00:23:51 2059 from controller-1
/etc/motd.d/00-header:

WARNING: Unauthorized access to this system is forbidden and will be
prosecuted by law. By accessing this system, you agree that your
actions may be monitored if unauthorized usage is suspected.

WARNING: Your password has expired.
You must change your password now and login again!
Changing password for user sysadmin.
Changing password for sysadmin.
(current) UNIX password:
passwd: Authentication token manipulation error
Connection to compute-0 closed.

Previous login (before lock/unlock) attempt showed a reasonable date

[sysadmin@controller-1 ~(keystone_admin)]$ ssh compute-0
sysadmin@compute-0's password:
Last login: Thu Jul 18 02:06:49 2019 from controller-0
/etc/motd.d/00-header:

WARNING: Unauthorized access to this system is forbidden and will be
prosecuted by law. By accessing this system, you agree that your
actions may be monitored if unauthorized usage is suspected.

Unable to reproduce issue

recommend: this issue to be non-gating
Reasoning: performed most typical error path cases and unable to repro

New Issue: computes don't seem to recover from the loss of a GM

Revision history for this message
Ghada Khalil (gkhalil) wrote :

As recommended by Eric, assigning to Alex to investigate as this appears to be a ptp driver issue. Marking as stx.2.0 for now until further investigation.

Changed in starlingx:
assignee: Eric MacDonald (rocksolidmtce) → Alex Kozyrev (akozyrev)
tags: added: stx.2.0 stx.config
Changed in starlingx:
importance: Undecided → High
status: Incomplete → Triaged
Revision history for this message
Alexander Kozyrev (akozyrev) wrote :

Unable to reproduce huge time leap and computes do recover in my testing on the same lab.
One thing needs to be addressed though: alarm is indeed not raised in case GM is lost.
There is a wrong logic in PTP "no lock" alarm mechanism: it check if MAC address of GM is your own.
But that's not the case when GM was present and then lost: MAC of GM address still belongs to a controller.
Proper check would include port status check as well as MAC address check.
If port status is not SLAVE that means we lost our connection to GM even with remote MAC address.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to integ (master)

Fix proposed to branch: master
Review: https://review.opendev.org/674924

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to integ (master)

Reviewed: https://review.opendev.org/674924
Committed: https://git.openstack.org/cgit/starlingx/integ/commit/?id=d11ccac73f8ab2f2c4b51a6556e2111d1f05aa39
Submitter: Zuul
Branch: master

commit d11ccac73f8ab2f2c4b51a6556e2111d1f05aa39
Author: Alex Kozyrev <email address hidden>
Date: Fri Aug 2 16:12:57 2019 -0400

    Alarm is not raised in case PTP GM is lost

    "No lock" PTP alarm is raised only when GM is and was not present
    in a network. Current logic only reaises this alarm in case MAC
    address of GM is the same as local MAC address. But it is only
    the case when no external GM ever appeared in a PTP setup.
    In case GM was present in a network and then lost we need to check
    port status instead. PTP MAC address still points to an external GM.
    But port status is changed from SLAVE to LISTENING state.

    Change-Id: I30365685e6f44566702cc82534ab6ebf0613a731
    Closes-bug: 1836884
    Signed-off-by: Alex Kozyrev <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Ghada Khalil (gkhalil) wrote :

@Alex, Please cherry-pick to the stx.2.0 release branch.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to integ (r/stx.2.0)

Fix proposed to branch: r/stx.2.0
Review: https://review.opendev.org/676288

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to integ (r/stx.2.0)

Reviewed: https://review.opendev.org/676288
Committed: https://git.openstack.org/cgit/starlingx/integ/commit/?id=b07d0451e71902a39072301ecbd5961ed666fcda
Submitter: Zuul
Branch: r/stx.2.0

commit b07d0451e71902a39072301ecbd5961ed666fcda
Author: Alex Kozyrev <email address hidden>
Date: Fri Aug 2 16:12:57 2019 -0400

    Alarm is not raised in case PTP GM is lost

    "No lock" PTP alarm is raised only when GM is and was not present
    in a network. Current logic only reaises this alarm in case MAC
    address of GM is the same as local MAC address. But it is only
    the case when no external GM ever appeared in a PTP setup.
    In case GM was present in a network and then lost we need to check
    port status instead. PTP MAC address still points to an external GM.
    But port status is changed from SLAVE to LISTENING state.

    Change-Id: I30365685e6f44566702cc82534ab6ebf0613a731
    Closes-bug: 1836884
    Signed-off-by: Alex Kozyrev <email address hidden>
    (cherry picked from commit d11ccac73f8ab2f2c4b51a6556e2111d1f05aa39)

Ghada Khalil (gkhalil)
tags: added: in-r-stx20
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.