linuxptp ts2phc master offset spikes on realtime systems

Bug #1970776 reported by Cole Walker
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Cole Walker

Bug Description

Brief Description
-----------------
On a realtime system, configuring ts2phc to source time from GNSS results in the master offset value intermittently spiking and causes the system time to be unstable.

Severity
--------
Provide the severity of the defect.
Major

Steps to Reproduce
------------------

Use a config like:
controller-0:~$ cat /etc/ptpinstance/ts2phc-ts1.conf
[global]
##
## Default Data Set
##
leapfile /usr/share/zoneinfo/leap-seconds.list
logging_level 7
ts2phc.nmea_serialport /dev/ttyGNSS_5100_0
ts2phc.pulsewidth 100000000

[enp138s0f0]
##
## Associated interface: data0
##
ts2phc.extts_polarity rising

[enp81s0f0]
##
## Associated interface: oam0
##
ts2phc.extts_polarity rising

Observe the master offset value in /var/log/user.log and it will occasionally spike by 1 second or more before attempting to stabilize again. This coincides with a high nmea_delay value.

Expected Behavior
------------------
Master offset value should hover close to 0 at all times.

Actual Behavior
----------------
Unstable master offset

Reproducibility
---------------
100% reproducible on realtime system, issue occurs multiple times per hour.

System Configuration
--------------------
AIO-SX

Branch/Pull Time/Commit
-----------------------
stx master

Last Pass
---------
New scenario

Timestamp/Logs
--------------
Attach the logs for debugging (use attachments in Launchpad or for large collect files use: https://files.starlingx.kube.cengn.ca/)
Provide a snippet of logs here and the timestamp when issue was seen.
Please indicate the unique identifier in the logs to highlight the problem

Test Activity
-------------
Developer testing

Workaround
----------
Manually change niceness of ice-gnss thread to be better than default.

Ghada Khalil (gkhalil)
tags: added: stx.7.0 stx.networking
Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: nobody → Cole Walker (cwalops)
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → High
status: New → Triaged
Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to utilities (master)

Reviewed: https://review.opendev.org/c/starlingx/utilities/+/839795
Committed: https://opendev.org/starlingx/utilities/commit/9183ef96db02faa9beeaa4021ad59e06ae6627ce
Submitter: "Zuul (22348)"
Branch: master

commit 9183ef96db02faa9beeaa4021ad59e06ae6627ce
Author: Cole Walker <email address hidden>
Date: Thu Apr 28 11:58:02 2022 -0400

    [PTP SyncE] Set niceness -10 for ice-gnss threads

    Problem: the master offset value in ts2phc intermittently spikes and
    causes the system to be incorrectly adjusted.

    This behaviour is seen when using the Intel Westport Channel NIC and ice
    driver 1.7.16 on a realtime kernel.

    Analysis of the issue shows that the ice-gnss thread responsible for
    reading from the GNSS and writing to the tty for consumption by ts2phc
    is sometimes getting delayed on realtime systems. Examination of
    typical workloads on the platform cores and discussion between Intel -
    the driver supplier - and the StarlingX communitiy has lead to an
    agreement to increase the priority of this thread.

    Most of the processes ordinarily running on the platform cores run at
    the default niceness of 0, so -10 has been selected to elevate the
    ice-gnss thread above those while leaving room on either side for other
    process tuning. It is also worth noting that the ice-gnss thread is
    being left as SCHED_OTHER, so processes assigned to SCHED_FIFO may still
    preempt it.

    Testing:

    PASS: Applied change to AIO-SX with Westport Channel NIC, ice-gnss
    thread is set to nice -10 after host lock/unlock.

    PASS: Cumulative 24 hours of ts2phc logs show no replication of fault
    when thread niceness is set to -10. When the thread is nice 0, fault
    occurs multiple times per hour.

    Closes-bug: 1970776

    Signed-off-by: Cole Walker <email address hidden>
    Change-Id: I1f45530f37ded11ab7406a5a1068f896a06c8843

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/840239

Ghada Khalil (gkhalil)
Changed in starlingx:
status: Fix Released → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (master)

Reviewed: https://review.opendev.org/c/starlingx/stx-puppet/+/840239
Committed: https://opendev.org/starlingx/stx-puppet/commit/01a012bf688acadf2d3b2434afe5435d0d7ed7b7
Submitter: "Zuul (22348)"
Branch: master

commit 01a012bf688acadf2d3b2434afe5435d0d7ed7b7
Author: Cole Walker <email address hidden>
Date: Mon May 2 16:01:04 2022 -0400

    [PTP SyncE] Set ice-gnss thread prio in puppet

    Problem: The ice-gnss thread is not spawned until ptp services are
    configured, so on first time setup, the affine-process.sh script is not
    able to set the priority of the thread because it runs before puppet.

    This change adds a task to the puppet manifest to also set the niceness
    of the ice-gnss threads in order to handle the case of first time setup.
    Subsequent lock/unlocks will handled earlier in the startup process by
    affine-process.sh.

    See https://review.opendev.org/c/starlingx/utilities/+/839795 for the
    earlier change to affine-process.sh.

    Testing:

    Pass: Thread niceness set correctly on first time setup on AIO-SX.
    Niceness is also correctly set on subsequent lock/unlocks. Switching
    node from ptp to ntp and back also results in correct priority.

    Closes-Bug: 1970776

    Signed-off-by: Cole Walker <email address hidden>
    Change-Id: I1c9c0ffb6cd0dad7b77232522832b1645256dcfd

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.