time drifting on linux-hwe kernels

Bug #1744988 reported by Juul Spies
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Medium
Joseph Salisbury
Artful
Fix Released
Medium
Joseph Salisbury
Bionic
Fix Released
Medium
Joseph Salisbury

Bug Description

== SRU Justification ==
We observe NTP time drift on two servers running hwe kernels in Xenial. A few weeks ago we wanted to switch from 4.4 to 4.10. When rebooting the servers to the 4.10 kernel we were seeing a big time offset within minutes after booting. Despite running ntpd, it would not keep up and the offset stayed and kept growing over t.

Rebooting back into the 4.4 at the time we immediatly noticed the time stayed normal. Over time I have tested about a dozen versions making me think something has been introduced in kernel 4.10 that makes the clock go out of sync.

So what do we observe?

After 1 min uptime:
     remote refid st t when poll reach delay offset jitter
==============================================================================
*ntp4.bit.nl .PPS. 1 u 5 16 7 0.497 100.084 81.382
+ntp1.bit.nl 193.0.0.229 2 u 8 16 7 0.603 93.241 70.643
+ntp2.bit.nl 193.67.79.202 2 u 8 16 7 0.582 93.218 70.674
+ntp3.bit.nl 193.79.237.14 2 u 9 16 7 0.781 90.488 70.574

A couple of minutes later (and also hours/days, the offset just keeps growing over time)

     remote refid st t when poll reach delay offset jitter
==============================================================================
*ntp4.bit.nl .PPS. 1 u 13 16 377 0.447 400.198 151.335
+ntp1.bit.nl 193.0.0.229 2 u 13 16 377 0.313 400.561 151.339
+ntp2.bit.nl 193.67.79.202 2 u 13 16 377 0.517 400.445 151.398
+ntp3.bit.nl 193.79.237.14 2 u 12 16 377 0.934 402.013 151.384

As mentioned I tested about a dozen of kernels and I thought I got it pinpointed to a specific release when the drifting got introduced, 4.10rc1. Below the test results of the kernels I have tested up till today:

Tested: 4.4.0-112-generic: not affected
Tested: 4.8.0-41-generic: not affected
Tested: 4.8.0-58-generic : not affected
Tested: 4.9.0 mainline: not affected
Tested: 4.9.66 mainline: not affected
Tested: 4.10-rc1 mainline: affected
Tested: 4.10 mainline: affected
Tested: 4.10.0-38-generic: affected
Tested: 4.10.0-40-generic: affected
Tested: 4.13.0-16-generic: affected
Tested: 4.13.0-31-generic: affected
Tested: 4.14.3 mainline: affected
Tested: 4.15-rc1 mainline: affected

When I was about to file this bugreport about an hour ago I noticed 4.15-rc9 was present and thought I gave it a go to make sure I really tested the latest version. And while running it over an hour now it stable.

Mostl likely the following from the changelog is related the issue we are having:

Len Brown (3):
      x86/tsc: Future-proof native_calibrate_tsc()
      x86/tsc: Fix erroneous TSC rate on Skylake Xeon
      x86/tsc: Print tsc_khz, when it differs from cpu_khz

Both servers that are having issues on our side our equipped with the following cpu:

Cpu Model (from /proc/cpuinfo)
vendor_id : GenuineIntel
cpu family : 6
model : 85
model name : Intel(R) Xeon(R) Gold 6136 CPU @ 3.00GHz

Standard information as requested:
1:
Description: Ubuntu 16.04.3 LTS
Release: 16.04

2:
root@bit-host6:~# apt-cache policy linux-image-generic-hwe-16.04
linux-image-generic-hwe-16.04:
  Installed: 4.13.0.31.51
  Candidate: 4.13.0.31.51

3: Stable time

4: A big time offset

== Fixes ==
da4ae6c4a0b8 ("x86/tsc: Future-proof native_calibrate_tsc()")
b51120309348 ("x86/tsc: Fix erroneous TSC rate on Skylake Xeon")
4b5b2127238e ("x86/tsc: Print tsc_khz, when it differs from cpu_khz")

== Regression Potential ==
Low. These three commits fix an existing regression. They were also cc'd to
stable so have had addition upstream review.

== Test Case ==
A test kernel was built with these patches and tested by the original bug reporter.
The bug reporter states the test kernel resolved the bug.

Revision history for this message
Juul Spies (juul) wrote :

Although probably obvious, the mainline kernels that I tested have all been downloaden from http://kernel.ubuntu.com/~kernel-ppa/mainline/

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux-hwe (Ubuntu):
status: New → Confirmed
Revision history for this message
Sander Smeenk (ubuntu-freshdot) wrote :

Wow!! I've been searching for this problem for quite some time!!

Revision history for this message
Juul Spies (juul) wrote :

I have created patchfile, tsc.patch, which applies the upstream patchwork done by Len Brown to the 4.13.0-31-generic Kernel. This resolves the time issues we are having.

tags: added: patch
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Can you see if this bug still occurs with the latest HWE kernel for Xenial, which is 4.13 based?

Changed in linux-hwe (Ubuntu):
importance: Undecided → Medium
status: Confirmed → Incomplete
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Sorry, you can ignore the request in comment #5. I missed all the kenrels you tested in the bug description.

affects: linux-hwe (Ubuntu) → linux (Ubuntu)
Changed in linux (Ubuntu):
status: Incomplete → Triaged
Changed in linux (Ubuntu Artful):
status: New → Triaged
importance: Undecided → Medium
tags: added: kernel-da-key
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built a test kernel with the three commits: da4ae6c, b511203 and 4b5b212

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1744988

Can you test this kernel and see if it resolves this bug?

Note, to test this kernel, you need to install both the linux-image and linux-image-extra .deb packages.

Thanks in advance!

Revision history for this message
Juul Spies (juul) wrote :

I am running your build on two servers for about an hour now. Timing is stable on the both of them.

Changed in linux (Ubuntu Artful):
status: Triaged → In Progress
Changed in linux (Ubuntu Bionic):
status: Triaged → In Progress
Changed in linux (Ubuntu Artful):
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Bionic):
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Bionic):
status: In Progress → Fix Committed
description: updated
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :
Changed in linux (Ubuntu Artful):
status: In Progress → Fix Committed
Revision history for this message
Stefan Bader (smb) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-artful' to 'verification-done-artful'. If the problem still exists, change the tag 'verification-needed-artful' to 'verification-failed-artful'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-artful
Revision history for this message
Juul Spies (juul) wrote :

My apologies, I was not aware of this. After your message I have started testing the proposed kernel and can verify it is working as expected.

tags: added: verification-done-artful
removed: verification-needed-artful
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (18.9 KiB)

This bug was fixed in the package linux - 4.13.0-38.43

---------------
linux (4.13.0-38.43) artful; urgency=medium

  * linux: 4.13.0-38.43 -proposed tracker (LP: #1755762)

  * Servers going OOM after updating kernel from 4.10 to 4.13 (LP: #1748408)
    - i40e: Fix memory leak related filter programming status
    - i40e: Add programming descriptors to cleaned_count

  * [SRU] Lenovo E41 Mic mute hotkey is not responding (LP: #1753347)
    - platform/x86: ideapad-laptop: Increase timeout to wait for EC answer

  * fails to dump with latest kpti fixes (LP: #1750021)
    - kdump: write correct address of mem_section into vmcoreinfo

  * headset mic can't be detected on two Dell machines (LP: #1748807)
    - ALSA: hda/realtek - Support headset mode for ALC215/ALC285/ALC289
    - ALSA: hda - Fix headset mic detection problem for two Dell machines
    - ALSA: hda - Fix a wrong FIXUP for alc289 on Dell machines

  * CIFS SMB2/SMB3 does not work for domain based DFS (LP: #1747572)
    - CIFS: make IPC a regular tcon
    - CIFS: use tcon_ipc instead of use_ipc parameter of SMB2_ioctl
    - CIFS: dump IPC tcon in debug proc file

  * i2c-thunderx: erroneous error message "unhandled state: 0" (LP: #1754076)
    - i2c: octeon: Prevent error message on bus error

  * hisi_sas: Add disk LED support (LP: #1752695)
    - scsi: hisi_sas: directly attached disk LED feature for v2 hw

  * EDAC, sb_edac: Backport 1 patch to Ubuntu 17.10 (Fix missing DIMM sysfs
    entries with KNL SNC2/SNC4 mode) (LP: #1743856)
    - EDAC, sb_edac: Fix missing DIMM sysfs entries with KNL SNC2/SNC4 mode

  * [regression] Colour banding and artefacts appear system-wide on an Asus
    Zenbook UX303LA with Intel HD 4400 graphics (LP: #1749420)
    - drm/edid: Add 6 bpc quirk for CPT panel in Asus UX303LA

  * DVB Card with SAA7146 chipset not working (LP: #1742316)
    - vmalloc: fix __GFP_HIGHMEM usage for vmalloc_32 on 32b systems

  * [Asus UX360UA] battery status in unity-panel is not changing when battery is
    being charged (LP: #1661876) // AC adapter status not detected on Asus
    ZenBook UX410UAK (LP: #1745032)
    - ACPI / battery: Add quirk for Asus UX360UA and UX410UAK

  * ASUS UX305LA - Battery state not detected correctly (LP: #1482390)
    - ACPI / battery: Add quirk for Asus GL502VSK and UX305LA

  * support thunderx2 vendor pmu events (LP: #1747523)
    - perf pmu: Extract function to get JSON alias map
    - perf pmu: Pass pmu as a parameter to get_cpuid_str()
    - perf tools arm64: Add support for get_cpuid_str function.
    - perf pmu: Add helper function is_pmu_core to detect PMU CORE devices
    - perf vendor events arm64: Add ThunderX2 implementation defined pmu core
      events
    - perf pmu: Add check for valid cpuid in perf_pmu__find_map()

  * lpfc.ko module doesn't work (LP: #1746970)
    - scsi: lpfc: Fix loop mode target discovery

  * Ubuntu 17.10 crashes on vmalloc.c (LP: #1739498)
    - powerpc/mm/book3s64: Make KERN_IO_START a variable
    - powerpc/mm/slb: Move comment next to the code it's referring to
    - powerpc/mm/hash64: Make vmalloc 56T on hash

  * ethtool -p fails to light NIC LED on HiSilicon D05 systems (LP: #1748567)
    - net...

Changed in linux (Ubuntu Artful):
status: Fix Committed → Fix Released
Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

Patches could be found in Bionic.

Changed in linux (Ubuntu):
status: Fix Committed → Fix Released
Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.