Very inaccurate TSC clocksource with kernel 4.13 on selected CPUs

Bug #1759787 reported by Paul Gear on 2018-03-29
28
This bug affects 4 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
High
Unassigned
Artful
High
Unassigned

Bug Description

On kernel 4.13.0-37-generic, HP ProLiant DL380 Gen10 systems have been observed with very large clock offsets, as measured by NTP. Over the past few days on one of our production systems, we've used 3 different kernels: https://pastebin.ubuntu.com/p/nDkkgRqdtv/

All of these kernels default to the TSC clocksource, which is supposed to be very reliable on Skylake-X CPUs. On 4.4 (linux-image-generic-lts-xenial) it works as expected; on 3.13 (trusty default kernel) it works a little worse, and on 4.13 (linux-image-generic-hwe-16.04) it is much worse. Today I switched 4.13 from the TSC clocksource to the HPET clocksource and it improved the situation dramatically.

I've produced loopstats & peerstats graphs from NTP corresponding to the dates in the pastebin above and placed them at https://people.canonical.com/~paulgear/ntp/.

ProblemType: Bug
DistroRelease: Ubuntu 14.04
Package: linux-image-4.13.0-37-generic 4.13.0-37.42
ProcVersionSignature: User Name 4.13.0-37.42~16.04.1-generic 4.13.13
Uname: Linux 4.13.0-37-generic x86_64
AlsaDevices:
 total 0
 crw-rw---- 1 root audio 116, 1 Mar 27 18:26 seq
 crw-rw---- 1 root audio 116, 33 Mar 27 18:26 timer
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
ApportVersion: 2.14.1-0ubuntu3.27
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
CurrentDmesg:
 [ 6280.259121] perf: interrupt took too long (2505 > 2500), lowering kernel.perf_event_max_sample_rate to 79750
 [10463.378558] perf: interrupt took too long (3133 > 3131), lowering kernel.perf_event_max_sample_rate to 63750
 [32314.949747] perf: interrupt took too long (4000 > 3916), lowering kernel.perf_event_max_sample_rate to 50000
 [129804.100274] clocksource: Switched to clocksource hpet
 [132747.312089] perf: interrupt took too long (5004 > 5000), lowering kernel.perf_event_max_sample_rate to 39750
Date: Thu Mar 29 07:45:22 2018
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
Lsusb:
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
 Bus 003 Device 002: ID 0bda:0329 Realtek Semiconductor Corp.
 Bus 003 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
 Bus 002 Device 002: ID 0424:2660 Standard Microsystems Corp. Hub
 Bus 002 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
MachineType: HPE ProLiant DL380 Gen10
PciMultimedia:

ProcFB: 0 mgadrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.13.0-37-generic root=UUID=b33fdcbd-a949-41a0-86d2-03d0c6808284 ro console=tty0 console=ttyS0,115200
RelatedPackageVersions:
 linux-restricted-modules-4.13.0-37-generic N/A
 linux-backports-modules-4.13.0-37-generic N/A
 linux-firmware 1.127.24
RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)
WifiSyslog:

dmi.bios.date: 02/15/2018
dmi.bios.vendor: HPE
dmi.bios.version: U30
dmi.board.name: ProLiant DL380 Gen10
dmi.board.vendor: HPE
dmi.chassis.type: 23
dmi.chassis.vendor: HPE
dmi.modalias: dmi:bvnHPE:bvrU30:bd02/15/2018:svnHPE:pnProLiantDL380Gen10:pvr:rvnHPE:rnProLiantDL380Gen10:rvr:cvnHPE:ct23:cvr:
dmi.product.family: ProLiant
dmi.product.name: ProLiant DL380 Gen10
dmi.sys.vendor: HPE

Paul Gear (paulgear) wrote :

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
tags: added: artful
Paul Gear (paulgear) on 2018-03-29
tags: added: canonical-is
removed: artful

Can you see if this bug also exists in the latest Bionic kernel:
https://launchpad.net/ubuntu/+source/linux/4.15.0-13.14/+build/14466170

Changed in linux (Ubuntu Artful):
status: New → Triaged
Changed in linux (Ubuntu):
status: Confirmed → Triaged
importance: Undecided → High
Changed in linux (Ubuntu Artful):
importance: Undecided → High
tags: added: kernel-key
tags: added: kernel-da-key
removed: kernel-key
Paul Gear (paulgear) wrote :

I've also seen this issue with a Dell R740 server, Xeon Gold 6132 CPU @ 2.60GHz, kernel 4.13.0-37-generic. Switching to hpet clocksource mitigated it successfully.

Unfortunately I still haven't had the opportunity to test this on 4.15.

summary: - Very inaccurate TSC clocksource on HP ProLiant DL380 Gen10 with kernel
- 4.13
+ Very inaccurate TSC clocksource with kernel 4.13 on selected CPUs

This bug was nominated against a series that is no longer supported, ie artful. The bug task representing the artful nomination is being closed as Won't Fix.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu Artful):
status: Triaged → Won't Fix
Paul Gear (paulgear) wrote :

Using an HP ProLiant DL380 Gen10 (Intel Xeon Gold 5118 CPU @ 2.30GHz) I tested the following kernels (on xenial):

4.13.0-37-generic - FAIL https://pastebin.canonical.com/p/RbTzC3jgjm/
4.13.0-39-generic - PASS
4.13.0-45-generic - PASS
4.15.0-29-generic - PASS

So it seems this was fixed in 4.13.0-38 or 4.13.0-39.

tags: added: canonical-bootstack
Andrea Ieri (aieri) wrote :

Just as another data point:

Dell R740 with Xeon Gold 6132 on xenial:

4.13.0-36-generic - FAIL
4.13.0-37-generic - FAIL
4.13.0-45-generic - PASS

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers