Very inaccurate TSC clocksource with kernel 4.13 on selected CPUs

Bug #1759787 reported by Paul Gear on 2018-03-29
22
This bug affects 3 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
High
Unassigned
Artful
High
Unassigned

Bug Description

On kernel 4.13.0-37-generic, HP ProLiant DL380 Gen10 systems have been observed with very large clock offsets, as measured by NTP. Over the past few days on one of our production systems, we've used 3 different kernels: https://pastebin.ubuntu.com/p/nDkkgRqdtv/

All of these kernels default to the TSC clocksource, which is supposed to be very reliable on Skylake-X CPUs. On 4.4 (linux-image-generic-lts-xenial) it works as expected; on 3.13 (trusty default kernel) it works a little worse, and on 4.13 (linux-image-generic-hwe-16.04) it is much worse. Today I switched 4.13 from the TSC clocksource to the HPET clocksource and it improved the situation dramatically.

I've produced loopstats & peerstats graphs from NTP corresponding to the dates in the pastebin above and placed them at https://people.canonical.com/~paulgear/ntp/.

ProblemType: Bug
DistroRelease: Ubuntu 14.04
Package: linux-image-4.13.0-37-generic 4.13.0-37.42
ProcVersionSignature: User Name 4.13.0-37.42~16.04.1-generic 4.13.13
Uname: Linux 4.13.0-37-generic x86_64
AlsaDevices:
 total 0
 crw-rw---- 1 root audio 116, 1 Mar 27 18:26 seq
 crw-rw---- 1 root audio 116, 33 Mar 27 18:26 timer
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
ApportVersion: 2.14.1-0ubuntu3.27
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
CurrentDmesg:
 [ 6280.259121] perf: interrupt took too long (2505 > 2500), lowering kernel.perf_event_max_sample_rate to 79750
 [10463.378558] perf: interrupt took too long (3133 > 3131), lowering kernel.perf_event_max_sample_rate to 63750
 [32314.949747] perf: interrupt took too long (4000 > 3916), lowering kernel.perf_event_max_sample_rate to 50000
 [129804.100274] clocksource: Switched to clocksource hpet
 [132747.312089] perf: interrupt took too long (5004 > 5000), lowering kernel.perf_event_max_sample_rate to 39750
Date: Thu Mar 29 07:45:22 2018
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
Lsusb:
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
 Bus 003 Device 002: ID 0bda:0329 Realtek Semiconductor Corp.
 Bus 003 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
 Bus 002 Device 002: ID 0424:2660 Standard Microsystems Corp. Hub
 Bus 002 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
MachineType: HPE ProLiant DL380 Gen10
PciMultimedia:

ProcFB: 0 mgadrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.13.0-37-generic root=UUID=b33fdcbd-a949-41a0-86d2-03d0c6808284 ro console=tty0 console=ttyS0,115200
RelatedPackageVersions:
 linux-restricted-modules-4.13.0-37-generic N/A
 linux-backports-modules-4.13.0-37-generic N/A
 linux-firmware 1.127.24
RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)
WifiSyslog:

dmi.bios.date: 02/15/2018
dmi.bios.vendor: HPE
dmi.bios.version: U30
dmi.board.name: ProLiant DL380 Gen10
dmi.board.vendor: HPE
dmi.chassis.type: 23
dmi.chassis.vendor: HPE
dmi.modalias: dmi:bvnHPE:bvrU30:bd02/15/2018:svnHPE:pnProLiantDL380Gen10:pvr:rvnHPE:rnProLiantDL380Gen10:rvr:cvnHPE:ct23:cvr:
dmi.product.family: ProLiant
dmi.product.name: ProLiant DL380 Gen10
dmi.sys.vendor: HPE

Paul Gear (paulgear) wrote :

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
tags: added: artful
Paul Gear (paulgear) on 2018-03-29
tags: added: canonical-is
removed: artful

Can you see if this bug also exists in the latest Bionic kernel:
https://launchpad.net/ubuntu/+source/linux/4.15.0-13.14/+build/14466170

Changed in linux (Ubuntu Artful):
status: New → Triaged
Changed in linux (Ubuntu):
status: Confirmed → Triaged
importance: Undecided → High
Changed in linux (Ubuntu Artful):
importance: Undecided → High
tags: added: kernel-key
tags: added: kernel-da-key
removed: kernel-key
Paul Gear (paulgear) wrote :

I've also seen this issue with a Dell R740 server, Xeon Gold 6132 CPU @ 2.60GHz, kernel 4.13.0-37-generic. Switching to hpet clocksource mitigated it successfully.

Unfortunately I still haven't had the opportunity to test this on 4.15.

summary: - Very inaccurate TSC clocksource on HP ProLiant DL380 Gen10 with kernel
- 4.13
+ Very inaccurate TSC clocksource with kernel 4.13 on selected CPUs
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers