Ubuntu
linux package

sched clock overflows in 208 days (i386 and amd64)

Bug #805341 reported by minoura on 2011-07-04

This bug affects 13 people

	Status	Importance	Assigned to
linux (Ubuntu)	Fix Released	Undecided	Chris J Arges
Lucid	Fix Released	Undecided	Chris J Arges
Maverick	Fix Released	Undecided	Chris J Arges
Natty	Fix Released	Undecided	Chris J Arges
Precise	Fix Released	Undecided	Chris J Arges

Bug Description

arch/x86/include/asm/timer.h:
#define CYC2NS_SCALE_FACTOR 10 /* 2^10, carefully chosen */

static inline unsigned long long __cycles_2_ns(unsigned long long cyc)
{
int cpu = smp_processor_id();
unsigned long long ns = per_cpu(cyc2ns_offset, cpu);
ns += cyc * per_cpu(cyc2ns, cpu) >> CYC2NS_SCALE_FACTOR; // !!!
return ns;
}

Apparently he or she was not very 'careful' to chose CYC2NS_SCALE_FACTOR.

cyc*cyc2ns overflows after (0xffffffffffffffff>>CYC2NS_SCALE_FACTOR)/10^9/60/60/24=208.5 days after
TSC reset. sched_clock should not wrap virtually forever (584years) and many kernel subsystems assume
it; for example the softlockup detector is confused by this;

[18446743993.431771] BUG: soft lockup - CPU#1 stuck for 17163091968s! [???:XXXXX]

Of course the time stamp (18446743993.431771) and 17163091968s (544 years!) are false.

Other problems caused by this overflow include task scheduler unfairness. softlockups ('stuck for 61s!') are
observed in 210~220 days after reboot but I've not yet succeeded to explain the logic.

I believe the problem is not solved in the latest 2.6.32-32.62 kernel packages (from code inspection).

Thanks.

ProblemType: Bug
DistroRelease: Ubuntu 10.04
Package: linux-image-2.6.32-22-server 2.6.32-22.36
Regression: No
Reproducible: Yes
ProcVersionSignature: Ubuntu 2.6.32-22.36-server 2.6.32.11+drm33.2
Uname: Linux 2.6.32-22-server x86_64
AlsaDevices: Error: command ['ls', '-l', '/dev/snd/'] failed with exit code 2: ls: /dev/snd/にアクセスできません: No such file or directory
AplayDevices: Error: [Errno 2] No such file or directory
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory
Date: Mon Jul 4 15:31:06 2011
InstallationMedia: Ubuntu-Server 10.04.1 LTS "Lucid Lynx" - Release amd64 (20100816.2)
Lsusb:
Bus 002 Device 003: ID 0000:0000
Bus 002 Device 002: ID 8087:0020
Bus 002 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 001 Device 002: ID 8087:0020
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
MachineType: HP ProLiant DL120 G6
PciMultimedia:

ProcCmdLine: BOOT_IMAGE=/boot/vmlinuz-2.6.32-22-server root=UUID=67bbc9ef-18cb-46c3-af61-c6d2e0645c07 ro console=tty console=ttyS0,9600n8 apic=debug crashkernel=384M-2G:64M,2G-:128M quiet
ProcEnviron:
LANG=C
SHELL=/bin/bash
SourcePackage: linux
dmi.bios.date: 09/06/2010
dmi.bios.vendor: HP
dmi.bios.version: O26
dmi.board.name: ProLiant DL120 G6
dmi.board.vendor: Wistron Corporation
dmi.chassis.asset.tag: quest
dmi.chassis.type: 23
dmi.chassis.vendor: HP
dmi.chassis.version: N/A
dmi.modalias: dmi:bvnHP:bvrO26:bd09/06/2010:svnHP:pnProLiantDL120G6:pvr:rvnWistronCorporation:rnProLiantDL120G6:rvr:cvnHP:ct23:cvrN/A:
dmi.product.name: ProLiant DL120 G6
dmi.sys.vendor: HP

Tags:

Related branches

lp:ubuntu/lucid-proposed/linux-lts-backport-natty

lp:ubuntu/lucid-proposed/linux-lts-backport-maverick

CVE References

Revision history for this message

minoura (minoura-valinux) wrote on 2011-07-04:

BootDmesg.txt Edit (62.2 KiB, text/plain; charset="utf-8")
CurrentDmesg.txt Edit (1.2 KiB, text/plain; charset="utf-8")
Dependencies.txt Edit (1.3 KiB, text/plain; charset="utf-8")
Lspci.txt Edit (11.2 KiB, text/plain; charset="utf-8")
ProcCpuinfo.txt Edit (6.4 KiB, text/plain; charset="utf-8")
ProcInterrupts.txt Edit (3.0 KiB, text/plain; charset="utf-8")
ProcModules.txt Edit (1.7 KiB, text/plain; charset="utf-8")
UdevDb.txt Edit (77.1 KiB, text/plain; charset="utf-8")
UdevLog.txt Edit (168.9 KiB, text/plain; charset="utf-8")

Revision history for this message

minoura (minoura-valinux) wrote on 2011-07-04:

> sched_clock should not wrap virtually forever

Actually it is not wrap but jump from 208.5 days to 584 years. cyc2ns_offset is around -100 secs.

Revision history for this message

Julian Wiedmann (jwiedmann) wrote on 2011-07-04:

This is probably related to http://thread.gmane.org/gmane.linux.kernel/1132515

Revision history for this message

minoura (minoura-valinux) wrote on 2011-07-11:

Thanks. Both troubles we are having seems mentioned.
Will checking that thread.

Revision history for this message

minoura (minoura-valinux) wrote on 2011-07-12:

Crash message Edit (97.9 KiB, image/jpeg)

Here's the crash message of 61s soft lockup case.

Brad Figg (brad-figg) on 2011-07-18

Changed in linux (Ubuntu):
status:	New → Confirmed

Revision history for this message

Tuomas Salo (naapuri) wrote on 2011-12-09:

I've confirmed this problem with several 10.04 LTS (amd64) machines VM's running on a VMware vSphere system. Each machine that had an uptime of 211 days suffered a sudden softlockup, triggered by an innocent-looking 'apt-get install libyaml-perl'. The machines were running 2.6.32-30-server.

Some consoles showed the "stuck for 17163091968s" message, others just these "stuck for 61s" and "task ... blocked for more than 120 seconds". I've also seen a variant with the value 17163091988s.

Other machines that were quite identical, but running either a bit older kernel (2.6.32-26-server, uptime 365 days) or a newer one (2.6.32-33-server or newer, uptime less than 115 days) were not affected by this issue.

Is there a way to reproduce the error? Maybe a way to set the problematic clock value to somewhere near the 210 day limit?

Revision history for this message

Tuomas Salo (naapuri) wrote on 2011-12-09:

(I'm a bit puzzled by why the day limit seems to be approximate. Could it have something to do with clock drifting and adjustments by eg. ntp?)

Revision history for this message

Tuomas Salo (naapuri) wrote on 2011-12-09:

Sorry to flood, just a minor point: the OP wrote: "Of course the time stamp (18446743993.431771) and 17163091968s (544 years!) are false." But actually the first timestamp is not false - it reveals that the lockup has occurred after 213,5 days of uptime. (18446743993/60/60/24/1000).

Revision history for this message

Fumihito YOSHIDA (hito) wrote on 2012-01-06:

This problem has been fixed at mainline/longterm-2.6.32.50 ?

https://www.kernel.org/pub/linux/kernel/v2.6/longterm/ChangeLog-2.6.32.50
| commit e52d3fb7c83d6cb569f343a3db9b45decc801f59
| Author: Salman Qazi <email address hidden>
| Date: Tue Nov 15 14:12:06 2011 -0800
|
| sched, x86: Avoid unnecessary overflow in sched_clock
|
| commit 4cecf6d401a01d054afc1e5f605bcbfe553cb9b9 upstream.
|
| In hundreds of days, the __cycles_2_ns calculation in sched_clock
| has an overflow. cyc * per_cpu(cyc2ns, cpu) exceeds 64 bits, causing
| the final value to become zero. We can solve this without losing
| any precision.
|
| We can decompose TSC into quotient and remainder of division by the
| scale factor, and then use this to convert TSC into nanoseconds.

Revision history for this message

Chris J Arges (arges) wrote on 2012-01-11:

#10

Here are the branches in lucid that contain 4cecf6d401a01d054afc1e5f605bcbfe553cb9b9:

ubuntu-lucid/master / Ubuntu-2.6.32-38.83 / YES
ubuntu-natty/master / Ubuntu-2.6.38-13.54 / NO
ubuntu-maverick/master / Ubuntu-2.6.35-32.64 / NO

Looks like we need to apply the SRU to natty/maverick.
It is in oneiric.

Changed in linux (Ubuntu):
assignee:	nobody → Chris J Arges (christopherarges)
status:	Confirmed → In Progress

Chris J Arges (arges) on 2012-01-11

tags:

added: kernel-da-key maverick natty
removed: lucid needs-upstream-testing

Revision history for this message

Chris J Arges (arges) wrote on 2012-01-11:

#11

0001-sched-x86-Avoid-unnecessary-overflow-in-sched_clock.patch Edit (2.6 KiB, text/plain)

Cherry picked from commit 4cecf6d401a01d054afc1e5f605bcbfe553cb9b9:
Applies cleanly to Maverick and Natty

Ubuntu Foundations Team Bug Bot (crichton) on 2012-01-11

tags:

added: patch

Tim Gardner (timg-tpi) on 2012-01-12

Changed in linux (Ubuntu Precise):
status:	In Progress → Fix Released
Changed in linux (Ubuntu Natty):
status:	New → Fix Committed
Changed in linux (Ubuntu Maverick):
status:	New → Fix Committed

Chris J Arges (arges) on 2012-01-12

Changed in linux (Ubuntu Maverick):
assignee:	nobody → Chris J Arges (christopherarges)
Changed in linux (Ubuntu Natty):
assignee:	nobody → Chris J Arges (christopherarges)

Revision history for this message

Herton R. Krzesinski (herton) wrote on 2012-01-27:

#12

Given the time it takes to test this, and with no known or possible way to reproduce the problem faster, it isn't possible to do the verification process in 1 week. I'm marking verification done for both maverick/natty.

tags:

added: verification-done-maverick verification-done-natty

Revision history for this message

Launchpad Janitor (janitor) wrote on 2012-02-10:

#13

This bug was fixed in the package linux - 2.6.35-32.65

---------------
linux (2.6.35-32.65) maverick-proposed; urgency=low

[Brad Figg]

* Release Tracking Bug
- LP: #920677

[ Upstream Kernel Changes ]

  * fuse: check size of FUSE_NOTIFY_INVAL_ENTRY message, CVE-2011-3353
    - LP: #905058
    - CVE-2011-3353
  * KVM: x86: Prevent starting PIT timers in the absence of irqchip support
    - LP: #911303
    - CVE-2011-4622
  * sched, x86: Avoid unnecessary overflow in sched_clock
    - LP: #805341
  * use cache type functions for arch_get_unmapped_area
    - LP: #861296
  * topdown mmap support
    - LP: #861296
  * xfs: validate acl count
    - LP: #917706
    - CVE-2012-0038
  * xfs: fix acl count validation in xfs_acl_from_disk()
    - LP: #917706
    - CVE-2012-0038
  * drm: integer overflow in drm_mode_dirtyfb_ioctl()
    - LP: #917838
    - CVE-2012-0044
  * x86/PCI: amd: factor out MMCONFIG discovery
    - LP: #647043
  * PNP: work around Dell 1536/1546 BIOS MMCONFIG bug that breaks USB
    - LP: #647043
-- Brad Figg <email address hidden> Mon, 23 Jan 2012 13:40:13 -0800

Changed in linux (Ubuntu Maverick):
status:	Fix Committed → Fix Released

Revision history for this message

Launchpad Janitor (janitor) wrote on 2012-02-10:

#14

This bug was fixed in the package linux - 2.6.38-13.55

---------------
linux (2.6.38-13.55) natty-proposed; urgency=low

[Brad Figg]

* Release Tracking Bug
- LP: #920790

[ Upstream Kernel Changes ]

  * fuse: check size of FUSE_NOTIFY_INVAL_ENTRY message, CVE-2011-3353
    - LP: #905058
    - CVE-2011-3353
  * KVM: x86: Prevent starting PIT timers in the absence of irqchip support
    - LP: #911303
    - CVE-2011-4622
  * sched, x86: Avoid unnecessary overflow in sched_clock
    - LP: #805341
  * use cache type functions for arch_get_unmapped_area
    - LP: #861296
  * topdown mmap support
    - LP: #861296
  * xfs: validate acl count
    - LP: #917706
    - CVE-2012-0038
  * xfs: fix acl count validation in xfs_acl_from_disk()
    - LP: #917706
    - CVE-2012-0038
  * drm: integer overflow in drm_mode_dirtyfb_ioctl()
    - LP: #917838
    - CVE-2012-0044
  * x86/PCI: amd: factor out MMCONFIG discovery
    - LP: #647043
  * PNP: work around Dell 1536/1546 BIOS MMCONFIG bug that
    - LP: #647043
-- Brad Figg <email address hidden> Mon, 23 Jan 2012 18:31:33 -0800

Changed in linux (Ubuntu Natty):
status:	Fix Committed → Fix Released

Revision history for this message

jaseywang (jaseywang) wrote on 2012-04-29:

#15

When will Lucid 2.6.32 be fixed ?

Revision history for this message

Steve Fox (drfickle2) wrote on 2012-04-30:

#16

I have had this occur on a couple production machines. One was 208 days, the other 209 days. Please merge to Lucid.

Revision history for this message

joe williams (joetify) wrote on 2012-05-11:

#17

Same here, please patch Lucid!

Chris J Arges (arges) on 2012-05-11

Changed in linux (Ubuntu Lucid):
assignee:	nobody → Chris J Arges (christopherarges)
status:	New → In Progress

Revision history for this message

Chris J Arges (arges) wrote on 2012-05-11:

#18

@ Steve Fox / Jasey Wang

This was fixed in Lucid already in: 2.6.32.50
According to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/902317
I've also inspected the git log and see this patch is present in the tree.

Changed in linux (Ubuntu Lucid):
status:	In Progress → Fix Released

Revision history for this message

Steve Fox (drfickle2) wrote on 2012-05-11:

#19

The report Chris linked to shows the -50 update went out on 2012-01-24, but linux-image-2.6.32-41-server seems to be the latest available on the mirrors (according to aptitude search linux-image-2.6.32). Sorry to pester, but how does one find the -50 update? Thank you.

Revision history for this message

Steve Fox (drfickle2) wrote on 2012-05-11:

#20

My mistake. A closer read of Chris' link shows the 'linux-image-2.6.32-38-server' package has the fix. Sorry for the spam and thanks for all the fish.

Revision history for this message

Chris J Arges (arges) wrote on 2012-05-11:

#21

@ Steve Fox
No problem! Better safe than sorry.
--chris

Revision history for this message

jaseywang (jaseywang) wrote on 2012-05-12:

#22

@Chris J Arges
you mean both the 2.6.32-50 and the 2.6.32-38.83 has the bug fixed ?

Revision history for this message

Steve Fox (drfickle2) wrote on 2012-05-12:

#23

@Jasey The 2.6.32.50 version is the one from the upstream kernel (see https://www.kernel.org/pub/linux/kernel/v2.6/longterm/v2.6.32/ChangeLog-2.6.32.50). That kernel is the version that Ubuntu's 2.6.32-38.83 kernel is based on (everything after the dash, -, is Ubuntu's versioning). I think this type of confusion was part of the reason that Linus simplified the version numbering after 3.0.

Revision history for this message

jaseywang (jaseywang) wrote on 2012-05-13:

#24

Really thanks for your patience :-)

Revision history for this message

lijianbing (galikeleli) wrote on 2013-04-26:

#25

When will Oneiric 3.2.0-030200rc4 (http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.2-rc4-oneiric/) be fixed ?

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

Bug #914784

You are

Subscribing...

Edit bug mail

Other bug subscribers

Patches

0001-sched-x86-Avoid-unnecessary-overflow-in-sched_clock.patch Edit

Add patch

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

Ubuntulinux package

sched clock overflows in 208 days (i386 and amd64)

Bug Description

Related branches

CVE References

Duplicates of this bug

Other bug subscribers

Patches

Bug attachments

Remote bug watches

Ubuntu
linux package