TSC Clocksource Unstable Switches To acpi_pm But Server clock freezes/becomes unusable
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
New
|
Undecided
|
Unassigned |
Bug Description
[Note I had update bug 190414 but it was closed]
Hi There,
I'm having this exact problem on Hardy Server, but this is fairly critical for me. The same error is logged:
Sep 20 10:22:25 host-01 kernel: [51281.289424] Clocksource tsc unstable (delta = 3323063740502 ns)
Sep 20 10:22:25 host-01 kernel: [51281.299403] Time: acpi_pm clocksource has been installed.
Sep 20 10:22:26 host-01 kernel: [51282.778316] NET: Registered protocol family 17
The problem here is once acpi_pm is installed the clock stays at that time as ntp seems to either be unable to update time or it constanly loses time and ntp manages to "hold" the time at the current time (Sep 20 10:22:26 in this case). This is a real showstopper as this causes all sorts of problems, Cacti graphs stop logging as it effectivly freezes in time as far as cacti is concerned and nagios/cron have problem scheduling things. The server (I've two that do this) both running 2.6.24-19-server become highly unstable, ssh logins fail, the system becomes highly unresponsive. I haven't pinned down when exactly this occured but it's basically rendered Hardy Server useless. It seems to have happened int he last couple of weeks but I'll be digging through the old logs to see... the problem being the logs are pretty useless with the time being completely off !
As regards changing clocksource the ones available are:
sudo cat /sys/devices/
acpi_pm jiffies tsc
I'm trying jiffies as acpi_pm and tsc appear useless. The processor is a Intel(R) Xeon(TM) CPU 2.40GHz and it's Dell Poweredge 2800 server, A 3rd Server which is a HP proliant hasn't show same error and it does have the hpet clocksource.
Ok well jiffies did not work, in cat when I switch ntp back on and ran date every 5 secs or so I get this:
user@host-01:~$ date
Sat Sep 20 10:22:51 BST 2008
user@host-01:~$ date
Mon Sep 22 05:40:35 BST 2008
user@host-01:~$ date
Mon Sep 22 05:40:35 BST 2008
user@host-01:~$ date
Mon Sep 22 05:40:35 BST 2008
user@host-01:~$ date
Mon Sep 22 05:40:35 BST 2008
where the actual time should have been 06:00 Hrs +
This also seems to lock up the machine, I've tried a remote reboot and while the ssh terminal was failry responsive some commands didn't seem to complete and had to be ctrl-C to quit. This is exactly what happened on the tcs/acpi_pm clocksource as well.
I've also read is forcing the CPU to stay at full speed can stop it which would be a ok temporary solution for me as it's more important the server works, but seems that scaling isn't available in Server ? Either that or it's handled differntly fromt he scaling governors like it was before:
user@host-01:~$ sudo ls -l /sys/devices/
total 0
-r-------- 1 root root 4096 2008-09-22 11:43 crash_notes
drwxr-xr-x 2 root root 0 2008-09-22 11:42 topology
user@host-01:~$ sudo ls -l /sys/devices/
total 0
-r--r--r-- 1 root root 4096 2008-09-22 11:43 core_id
-r--r--r-- 1 root root 4096 2008-09-22 11:42 core_siblings
-r--r--r-- 1 root root 4096 2008-09-22 11:43 physical_package_id
-r--r--r-- 1 root root 4096 2008-09-22 11:43 thread_siblings
I've set clocksource=acpi_pm at boot to see if starting out on it rather than switching from TSC solves the issue, I'll update as soon as I have info. Just to add this never occurs right after or during boot, it can take several hours to occur. I haven't spotted a pattern yet but I'll keep my eyes open.
Attached Dmesg with acpi_pm enabled in grub.
Despite acpi_pm enabled at boot it still seems to use TSC (Have to pardon my not understanding the inner workings of this) as I get the following:
Sep 22 12:03:57 host-01 kernel: [ 1007.064201] Clocksource tsc unstable (delta = 140599784626 ns)
It doesn't give me the switching to acpi_pm but the clock has already started to wander after being up for only a short while.
PS: Apologies I know I just posted what was in my last comments to 190414 but I didn't want to leave out anything and might help following it through....
Well I've found the update that caused it, I believe it's the last Kernel update that went on, ever since then I've had this issue, traced back my emails to some of the folk onsite asking for a restart after this date.
Log complete.
Aptitude 0.4.9: log report
Tue, Aug 26 2008 14:33:54 +0100
IMPORTANT: this log only lists intended actions; actions which fail due to
dpkg problems may not be completed.
Will install 14 packages, and remove 0 packages. ======= ======= ======= ======= ======= ======= ======= ======= ======= ======= == 8.04.1 2.6.24- 19-server 2.6.24-19.36 -> 2.6.24-19.41 ======= ======= ======= ======= ======= ======= ======= ======= ======= ======= ==
57.3kB of disk space will be used
=======
[UPGRADE] initramfs-tools 0.85eubuntu39.1 -> 0.85eubuntu39.2
[UPGRADE] iproute 20071016-2ubuntu1 -> 20071016-2ubuntu2
[UPGRADE] libglib2.0-0 2.16.3-1ubuntu3 -> 2.16.4-0ubuntu2
[UPGRADE] libglib2.0-data 2.16.3-1ubuntu3 -> 2.16.4-0ubuntu2
[UPGRADE] libldap-2.4-2 2.4.9-0ubuntu0.8.04 -> 2.4.9-0ubuntu0.
[UPGRADE] linux-image-
[UPGRADE] linux-libc-dev 2.6.24-19.36 -> 2.6.24-19.41
[UPGRADE] pciutils 1:2.2.4-1.1ubuntu4 -> 1:2.2.4-1.1ubuntu5
[UPGRADE] procps 1:3.2.7-5ubuntu2 -> 1:3.2.7-5ubuntu3
[UPGRADE] python2.5 2.5.2-2ubuntu4 -> 2.5.2-2ubuntu4.1
[UPGRADE] python2.5-minimal 2.5.2-2ubuntu4 -> 2.5.2-2ubuntu4.1
[UPGRADE] tzdata 2008c-1ubuntu0.8.04 -> 2008e-1ubuntu0.8.04
[UPGRADE] ufw 0.16.2.2 -> 0.16.2.3
[UPGRADE] update-manager-core 1:0.87.27 -> 1:0.87.30
=======