TSC Clocksource Unstable Switches To acpi_pm But Server clock freezes/becomes unusable

Bug #273313 reported by Félim Whiteley
8
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
New
Undecided
Unassigned

Bug Description

[Note I had update bug 190414 but it was closed]

Hi There,

I'm having this exact problem on Hardy Server, but this is fairly critical for me. The same error is logged:

Sep 20 10:22:25 host-01 kernel: [51281.289424] Clocksource tsc unstable (delta = 3323063740502 ns)
Sep 20 10:22:25 host-01 kernel: [51281.299403] Time: acpi_pm clocksource has been installed.
Sep 20 10:22:26 host-01 kernel: [51282.778316] NET: Registered protocol family 17

The problem here is once acpi_pm is installed the clock stays at that time as ntp seems to either be unable to update time or it constanly loses time and ntp manages to "hold" the time at the current time (Sep 20 10:22:26 in this case). This is a real showstopper as this causes all sorts of problems, Cacti graphs stop logging as it effectivly freezes in time as far as cacti is concerned and nagios/cron have problem scheduling things. The server (I've two that do this) both running 2.6.24-19-server become highly unstable, ssh logins fail, the system becomes highly unresponsive. I haven't pinned down when exactly this occured but it's basically rendered Hardy Server useless. It seems to have happened int he last couple of weeks but I'll be digging through the old logs to see... the problem being the logs are pretty useless with the time being completely off !

As regards changing clocksource the ones available are:
sudo cat /sys/devices/system/clocksource/clocksource0/available_clocksource
acpi_pm jiffies tsc

I'm trying jiffies as acpi_pm and tsc appear useless. The processor is a Intel(R) Xeon(TM) CPU 2.40GHz and it's Dell Poweredge 2800 server, A 3rd Server which is a HP proliant hasn't show same error and it does have the hpet clocksource.

Ok well jiffies did not work, in cat when I switch ntp back on and ran date every 5 secs or so I get this:

user@host-01:~$ date
Sat Sep 20 10:22:51 BST 2008
user@host-01:~$ date
Mon Sep 22 05:40:35 BST 2008
user@host-01:~$ date
Mon Sep 22 05:40:35 BST 2008
user@host-01:~$ date
Mon Sep 22 05:40:35 BST 2008
user@host-01:~$ date
Mon Sep 22 05:40:35 BST 2008

where the actual time should have been 06:00 Hrs +

This also seems to lock up the machine, I've tried a remote reboot and while the ssh terminal was failry responsive some commands didn't seem to complete and had to be ctrl-C to quit. This is exactly what happened on the tcs/acpi_pm clocksource as well.

I've also read is forcing the CPU to stay at full speed can stop it which would be a ok temporary solution for me as it's more important the server works, but seems that scaling isn't available in Server ? Either that or it's handled differntly fromt he scaling governors like it was before:

user@host-01:~$ sudo ls -l /sys/devices/system/cpu/cpu0/
total 0
-r-------- 1 root root 4096 2008-09-22 11:43 crash_notes
drwxr-xr-x 2 root root 0 2008-09-22 11:42 topology
user@host-01:~$ sudo ls -l /sys/devices/system/cpu/cpu0/topology/
total 0
-r--r--r-- 1 root root 4096 2008-09-22 11:43 core_id
-r--r--r-- 1 root root 4096 2008-09-22 11:42 core_siblings
-r--r--r-- 1 root root 4096 2008-09-22 11:43 physical_package_id
-r--r--r-- 1 root root 4096 2008-09-22 11:43 thread_siblings

I've set clocksource=acpi_pm at boot to see if starting out on it rather than switching from TSC solves the issue, I'll update as soon as I have info. Just to add this never occurs right after or during boot, it can take several hours to occur. I haven't spotted a pattern yet but I'll keep my eyes open.

Attached Dmesg with acpi_pm enabled in grub.

Despite acpi_pm enabled at boot it still seems to use TSC (Have to pardon my not understanding the inner workings of this) as I get the following:

Sep 22 12:03:57 host-01 kernel: [ 1007.064201] Clocksource tsc unstable (delta = 140599784626 ns)

It doesn't give me the switching to acpi_pm but the clock has already started to wander after being up for only a short while.

PS: Apologies I know I just posted what was in my last comments to 190414 but I didn't want to leave out anything and might help following it through....

Revision history for this message
Félim Whiteley (felimwhiteley) wrote :
Revision history for this message
Félim Whiteley (felimwhiteley) wrote :

Well I've found the update that caused it, I believe it's the last Kernel update that went on, ever since then I've had this issue, traced back my emails to some of the folk onsite asking for a restart after this date.

Log complete.
Aptitude 0.4.9: log report
Tue, Aug 26 2008 14:33:54 +0100

IMPORTANT: this log only lists intended actions; actions which fail due to
dpkg problems may not be completed.

Will install 14 packages, and remove 0 packages.
57.3kB of disk space will be used
===============================================================================
[UPGRADE] initramfs-tools 0.85eubuntu39.1 -> 0.85eubuntu39.2
[UPGRADE] iproute 20071016-2ubuntu1 -> 20071016-2ubuntu2
[UPGRADE] libglib2.0-0 2.16.3-1ubuntu3 -> 2.16.4-0ubuntu2
[UPGRADE] libglib2.0-data 2.16.3-1ubuntu3 -> 2.16.4-0ubuntu2
[UPGRADE] libldap-2.4-2 2.4.9-0ubuntu0.8.04 -> 2.4.9-0ubuntu0.8.04.1
[UPGRADE] linux-image-2.6.24-19-server 2.6.24-19.36 -> 2.6.24-19.41
[UPGRADE] linux-libc-dev 2.6.24-19.36 -> 2.6.24-19.41
[UPGRADE] pciutils 1:2.2.4-1.1ubuntu4 -> 1:2.2.4-1.1ubuntu5
[UPGRADE] procps 1:3.2.7-5ubuntu2 -> 1:3.2.7-5ubuntu3
[UPGRADE] python2.5 2.5.2-2ubuntu4 -> 2.5.2-2ubuntu4.1
[UPGRADE] python2.5-minimal 2.5.2-2ubuntu4 -> 2.5.2-2ubuntu4.1
[UPGRADE] tzdata 2008c-1ubuntu0.8.04 -> 2008e-1ubuntu0.8.04
[UPGRADE] ufw 0.16.2.2 -> 0.16.2.3
[UPGRADE] update-manager-core 1:0.87.27 -> 1:0.87.30
===============================================================================

Revision history for this message
Félim Whiteley (felimwhiteley) wrote :

Correction it appears there was no reboot when I upgraded from 2.6.24.16.18, so the first use of the 2.6.24-19.36 kernel actually appeared to cause an error. I skipped the first -19 upgrade.

Aug 24 11:38:54 host-01 kernel: [ 0.000000] Linux version 2.6.24-19-server (buildd@terranova) (gcc version 4.2.3 (Ubuntu 4.2.3-2ubuntu7)) #1 SMP Sat Jul 12 00:40:01 UTC 2008 (Ubuntu 2.6.24-19.36-server)

Revision history for this message
Félim Whiteley (felimwhiteley) wrote :

Ok well I tried switching back to the -16 Kernel and this is not the cause:

root@host-01:/home/user# uname -a
Linux host-01 2.6.24-16-server #1 SMP Thu Apr 10 13:58:00 UTC 2008 i686 GNU/Linux
root@host-01:/home/user# date
Thu Sep 25 13:09:07 BST 2008

This was run about 5mins after entering this bug, it went from tsc to acpi_pm almost immediatly after boot. Something els emust be causing it or it's a bug in the whole 2.6.24 Kernel, I'm not sure. The time has frozen on that date time.

Revision history for this message
Félim Whiteley (felimwhiteley) wrote :

Well I managed to run the server for a good week until I got the following:

[1110565.137933] Clocksource tsc unstable (delta = 3271800672992 ns)
[1110565.147908] Time: acpi_pm clocksource has been installed.
[1185505.906355] set_rtc_mmss: can't update from 59 to 0

I'd thought I'd found the initial problem where this server and one like it run checks against various hosts and due to a bug in some checking code had been fighting over hosts, getting lots of failed process and utilisation went through the roof. But with the other off this box lasted a week, but alas it still failed.

Any ideas ? The time on the box is not stuck at 1440hrs yesterday (Sunday the 12th Oct).

Revision history for this message
Leann Ogasawara (leannogasawara) wrote :

[This is an automated message. Apologies if it has reached you inappropriately.]

This bug was reported against the linux-meta package when it likely should have been reported against the linux package instead. We are automatically transitioning this to the linux kernel package so that the appropriate teams are notified and made aware of this issue. Thanks.

affects: linux-meta (Ubuntu) → linux (Ubuntu)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.