odd timing behaviour on panda

Bug #873453 reported by Dr. David Alan Gilbert
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Linaro Ubuntu
Fix Released
Undecided
Unassigned
linaro-landing-team-ti
Fix Released
Low
David Long

Bug Description

I've got a set of benchmarks that use clock_gettime like:

    clock_gettime(CLOCK_REALTIME, &tbefore);

    for(l=0;l<numloops;l++) {
       dostuff
    }

    clock_gettime(CLOCK_REALTIME, &tafter);

    nsdiff=(double)(tafter.tv_nsec - tbefore.tv_nsec);
    nsdiff+=1000000000.0 *(tafter.tv_sec - tbefore.tv_sec);

and I've just reinstalled our local panda to using Linaro 11.09 (kernel 3.0.0-1404-linaro-lt-omap)
and it's starting to get weird timing artifacts.

For example:

smarter_strlen_ldrd: ,102400, loops of ,62, bytes=6.054688 MB, transferred in ,3936768.000000 ns, giving, 1537.984331 MB/s
smarter_strlen_ldrd: ,102400, loops of ,32, bytes=3.125000 MB, transferred in ,0.000000 ns, giving, inf MB/s
smarter_strlen_ldrd: ,102400, loops of ,16, bytes=1.562500 MB, transferred in ,4180909.000000 ns, giving, 373.722557 MB/s

Now there is no way that loop took 0.000000 ns! Running the same binary on an older installation runs fine.

Dave

Tags: panda
Fathi Boudra (fboudra)
affects: linux-linaro-omap (Ubuntu) → ubuntu-leb
affects: ubuntu-leb → linaro-ubuntu
tags: added: panda
Anmar Oueja (anmar)
Changed in linaro-landing-team-ti:
importance: Undecided → Medium
assignee: nobody → warmcat (andy-warmcat)
milestone: none → 2011.11
Revision history for this message
warmcat (andy-warmcat) wrote :

If there is some granularity to the timing it doesn't sound crazy you read the same time twice sometimes. There's a 32kHz clock I know is used for some things, that would have high granularity.

Is this gettimeofday() doing this too?

Changed in linaro-landing-team-ti:
milestone: 2011.11 → none
importance: Medium → Undecided
Revision history for this message
Dr. David Alan Gilbert (davidgil-uk) wrote :

Hi,
  It's more than you would expect from granularity; I've attached a test (needs -lrt -lm); here's what I'm seeing (varies heavily dependent on run, so try a few times):

Linux panda-01 3.0.0-1404-linaro-lt-omap #8~ppa~natty-Ubuntu SMP PREEMPT Wed Sep 28 17:16:15 UTC 2011 armv7l armv7l armv7l GNU/Linux

daveg@panda-01:~$ ./armtimetest
 CLOCK_REALTIME tsdiff=186492933.000000 ns (1319800490/839110422 -> 1319800490/652617489
 gettimeofday tvdiff=183045.000000 us (1319800490/839110 -> 1319800490/656065
result=0.225014
 CLOCK_REALTIME tsdiff=189025879.000000 ns (1319800491/34697580 -> 1319800490/845671701
 gettimeofday tvdiff=189026.000000 us (1319800491/34697 -> 1319800490/845671
result=0.225014
 CLOCK_REALTIME tsdiff=178375245.000000 ns (1319800491/213378000 -> 1319800491/35002755
 gettimeofday tvdiff=178376.000000 us (1319800491/213378 -> 1319800491/35002
result=0.225014
 CLOCK_REALTIME tsdiff=161682129.000000 ns (1319800491/385497140 -> 1319800491/223815011
 gettimeofday tvdiff=161682.000000 us (1319800491/385497 -> 1319800491/223815
result=0.225014
 CLOCK_REALTIME tsdiff=185974122.000000 ns (1319800491/599364328 -> 1319800491/413390206
 gettimeofday tvdiff=185974.000000 us (1319800491/599364 -> 1319800491/413390
result=0.225014
 CLOCK_REALTIME tsdiff=181945801.000000 ns (1319800491/784453439 -> 1319800491/602507638
 gettimeofday tvdiff=181946.000000 us (1319800491/784453 -> 1319800491/602507
result=0.225014
 CLOCK_REALTIME tsdiff=182708740.000000 ns (1319800491/974272775 -> 1319800491/791564035
 gettimeofday tvdiff=182708.000000 us (1319800491/974272 -> 1319800491/791564
result=0.225014
 CLOCK_REALTIME tsdiff=84564210.000000 ns (1319800492/65154124 -> 1319800491/980589914
 gettimeofday tvdiff=84565.000000 us (1319800492/65154 -> 1319800491/980589
result=0.225014
 CLOCK_REALTIME tsdiff=94512947.000000 ns (1319800492/268279124 -> 1319800492/173766177
 gettimeofday tvdiff=94513.000000 us (1319800492/268279 -> 1319800492/173766
result=0.225014
 CLOCK_REALTIME tsdiff=172271735.000000 ns (1319800492/534972239 -> 1319800492/362700504
 gettimeofday tvdiff=172272.000000 us (1319800492/534972 -> 1319800492/362700

So varying mostly somewhere around 180ms; but look at those two tests near the end that apparently ran in 85ms and 95ms - no way should it vary by a factor of 2, and by ~100ms - that's way outside the timer randomness from either 32khz or the 128Hz interrupt rate; and the fact it's apparently much faster rules out a process suddenly appearing and hogging CPU.

For reference I've also just run this on 'habitat' (one of the porter Panda boards) and it's giving consistent times - albeit it in the range of ~150ms (it's bogomips figure is ~2000 while mine is currently ~1500 which is also odd).

Dave

Revision history for this message
Dr. David Alan Gilbert (davidgil-uk) wrote :

This bug doens't happen on the older 2.6.38-1400-linaro-omap

Revision history for this message
Dr. David Alan Gilbert (davidgil-uk) wrote :
Download full text (3.9 KiB)

Still does it on 3.1.0-1401 - although it seems to have gained a new oddity; first one or two loops are a lot slower.

daveg@panda-01:~$ uname -a
Linux panda-01 3.1.0-1401-linaro-lt-omap #3~ppa~oneiric-Ubuntu SMP PREEMPT Wed Oct 26 04:41:18 UTC 2011 armv7l armv7l armv7l GNU/Linux

daveg@panda-01:~$ ./armtimetest
 CLOCK_REALTIME tsdiff=321655295.000000 ns (1317900646/241210599 -> 1317900645/919555304
 gettimeofday tvdiff=321655.000000 us (1317900646/241210 -> 1317900645/919555
result=0.225014
 CLOCK_REALTIME tsdiff=132324229.000000 ns (1317900646/412841459 -> 1317900646/280517230
 gettimeofday tvdiff=148590.000000 us (1317900646/429107 -> 1317900646/280517
result=0.225014
 CLOCK_REALTIME tsdiff=148376464.000000 ns (1317900646/577850003 -> 1317900646/429473539
 gettimeofday tvdiff=148377.000000 us (1317900646/577850 -> 1317900646/429473
result=0.225014
 CLOCK_REALTIME tsdiff=147369385.000000 ns (1317900646/725585599 -> 1317900646/578216214
 gettimeofday tvdiff=147369.000000 us (1317900646/725585 -> 1317900646/578216
result=0.225014
 CLOCK_REALTIME tsdiff=148406981.000000 ns (1317900646/875365872 -> 1317900646/726958891
 gettimeofday tvdiff=148407.000000 us (1317900646/875365 -> 1317900646/726958
result=0.225014
 CLOCK_REALTIME tsdiff=144165039.000000 ns (1317900647/19897122 -> 1317900646/875732083
 gettimeofday tvdiff=144165.000000 us (1317900647/19897 -> 1317900646/875732
result=0.225014
 CLOCK_REALTIME tsdiff=120361328.000000 ns (1317900647/144744534 -> 1317900647/24383206
 gettimeofday tvdiff=120361.000000 us (1317900647/144744 -> 1317900647/24383
result=0.225014
 CLOCK_REALTIME tsdiff=144836427.000000 ns (1317900647/318175932 -> 1317900647/173339505
 gettimeofday tvdiff=144836.000000 us (1317900647/318175 -> 1317900647/173339
result=0.225014
 CLOCK_REALTIME tsdiff=113372803.000000 ns (1317900647/435607573 -> 1317900647/322234770
 gettimeofday tvdiff=113373.000000 us (1317900647/435607 -> 1317900647/322234
result=0.225014
 CLOCK_REALTIME tsdiff=138397218.000000 ns (1317900647/609313627 -> 1317900647/470916409
 gettimeofday tvdiff=138397.000000 us (1317900647/609313 -> 1317900647/470916
result=0.225014
daveg@panda-01:~$ ./armtimetest
 CLOCK_REALTIME tsdiff=450469963.000000 ns (1317900650/737151764 -> 1317900650/286681801
 gettimeofday tvdiff=501129.000000 us (1317900650/793548 -> 1317900650/292419
result=0.225014
 CLOCK_REALTIME tsdiff=148040774.000000 ns (1317900650/944976468 -> 1317900650/796935694
 gettimeofday tvdiff=148041.000000 us (1317900650/944976 -> 1317900650/796935
result=0.225014
 CLOCK_REALTIME tsdiff=148437499.000000 ns (1317900651/94115872 -> 1317900650/945678373
 gettimeofday tvdiff=148437.000000 us (1317900651/94115 -> 1317900650/945678
result=0.225014
 CLOCK_REALTIME tsdiff=148345947.000000 ns (1317900651/242828031 -> 1317900651/94482084
 gettimeofday tvdiff=148346.000000 us (1317900651/242828 -> 1317900651/94482
result=0.225014
 CLOCK_REALTIME tsdiff=145599365.000000 ns (1317900651/388793607 -> 1317900651/243194242
 gettimeofday tvdiff=145599.000000 us (1317900651/388793 -> 1317900651/243194
result=0.225014
 CLOCK_REALTIME tsdiff=130615235.000000 ns (1317900651/522582670 -> 1317900651/391967435
 gettimeofday tvdiff=13061...

Read more...

Revision history for this message
Dr. David Alan Gilbert (davidgil-uk) wrote :

To add another snippet of information:

1) Running the test with the same kernel as above but keeping the CPUs busy with a low priority task eliminates the anomalous 1st run time - e.g.

taskset -c 0 nice -n 19 md5sum /dev/zero &
taskset -c 1 nice -n 19 md5sum /dev/zero &
./armtimetest

1b) Running those immediately prior to the test doesn't help - e.g.:

(taskset -c 0 md5sum /proc/cpuinfo & taskset -c 1 md5sum /proc/cpuinfo); wait; ./armtimetest

2) The issue about getting times that are unusually fast still happens occasionally

e.g.
 CLOCK_REALTIME tsdiff=146453858.000000 ns (1322760968/103338104 -> 1322760967/956884246
 gettimeofday tvdiff=146454.000000 us (1322760968/103338 -> 1322760967/956884
result=0.225014
 CLOCK_REALTIME tsdiff=54931640.000000 ns (1322760968/162847382 -> 1322760968/107915742
 gettimeofday tvdiff=54932.000000 us (1322760968/162847 -> 1322760968/107915
result=0.225014
 CLOCK_REALTIME tsdiff=161773686.000000 ns (1322760968/418157439 -> 1322760968/256383753
 gettimeofday tvdiff=161652.000000 us (1322760968/418157 -> 1322760968/256505
result=0.225014

(see the middle reading) It took 5 or 6 runs toget something that noticeable.

Revision history for this message
David Long (dave-long) wrote :

I know it's unlikely, but if you could just verify that you are not running ntp.

Revision history for this message
warmcat (andy-warmcat) wrote :

I noticed while uplevelling our cpuidle stuff for 3.2-rc changes that it has to fix up elapsed time while it was idle... is CPU_FREQ enabled in your config?

Revision history for this message
Ricardo Salveti (rsalveti) wrote :

He's using the kernel package we provide, that has CPU_FREQ enabled by default:

linaro@linaro-ubuntu-desktop:~$ grep CPU_FREQ= /boot/config-3.1.0-140*
/boot/config-3.1.0-1400-linaro-lt-omap:CONFIG_CPU_FREQ=y
/boot/config-3.1.0-1401-linaro-lt-omap:CONFIG_CPU_FREQ=y
/boot/config-3.1.0-1402-linaro-lt-omap:CONFIG_CPU_FREQ=y

Revision history for this message
warmcat (andy-warmcat) wrote :

I'm sorry Ricardo I meant CPU_IDLE. I think this bug is old enough originally to predate us enabling CPU_FREQ. But we had CPU_IDLE on for a while before that.

We are anyway looking at CPU_IDLE implementation we have because we found it never moves away from the first C-state even though it is definitely getting called and acting otherwise reasonable. In addition there were big changes in 3.2-rc1 / rc2 about cpuidle api which I updated in our tracking. They did change how they accounted for time while idle.

So it'd be interesting to see if the issue can be reproduced with CPU_IDLE disabled if it is currently enabled. If it's not currently enabled, well, I dunno.

Revision history for this message
Ricardo Salveti (rsalveti) wrote : Re: [Bug 873453] Re: odd timing behaviour on panda

On Thu, Dec 1, 2011 at 11:59 PM, warmcat <email address hidden> wrote:
> I'm sorry Ricardo I meant CPU_IDLE.  I think this bug is old enough
> originally to predate us enabling CPU_FREQ.  But we had CPU_IDLE on for
> a while before that.
>
> We are anyway looking at CPU_IDLE implementation we have because we
> found it never moves away from the first C-state even though it is
> definitely getting called and acting otherwise reasonable.  In addition
> there were big changes in 3.2-rc1 / rc2 about cpuidle api which I
> updated in our tracking.  They did change how they accounted for time
> while idle.
>
> So it'd be interesting to see if the issue can be reproduced with
> CPU_IDLE disabled if it is currently enabled.  If it's not currently
> enabled, well, I dunno.

It's also enabled with current kernel:
linaro@linaro-ubuntu-desktop:~$ grep CPU_IDLE= /boot/config-3.1.0-140*
/boot/config-3.1.0-1400-linaro-lt-omap:CONFIG_CPU_IDLE=y
/boot/config-3.1.0-1401-linaro-lt-omap:CONFIG_CPU_IDLE=y
/boot/config-3.1.0-1402-linaro-lt-omap:CONFIG_CPU_IDLE=y

Will recompile the kernel without CPU_IDLE and post the link to the
bug in a few minutes.

Revision history for this message
Ricardo Salveti (rsalveti) wrote :

Here's the kernel with CPU_IDLE disabled: http://people.linaro.org/~rsalveti/linux-image-3.1.0-1402-linaro-lt-omap_3.1.0-1402.5_armel.deb

But I was also able to get some fast timings with it sometimes:
 CLOCK_REALTIME tsdiff=166351318.000000 ns (1322121994/667296966 -> 1322121994/500945648
 gettimeofday tvdiff=166351.000000 us (1322121994/667296 -> 1322121994/500945
result=0.225014
 CLOCK_REALTIME tsdiff=72937012.000000 ns (1322121994/744903167 -> 1322121994/671966155
 gettimeofday tvdiff=72937.000000 us (1322121994/744903 -> 1322121994/671966
result=0.225014
 CLOCK_REALTIME tsdiff=169189457.000000 ns (1322121995/11504731 -> 1322121994/842315274
 gettimeofday tvdiff=169189.000000 us (1322121995/11504 -> 1322121994/842315
result=0.225014

But please check and see if you're able to reproduce it at your enviroment.

Revision history for this message
warmcat (andy-warmcat) wrote :

Oh well thanks for checking. Dave Long's going to take a look at it.

Revision history for this message
David Long (dave-long) wrote :

Looking at the getnstimeofday() kernel code and instrumenting the kernel shows me the Posix realtime clock function apparently only uses the 32KHz hardware clock. So I don't know how we'd ever meet the resolution claim of 1ns returned by clock_getres(), but on the plus side I would think the time would be stable despite going into low-power states. I am looking to see where things may be going wrong.

If someone has a kernel that actually avoids this problem I would be very interested in the contents of all the (short) files under

/sys/devices/system/clocksource

David Long (dave-long)
Changed in linaro-landing-team-ti:
assignee: warmcat (andy-warmcat) → David Long (dave-long)
Revision history for this message
warmcat (andy-warmcat) wrote :

Dave according to #3 you might want to check out and try branch "tilt-linux-linaro-2.6.38" from our repo, it looks like this might not have the issue.

Reading #2 closer it seems like the periods between sampling are too big for it to be a simple granularity issue. One of the issues is that he knows it's doing 180ms of work but sometimes it says only 85ms has elapsed.

Revision history for this message
Dr. David Alan Gilbert (davidgil-uk) wrote :

Hi DaveL,
  daveg@panda-01:~$ cd /sys/devices/system/clocksource/
daveg@panda-01:/sys/devices/system/clocksource$ ls
clocksource0
daveg@panda-01:/sys/devices/system/clocksource$ cat clocksource0/
cat: clocksource0/: Is a directory
daveg@panda-01:/sys/devices/system/clocksource$ cd clocksource0/
daveg@panda-01:/sys/devices/system/clocksource/clocksource0$ ls
available_clocksource current_clocksource
daveg@panda-01:/sys/devices/system/clocksource/clocksource0$ cat available_clocksource
32k_counter
daveg@panda-01:/sys/devices/system/clocksource/clocksource0$ cat current_clocksource
32k_counter

ps -eaf|grep ntp shows nothing (the ntpdate package is installed, initlist list | grep time or ntp shows nothing).

Dave

Revision history for this message
David Long (dave-long) wrote :

I know the 32KHz clock does not explain the granularity, but I think it's still a lie for clock_getres() to report 1ns with a 32KHz clock. Not relevant to this bug, but possibly a new bug.

Although I can duplicate significant jitter with 3.0, the jitter with 3.1 is much more believable. I don't yet have an explanation for why.

Revision history for this message
David Long (dave-long) wrote :

I am still not able to reproduce this problem from our 3.1 tracking tree. I verified correct behavior (so far) in 2.38, and failure in 3.0.

Revision history for this message
warmcat (andy-warmcat) wrote :

Just to be clear there's no under-reporting of time in tilt-3.1 or tilt-tracking (which is 3.2-rc4 at the moment)? Or either?

Revision history for this message
David Long (dave-long) wrote :

Correct, it stays close to and never dips below a baseline runtime. I added code in the test to detect that and I am soaking it some more now on tilt-tracking, but in 20 minutes of continuous running I've yet to see the reported problem.

Revision history for this message
Dr. David Alan Gilbert (davidgil-uk) wrote :

Feel free to throw me a kernel .deb set and I'll give it a go.

Dave

Changed in linaro-ubuntu:
assignee: nobody → Ricardo Salveti (rsalveti)
Revision history for this message
Ricardo Salveti (rsalveti) wrote :

The package from current 3.1 tilt-linux-linaro: http://people.linaro.org/~rsalveti/873453/

Still no luck with it:

 CLOCK_REALTIME tsdiff=126373291.000000 ns (1322122264/661772965 -> 1322122264/535399674
 gettimeofday tvdiff=126373.000000 us (1322122264/661772 -> 1322122264/535399
result=0.225014
 CLOCK_REALTIME tsdiff=54077149.000000 ns (1322122264/720031022 -> 1322122264/665953873
 gettimeofday tvdiff=54078.000000 us (1322122264/720031 -> 1322122264/665953
result=0.225014
 CLOCK_REALTIME tsdiff=130584722.000000 ns (1322122264/927123307 -> 1322122264/796538585
 gettimeofday tvdiff=130585.000000 us (1322122264/927123 -> 1322122264/796538
result=0.225014

Time to test the current tilt-tracking one.

Revision history for this message
warmcat (andy-warmcat) wrote :

Dave can you confirm you're not seeing it on tilt-3.1? Because then we're looking at linux-linaro-3.1, config changes, bootloader, rootfs content causing it.

If you just meant you're not seeing it on tracking, that's a different story we need to get a packaged tracking to try.

Revision history for this message
Ricardo Salveti (rsalveti) wrote :

Also pushed a kernel package based on current tracking tree: http://people.linaro.org/~rsalveti/873453/

Same issue:

Linux linaro-nano 3.2.0-rc4+ #2 SMP PREEMPT Wed Dec 7 00:05:39 BRST 2011 armv7l armv7l armv7l GNU/Linux

 CLOCK_REALTIME tsdiff=127746579.000000 ns (1322121872/483062069 -> 1322121872/355315490
 gettimeofday tvdiff=127747.000000 us (1322121872/483062 -> 1322121872/355315
result=0.225014
 CLOCK_REALTIME tsdiff=126800540.000000 ns (1322121872/613036437 -> 1322121872/486235897
 gettimeofday tvdiff=126801.000000 us (1322121872/613036 -> 1322121872/486235
result=0.225014
 CLOCK_REALTIME tsdiff=38024902.000000 ns (1322121872/655272764 -> 1322121872/617247862
 gettimeofday tvdiff=38025.000000 us (1322121872/655272 -> 1322121872/617247
result=0.225014
 CLOCK_REALTIME tsdiff=130035409.000000 ns (1322121872/878112120 -> 1322121872/748076711
 gettimeofday tvdiff=130036.000000 us (1322121872/878112 -> 1322121872/748076
result=0.225014

Changed in linaro-landing-team-ti:
status: New → Confirmed
Changed in linaro-ubuntu:
status: New → Confirmed
Revision history for this message
warmcat (andy-warmcat) wrote :

Dave can you install Ricardo's kernel package and see if it's broken the same way for you.

If it is, we need to diff the config from the two kernels, checking what modules are installed.

If it isn't, we need to look at bootloader / rootfs differences.

Revision history for this message
Ricardo Salveti (rsalveti) wrote :

On Wed, Dec 7, 2011 at 12:44 AM, warmcat <email address hidden> wrote:
> Dave can you install Ricardo's kernel package and see if it's broken the
> same way for you.
>
> If it is, we need to diff the config from the two kernels, checking what
> modules are installed.

You can easily extract both the module list and config from my
package, but I just used the omap4_defconfig one (that I know it's
supported by you).

> If it isn't, we need to look at bootloader / rootfs differences.

Yeah, information about the image, u-boot version and so on might also help.

Revision history for this message
warmcat (andy-warmcat) wrote :

Just to be super paranoid, Dave is using 4460 Panda I believe, is everyone with the problem using 4430?

What is plugged into your board in terms of USB peripherals, HDMI or DVI? Does everyone have wireless lan connected to something or not? Onboard Ethernet?

Revision history for this message
Ricardo Salveti (rsalveti) wrote :

On Wed, Dec 7, 2011 at 1:14 AM, warmcat <email address hidden> wrote:
> Just to be super paranoid, Dave is using 4460 Panda I believe, is
> everyone with the problem using 4430?

I'm also testing it on a 4460. Last I check I was also able to
reproduce this on my 4430, but will double check again.

> What is plugged into your board in terms of USB peripherals, HDMI or
> DVI?  Does everyone have wireless lan connected to something or not?
> Onboard Ethernet?

Nothing is connected at the board besides the power supply and the
ethernet cable.

Revision history for this message
Ricardo Salveti (rsalveti) wrote :

On Wed, Dec 7, 2011 at 2:03 AM, Ricardo Salveti <email address hidden> wrote:
> On Wed, Dec 7, 2011 at 1:14 AM, warmcat <email address hidden> wrote:
>> Just to be super paranoid, Dave is using 4460 Panda I believe, is
>> everyone with the problem using 4430?
>
> I'm also testing it on a 4460. Last I check I was also able to
> reproduce this on my 4430, but will double check again.

That's interesting, I'm unable to reproduce the issue on my 4430 when
using the tilt-tracking kernel. Just used the same SD card I'm able to
reproduce the issue on my 4460.

Revision history for this message
Dr. David Alan Gilbert (davidgil-uk) wrote :

Hi,
  The board identifies as:
OMAP4430 ES2.1

(It's one of the moderately early pandas - A1 rev I think).

With Ricardo's 3.2.0-rc4+ #2 kernel it seems a lot better; I still get the sluggish first loop, but I've not
had a loop show as bizarrely fast.

The ethernet is connected, there is a USB disk attached (and mounted, but not as /), and the serial is connected to a host running minicom (via a USB-serial adapter on the host). No display is connected, I've not done anything to the wireless settings (ifconfig shows 0 packets on wlan0. / is on an SD card.

Revision history for this message
warmcat (andy-warmcat) wrote :

Ricardo I think you might find if you do

# echo performance > /sys/devices/system/cpu/cpu0/cpu0/cpufreq/scaling_governor

in 4460 case, after the rootfs forced it to ondemand, you will only see problems periodically when the thermal_framework stuff kicks in, a different pattern of behaviour.

David, I don't know what caused these problems historically, but I think what's left with tracking is that the issue is coming from 4460-specific cpufreq change action. I think on tracking, for 4430 it's OK.

Dave, I think it's worth reviewing that code you already know from the 4460 clock rounding / changing patch you did (I don't mean your patch, I mean the flow around that code) looking for ways it could screw with timekeeping. We already know at least the version of that code we had didn't have all the critical bugs shaken out of it by the time we got it. Maybe also review Keerthy's cpu temp sensor driver you also know well now since that's also 4460 specific.

Revision history for this message
Ricardo Salveti (rsalveti) wrote :

Andy, with 4460 I'm always able to reproduce the issue, with both ondemand and performance.

Any other change that might help tracking down the issue?

Revision history for this message
David Long (dave-long) wrote :

1) I've deliberately done all my testing so far on 4430
2) I've been using omap4_defconfig for all test builds
3) I've tried tilt-3.1, tilt-linux-linaro-3.1, and tilt-linux-tinaro-3.0. Tilt-3.1 was the only one without the problem.

Revision history for this message
warmcat (andy-warmcat) wrote :

It seems there's general agreement that this is gone on 4430 and tracking kernel at least.

On 4460 and tilt-tracking I saw it work stably for ~ 200 loops until thermal stuff kicks in, at which point it starts showing damage (I modified the script a bit)

...
130616 vs 130616
131775 vs 131775
29083 vs 29083
130707 vs 130707
130463 vs 130463
87371 vs 87371
130859 vs 130859
124969 vs 130523
130584 vs 130584
129517 vs 129517
128663 vs 128663
130188 vs 130188
121307 vs 121277
126495 vs 126495
31769 vs 31769
130799 vs 130799
128540 vs 128540
130615 vs 130615
130493 vs 130493
128234 vs 128234
130371 vs 130371
129975 vs 129975
130676 vs 130676
129700 vs 129700
130523 vs 130523
130463 vs 130463
122772 vs 122772
124970 vs 124970
70740 vs 70740
127411 vs 133972
118714 vs 118714
130157 vs 130157
129028 vs 129028
69244 vs 69244
127685 vs 127685
91736 vs 91736
130615 vs 130615
121276 vs 121276
123810 vs 123810
30273 vs 30273
130707 vs 130707
122436 vs 122436
52338 vs 52338
130523 vs 130523
125641 vs 125641
130066 vs 130066
102967 vs 102967
115753 vs 115753

in dmesg contemporary with that is

[ 95.840850] omap_safe_zone:hot spot temp 82135
[ 100.841094] omap_monitor_zone:hot spot temp 85689
[ 104.592437] omap_safe_zone:hot spot temp 81542
[ 105.591278] omap_monitor_zone:hot spot temp 86282
[ 115.841979] omap_safe_zone:hot spot temp 82135

Because of thermal, it's not going to give stable results on 4460 like 4430 since it's switching between 1.2GHz and 920MHz all the time. But that should always give results at or inbetween two times matching those speeds, nothing abnormally low.

Notice that the default on tracking is performance cpu_freq governor now, but ~1 min after boot Ubuntu resets it to ondemand via a sleepy boot script IIRC. So you have to be careful the governor is what you think it is because it will change it behind your back.

Revision history for this message
David Long (dave-long) wrote :

If you do an:

echo 350000 > /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq

you can lock the cpu at the lowest supported frequency. I verified this with a printk in:

arch/arm/mach-omap2/omap2plus-cpufreq.c.

using tilit-3.1 and tilit-tracking. However, the 4460 continues to display jitter in the timing info (with some short intervals), so I don't see how this can be related to frequency scaling. It feels more like it's delaying processing of the wall clock update (from timer_tick). It's not obvious to me why this would be 4460-specific though. Removing NO_HZ from the config might reduce the frequency of occurrence, but clearly does not fix the problem.

Revision history for this message
warmcat (andy-warmcat) wrote :

Thermal is 4460-only, we know it has direct path to omap-specific cpufreq backend, might try killing that and dvfs in config. I didn't try it for a while but you could also turn off PM wholesale in config. See if either of those are implicated.

Revision history for this message
David Long (dave-long) wrote :

I turned off THERMAL_FRAMEWORK, CPU_IDLE, and CPU_FREQ (and their assorted related configs). It made no difference, 4460 still fails and 4430 passes. I also tried turning off OMAP4460PLUS_SCM (which turns off the temp sensor config), but no luck there either.

 I'm going to try and work backwards from the wall clock time code.

Revision history for this message
Dr. David Alan Gilbert (davidgil-uk) wrote :

As an aside, on the 3.2.0-rc4+ that I got from Ricardo above, it seems to spit out a :

[416071.785369] omapdss DISPC error: timeout waiting for digit out to stop

every few minutes.

Dave

Revision history for this message
David Long (dave-long) wrote :

I did quite a bit of experimentation and determined the problem goes away if I inhibit the 32k_timer from having its clock gated at idle. I don't know why we should have to do this but it's the only fix I've been able to come up with. My theory is that a stale value is being retrieved when the interface clock is coming out of idle. I have provided the patch to my tech lead.

Anmar Oueja (anmar)
Changed in linaro-landing-team-ti:
importance: Undecided → Low
Revision history for this message
warmcat (andy-warmcat) wrote :

Guys... since January... does this problem still exist on current tilt3.4?

warmcat (andy-warmcat)
Changed in linaro-landing-team-ti:
status: Confirmed → Fix Released
Fathi Boudra (fboudra)
Changed in linaro-ubuntu:
assignee: Ricardo Salveti (rsalveti) → nobody
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.