cache-coherency test failed. Abort!! while bigLITTLE test execution on Fastmodel

Bug #1027203 reported by Naresh Kamboju on 2012-07-20
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Linaro Android
Fix Released
High
Amit Pundir
Linaro big.LITTLE
Fix Released
Undecided
Nicolas Pitre

Bug Description

Android b.L switcher integrated image build 85 on July 20 is booted on Fastmodels and boot success and
bigLITTLE testsuite has been given for test execution and following issues
is reported.

coretile.cache_state_modelled=0 disabled.
and fast model used is RTSM_VE_Cortex-A15x4-A7x4

#run_stress_switcher_tests.sh -a
.
.
.
Running stressapptest -M 16 --cc_test -s 10
***bl-agitator***
CPU count: 4
CPU0: big freq 1000000 LITTLE freq 100000
CPU1: big freq 1000000 LITTLE freq 100000
CPU2: big freq 1000000 LITTLE freq 100000
CPU3: big freq 1000000 LITTLE freq 100000
Random switcher seed 100 limit 1000
Random switcher seed 100 limit 1000
Random switcher seed 100 limit 1000
cache-coherency test failed. Abort!!

Kill bigLITTLE switcher
Time elapsed: 0:00:10.1000
Terminated because of SIG 15
Test failed. Abort!!
cache-coherency-switching : FAIL

I have download pre-built images from below link to test BL test suite with this image on Fastmodels

https://snapshots.linaro.org/android/~linaro-android-restricted/vexpress-rtsm-isw-ics-gcc47-armlt-stable-open/85/target/product/vexpress_rtsm/

please find attached logs for more information.

Changed in linaro-big-little-system:
assignee: nobody → Nicolas Pitre (npitre)

Here's what I found so far:

- The bug apparently doesn't affect the Ubuntu build. So I tried
  a kernel without the Android patches (the Android user space isn't
  all happy about it but the shell is working) and the test still fails.
  At least this rules out Android kernel changes.

- Then I hacked the cpufreq driver to let user space believe switch
  requests were honored but without doing any switching using this
  patch:

diff --git a/drivers/cpufreq/arm-bl-cpufreq.c b/drivers/cpufreq/arm-bl-cpufreq.c
index 85e1bb0519..14e62d9866 100644
--- a/drivers/cpufreq/arm-bl-cpufreq.c
+++ b/drivers/cpufreq/arm-bl-cpufreq.c
@@ -106,9 +106,13 @@ static void __get_current_cluster(void *_data)

 static int get_current_cluster(unsigned int cpu)
 {
+#if 0
  unsigned int cluster = 0;
  smp_call_function_single(cpu, __get_current_cluster, &cluster, 1);
  return cluster;
+#else
+ return per_cpu(cpu_cur_cluster, cpu);
+#endif
 }

 static int get_current_cached_cluster(unsigned int cpu)
@@ -143,7 +147,7 @@ static void switch_to_entry(unsigned int cpu,
  freqs.new = entry_to_freq(target);

  cpufreq_notify_transition(&freqs, CPUFREQ_PRECHANGE);
- bL_switch_request(cpu, new_cluster);
+ //bL_switch_request(cpu, new_cluster);
  per_cpu(cpu_cur_cluster, cpu) = new_cluster;
  cpufreq_notify_transition(&freqs, CPUFREQ_POSTCHANGE);
 }

  And then the test appears to work OK.

So... it seems that the difference between the Android user libraries
and the Ubuntu user libraries is triggering the bug. However the bug
appears only when real switching is taking place, indicating some
remaining flaw with the switcher.

Would it be possible to get a strace dump of the running test when
running on Ubuntu vs Android? Extracting syscalls differences between
the two could help identify the problem. No need to switch at the same
time therefore no need to use Fast Models for this.

Naresh Kamboju (naresh-kamboju) wrote :
Download full text (3.9 KiB)

Hi Nicolas,

On 4 August 2012 09:05, Nicolas Pitre <email address hidden> wrote:

> Here's what I found so far:
>
> - The bug apparently doesn't affect the Ubuntu build. So I tried
> a kernel without the Android patches (the Android user space isn't
> all happy about it but the shell is working) and the test still fails.
> At least this rules out Android kernel changes.
>
> - Then I hacked the cpufreq driver to let user space believe switch
> requests were honored but without doing any switching using this
> patch:
>
> diff --git a/drivers/cpufreq/arm-bl-cpufreq.c
> b/drivers/cpufreq/arm-bl-cpufreq.c
> index 85e1bb0519..14e62d9866 100644
> --- a/drivers/cpufreq/arm-bl-cpufreq.c
> +++ b/drivers/cpufreq/arm-bl-cpufreq.c
> @@ -106,9 +106,13 @@ static void __get_current_cluster(void *_data)
>
> static int get_current_cluster(unsigned int cpu)
> {
> +#if 0
> unsigned int cluster = 0;
> smp_call_function_single(cpu, __get_current_cluster, &cluster, 1);
> return cluster;
> +#else
> + return per_cpu(cpu_cur_cluster, cpu);
> +#endif
> }
>
> static int get_current_cached_cluster(unsigned int cpu)
> @@ -143,7 +147,7 @@ static void switch_to_entry(unsigned int cpu,
> freqs.new = entry_to_freq(target);
>
> cpufreq_notify_transition(&freqs, CPUFREQ_PRECHANGE);
> - bL_switch_request(cpu, new_cluster);
> + //bL_switch_request(cpu, new_cluster);
> per_cpu(cpu_cur_cluster, cpu) = new_cluster;
> cpufreq_notify_transition(&freqs, CPUFREQ_POSTCHANGE);
> }
>
> And then the test appears to work OK.
>
> OK.

> So... it seems that the difference between the Android user libraries
> and the Ubuntu user libraries is triggering the bug. However the bug
> appears only when real switching is taking place, indicating some
> remaining flaw with the switcher.
>
> Would it be possible to get a strace dump of the running test when
> running on Ubuntu vs Android?

please find attached strace log for Android and execution still in progress
for ubuntu.
I will share strace log on ubuntu soon.

Best regards
Naresh Kamboju

> Extracting syscalls differences between
> the two could help identify the problem. No need to switch at the same
> time therefore no need to use Fast Models for this.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1027203
>
> Title:
> cache-coherency test failed. Abort!! while bigLITTLE test execution on
> Fastmodel
>
> Status in Linaro big.LITTLE System:
> New
>
> Bug description:
> Android b.L switcher integrated image build 85 on July 20 is booted on
> Fastmodels and boot success and
> bigLITTLE testsuite has been given for test execution and following
> issues
> is reported.
>
> coretile.cache_state_modelled=0 disabled.
> and fast model used is RTSM_VE_Cortex-A15x4-A7x4
>
> #run_stress_switcher_tests.sh -a
> .
> .
> .
> Running stressapptest -M 16 --cc_test -s 10
> ***bl-agitator***
> CPU count: 4
> CPU0: big freq 1000000 LITTLE freq 100000
> CPU1: big freq 1000000 LITTLE freq 100000
> CPU2: big freq 1000000 LITTLE freq 100000
>...

Read more...

Naresh Kamboju (naresh-kamboju) wrote :

please find attached strace log for Android and execution still in progress for ubuntu.
I will share strace log on ubuntu soon.

since strace was not accepting command line args from scripts
so put in a one more script.

echo "#! /system/bin/sh" >> strace-bug-1027203.sh
echo "run_stress_switcher_tests.sh -t 3" >> strace-bug-1027203.sh
chmod 777 strace-bug-1027203.sh

Naresh Kamboju (naresh-kamboju) wrote :

please find attached strace log for ubuntu.

since strace was not accepting command line args from scripts
so i put it in a one more script.

root@linaro-developer:~# cat strace-bug-1027203.sh
#! /bin/sh
run_stress_switcher_tests.sh -t 3
root@linaro-developer:~#
root@linaro-developer:~# strace -f ./strace-bug-1027203.sh

Best regards
Naresh Kamboju

Tixy (Jon Medhurst) (tixy) wrote :

This doesn't look like the Ubuntu logs refer to a successful test run. They contain lots of entries like

[pid 1328] write(2, "/usr/bin/cache-coherency-switche"..., 81/usr/bin/cache-coherency-switcher.sh: 362: /usr/bin/cache-coherency-switcher.sh: ) = 81
[pid 1328] write(2, "[[: not found", 13[[: not found) = 13
[pid 1328] write(2, "\n", 1

which the Android one doesn't.

And the Android log has entries which look like it's doing something useful, which are absent from the Ubuntu log.
E.g....

[pid 922] write(1, "Starting bigLITTLE random switch"..., 53Starting bigLITTLE random switcher in the background
)
[pid 1430] open("/sys/devices/system/cpu/cpu3/cpufreq/cpuinfo_cur_freq", O_RDONLY|O_LARGEFILE <unfinished ...>
[pid 1404] open("/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq", O_RDONLY|O_LARGEFILE <unfinished ...>

Tixy (Jon Medhurst) (tixy) wrote :

I also note that the Ubuntu test logs have:

[pid 1268] write(1, "cache-coherency-switching : PASS"..., 33cache-coherency-switching : PASS

which is a bit worrying considering it doesn't seem to be running the tests properly.

Nicolas Pitre (npitre) wrote :

Indeed! Good catch. I was too focused on stressapptest.
This test would have to be fixed and run again on Ubuntu.

Thanks for your comments.
I understood the problem.
today i have been digging this problem on ubuntu and found that "Kernel OOPS " could be the problem for file not found error.
because after detecting "Kernel OOPS" exit 1 will be called and i think strace lost its flow.
/usr/bin/cache-coherency-switcher.sh file always present on the same location but our test not able to found this file after "Kernel OOPS". The reason for "Kernel OOPS" is that, in the script we are doing dmesg | grep "unable to handle" is getting pass here.

the execution flow is taking place as
run_stress_switcher_tests.sh with -t 3 options invokes -> cache-coherency-switcher.sh -f big -c 0 -n 0 1 2 3 -r 100 -l 1000 this will invoke first step as switch() function will be called and this will internally invokes boot_a15 will jump execution to an other file called boot-a15.sh and inside this file the first and foremost job is to check for the "check_kernel_oops" where script detecting and below and script doing exit 1.

 check_kernel_oops()
{
        KERNEL_ERR=`dmesg | grep "Unable to handle kernel "`
        if ! ([[ -z "$KERNEL_ERR" ]]); then
                echo "Kernel OOPS. Abort!!"
                exit 1
        fi
}

from this step onwards script not able to find file (/usr/bin/cache-coherency-switcher.sh ) and getting file not found error several times.

I would like to give two runs to isolate the problem.
1. disable check_kernel_oops
2. enable check_kernel_oops and run it on stable ubuntu. (where dmesg should not have any error logs)
3. is "grep" doing right job. let me run manually dmesg | grep "Unable to handle kernel "

I would to update this bug with latest information,
I have found that the problem of "file not found" issue is not only with cache-coherence-switcher test. It is an issue the way we running the strace on ubuntu, same commands did worked on android but not working on ubuntu.
Today I have made run in basic test on ubuntu as blow and found same kind of file not found error and did not see the expected test (/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq) happening.
I need to dig this issue more w.r.t ubuntu env and the strace outcome.

please find attached file. which i ran today for bl-basic-tests.
root@linaro-developer:~# cat strace-run.sh
#! /usr/bin/sh
run_stress_switcher_tests.sh -t 13

I have suspected strace output problem on ubuntu and carried out few experiments to confirm the behavior.
expected sysfs output is coming on EXP-1 and EXP-2. but not incase of EXP-3, EXP-4 and EXP-5

Regarding the EXP-1 and EXP-2 test
        with strace switching is happening.
        without strace switching is happening.

Regarding the EXP-3, EXP-4 and EXP-5 test
        with strace switching is not happening.
        without strace switching is happening.
The above behavior is confirmed by amit pundir, he has GUI FM where we can see cluster blinks.

EXP 1:
#cd /usr/bin
#strace -f /usr/bin/sh governor
open("/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor", O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE, 0666) = 3
write(1, "userspace\n", 10) = 10
[pid 1970] open("/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor", O_RDONLY|O_LARGEFILE) = 3
[pid 1970] read(3, "userspace\n", 32768) = 10

File not found error we can ignore from the strace output.
After above open() read() and write() operations, still strace reported governor file not found several times.

[pid 1968] write(2, "[[: not found", 13[[: not found) = 13
[pid 1968] write(2, "\n", 1

EXP 2:
#strace -f /usr/bin/sh /usr/bin/run_stress_switcher_test.sh -b (run-bl-basic-tests.sh and switcher-tests.sh )
rmmod arm-bl-cpufreq
[pid 1427] open("/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor", O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE, 0666) = -1 ENOENT (No such file or directory)
...
insmod arm-bl-cpufreq
[pid 1503] open("/sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed", O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE, 0666) = 3
[pid 1503] write(1, "1000000\n", 8) = 8

EXP 3:
#strace -f /usr/bin/sh /usr/bin/run_stress_switcher_test.sh -t 1 (test name cache-coherency-a7 )

EXP 4:
#strace -f /usr/bin/sh /usr/bin/run_stress_switcher_test.sh -t 2 (test name cache-coherency-a15 )

EXP 5:
#strace -f /usr/bin/sh /usr/bin/run_stress_switcher_test.sh -t 3 (test name cache-coherency-switching )

below test case also not able to produce expected output with strace.
EXP 6:
#strace -f /usr/bin/sh /usr/bin/run_stress_switcher_test.sh -t 10 ( test name mem-stress-a7)

EXP 7:
#strace -f /usr/bin/sh /usr/bin/run_stress_switcher_test.sh -t 12 ( test name mem-stress-switching )

As I suspected earlier,
a. The "strace" on ubuntu is the problem for not executing the test scripts.
b. "strace" is not able to parse the shell script as expected.
c. In our current test script, the strace is not able to parse line number 4 in the below example code snippet.
d. line 3, from dmesg do not notice "Unable to handle kernel" string. still line 4 getting passed and test case is doing exit 1.
e. strace is not able to parse the string length check as expected.
f. in our test script, string length check is happening several time and strace on ubuntu is not giving expected results.
g. line 10, will not reach on ubuntu with strace.
h. line 10, will reach on ubuntu without strace.

Code:
--------
1 check_kernel_oops()
2 {
3 KERNEL_ERR=`dmesg | grep "Unable to handle kernel "`
4 if ! ([[ -z "$KERNEL_ERR" ]]); then
5 echo "Kernel OOPS. Abort!!"
6 exit 1
7 fi
8 }
9 check_kernel_oops
10 echo " with strace on ubuntu you never reach me"

Output:
-----------
root@linaro-developer:~# strace -o output-kernel-oops.log -f /usr/bin/sh kernel-oops.sh
kernel-oops.sh: 4: kernel-oops.sh: [[: not found
Kernel OOPS. Abort!!
root@linaro-developer:~# ./kernel-oops.sh
 with strace on ubuntu you never reach me
root@linaro-developer:~# uname -a
Linux linaro-developer 3.4-0-linaro-lt-vexpressdt-rtsm #0~RC1linaro14-Ubuntu SMP Sat Jul 21 20:17:17 UTC 2012 armv7l armv7l armv7l GNU/Linux

Comments:
----------------
->No need to change the test case implementation.
->Platform Team can continue investigation on Android Failure as reported.
->I think we need to report this strace issue to ubuntu package builders.

As you know on Android, strace is working perfect.

Ricardo Salveti (rsalveti) wrote :

This is not a problem with strace itself, but with the default shell used by Ubuntu.

While when running by hand it worked fine, it probably just worked because it was using bash, which is not the default at ubuntu (dash is, check /bin/sh).

The results:
linaro@linaro-nano:~$ cat foo.sh
#!/bin/sh

check_kernel_oops()
{
 KERNEL_ERR=`dmesg | grep "Unable to handle kernel "`
 if ! ([[ -z "$KERNEL_ERR" ]]); then
  echo "Kernel OOPS. Abort!!"
  exit 1
 fi
}

check_kernel_oops
echo " with strace on ubuntu you never reach me"
linaro@linaro-nano:~$ sh foo.sh
foo.sh: 6: foo.sh: [[: not found
Kernel OOPS. Abort!!
linaro@linaro-nano:~$ bash foo.sh
 with strace on ubuntu you never reach me

Now with a modified script, which is also compatible with minimal shells like dash:
linaro@linaro-nano:~$ cat foo.sh
#!/bin/sh

check_kernel_oops()
{
 KERNEL_ERR=`dmesg | grep "Unable to handle kernel "`
 if [ -n "$KERNEL_ERR" ]; then
  echo "Kernel OOPS. Abort!!"
  exit 1
 fi
}

check_kernel_oops
echo " with strace on ubuntu you never reach me"
linaro@linaro-nano:~$ sh foo.sh
 with strace on ubuntu you never reach me
linaro@linaro-nano:~$ bash foo.sh
 with strace on ubuntu you never reach me
linaro@linaro-nano:~$ strace -o output-kernel-oops.log -f /bin/sh foo.sh
 with strace on ubuntu you never reach me

Thanks for your comments Ricardo Salveti (rsalveti).
I have gone through page http://blogs.plexibus.com/2010/01/04/bashisms-and-ubuntu/
I ran checkbashisms on the existing BL switcher core test suite and found 130+ changes needs to be done.
if i have to run test suite with strace then need to modify 130+ lines of script. although it is working well without strace.

core$ ls */*.sh *.sh | xargs checkbashisms

<nkambo> rsalveti, is there any possibility of using bash instead of sh by strace ?
<nkambo> rsalveti, I ran checkbashisms on the existing BL switcher core test suite and found 130+ changes needs to be done.

<rsalveti> nkambo: ouch
<rsalveti> nkambo: yup, you can force bash, but removing bashisms would also be a good thing to do
<nkambo> rsalveti, " removing bashisms would also be a good thing to do" how it can be done...is it while building rootfs ?
<rsalveti> nkambo: changing the test suite code
<nkambo> rsalveti, ooh ok

<rsalveti> nkambo: if you try strace -o output-kernel-oops.log -f /bin/bash script.sh it might work
<nkambo> rsalveti, doing the same.
<nkambo> rsalveti, let me verify can we force /bin/bash instead of /bin/sh

execution started on target, still not come out of execution.
i will publish results after complete execution of test case.

Nicolas Pitre (npitre) wrote :

On Tue, 28 Aug 2012, Naresh Kamboju wrote:

> I have gone through page http://blogs.plexibus.com/2010/01/04/bashisms-and-ubuntu/
> I ran checkbashisms on the existing BL switcher core test suite and found 130+ changes needs to be done.
> if i have to run test suite with strace then need to modify 130+ lines of script. although it is working well without strace.

Is it _always_ working well even without strace for sure?

The safest option might simply be to force bash as the shell to use when
running tests on Ubuntu.

test execution completed with strace after forcing shell to bash instead of sh.
test summary is obtained and strace log has been saved.

>Is it _always_ working well even without strace for sure?
The answer is yes.
But some times the test case will fail if governors not been set to userspace.
I am working on modification of test suite when we run each test independently (not -a).
but this will not stop on investigating this bug on android.

>The safest option might simply be to force bash as the shell to use when running tests on Ubuntu.
yes. this would be a good idea.

Zach Pfeffer (pfefferz) on 2012-09-04
Changed in linaro-android:
importance: Undecided → High
assignee: nobody → Amit Pundir (pundiramit)
milestone: none → 12.09

With the latest Android build=31 this test case is passed. which concludes All the BL core tests on Android fully PASSED.
https://android-build.linaro.org/builds/~linaro-android-restricted/vexpress-rtsm-isw-jb-gcc47-armlt-stable-open/#build=31

kernel commit id:
4c681f56be2d07b0326847d49523805c4ab9ac8f

Results log:
---------------
Running cache-coherency-switching ..

Switching to big mode if not already in.

Number of CPUs successfully brought up during boot = 4

Running /system/bin/boot-a15.sh -c 0

cpu0 is big

Starting bigLITTLE random switcher in the background
spawning thread(s) on specified cpu(s)

Running stressapptest -M 16 --cc_test -s 10
***bl-agitator***
CPU count: 4
CPU0: big freq 1000000 LITTLE freq 100000
CPU1: big freq 1000000 LITTLE freq 100000
CPU2: big freq 1000000 LITTLE freq 100000
CPU3: big freq 1000000 LITTLE freq 100000
Random switcher seed 100 limit 1000
Random switcher seed 100 limit 1000
Random switcher seed 100 limit 1000
cache-coherency test finished successfully

Kill bigLITTLE switcher
Time elapsed: 0:00:10.349
Terminated because of SIG 15
cache-coherency-switching : PASS

Results-summary:
------------------------
Summary ..
Total Tests = 15
Tests Passed = 15
Tests Failed = 0

Linux kernel version:
-----------------------------
root@android:/ # uname -a
Linux localhost 3.6.0-rc4+ #1 SMP Wed Sep 5 16:34:38 UTC 2012 armv7l GNU/Linux

Complete test execution log attached.
Which has all BL core test cases PASSED.

Nicolas Pitre (npitre) wrote :

It turns out that this was a generic kernel bug that both the switcher
and probably some of the Android user space libraries were exposing.

The commit responsible for making the test pass is commit fae218a1ff
from v3.5.2, or commit a84b895a23 from v3.6-rc1.

Therefore marking as "fix released" as this is available in the above
mentioned kernel versions.

Changed in linaro-big-little-system:
status: New → Fix Released
Amit Pundir (pundiramit) wrote :

No longer affecting bL Android builds. Verified by naresh-kamboju on https://android-build.linaro.org/builds/~linaro-android-restricted/vexpress-rtsm-isw-jb-gcc47-armlt-stable-open/#build=31, #build=40 and #build=45.

Changed in linaro-android:
status: New → Fix Released
status: Fix Released → Fix Committed
Amit Pundir (pundiramit) on 2012-10-01
Changed in linaro-android:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers