cache-coherency test failed. Abort!! while bigLITTLE test execution on Fastmodel
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
| Linaro Android |
Fix Released
|
High
|
Amit Pundir | |
| Linaro big.LITTLE |
Fix Released
|
Undecided
|
Nicolas Pitre |
Bug Description
Android b.L switcher integrated image build 85 on July 20 is booted on Fastmodels and boot success and
bigLITTLE testsuite has been given for test execution and following issues
is reported.
coretile.
and fast model used is RTSM_VE_
#run_stress_
.
.
.
Running stressapptest -M 16 --cc_test -s 10
***bl-agitator***
CPU count: 4
CPU0: big freq 1000000 LITTLE freq 100000
CPU1: big freq 1000000 LITTLE freq 100000
CPU2: big freq 1000000 LITTLE freq 100000
CPU3: big freq 1000000 LITTLE freq 100000
Random switcher seed 100 limit 1000
Random switcher seed 100 limit 1000
Random switcher seed 100 limit 1000
cache-coherency test failed. Abort!!
Kill bigLITTLE switcher
Time elapsed: 0:00:10.1000
Terminated because of SIG 15
Test failed. Abort!!
cache-coherency
I have download pre-built images from below link to test BL test suite with this image on Fastmodels
please find attached logs for more information.
Naresh Kamboju (naresh-kamboju) wrote : | #1 |
Changed in linaro-big-little-system: | |
assignee: | nobody → Nicolas Pitre (npitre) |
Naresh Kamboju (naresh-kamboju) wrote : | #2 |
Nicolas Pitre (npitre) wrote : Re: [Bug 1027203] Re: cache-coherency test failed. Abort!! while bigLITTLE test execution on Fastmodel | #3 |
Here's what I found so far:
- The bug apparently doesn't affect the Ubuntu build. So I tried
a kernel without the Android patches (the Android user space isn't
all happy about it but the shell is working) and the test still fails.
At least this rules out Android kernel changes.
- Then I hacked the cpufreq driver to let user space believe switch
requests were honored but without doing any switching using this
patch:
diff --git a/drivers/
index 85e1bb0519.
--- a/drivers/
+++ b/drivers/
@@ -106,9 +106,13 @@ static void __get_current_
static int get_current_
{
+#if 0
unsigned int cluster = 0;
smp_call_
return cluster;
+#else
+ return per_cpu(
+#endif
}
static int get_current_
@@ -143,7 +147,7 @@ static void switch_
freqs.new = entry_to_
cpufreq_
- bL_switch_
+ //bL_switch_
per_cpu(
cpufreq_
}
And then the test appears to work OK.
So... it seems that the difference between the Android user libraries
and the Ubuntu user libraries is triggering the bug. However the bug
appears only when real switching is taking place, indicating some
remaining flaw with the switcher.
Would it be possible to get a strace dump of the running test when
running on Ubuntu vs Android? Extracting syscalls differences between
the two could help identify the problem. No need to switch at the same
time therefore no need to use Fast Models for this.
Naresh Kamboju (naresh-kamboju) wrote : | #4 |
Hi Nicolas,
On 4 August 2012 09:05, Nicolas Pitre <email address hidden> wrote:
> Here's what I found so far:
>
> - The bug apparently doesn't affect the Ubuntu build. So I tried
> a kernel without the Android patches (the Android user space isn't
> all happy about it but the shell is working) and the test still fails.
> At least this rules out Android kernel changes.
>
> - Then I hacked the cpufreq driver to let user space believe switch
> requests were honored but without doing any switching using this
> patch:
>
> diff --git a/drivers/
> b/drivers/
> index 85e1bb0519.
> --- a/drivers/
> +++ b/drivers/
> @@ -106,9 +106,13 @@ static void __get_current_
>
> static int get_current_
> {
> +#if 0
> unsigned int cluster = 0;
> smp_call_
> return cluster;
> +#else
> + return per_cpu(
> +#endif
> }
>
> static int get_current_
> @@ -143,7 +147,7 @@ static void switch_
> freqs.new = entry_to_
>
> cpufreq_
> - bL_switch_
> + //bL_switch_
> per_cpu(
> cpufreq_
> }
>
> And then the test appears to work OK.
>
> OK.
> So... it seems that the difference between the Android user libraries
> and the Ubuntu user libraries is triggering the bug. However the bug
> appears only when real switching is taking place, indicating some
> remaining flaw with the switcher.
>
> Would it be possible to get a strace dump of the running test when
> running on Ubuntu vs Android?
please find attached strace log for Android and execution still in progress
for ubuntu.
I will share strace log on ubuntu soon.
Best regards
Naresh Kamboju
> Extracting syscalls differences between
> the two could help identify the problem. No need to switch at the same
> time therefore no need to use Fast Models for this.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https:/
>
> Title:
> cache-coherency test failed. Abort!! while bigLITTLE test execution on
> Fastmodel
>
> Status in Linaro big.LITTLE System:
> New
>
> Bug description:
> Android b.L switcher integrated image build 85 on July 20 is booted on
> Fastmodels and boot success and
> bigLITTLE testsuite has been given for test execution and following
> issues
> is reported.
>
> coretile.
> and fast model used is RTSM_VE_
>
> #run_stress_
> .
> .
> .
> Running stressapptest -M 16 --cc_test -s 10
> ***bl-agitator***
> CPU count: 4
> CPU0: big freq 1000000 LITTLE freq 100000
> CPU1: big freq 1000000 LITTLE freq 100000
> CPU2: big freq 1000000 LITTLE freq 100000
>...
Naresh Kamboju (naresh-kamboju) wrote : | #5 |
please find attached strace log for Android and execution still in progress for ubuntu.
I will share strace log on ubuntu soon.
since strace was not accepting command line args from scripts
so put in a one more script.
echo "#! /system/bin/sh" >> strace-bug-1027203.sh
echo "run_stress_
chmod 777 strace-bug-1027203.sh
Naresh Kamboju (naresh-kamboju) wrote : | #6 |
please find attached strace log for ubuntu.
since strace was not accepting command line args from scripts
so i put it in a one more script.
root@linaro-
#! /bin/sh
run_stress_
root@linaro-
root@linaro-
Best regards
Naresh Kamboju
Tixy (Jon Medhurst) (tixy) wrote : | #7 |
This doesn't look like the Ubuntu logs refer to a successful test run. They contain lots of entries like
[pid 1328] write(2, "/usr/bin/
[pid 1328] write(2, "[[: not found", 13[[: not found) = 13
[pid 1328] write(2, "\n", 1
which the Android one doesn't.
And the Android log has entries which look like it's doing something useful, which are absent from the Ubuntu log.
E.g....
[pid 922] write(1, "Starting bigLITTLE random switch"..., 53Starting bigLITTLE random switcher in the background
)
[pid 1430] open("/
[pid 1404] open("/
Tixy (Jon Medhurst) (tixy) wrote : | #8 |
I also note that the Ubuntu test logs have:
[pid 1268] write(1, "cache-
which is a bit worrying considering it doesn't seem to be running the tests properly.
Nicolas Pitre (npitre) wrote : | #9 |
Indeed! Good catch. I was too focused on stressapptest.
This test would have to be fixed and run again on Ubuntu.
Naresh Kamboju (naresh-kamboju) wrote : | #10 |
Thanks for your comments.
I understood the problem.
today i have been digging this problem on ubuntu and found that "Kernel OOPS " could be the problem for file not found error.
because after detecting "Kernel OOPS" exit 1 will be called and i think strace lost its flow.
/usr/bin/
the execution flow is taking place as
run_stress_
check_
{
if ! ([[ -z "$KERNEL_ERR" ]]); then
fi
}
from this step onwards script not able to find file (/usr/bin/
I would like to give two runs to isolate the problem.
1. disable check_kernel_oops
2. enable check_kernel_oops and run it on stable ubuntu. (where dmesg should not have any error logs)
3. is "grep" doing right job. let me run manually dmesg | grep "Unable to handle kernel "
Naresh Kamboju (naresh-kamboju) wrote : | #11 |
I would to update this bug with latest information,
I have found that the problem of "file not found" issue is not only with cache-coherence
Today I have made run in basic test on ubuntu as blow and found same kind of file not found error and did not see the expected test (/sys/devices/
I need to dig this issue more w.r.t ubuntu env and the strace outcome.
please find attached file. which i ran today for bl-basic-tests.
root@linaro-
#! /usr/bin/sh
run_stress_
Naresh Kamboju (naresh-kamboju) wrote : | #12 |
I have suspected strace output problem on ubuntu and carried out few experiments to confirm the behavior.
expected sysfs output is coming on EXP-1 and EXP-2. but not incase of EXP-3, EXP-4 and EXP-5
Regarding the EXP-1 and EXP-2 test
with strace switching is happening.
without strace switching is happening.
Regarding the EXP-3, EXP-4 and EXP-5 test
with strace switching is not happening.
without strace switching is happening.
The above behavior is confirmed by amit pundir, he has GUI FM where we can see cluster blinks.
EXP 1:
#cd /usr/bin
#strace -f /usr/bin/sh governor
open("/
write(1, "userspace\n", 10) = 10
[pid 1970] open("/
[pid 1970] read(3, "userspace\n", 32768) = 10
File not found error we can ignore from the strace output.
After above open() read() and write() operations, still strace reported governor file not found several times.
[pid 1968] write(2, "[[: not found", 13[[: not found) = 13
[pid 1968] write(2, "\n", 1
EXP 2:
#strace -f /usr/bin/sh /usr/bin/
rmmod arm-bl-cpufreq
[pid 1427] open("/
...
insmod arm-bl-cpufreq
[pid 1503] open("/
[pid 1503] write(1, "1000000\n", 8) = 8
EXP 3:
#strace -f /usr/bin/sh /usr/bin/
EXP 4:
#strace -f /usr/bin/sh /usr/bin/
EXP 5:
#strace -f /usr/bin/sh /usr/bin/
Naresh Kamboju (naresh-kamboju) wrote : | #13 |
below test case also not able to produce expected output with strace.
EXP 6:
#strace -f /usr/bin/sh /usr/bin/
EXP 7:
#strace -f /usr/bin/sh /usr/bin/
Naresh Kamboju (naresh-kamboju) wrote : | #14 |
As I suspected earlier,
a. The "strace" on ubuntu is the problem for not executing the test scripts.
b. "strace" is not able to parse the shell script as expected.
c. In our current test script, the strace is not able to parse line number 4 in the below example code snippet.
d. line 3, from dmesg do not notice "Unable to handle kernel" string. still line 4 getting passed and test case is doing exit 1.
e. strace is not able to parse the string length check as expected.
f. in our test script, string length check is happening several time and strace on ubuntu is not giving expected results.
g. line 10, will not reach on ubuntu with strace.
h. line 10, will reach on ubuntu without strace.
Code:
--------
1 check_kernel_oops()
2 {
3 KERNEL_ERR=`dmesg | grep "Unable to handle kernel "`
4 if ! ([[ -z "$KERNEL_ERR" ]]); then
5 echo "Kernel OOPS. Abort!!"
6 exit 1
7 fi
8 }
9 check_kernel_oops
10 echo " with strace on ubuntu you never reach me"
Output:
-----------
root@linaro-
kernel-oops.sh: 4: kernel-oops.sh: [[: not found
Kernel OOPS. Abort!!
root@linaro-
with strace on ubuntu you never reach me
root@linaro-
Linux linaro-developer 3.4-0-linaro-
Comments:
----------------
->No need to change the test case implementation.
->Platform Team can continue investigation on Android Failure as reported.
->I think we need to report this strace issue to ubuntu package builders.
As you know on Android, strace is working perfect.
Ricardo Salveti (rsalveti) wrote : | #15 |
This is not a problem with strace itself, but with the default shell used by Ubuntu.
While when running by hand it worked fine, it probably just worked because it was using bash, which is not the default at ubuntu (dash is, check /bin/sh).
The results:
linaro@
#!/bin/sh
check_kernel_oops()
{
KERNEL_ERR=`dmesg | grep "Unable to handle kernel "`
if ! ([[ -z "$KERNEL_ERR" ]]); then
echo "Kernel OOPS. Abort!!"
exit 1
fi
}
check_kernel_oops
echo " with strace on ubuntu you never reach me"
linaro@
foo.sh: 6: foo.sh: [[: not found
Kernel OOPS. Abort!!
linaro@
with strace on ubuntu you never reach me
Now with a modified script, which is also compatible with minimal shells like dash:
linaro@
#!/bin/sh
check_kernel_oops()
{
KERNEL_ERR=`dmesg | grep "Unable to handle kernel "`
if [ -n "$KERNEL_ERR" ]; then
echo "Kernel OOPS. Abort!!"
exit 1
fi
}
check_kernel_oops
echo " with strace on ubuntu you never reach me"
linaro@
with strace on ubuntu you never reach me
linaro@
with strace on ubuntu you never reach me
linaro@
with strace on ubuntu you never reach me
Naresh Kamboju (naresh-kamboju) wrote : | #16 |
Thanks for your comments Ricardo Salveti (rsalveti).
I have gone through page http://
I ran checkbashisms on the existing BL switcher core test suite and found 130+ changes needs to be done.
if i have to run test suite with strace then need to modify 130+ lines of script. although it is working well without strace.
core$ ls */*.sh *.sh | xargs checkbashisms
<nkambo> rsalveti, is there any possibility of using bash instead of sh by strace ?
<nkambo> rsalveti, I ran checkbashisms on the existing BL switcher core test suite and found 130+ changes needs to be done.
<rsalveti> nkambo: ouch
<rsalveti> nkambo: yup, you can force bash, but removing bashisms would also be a good thing to do
<nkambo> rsalveti, " removing bashisms would also be a good thing to do" how it can be done...is it while building rootfs ?
<rsalveti> nkambo: changing the test suite code
<nkambo> rsalveti, ooh ok
<rsalveti> nkambo: if you try strace -o output-
<nkambo> rsalveti, doing the same.
<nkambo> rsalveti, let me verify can we force /bin/bash instead of /bin/sh
execution started on target, still not come out of execution.
i will publish results after complete execution of test case.
Nicolas Pitre (npitre) wrote : | #17 |
On Tue, 28 Aug 2012, Naresh Kamboju wrote:
> I have gone through page http://
> I ran checkbashisms on the existing BL switcher core test suite and found 130+ changes needs to be done.
> if i have to run test suite with strace then need to modify 130+ lines of script. although it is working well without strace.
Is it _always_ working well even without strace for sure?
The safest option might simply be to force bash as the shell to use when
running tests on Ubuntu.
Naresh Kamboju (naresh-kamboju) wrote : | #18 |
test execution completed with strace after forcing shell to bash instead of sh.
test summary is obtained and strace log has been saved.
Naresh Kamboju (naresh-kamboju) wrote : | #19 |
>Is it _always_ working well even without strace for sure?
The answer is yes.
But some times the test case will fail if governors not been set to userspace.
I am working on modification of test suite when we run each test independently (not -a).
but this will not stop on investigating this bug on android.
>The safest option might simply be to force bash as the shell to use when running tests on Ubuntu.
yes. this would be a good idea.
Naresh Kamboju (naresh-kamboju) wrote : | #20 |
same test case fail reported with build=123
https:/
Changed in linaro-android: | |
importance: | Undecided → High |
assignee: | nobody → Amit Pundir (pundiramit) |
milestone: | none → 12.09 |
Naresh Kamboju (naresh-kamboju) wrote : | #21 |
With the latest Android build=31 this test case is passed. which concludes All the BL core tests on Android fully PASSED.
https:/
kernel commit id:
4c681f56be2d07b
Results log:
---------------
Running cache-coherency
Switching to big mode if not already in.
Number of CPUs successfully brought up during boot = 4
Running /system/
cpu0 is big
Starting bigLITTLE random switcher in the background
spawning thread(s) on specified cpu(s)
Running stressapptest -M 16 --cc_test -s 10
***bl-agitator***
CPU count: 4
CPU0: big freq 1000000 LITTLE freq 100000
CPU1: big freq 1000000 LITTLE freq 100000
CPU2: big freq 1000000 LITTLE freq 100000
CPU3: big freq 1000000 LITTLE freq 100000
Random switcher seed 100 limit 1000
Random switcher seed 100 limit 1000
Random switcher seed 100 limit 1000
cache-coherency test finished successfully
Kill bigLITTLE switcher
Time elapsed: 0:00:10.349
Terminated because of SIG 15
cache-coherency
Results-summary:
-------
Summary ..
Total Tests = 15
Tests Passed = 15
Tests Failed = 0
Linux kernel version:
-------
root@android:/ # uname -a
Linux localhost 3.6.0-rc4+ #1 SMP Wed Sep 5 16:34:38 UTC 2012 armv7l GNU/Linux
Naresh Kamboju (naresh-kamboju) wrote : | #22 |
Complete test execution log attached.
Which has all BL core test cases PASSED.
Nicolas Pitre (npitre) wrote : | #23 |
It turns out that this was a generic kernel bug that both the switcher
and probably some of the Android user space libraries were exposing.
The commit responsible for making the test pass is commit fae218a1ff
from v3.5.2, or commit a84b895a23 from v3.6-rc1.
Therefore marking as "fix released" as this is available in the above
mentioned kernel versions.
Changed in linaro-big-little-system: | |
status: | New → Fix Released |
Amit Pundir (pundiramit) wrote : | #24 |
No longer affecting bL Android builds. Verified by naresh-kamboju on https:/
Changed in linaro-android: | |
status: | New → Fix Released |
status: | Fix Released → Fix Committed |
Changed in linaro-android: | |
status: | Fix Committed → Fix Released |
same test case fail reported with build=95 /android- build.linaro. org/builds/ ~linaro- android- restricted/ vexpress- rtsm-isw- ics-gcc47- armlt-stable- open/#build= 95
https:/
-Naresh