Ubuntu

Shared memory operations on very fast ARM hardware suffer from non-atomic operations and race conditions.

Reported by Matthias Klose on 2008-11-19
6
Affects Status Importance Assigned to Milestone
libipc-sharelite-perl (Ubuntu)
Low
Loïc Minier
Jaunty
High
Michael Casadevall
linux (Ubuntu)
Low
Michael Casadevall
Jaunty
Low
Michael Casadevall

Bug Description

On extremely fast ARM boards (BogoMIPS over 900), issues involving shared memory and locking crop up, and were initially discovered via libipc-sharelite-perl's test suite. There is a lag in between posted a value to shared memory, and then being able to retrieve it, a lag which cause introduce race conditions as seen in libipc-sharelite-perl.

Original Bug Report:
http://launchpadlibrarian.net/19661148/buildlog_ubuntu-jaunty-armel.libipc-sharelite-perl_0.13-1build2_FAILEDTOBUILD.txt.gz

/usr/bin/make test
make[1]: Entering directory `/build/buildd/libipc-sharelite-perl-0.13'
PERL_DL_NONLAZY=1 /usr/bin/perl "-MExtUtils::Command::MM" "-e" "test_harness(0, 'blib/lib', 'blib/arch')" t/*.t
t/00-load......# Testing IPC::ShareLite 0.13
ok
t/pod..........ok
t/sharelite....
# Failed test 'num_segments'
# at t/sharelite.t line 40.
# got: '2'
# expected: '3'

# Failed test 'lock'
# at t/sharelite.t line 71.
# got: '1008'
# expected: '2000'

# Failed test 'version inc'
# at t/sharelite.t line 72.
# got: '1012'
# expected: '2004'
# Looks like you failed 3 tests of 14.
dubious
 Test returned status 3 (wstat 768, 0x300)
DIED. FAILED tests 10, 13-14
 Failed 3/14 tests, 78.57% okay
Failed Test Stat Wstat Total Fail List of Failed
-------------------------------------------------------------------------------
t/sharelite.t 3 768 14 3 10 13-14
Failed 1/3 test scripts. 3/16 subtests failed.
Files=3, Tests=16, 1 wallclock secs ( 1.36 cusr + 0.08 csys = 1.44 CPU)
Failed 1/3 test programs. 3/16 subtests failed.
make[1]: *** [test_dynamic] Error 3
make[1]: Leaving directory `/build/buildd/libipc-sharelite-perl-0.13'

Matthias Klose (doko) on 2008-11-19
Changed in libipc-sharelite-perl:
importance: Undecided → High
status: New → Triaged
Loïc Minier (lool) wrote :

It seems build3 built on armel; is this still an issue? Do you have an updated build log?

Steve Langasek (vorlon) wrote :

If build3 succeeded, I think we can consider this issue resolved unless/until it shows up again.

Changed in libipc-sharelite-perl:
status: Triaged → Fix Released

reopened. just builds because the testsuite results are ignored on armel for the last upload.

Changed in libipc-sharelite-perl:
status: Fix Released → Triaged

build3 disabled testsuite failures on armel which is why it passed; sorry.

However I couldn't reproduce the issue on my armel babbage (testsuite passed); I've uploaded build4 to see how it goes on the buildd nowadays.

So it failed on the buildds again.

It could be that the SHM segment is already used, but that would fail more tests. I suspect it's a board specific issue in the support of SHM features.

I couldn't reproduce on my evm board either.

Loïc Minier (lool) on 2009-02-12
Changed in libipc-sharelite-perl:
assignee: nobody → ogra
Oliver Grawert (ogra) wrote :

tested on the babbage board where it finishes to build with no issues at all ...
i will do a test on qemu and on the porter machine in parallel, my suspicion is that the buildd/porter HW or the used kernel is at fault here

Oliver Grawert (ogra) wrote :

same issues on the porter box:

/usr/bin/make test
make[1]: Entering directory `/home/ogra/perltest/libipc-sharelite-perl-0.13'
PERL_DL_NONLAZY=1 /usr/bin/perl "-MExtUtils::Command::MM" "-e" "test_harness(0, 'blib/lib', 'blib/arch')" t/*.t
t/00-load......ok 1/1# Testing IPC::ShareLite 0.13
t/00-load......ok
t/pod..........ok
t/sharelite....NOK 10/14
# Failed test 'num_segments'
# at t/sharelite.t line 40.
# got: '1'
# expected: '3'
t/sharelite....NOK 11/14
# Failed test 'frag fetch'
# at t/sharelite.t line 43.
# got: ''
# expected: 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
t/sharelite....NOK 13/14
# Failed test 'lock'
# at t/sharelite.t line 71.
# got: '1000'
# expected: '2000'
t/sharelite....NOK 14/14
# Failed test 'version inc'
# at t/sharelite.t line 72.
# got: '1004'
# expected: '2004'
# Looks like you failed 4 tests of 14.
t/sharelite....dubious
 Test returned status 4 (wstat 1024, 0x400)
DIED. FAILED tests 10-11, 13-14
 Failed 4/14 tests, 71.43% okay
Failed Test Stat Wstat Total Fail List of Failed
-------------------------------------------------------------------------------
t/sharelite.t 4 1024 14 4 10-11 13-14
Failed 1/3 test scripts. 4/16 subtests failed.
Files=3, Tests=16, 1 wallclock secs ( 1.19 cusr + 0.10 csys = 1.29 CPU)
Failed 1/3 test programs. 4/16 subtests failed.
make[1]: *** [test_dynamic] Error 4
make[1]: Leaving directory `/home/ogra/perltest/libipc-sharelite-perl-0.13'
make: [build-stamp] Error 2 (ignored)
touch build-stamp
 fakeroot debian/rules binary
dh_testdir
dh_testroot
dh_clean -k

Oliver Grawert (ogra) wrote :

buildlog excerpt on the babbge:

/usr/bin/make test
make[1]: Entering directory `/home/ogra/perltest/libipc-sharelite-perl-0.13'
PERL_DL_NONLAZY=1 /usr/bin/perl "-MExtUtils::Command::MM" "-e" "test_harness(0, 'blib/lib', 'blib/arch')" t/*.t
t/00-load......ok 1/1# Testing IPC::ShareLite 0.13
t/00-load......ok
t/pod..........ok
t/sharelite....ok
All tests successful.
Files=3, Tests=16, 2 wallclock secs ( 1.32 cusr + 0.15 csys = 1.47 CPU)

Oliver Grawert (ogra) wrote :

buildlog excerpt on qemu

/usr/bin/make test
make[1]: Entering directory `/home/ogra/perltest/libipc-sharelite-perl-0.13'
PERL_DL_NONLAZY=1 /usr/bin/perl "-MExtUtils::Command::MM" "-e" "test_harness(0, 'blib/lib', 'blib/arch')" t/*.t
t/00-load......ok 1/1# Testing IPC::ShareLite 0.13
t/00-load......ok
t/pod..........ok
t/sharelite....ok
All tests successful.
Files=3, Tests=16, 2 wallclock secs ( 1.32 cusr + 0.15 csys = 1.47 CPU)

Download full text (3.4 KiB)

Testing on rimu with lamont's help, it seems to be an alignment issue. Setting /proc/cpu/alignment to 2 (fixup) causes the following failures:

# Failed test 'num_segments'
# at t/sharelite.t line 40.
# got: '1'
# expected: '3'
t/sharelite....NOK 11/14
# Failed test 'frag fetch'
# at t/sharelite.t line 43.
# got: ''
# expected: 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
t/sharelite....NOK 12/14
# Failed test 'version inc'
# at t/sharelite.t line 47.
# got: '2'
# expected: '4'
t/sharelite....NOK 13/14
# Failed test 'lock'
# at t/sharelite.t line 71.
# got: 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXYAUS'
# expected: '2000'
t/sharelite....NOK 14/14
# Failed test 'version inc'
# at t/sharelite.t line 72.
# got: '1019'
# expected: '2004'
# Looks like you failed 5 tests of 14.
t/sharelite....dubious
 Test returned status 5 (wstat 1280, 0x500)
DIED. FAILED tests 10-14
 Failed 5/14 tests, 64.29% okay
Failed Test Stat Wstat Total Fail List of Failed
-------------------------------------------------------------------------------
t/sharelite.t 5 1280 14 5 10-14
Failed 1/3 test scripts. 5/16 subtests failed.
Files=3, Tests=16, 2 wallclock secs ( 1.27 cusr + 0.09 csys = 1.36 CPU)
Failed 1/3 test programs. 5/16 subtests failed.
make: *** [test_dynamic] Error 5

Meanwhile, setting it to 3 (fixup+warn) causes one build failure to go away, and the following make test results:

(jaunty)mcasadevall@rimu:~/libipc-sharelite-perl-0.13$ make test
PERL_DL_NONLAZY=1 /usr/bin/perl "-MExtUtils::Command::MM" "-e" "test_harness(0, 'blib/lib', 'blib/arch')" t/*.t
t/00-load......ok 1/1# Testing IPC::ShareLite 0.13
t/00-load......ok
t/pod..........ok
t/sharelite....NOK 13/14
# Failed test 'lock'
# at t/sharelite.t line 71.
# got: '1060'
# expected: '2000'
t/sharelite....NOK 14/14
# Failed test 'version inc'
# at t/sharelite.t line 72.
# got: '1064'
# expected: '2004'
# Looks like you failed 2 tests of 14.
t/sharelite....dubious
 Test returned status 2 (wstat 512, 0x200)
DIED. FAILED tests 13-14
 Failed 2/14 tests, 85.71% okay
Failed Test Stat Wstat Total Fail List of Failed
-------------------------------------------------------------------------------
t/sharelite...

Read more...

Changed in libipc-sharelite-perl:
assignee: ogra → mcasadevall

On further testing, I've discovered that the tests that actually fail are inconsistent. If the test suite is run multiple times, the tests that fail vary between 3 to 5. It looks like memory corruption or a buffer overflow, but that makes no sense since we should be seeing that consistently on all ARM boards ...

Here's a debdiff that works around the issue. This package has been successfully built in a devirtualized PPA on armel.

After much debugging, the issue is a race issue. On non-armel platforms, shared memory operations appear to be atomic, but on ARM, on very fast machines, these shared memory operations have a slight lag to the, which causes the test suite failures. I'm trying to see if I can cobble together a test suite to isolate the issue specifically.

Changed in linux:
assignee: nobody → mcasadevall
importance: Undecided → High
status: New → Triaged
description: updated
Matt Zimmerman (mdz) on 2009-03-25
tags: added: arm
Steve Langasek (vorlon) wrote :

As the bug has been identified as a kernel issue, and the distinction between not running the test suite and running the test suite with kludges applied to ensure it passes is largely immaterial, I'm dropping the jaunty target for libipc-sharelite-perl itself.

Changed in libipc-sharelite-perl:
status: Triaged → Won't Fix

I'm downgrading this bug from High to Low. As far as I can tell, this seems to be an issue with either the buildd hardware or kernel, and not a general ARM issue as it can't be reproduced on any other fast ARM hardware it seems ...

Changed in linux (Ubuntu Jaunty):
importance: High → Low
status: Triaged → Invalid
Changed in libipc-sharelite-perl (Ubuntu):
importance: High → Low
status: Triaged → Confirmed
Steve Langasek (vorlon) wrote :

Is the buildd not running the Ubuntu kernel?

Loïc Minier (lool) wrote :

No; handbuilt vanilla 2.6.27 or .28 with a couple of patches.

Paul Larson (pwlars) on 2009-06-05
tags: added: armel
removed: arm
Loïc Minier (lool) wrote :

I don't think sleeping during the testsuite is a valuable fix; we either care to fix the root cause or to keep it visible. Hiding it with a sleep isn't too nice IMO.

Loïc Minier (lool) wrote :

Unassigning Michael as he has not been recently working on this; what needs to happen here:
- checking whether libipc-sharelite-perl uses the libipc API correctly or incorrectly
- if API is used correctly, building a minimal C test case of the issue on this kernel

Changed in libipc-sharelite-perl (Ubuntu):
assignee: Michael Casadevall (mcasadevall) → nobody
Loïc Minier (lool) on 2009-09-08
Changed in libipc-sharelite-perl (Ubuntu):
assignee: nobody → Loïc Minier (lool)
status: Confirmed → Fix Committed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package libipc-sharelite-perl - 0.17-1+build1

---------------
libipc-sharelite-perl (0.17-1+build1) karmic; urgency=low

  * Drop bogus changes of 0.17-1ubuntu1; the testsuite should now pass on new
    buildds. It seems the issue was specific to a particular SoC/kernel
    combination. LP: #299847.
  * Update version number to reflect that we dont have any Ubuntu specific
    changes now.

 -- Loic Minier <email address hidden> Tue, 08 Sep 2009 10:26:10 +0200

Changed in libipc-sharelite-perl (Ubuntu):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Bug attachments