using valgrind with mpirun causes problems

Bug #1045326 reported by Evren Yurtesen on 2012-09-03
46
This bug affects 8 people
Affects Status Importance Assigned to Milestone
mpich (Debian)
Fix Released
Unknown
mpich2 (Ubuntu)
Undecided
Unassigned

Bug Description

I am compiling a test program on 12.04 with latest updates:
http://people.sc.fsu.edu/~jburkardt/f_src/mpi_stubs/hello.f90

It works fine normally:
$ ./a.out

HELLO_WORLD - Master process:
  FORTRAN90 version
  An MPI test program.

  The number of processes is 1

  Process 0 says "Hello, world!"

HELLO_WORLD:
  Normal end of execution.
eyurtese@supremum:~$ mpirun -n 2 ./a.out

HELLO_WORLD - Master process:
  FORTRAN90 version
  An MPI test program.

  The number of processes is 2

  Process 0 says "Hello, world!"
  Process 1 says "Hello, world!"

HELLO_WORLD:
  Normal end of execution.

If I run it with valgrind:
$ mpirun -n 2 valgrind ./a.out
==26486== Memcheck, a memory error detector
==26486== Copyright (C) 2002-2011, and GNU GPL'd, by Julian Seward et al.
==26486== Using Valgrind-3.7.0 and LibVEX; rerun with -h for copyright info
==26486== Command: ./a.out
==26486==
==26487== Memcheck, a memory error detector
==26487== Copyright (C) 2002-2011, and GNU GPL'd, by Julian Seward et al.
==26487== Using Valgrind-3.7.0 and LibVEX; rerun with -h for copyright info
==26487== Command: ./a.out
==26487==
==26486== Warning: ignored attempt to set SIGRT32 handler in sigaction();
==26486== the SIGRT32 signal is used internally by Valgrind
==26487== Warning: ignored attempt to set SIGRT32 handler in sigaction();
==26487== the SIGRT32 signal is used internally by Valgrind
cr_libinit.c:183 cri_init: sigaction() failed: Invalid argument
cr_libinit.c:183 cri_init: sigaction() failed: Invalid argument
==26486==
==26486== HEAP SUMMARY:
==26486== in use at exit: 0 bytes in 0 blocks
==26486== total heap usage: 5 allocs, 5 frees, 8,165 bytes allocated
==26486==
==26486== All heap blocks were freed -- no leaks are possible
==26486==
==26486== For counts of detected and suppressed errors, rerun with: -v
==26486== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 2 from 2)
==26487==
==26487== HEAP SUMMARY:
==26487== in use at exit: 0 bytes in 0 blocks
==26487== total heap usage: 5 allocs, 5 frees, 8,165 bytes allocated
==26487==
==26487== All heap blocks were freed -- no leaks are possible
==26487==
==26487== For counts of detected and suppressed errors, rerun with: -v
==26487== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 2 from 2)

=====================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= EXIT CODE: 134
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
=====================================================================================
APPLICATION TERMINATED WITH THE EXIT STRING: Aborted (signal 6)

I have manually compiled Valgrind 3.8 and I am getting the same error, which makes me think that the problem is mpich2.

Also I found a thread which mentions that somebody had same problem fixed by upgrading mpich2 from 1.3 to 1.4 but we already have 1.4.1 in ubuntu, so perhaps an update to 1.4.1p1 would be good? (I think it has nothing to do with the versions but perhaps compiling against newer libraries does help).
http://lists.mcs.anl.gov/pipermail/mpich-discuss/2011-August/010608.html

if the problem is libcr, then somebody must put a note to recompile mpich2 package everytime libcr is updated.

ProblemType: Bug
DistroRelease: Ubuntu 12.04
Package: mpich2 1.4.1-1ubuntu1
ProcVersionSignature: Ubuntu 3.2.0-29.46-generic 3.2.24
Uname: Linux 3.2.0-29-generic x86_64
NonfreeKernelModules: fglrx
ApportVersion: 2.0.1-0ubuntu12
Architecture: amd64
Date: Mon Sep 3 16:14:10 2012
InstallationMedia: This
ProcEnviron:
 TERM=xterm
 PATH=(custom, no user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
SourcePackage: mpich2
UpgradeStatus: No upgrade log present (probably fresh install)

Evren Yurtesen (eyurtese-g) wrote :
Download full text (4.3 KiB)

Hello,

Could you please report this bug to mpich2 upstream?

https://trac.mcs.anl.gov/projects/mpich2

You'll need to register and login before filing the ticket.

  -- Pavan

On 09/03/2012 08:22 AM, Evren Yurtesen wrote:
> Public bug reported:
>
> I am compiling a test program on 12.04 with latest updates:
> http://people.sc.fsu.edu/~jburkardt/f_src/mpi_stubs/hello.f90
>
> It works fine normally:
> $ ./a.out
>
> HELLO_WORLD - Master process:
> FORTRAN90 version
> An MPI test program.
>
> The number of processes is 1
>
> Process 0 says "Hello, world!"
>
> HELLO_WORLD:
> Normal end of execution.
> eyurtese@supremum:~$ mpirun -n 2 ./a.out
>
> HELLO_WORLD - Master process:
> FORTRAN90 version
> An MPI test program.
>
> The number of processes is 2
>
> Process 0 says "Hello, world!"
> Process 1 says "Hello, world!"
>
> HELLO_WORLD:
> Normal end of execution.
>
> If I run it with valgrind:
> $ mpirun -n 2 valgrind ./a.out
> ==26486== Memcheck, a memory error detector
> ==26486== Copyright (C) 2002-2011, and GNU GPL'd, by Julian Seward et al.
> ==26486== Using Valgrind-3.7.0 and LibVEX; rerun with -h for copyright info
> ==26486== Command: ./a.out
> ==26486==
> ==26487== Memcheck, a memory error detector
> ==26487== Copyright (C) 2002-2011, and GNU GPL'd, by Julian Seward et al.
> ==26487== Using Valgrind-3.7.0 and LibVEX; rerun with -h for copyright info
> ==26487== Command: ./a.out
> ==26487==
> ==26486== Warning: ignored attempt to set SIGRT32 handler in sigaction();
> ==26486== the SIGRT32 signal is used internally by Valgrind
> ==26487== Warning: ignored attempt to set SIGRT32 handler in sigaction();
> ==26487== the SIGRT32 signal is used internally by Valgrind
> cr_libinit.c:183 cri_init: sigaction() failed: Invalid argument
> cr_libinit.c:183 cri_init: sigaction() failed: Invalid argument
> ==26486==
> ==26486== HEAP SUMMARY:
> ==26486== in use at exit: 0 bytes in 0 blocks
> ==26486== total heap usage: 5 allocs, 5 frees, 8,165 bytes allocated
> ==26486==
> ==26486== All heap blocks were freed -- no leaks are possible
> ==26486==
> ==26486== For counts of detected and suppressed errors, rerun with: -v
> ==26486== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 2 from 2)
> ==26487==
> ==26487== HEAP SUMMARY:
> ==26487== in use at exit: 0 bytes in 0 blocks
> ==26487== total heap usage: 5 allocs, 5 frees, 8,165 bytes allocated
> ==26487==
> ==26487== All heap blocks were freed -- no leaks are possible
> ==26487==
> ==26487== For counts of detected and suppressed errors, rerun with: -v
> ==26487== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 2 from 2)
>
> =====================================================================================
> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> = EXIT CODE: 134
> = CLEANING UP REMAINING PROCESSES
> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> =====================================================================================
> APPLICATION TERMINATED WITH THE EXIT STRING: Aborted (signal 6)
>
>
> I have manually compiled Valgrind 3.8 and I am getting the same error, which makes me think that the pr...

Read more...

Evren Yurtesen (eyurtese-g) wrote :

I believe the problem is related to Ubuntu providing something wrong and not a bug mpich2. I have mpich2 working fine on SL6. I will try to compile my own mpich2 installation on ubuntu and re-test on again and return back.

I just have bad flu now,, so it might take a week or so before I can return back with more detailed information. But as I pointed out before, this seem to have happened to a person before who updated his libraries but not mpich2 installation. (as far as I understand from the thread I mentioned in my first post).

Evren Yurtesen (eyurtese-g) wrote :

I tried to compile mpich2 from sources and the problem seems to be related to checkpointing (blcr package).

blcr does not seem to work properly with kernel 3.x+ since kernel module does not compile
https://bugs.launchpad.net/ubuntu/+source/blcr/+bug/804943
-------------------------------------------------------------------------------------------------------------------
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following NEW packages will be installed:
  blcr-dkms
0 upgraded, 1 newly installed, 0 to remove and 32 not upgraded.
Need to get 0 B/892 kB of archives.
After this operation, 4,440 kB of additional disk space will be used.
Selecting previously unselected package blcr-dkms.
(Reading database ... 229696 files and directories currently installed.)
Unpacking blcr-dkms (from .../blcr-dkms_0.8.2-15ubuntu2.1_all.deb) ...
Setting up blcr-dkms (0.8.2-15ubuntu2.1) ...

Creating symlink /var/lib/dkms/blcr/0.8.2/source ->
                 /usr/src/blcr-0.8.2

DKMS: add completed.

Kernel preparation unnecessary for this kernel. Skipping...

Building module:
cleaning build area....
make KERNELRELEASE=3.2.0-29-generic -C /lib/modules/3.2.0-29-generic/build M=/var/lib/dkms/blcr/0.8.2/build.....(bad exit status: 2)
Error! Bad return status for module build on kernel: 3.2.0-29-generic (x86_64)
Consult /var/lib/dkms/blcr/0.8.2/build/make.log for more information.
-------------------------------------------------------------------------------------------------------------------

So, since it is sort of useless without the kernel module (and since nobody could be using this unusable feature):
https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#patch

Why does ubuntu have to enable BLCR when compiling mpich2? Isnt it possible to "not" enable checkpointing?

If I compile mpich2 without ' --enable-checkpointing --with-hydra-ckpointlib=blcr 'configure options, then everything works fine.

Also, as a side note, can you upgrade mpich2 to 1.4.1p1? (Fedora seem to have 1.4.1p1 since FC15!)
http://www.mcs.anl.gov/research/projects/mpich2/downloads/index.php?s=downloads

Evren Yurtesen (eyurtese-g) wrote :

The problem continues on freshly installed Ubuntu 12.10. It is due to compile options Ubuntu is using and not a bug in mpich2, therefore I am not sure what mpich2 developers can do in this situation. It is up to Ubuntu...

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in mpich2 (Ubuntu):
status: New → Confirmed
Carl Ollivier-Gooch (cfog) wrote :

Any word on when this will be fixed? Or has it been fixed but not backported to 12.04?

Alex D (iueoqre) wrote :

This bug still exists in 15.10 and is very severe. It means that the Ubuntu mpich is nearly useless for anyone doing development and such people must build their own mpich.

Changed in mpich (Debian):
status: Unknown → Confirmed
Changed in mpich (Debian):
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.