Segmentation fault while sending large arrays

Bug #231062 reported by Bram Metsch
24
This bug affects 3 people
Affects Status Importance Assigned to Milestone
mpich (Debian)
New
Unknown
mpich (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

Using mpich from Ubuntu's packages, the attached program crashes with the following output:

$ mpirun -np 2 ./test
MPI_Init returned 0
myid = 0 i = 1
myid = 0 numprocs = 2
myid = 0: Sending to 1 size = 16000
p0_12348: p4_error: interrupt SIGSEGV: 11
p0_12348: (0.031250) net_send: could not write to fd=12, errno = 32
rm_l_1_12357: (0.000000) net_send: could not write to fd=5, errno = 32
MPI_Init returned 0
myid = 1 numprocs = 2
myid = 1: Receiving from 0 size = 16000

I compiled the executable as follows:

mpicc -Wall -c -o test.o -Wall -g3 test.c
mpicc test.o -o test

I have the following packages installed:

ii libmpich-mpd1.0-dev 1.2.7-8 mpich static libraries and development files
ii libmpich-mpd1.0gf 1.2.7-8 mpich-mpd runtime shared library
ii libmpich1.0-dev 1.2.7-8 mpich static libraries and development files
ii libmpich1.0gf 1.2.7-8 mpich runtime shared library
ii mpich-bin 1.2.7-8 MPI parallel computing system implementation
ii mpich-mpd-bin 1.2.7-8 MPI parallel computing system implementation, MPD version

The alternatives are defines as follows:

lrwxrwxrwx 1 root root 32 May 13 12:23 /etc/alternatives/libmpi++.a -> /usr/lib/mpich/lib/libpmpich++.a
lrwxrwxrwx 1 root root 40 May 13 12:23 /etc/alternatives/libmpi++.so -> /usr/lib/mpich/lib/shared/libpmpich++.so
lrwxrwxrwx 1 root root 29 May 13 12:23 /etc/alternatives/libmpi.a -> /usr/lib/mpich/lib/libmpich.a
lrwxrwxrwx 1 root root 37 May 13 12:23 /etc/alternatives/libmpi.so -> /usr/lib/mpich/lib/shared/libmpich.so
lrwxrwxrwx 1 root root 44 May 14 15:59 /etc/alternatives/libmpiuni.a -> /usr/lib/petscdir/2.3.3/lib/libO/libmpiuni.a
lrwxrwxrwx 1 root root 22 May 13 12:23 /etc/alternatives/mpi -> /usr/lib/mpich/include
lrwxrwxrwx 1 root root 20 May 13 12:23 /etc/alternatives/mpiCC -> /usr/bin/mpiCC.mpich
lrwxrwxrwx 1 root root 36 May 13 12:23 /etc/alternatives/mpiCC.1.gz -> /usr/share/man/man1/mpiCC.mpich.1.gz
lrwxrwxrwx 1 root root 20 May 13 12:23 /etc/alternatives/mpicc -> /usr/bin/mpicc.mpich
lrwxrwxrwx 1 root root 36 May 13 12:23 /etc/alternatives/mpicc.1.gz -> /usr/share/man/man1/mpicc.mpich.1.gz
lrwxrwxrwx 1 root root 27 May 13 12:23 /etc/alternatives/mpichversion -> /usr/bin/mpichversion.mpich
lrwxrwxrwx 1 root root 43 May 13 12:23 /etc/alternatives/mpichversion.1.gz -> /usr/share/man/man1/mpichversion.mpich.1.gz
lrwxrwxrwx 1 root root 21 May 13 12:23 /etc/alternatives/mpicxx -> /usr/bin/mpicxx.mpich
lrwxrwxrwx 1 root root 37 May 13 12:23 /etc/alternatives/mpicxx.1.gz -> /usr/share/man/man1/mpicxx.mpich.1.gz
lrwxrwxrwx 1 root root 24 May 13 12:05 /etc/alternatives/mpiexec -> /usr/bin/mpiexec.openmpi
lrwxrwxrwx 1 root root 40 May 13 12:05 /etc/alternatives/mpiexec.1.gz -> /usr/share/man/man1/mpiexec.openmpi.1.gz
lrwxrwxrwx 1 root root 21 May 13 12:23 /etc/alternatives/mpif77 -> /usr/bin/mpif77.mpich
lrwxrwxrwx 1 root root 37 May 13 12:23 /etc/alternatives/mpif77.1.gz -> /usr/share/man/man1/mpif77.mpich.1.gz
lrwxrwxrwx 1 root root 21 May 13 12:23 /etc/alternatives/mpif90 -> /usr/bin/mpif90.mpich
lrwxrwxrwx 1 root root 37 May 13 12:23 /etc/alternatives/mpif90.1.gz -> /usr/share/man/man1/mpif90.mpich.1.gz
lrwxrwxrwx 1 root root 21 May 13 12:23 /etc/alternatives/mpiman -> /usr/bin/mpiman.mpich
lrwxrwxrwx 1 root root 37 May 13 12:23 /etc/alternatives/mpiman.1.gz -> /usr/share/man/man1/mpiman.mpich.1.gz
lrwxrwxrwx 1 root root 26 May 13 12:23 /etc/alternatives/mpireconfig -> /usr/bin/mpireconfig.mpich
lrwxrwxrwx 1 root root 42 May 13 12:23 /etc/alternatives/mpireconfig.1.gz -> /usr/share/man/man1/mpireconfig.mpich.1.gz
lrwxrwxrwx 1 root root 21 May 13 12:23 /etc/alternatives/mpirun -> /usr/bin/mpirun.mpich
lrwxrwxrwx 1 root root 37 May 13 12:23 /etc/alternatives/mpirun.1.gz -> /usr/share/man/man1/mpirun.mpich.1.gz
lrwxrwxrwx 1 root root 38 May 14 15:59 /etc/alternatives/mpirun_lam -> /usr/lib/petscdir/2.3.3/bin/mpirun_lam
lrwxrwxrwx 1 root root 43 May 14 15:59 /etc/alternatives/mpirun_lam.1.gz -> /usr/lib/petscdir/2.3.3/bin/mpirun_lam.1.gz

If I just reduce the macro "size" by one, the program works:
$ mpirun -np 2 ./test
MPI_Init returned 0
myid = 0 i = 1
myid = 0 numprocs = 2
myid = 0: Sending to 1 size = 15999
MPI_Init returned 0
myid = 1 numprocs = 2
myid = 1: Receiving from 0 size = 15999

Using a self-built version of mpich-1.2.7p1, everything is OK even for larger numbers of "size", e.g.

$ mpirun -np 2 ./test
MPI_Init returned 0
myid = 0 i = 1
myid = 0 numprocs = 2
myid = 0: Sending to 1 size = 1600000
MPI_Init returned 0
myid = 1 numprocs = 2
myid = 1: Receiving from 0 size = 1600000

Revision history for this message
Bram Metsch (metsch) wrote :
Revision history for this message
Bram Metsch (metsch) wrote :

Sorry, I attached the wrong file the last time. With this post you find the right one

Revision history for this message
Jan Hama (hamaeker) wrote :

Hi !

I have this bug also on a 64bit system.
However the test code works on a 32-bit system.

Revision history for this message
Michael Helmling (supermihi) wrote :

I think this library is the correct package that causes the bug. Also confirmed here on several hardy machines, also appears in debian testing.

Revision history for this message
Michael Helmling (supermihi) wrote :

Isn't anybody able to fix this? That bug is *very* annoying!

Revision history for this message
Emmanuel FARHI (farhi) wrote :

One possible solution is to split arrays in small bits, e.g. only 1000 or 10000 elements being sent/received, and catenate messages to rebuild original array.
I attach example implementations for Reduce, Send and Recv calls (file: mpi_split-blocks.txt).

Revision history for this message
Ralf Wildenhues (wildenhues) wrote :

You can't be serious in recommending to change user code, just to cope with a bug in Ubuntu's build of MPICH.
Please rebuild this package from source. There is likely just a 32bit vs 64bit issue in some header that
broke the build.

This is a bug in Debian/Ubuntu packaging, building the upstream software creates a flawless MPICH.

Revision history for this message
Ralf Wildenhues (wildenhues) wrote :

Upstream Debian bug that is likely the same issue:
<http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=498213>

Changed in mpich (Debian):
status: Unknown → New
Revision history for this message
Bram Metsch (metsch) wrote :

The bug does not affect lucid.

Revision history for this message
Thomas Hotz (thotz-deactivatedaccount) wrote :

Ubuntu versions starting with Lucid (10.04 LTS) have mpich2, so I'm closing this bug.

Changed in mpich (Ubuntu):
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.