libopenmpi segfaults when electric fence is enabled

Bug #260027 reported by Patrick Farrell
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
openmpi (Ubuntu)
Fix Released
Undecided
Unassigned

Bug Description

Binary package hint: libopenmpi1

Electric Fence is a library for finding memory access bugs. It wraps malloc in such a way that a segfault
is issued upon invalid memory access.

When I load my MPI program with electric-fence enabled, MPI::Init issues a segfault:
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xb562a6c0 (LWP 2014)]
0xb5d56fb1 in opal_free_list_grow () from /usr/lib/libopen-pal.so.0
(gdb) bt
#0 0xb5d56fb1 in opal_free_list_grow () from /usr/lib/libopen-pal.so.0
#1 0xb5d57099 in opal_free_list_init () from /usr/lib/libopen-pal.so.0
#2 0xb2bed9b5 in ompi_osc_pt2pt_component_init () from /usr/lib/openmpi/lib/openmpi/mca_osc_pt2pt.so
#3 0xb7953d4a in ompi_osc_base_find_available () from /usr/lib/libmpi.so.0
#4 0xb791b032 in ompi_mpi_init () from /usr/lib/libmpi.so.0
#5 0xb793dd17 in PMPI_Init () from /usr/lib/libmpi.so.0
#6 0x0811c8dc in MPI::Init ()
#7 0x081189fa in main ()

That MPI::Init is crashing is suggestive: it very likely indicates an invalid
memory access.

What I expected to happen: I expected MPI::Init to execute without
crashing. I expected the program to possibly crash later on an invalid
memory access in my own code.

What actually happened: MPI::Init crashed.

I have attached test.cpp, a program which just
calls MPI::Init.

[11:45][pfarrell@turing:/tmp]$ cat test.cpp
#include <stdlib.h>
#include <unistd.h>

#include <mpi.h>

int main(int argc, char **argv)
{
  MPI::Init(argc, argv);
  MPI::Finalize();

  return 0;
}
[11:46][pfarrell@turing:/tmp]$ mpiCC -o test test.cpp
[11:46][pfarrell@turing:/tmp]$ gdb ./test
GNU gdb 6.8-debian
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "i486-linux-gnu"...
(gdb) set logging overwrite on
(gdb) set environment EF_PROTECT_BELOW 0
(gdb) set environment EF_DISABLE_BANNER 1
(gdb) set environment LD_PRELOAD /usr/lib/libefence.so.0.0
(gdb) handle SIG33 pass nostop noprint
Signal Stop Print Pass to program Description
SIG33 No No Yes Real-time event 33
(gdb) set pagination 0
(gdb) run
Starting program: /tmp/test
[Thread debugging using libthread_db enabled]
[New Thread 0xb7b6d6c0 (LWP 3047)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xb7b6d6c0 (LWP 3047)]
0xb7e24fb1 in opal_free_list_grow () from /usr/lib/libopen-pal.so.0
(gdb) backtrace full
#0 0xb7e24fb1 in opal_free_list_grow () from /usr/lib/libopen-pal.so.0
No symbol table info available.
#1 0xb7e25099 in opal_free_list_init () from /usr/lib/libopen-pal.so.0
No symbol table info available.
#2 0xb62309b5 in ompi_osc_pt2pt_component_init () from /usr/lib/openmpi/lib/openmpi/mca_osc_pt2pt.so
No symbol table info available.
#3 0xb7f30d4a in ompi_osc_base_find_available () from /usr/lib/libmpi.so.0
No symbol table info available.
#4 0xb7ef8032 in ompi_mpi_init () from /usr/lib/libmpi.so.0
No symbol table info available.
#5 0xb7f1ad17 in PMPI_Init () from /usr/lib/libmpi.so.0
No symbol table info available.
#6 0x08052cec in MPI::Init ()
No locals.
#7 0x0804f7f0 in main ()
No locals.
(gdb) info registers
eax 0xb6237800 -1239189504
ecx 0xb6237528 -1239190232
edx 0xb6237800 -1239189504
ebx 0xb7e4e458 -1209736104
esp 0xbfbaec80 0xbfbaec80
ebp 0xbfbaec98 0xbfbaec98
esi 0xb7614fec -1218359316
edi 0xb62f8000 -1238401024
eip 0xb7e24fb1 0xb7e24fb1 <opal_free_list_grow+241>
eflags 0x10286 [ PF SF IF RF ]
cs 0x73 115
ss 0x7b 123
ds 0x7b 123
es 0x7b 123
fs 0x0 0
gs 0x33 51
(gdb) thread apply all backtrace

Thread 1 (Thread 0xb7b6d6c0 (LWP 3047)):
#0 0xb7e24fb1 in opal_free_list_grow () from /usr/lib/libopen-pal.so.0
#1 0xb7e25099 in opal_free_list_init () from /usr/lib/libopen-pal.so.0
#2 0xb62309b5 in ompi_osc_pt2pt_component_init () from /usr/lib/openmpi/lib/openmpi/mca_osc_pt2pt.so
#3 0xb7f30d4a in ompi_osc_base_find_available () from /usr/lib/libmpi.so.0
#4 0xb7ef8032 in ompi_mpi_init () from /usr/lib/libmpi.so.0
#5 0xb7f1ad17 in PMPI_Init () from /usr/lib/libmpi.so.0
#6 0x08052cec in MPI::Init ()
#7 0x0804f7f0 in main ()
(gdb) quit
The program is running. Exit anyway? (y or n) y
[11:47][pfarrell@turing:/tmp]$

[pfarrell@turing:~]$ lsb_release -rd
Description: Ubuntu 8.04
Release: 8.04

[pfarrell@turing:~]$ apt-cache policy libopenmpi1
libopenmpi1:
  Installed: 1.2.5-1ubuntu1.1
  Candidate: 1.2.5-1ubuntu1.1
  Version table:
 *** 1.2.5-1ubuntu1.1 0
        100 /var/lib/dpkg/status
     1.2.5-1ubuntu1 0
        500 http://gb.archive.ubuntu.com hardy/universe Packages

Revision history for this message
Patrick Farrell (pefarrell) wrote :
Revision history for this message
Patrick Farrell (pefarrell) wrote :

After struggling a bit to build libopenmpi1 with debugging symbols (even with libopenmpi-dbg installed, libopal-pal
does not have debugging symbols installed), I managed to get a more useful backtrace:

0xb5cdd334 in opal_free_list_grow (flist=0xb2b46a50, num_elements=1) at class/opal_free_list.c:113
113 OBJ_CONSTRUCT_INTERNAL(item, flist->fl_elem_class);
(gdb) bt
#0 0xb5cdd334 in opal_free_list_grow (flist=0xb2b46a50, num_elements=1) at class/opal_free_list.c:113
#1 0xb5cdd479 in opal_free_list_init (flist=0xb2b46a50, elem_size=56, elem_class=0xb2b46e20, num_elements_to_alloc=73, max_elements_to_alloc=-1, num_elements_per_alloc=1) at class/opal_free_list.c:78
#2 0xb2b381aa in ompi_osc_pt2pt_component_init (enable_progress_threads=false, enable_mpi_threads=false) at osc_pt2pt_component.c:173
#3 0xb792b67c in ompi_osc_base_find_available (enable_progress_threads=false, enable_mpi_threads=false) at base/osc_base_open.c:84
#4 0xb78e6abe in ompi_mpi_init (argc=5, argv=0xbfd61f84, requested=0, provided=0xbfd61e78) at runtime/ompi_mpi_init.c:411
#5 0xb7911a87 in PMPI_Init (argc=0xbfd61f00, argv=0xbfd61f04) at pinit.c:71
#6 0x0811ca6c in MPI::Init ()
#7 0x08118b8a in main ()

Revision history for this message
Patrick Farrell (pefarrell) wrote :

Expanding the OBJ_CONSTRUCT_INTERAL macro with its definition in opal/class/opal_object.h, one finds that the illegal
instruction is

((opal_object_t *) (item))->obj_class = (flist->fl_elem_class);

I modified the openmpi source to print out the argument to malloc, the returned pointer,
and the address of the above variable. Here is a modified source snippet of opal_free_list_grow,
annotated with the output of the debugging printouts:

    fprintf(stderr, "mpidebug: allocating %d\n", (num_elements * flist->fl_elem_size) + sizeof(opal_list_item_t) + CACHE_LINE_SIZE);
    alloc_ptr = (unsigned char *)malloc(1 * ((num_elements * flist->fl_elem_size) +
                                        sizeof(opal_list_item_t) +
                                        CACHE_LINE_SIZE));
    fprintf(stderr, "mpidebug: allocated at memory address %p\n", alloc_ptr);

mpidebug: allocating 216
mpidebug: allocated at memory address 0xb62bdf28

    for(i=0; i<num_elements; i++) {
        opal_free_list_item_t* item = (opal_free_list_item_t*)ptr;
        if (NULL != flist->fl_elem_class) {
            do {
                if (0 == (flist->fl_elem_class)->cls_initialized) {
                    opal_class_initialize((flist->fl_elem_class));
                }
                fprintf(stderr, "mpidebug: accessing address %p\n", &((opal_object_t *) (item))->obj_class);
                ((opal_object_t *) (item))->obj_class = (flist->fl_elem_class);
                fprintf(stderr, "mpidebug: accessing address %p\n", &((opal_object_t *) (item))->obj_reference_count);
                ((opal_object_t *) (item))->obj_reference_count = 1;
                opal_obj_run_constructors((opal_object_t *) (item));
            } while (0);
        }
        opal_list_append(&(flist->super), &(item->super));
        ptr += flist->fl_elem_size;
    }

mpidebug: accessing address 0xb62be000

As can be seen, the instruction
((opal_object_t *) (item))->obj_class = (flist->fl_elem_class)
access memory at alloc_ptr + 216, but alloc_ptr was allocated
to be of size 216.

Revision history for this message
Gabriel de Perthuis (g2p) wrote :

I "fixed" this by building the intrepid package.

sudo apt-get build-dep openmpi
dget -x http://archive.ubuntu.com/ubuntu/pool/universe/o/openmpi/openmpi_1.2.7~rc2-1ubuntu2.dsc
cd openmpi-1.2.7~rc2/
debuild --no-tgz-check -us -uc -i -I && sudo debi

Revision history for this message
Gabriel de Perthuis (g2p) wrote :

Sorry, this isn't fixed, disregard my last comment.

Revision history for this message
Cesare Tirabassi (norsetto) wrote :

Discussed and fix committed upstream (see http://www.open-mpi.org/community/lists/devel/2008/08/4607.php and follow-ups).

Changed in openmpi:
status: New → Fix Committed
Przemek K. (azrael)
Changed in openmpi (Ubuntu):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.