Ubuntu

Binutils corrupts Open MPI

Reported by Dirk Eddelbuettel on 2008-05-25
38
Affects Status Importance Assigned to Milestone
binutils (Ubuntu)
Undecided
Unassigned
Hardy
Undecided
Unassigned
Intrepid
Undecided
Unassigned
openmpi (Ubuntu)
Undecided
Unassigned
Hardy
Medium
Unassigned
Intrepid
Undecided
Unassigned

Bug Description

Binary package hint: binutils

I think I found a bad bug in hardy. I do not know what it is, but I can pin
it down. It involves Open MPI when used with R via the Rmpi add-on package
for R. And I think it points to the toolchain, hence filed against binutils. This may
of course need re-assignment.

What you need installed comes via
  $ sudo apt-get install r-cran-rmpi

At a minimal level, you can try this (here running on my Debian testing box)
where we load the Rmpi add-on into R (thus dyn.loading libopenmpi1) and then
just showing a simple hello world:

  edd@ron:~$ echo 'library(Rmpi); cat("Still alive\n")' | R --slave
  Still alive
  edd@ron:~$

On hardy with default packages:

  edd@joe:~$ echo 'library(Rmpi); cat("Still alive\n")' | R --slave
  [joe:29084] *** Process received signal ***
  [joe:29084] Signal: Segmentation fault (11)
  [joe:29084] Signal code: Address not mapped (1)
  [joe:29084] Failing at address: 0x8c92004
  [joe:29084] [ 0] [0xffffe440]
  [joe:29084] [ 1] /usr/lib/libopen-pal.so.0(free+0xc4) [0xb744cae4]
  [joe:29084] [ 2] /usr/lib/libopen-pal.so.0 [0xb7432d3e]
  [joe:29084] [ 3] /usr/lib/libopen-pal.so.0 [0xb74328ea]
  [joe:29084] [ 4] /usr/lib/libopen-pal.so.0(lt_dlforeachfile+0x3d) [0xb74329dd]
  [joe:29084] [ 5] /usr/lib/libopen-pal.so.0(mca_base_component_find+0x327) [0xb743b1d7]
  [joe:29084] [ 6] /usr/lib/libopen-pal.so.0(mca_base_components_open+0x18a) [0xb743bbca]
  [joe:29084] [ 7] /usr/lib/libopen-pal.so.0(opal_timer_base_open+0x7b) [0xb7451d6b]
  [joe:29084] [ 8] /usr/lib/libopen-pal.so.0(opal_init+0xcb) [0xb743066b]
  [joe:29084] [ 9] /usr/lib/libmpi.so.0(ompi_mpi_init+0x19) [0xb751b2b9]
  [joe:29084] [10] /usr/lib/libmpi.so.0(MPI_Init+0x18f) [0xb753e62f]
  [joe:29084] [11] /usr/lib/R/site-library/Rmpi/libs/Rmpi.so(mpi_initialize+0x54) [0xb7574154]
  [joe:29084] [12] /usr/lib/R/lib/libR.so [0xb7ce1c1d]
  [joe:29084] [13] /usr/lib/R/lib/libR.so(Rf_eval+0x714) [0xb7d08d64]
  [joe:29084] [14] /usr/lib/R/lib/libR.so [0xb7d098af]
  [joe:29084] [15] /usr/lib/R/lib/libR.so(Rf_eval+0x542) [0xb7d08b92]
  [joe:29084] [16] /usr/lib/R/lib/libR.so [0xb7d0aacf]
  [joe:29084] [17] /usr/lib/R/lib/libR.so(Rf_eval+0x451) [0xb7d08aa1]
  [joe:29084] [18] /usr/lib/R/lib/libR.so [0xb7d0a0b0]
  [joe:29084] [19] /usr/lib/R/lib/libR.so(Rf_eval+0x451) [0xb7d08aa1]
  [joe:29084] [20] /usr/lib/R/lib/libR.so(Rf_applyClosure+0x2ac) [0xb7d0c0fc]
  [joe:29084] [21] /usr/lib/R/lib/libR.so(Rf_eval+0x349) [0xb7d08999]
  [joe:29084] [22] /usr/lib/R/lib/libR.so [0xb7d0ab52]
  [joe:29084] [23] /usr/lib/R/lib/libR.so(Rf_eval+0x451) [0xb7d08aa1]
  [joe:29084] [24] /usr/lib/R/lib/libR.so [0xb7d0a0b0]
  [joe:29084] [25] /usr/lib/R/lib/libR.so(Rf_eval+0x451) [0xb7d08aa1]
  [joe:29084] [26] /usr/lib/R/lib/libR.so [0xb7d0ab52]
  [joe:29084] [27] /usr/lib/R/lib/libR.so(Rf_eval+0x451) [0xb7d08aa1]
  [joe:29084] [28] /usr/lib/R/lib/libR.so [0xb7d0a0b0]
  [joe:29084] [29] /usr/lib/R/lib/libR.so(Rf_eval+0x451) [0xb7d08aa1]
  [joe:29084] *** End of error message ***
  Segmentation fault
  edd@joe:~$ dpkg -l r-base-core libopenmpi1 r-cran-rmpi
  Desired=Unknown/Install/Remove/Purge/Hold
  | Status=Not/Installed/Config-f/Unpacked/Failed-cfg/Half-inst/t-aWait/T-pend
  |/ Err?=(none)/Hold/Reinst-required/X=both-problems (Status,Err: uppercase=bad)
  ||/ Name Version Description
  +++-======================================-======================================-============================================================================================
  ii libopenmpi1 1.2.5-1ubuntu1 high performance message passing library -- shared library
  ii r-base-core 2.6.2-2 GNU R core of statistical computing language and environment
  ii r-cran-rmpi 0.5-5-1 GNU R package interfacing MPI libraries for distributed computing
  edd@joe:~$

Now, at work (using Ubuntu) I tend to just fetch from Debian sources via
apt-get source and rebuild locally. This failed when I recently rebuilt
libopenmpi1 -- but it worked with a package I rebuilt in early April. I had
chased the bug for a day or two, emailed Rmpi upstream -- no luck.

Now I just took the _exact same sources for openmpi_ and rebuilt on a Gutsy
7.10 machine that a colleague hadn't upgraded yet -- and it works.

The only difference was that I removed the Build-Depends on libibverbs-dev as
we don't have Infiniband yet. And just to be sure, I also rebuilt with
libibverbs-dev and it also works. All three packages are pristine -- I am
Debian maintainer for all three. The __only change vector is Ubuntu 7.10
versus 8.04__.

I suspect that the default Ubuntu builds now strip something they didn't used
to. Do you have any insight? For that matter libopenmpi1 has three
different libraries that are interrelated and has to do a
  #ifdef OPENMPI
          dlopen("libmpi.so.0", RTLD_GLOBAL);
  #endif
However, that used to work like a charm and works on Debian stable, testing, ...

I'd be glad to help debug this by I am of course a compiler newb so be
gentle.

Cheers, Dirk

Changed in binutils:
status: New → Invalid
status: New → Invalid

On Sun, Jun 15, 2008 at 07:30:25PM -0000, Cesare Tirabassi wrote:
> ** Also affects: openmpi (Ubuntu)
> Importance: Undecided
> Status: New
>
> ** Changed in: binutils (Ubuntu)
> Status: New => Invalid
>
> ** Changed in: binutils (Ubuntu Hardy)
> Status: New => Invalid

Care to explain?

As my bug report stated, the _only_ change was the Gutsy/Hardy upgrade
as I compiled the _same sources_.

This is no Open MPI bug. This is a toolchain bug that happens to break Open MPI.

Dirk

>
> --
> Binutils corrupts Open MPI
> https://bugs.launchpad.net/bugs/234837
> You received this bug notification because you are a direct subscriber
> of the bug.

--
Three out of two people have difficulties with fractions.

Steffen Neumann (sneumann) wrote :

I can add that the problem has been introduced somewhere
within the hardy development time. We have a machine
that has been installed around february/march with a then-current hardy.
This includes a working OpenMPI package. The update to the released hardy
broke openMPI.

Yours,
Steffen

(Whose bugs 210273 and 224706 have been marked a duplicate of this).
So if this bug is "invalid" I'd like to know if someone is working on any of
the other two.

Yours,
Steffen

Dirk Eddelbuettel (edd) wrote :

On 15 June 2008 at 19:53, Steffen Neumann wrote:
| I can add that the problem has been introduced somewhere
| within the hardy development time. We have a machine
| that has been installed around february/march with a then-current hardy.
| This includes a working OpenMPI package. The update to the released hardy
| broke openMPI.

Yes, exactly what I observerd too.

| Yours,
| Steffen
|
| (Whose bugs 210273 and 224706 have been marked a duplicate of this).
| So if this bug is "invalid" I'd like to know if someone is working on any of
| the other two.

I am as stunned by this as Steffen and am eagerly awaiting an explanation.

Dirk
who happens to co-maintain Open MPI in Debian and suspects that Open MPI is
not at fault here but rather how the package was built for Hardy / whatever
changed for the compilers / linkers / ... in hardy as _the same source
package built on Gutsy_ works for me.

| Yours,
| Steffen
|
| --
| Binutils corrupts Open MPI
| https://bugs.launchpad.net/bugs/234837
| You received this bug notification because you are a direct subscriber
| of the bug.

--
Three out of two people have difficulties with fractions.

Cesare Tirabassi (norsetto) wrote :

I still have not received any answer to my emails (Friday 13 June @12:22:46 CEST and Sunday 15 June @21:25:20 CEST).
In summary, this is apparently caused by compiling openmpi with -Wl,-Bsymbolic-functions (default from latest dpkg-buildpackage).
This could possibly be due to any of the three libopenmpi1 different interrelated libraries sharing symbols?
Can you and/or Manuel comment on this before we release a fix?

Changed in openmpi:
status: New → Incomplete
Dirk Eddelbuettel (edd) wrote :

Cesare,

Thanks for your mail!

On 15 June 2008 at 20:53, Cesare Tirabassi wrote:
| I still have not received any answer to my emails (Friday 13 June @12:22:46 CEST and Sunday 15 June @21:25:20 CEST).

I replied to your email from Friday on Friday right when I got, ie before
leaving for work:

  From: Dirk Eddelbuettel <email address hidden>
  To: Cesare Tirabassi <email address hidden>
  Subject: Re: SUSPECT: Re: [Rd] Rmpi segfault after install on Ubuntu Hardy Heron
  Date: Fri, 13 Jun 2008 07:27:37 -0500

and exim4 had no issues as far as I can see

  2008-06-13 07:27:37 1K78NV-00078e-Jd <= <email address hidden> U=edd P=local S=1869 <email address hidden>
  2008-06-13 07:27:39 1K78NV-00078e-Jd => <email address hidden> R=smarthost T=remote_smtp_smarthost H=smtp.g.comcast.net [76.96.30.117] X=TLS1.0:RSA_AES_256_CBC_SHA1:32 DN="C=US,ST=Pennsylvania,L=Philadelphia,O=Comcast Cable Communications Management LLC,OU=Business Center,CN=smtp.comcast.net"
  2008-06-13 07:27:39 1K78NV-00078e-Jd Completed

Did you get that or not?

Today I only received (automated) Launchpad messages from you but no direct
mail. I did send you one given that I had not heard from you.

Looks like there may be email troubles at your end? Just to be sure, I added
both you email addresses as CC.

| In summary, this is apparently caused by compiling openmpi with -Wl,-Bsymbolic-functions (default from latest dpkg-buildpackage).

If it is dpkg-buildpackages, it would 'hit' us too as we're building in
unstable. So I think it may be something else.

Could it be that Ubuntu has a different GCC default in stripping symbols or
something? The first person I talked to about this was doko but is always so
overwhelmed that he told me to please file a bug report which I did.

Which promptly got ignored by everybody.

| This could possibly be due to any of the three libopenmpi1 different interrelated libraries sharing symbols?

That was my immediate gut reaction as well. But that is a design choice Open
MPI made upstream, and "we" (as in Debian's Open MPI maintainer) do not fight
it.

*If* you guys changed build options, I'd start by reverting to what we do.

But if you guys changed nothing, well then I am at a loss.

| Can you and/or Manuel comment on this before we release a fix?

[ CCing Manuel who you forgot to CC. ]

Sure. Let's hash this out for a moment to get it right. Manuel may want to
comment as I may well have overlooked or forgotten something.

Dirk

| ** Changed in: openmpi (Ubuntu Intrepid)
| Status: New => Incomplete
|
| --
| Binutils corrupts Open MPI
| https://bugs.launchpad.net/bugs/234837
| You received this bug notification because you are a direct subscriber
| of the bug.

--
Three out of two people have difficulties with fractions.

Cesare Tirabassi (norsetto) wrote :

Well, by looking at your buildd build log your default LDFLAGS is "" while we do use -Wl,-Bsymbolic. I guess you never received my emails as I was explaining that in details. Anyhow, compiling unexporting LDFLAGS in debian/rules works as expected:

root@norsetto:/root/debian# echo 'library(Rmpi); cat("Still alive\n")' |
R --slave
[norsetto:13561] mca: base: components_open: component timer / linux open
function failed
[norsetto:13561] mca: base: component_find: unable to open osc pt2pt: file not
found (ignored)
libibverbs: Fatal: couldn't read uverbs ABI version.
--------------------------------------------------------------------------
[0,0,0]: OpenIB on host norsetto was unable to find any HCAs.
Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
Still alive

The above is with the patched and rebuilt 1.2.5-1.

You seem to confirm that upstream decided to share symbols between libraries, it is therefore a design choice and not a bug; I'm therefore happy to release this fix for intrepid right now. For hardy we need an sru, for which we need cooperation from you and possibly others since we need at least 3 persons to test and confirm this works (and doesn't break anything else) before we can make it official.

I'll work on this first thing tomorrow morning.

Dirk Eddelbuettel (edd) wrote :

Just back in from a long run...

On 15 June 2008 at 22:49, Cesare Tirabassi wrote:
| Well, by looking at your buildd build log your default LDFLAGS is ""
| while we do use -Wl,-Bsymbolic.

That'll do to break the package, I suppose!

| I guess you never received my emails as
| I was explaining that in details.

That email lossage is a side issue we should take up on the side. Any idea
why I am not getting yours / your not getting mine?

| Anyhow, compiling unexporting LDFLAGS
| in debian/rules works as expected:
|
| root@norsetto:/root/debian# echo 'library(Rmpi); cat("Still alive\n")' |
| R --slave
| [norsetto:13561] mca: base: components_open: component timer / linux open
| function failed
| [norsetto:13561] mca: base: component_find: unable to open osc pt2pt: file not
| found (ignored)
| libibverbs: Fatal: couldn't read uverbs ABI version.
| --------------------------------------------------------------------------
| [0,0,0]: OpenIB on host norsetto was unable to find any HCAs.
| Another transport will be used instead, although this may result in
| lower performance.
| --------------------------------------------------------------------------
| Still alive

Perfect. And if you uncomment the line saying "btl = ^openib" in
/etc/openmpi/openmpi-mca-params.conf as in

  # Disable the use of InfiniBand
  # btl = ^openib
  btl = ^openib

you will suppress the noise telling you that you have no IB hardware... This
has now been improved upstream, by the way.

| The above is with the patched and rebuilt 1.2.5-1.

Yes, look good! Thanks for that!

| You seem to confirm that upstream decided to share symbols between
| libraries, it is therefore a design choice and not a bug; I'm therefore
| happy to release this fix for intrepid right now. For hardy we need an
| sru, for which we need cooperation from you and possibly others since we
| need at least 3 persons to test and confirm this works (and doesn't
| break anything else) before we can make it official.

Let me know how I can help. As I said, same code / same everything but built
on Gutsy (ie without -Wl,-Bsymbolic) works.

| I'll work on this first thing tomorrow morning.

Much appreciated. This seems to have bitten a few people, and is too obscure
for most to figure out by themselves.

Dirk

| --
| Binutils corrupts Open MPI
| https://bugs.launchpad.net/bugs/234837
| You received this bug notification because you are a direct subscriber
| of the bug.

--
Three out of two people have difficulties with fractions.

ilmarw (ilmar-wilbers) wrote :

I reported the bug marked as a duplicate of this (https://bugs.launchpad.net/ubuntu/+source/openmpi/+bug/224706), and have had problems with using Python on top of openmpi. We have been building our own packages for openmpi based on Debian. A fix for Hardy would be highly appreciated, so that we no longer have to make people using our software add third-party repositories.

Bottom line, I would be happy to test and confirm this. In fact, there are several people here who would like to test such a fix.

ilmar

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package openmpi - 1.2.6-2ubuntu1

---------------
openmpi (1.2.6-2ubuntu1) intrepid; urgency=low

  * debian/rules: unexport LDFLAGS
    fix SIGSEGV when initialising mpi (LP: #234837).
  * Modify Maintainer value to match the DebianMaintainerField
    specification.

 -- Cesare Tirabassi <email address hidden> Sun, 15 Jun 2008 21:39:00 +0200

Changed in openmpi:
status: Incomplete → Fix Released
Cesare Tirabassi (norsetto) wrote :

motu-sru is subscribed to give clearance for uploading the fix to hardy-proposed:

1. The openmpi package is currently broken in Hardy, since it requires compilation without the -Wl,-Bsymbolic linker option. Any user attempt to initialise the library either directly or through 3rd party applications fails with a SIGSEGV.
2. The bug has been fixed in the release version by disabling dpkg-buildpackage LDFLAGS default.
3. A minimal patch applicable to the stable version of the package is attached.
4. TEST CASE:

A simple test case is defined here. It will be appreciated if users which are using the library through other applications can also confirm the fix, or otherwise.

4.1 In Hardy, install the r-cran-rmpi package with your preferred package manager. This should drag in all required dependencies, including openmpi.

4.2 Try the following simple command in a terminal:

echo 'library(Rmpi); cat("Still alive\n")' | R --slave

This should fail with a SIGSEGV (Address/Memory not mapped) signal.

4.3 Enable the hardy-proposed repository. For instance, you can do this by adding the following line to /etc/apt/sources.list:

  deb http://archive.ubuntu.com/ubuntu/ hardy-proposed universe

Update your local cache and upgrade. This should upgrade at least the following packages:

openmpi-bin, libopenmpi1, openmpi-common

4.4 Re-run the command from step 4.2 in a terminal. Initialisation should be successfull and you should see the message "Still alive" on the screen.

4.5 Disable the hardy-proposed repository.

5. There is obviously a potential for regression. I believe this is mitigated by the fact that in Debian (and Ubuntu before the dpkg-buildpackage change) this has not shown any ill effect.

WARNING: Any user willing to help with testing is more than welcome. Please be aware that there might be some time before the updated packages will be available in your mirror. The sru needs to be approved and the package manually copied by an archive admin, this may take from 1 to several days to be processed.

Changed in openmpi:
importance: Undecided → Medium
status: New → Confirmed
ilmarw (ilmar-wilbers) wrote :

Where is LDFLAGS defined in the first place? We built the package from source without the patch, and this worked.

Also, should InfiniBand be disabled as Dirk suggested?

ilmar

Andreas Klöckner (inform) wrote :

Disabling IB was only about the "MEEP! You don't have IB hardware!" warning.

Andreas

ilmarw (ilmar-wilbers) wrote :

I know, but I find it annoying to having to edit /etc/openmpi/openmpi-mca-params.conf, and since it is disabled in openmpi i newer versions it could as well be removed.

ilmar

Luca Falavigna (dktrkranz) wrote :

ACK from motu-sru.

Dirk Eddelbuettel (edd) wrote :

On Mon, Jun 16, 2008 at 01:31:53PM -0000, ilmarw wrote:
> I know, but I find it annoying to having to edit /etc/openmpi/openmpi-
> mca-params.conf, and since it is disabled in openmpi i newer versions it
> could as well be removed.

"Our" (as in Debian Open MPI maintainers in discussion with upstream)
consensus was that it is still preferable to have the 'meep' rather
than to disable IB for those who actually have the hardware.

Luckily that is all somewhat moot now as the code will behave better
going forward and deal with this autoMAGICally.

Dirk

--
Three out of two people have difficulties with fractions.

Jonathan Riddell (jr) wrote :

accepted into hardy-proposed, please test

Johannes Ring (johannr) wrote :

Nice! I had the same problems as reported in bug #224706 and now they are gone.

Johannes

ilmarw (ilmar-wilbers) wrote :

I can confirm that this indeed fixed the problems that I have been having (and that I reported in #224706).

ilmar

Olaf Lenz (olenz) wrote :

The fix works fine for me. Thanks!

Olaf

The fix worked for me too.

Cesare Tirabassi (norsetto) wrote :

Thanks to all that tested.
We have 4 positives and no negatives, I'm therefore asking archive to publish this to hardy-updates.

Changed in openmpi:
status: Confirmed → Fix Committed
Martin Pitt (pitti) wrote :

Copied to hardy-updates.

Changed in openmpi:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers