Invalid results with OpenMPI because of --enable-heterogeneous

Bug #1731938 reported by Xavier Besseron
66
This bug affects 11 people
Affects Status Importance Assigned to Milestone
openmpi (Debian)
Fix Released
Unknown
openmpi (Ubuntu)
Fix Released
Undecided
Unassigned
Bionic
Confirmed
Undecided
Unassigned
Eoan
Fix Released
Undecided
Unassigned
Focal
Fix Released
Undecided
Unassigned

Bug Description

It seems that OpenMPI is built with the option --enable-heterogeneous and that it causes invalid results during execution.

Looking at this issue https://github.com/open-mpi/ompi/issues/171 dated from 2013, it appears that this option is broken and should not be used anymore (even if it has never been removed or marked as deprecated).

Now on latest Ubuntu Artful, this option causes invalid results. Here is a simple example:

    int A = 666, B = 42;
    MPI_Irecv(&A, 1, MPI_INT, MPI_ANY_SOURCE, tag, comm, &req);
    MPI_Send(&B, 1, MPI_INT, my_rank, tag, comm);
    MPI_Wait(&req, &status);

    # After that, when compiled with --enable-heterogeneous, we have A != B

The full example is in attachment. This happens with just a single process, when running with "mpirun -n 1 ./bug_openmpi_artful". This example is extracted and simplified code from the Zoltan library with which I initially noticed the issue.

If I re-build the openmpi packages without the --enable-heterogeneous configure option, then this example works fine.

ProblemType: Bug
DistroRelease: Ubuntu 17.10
Package: libopenmpi-dev 2.1.1-6
ProcVersionSignature: Ubuntu 4.13.0-16.19-generic 4.13.4
Uname: Linux 4.13.0-16-generic x86_64
ApportVersion: 2.20.7-0ubuntu3.1
Architecture: amd64
CurrentDesktop: KDE
Date: Mon Nov 13 15:28:55 2017
InstallationDate: Installed on 2017-02-07 (279 days ago)
InstallationMedia: Ubuntu 16.10 "Yakkety Yak" - Release amd64 (20161012.2)
SourcePackage: openmpi
UpgradeStatus: Upgraded to artful on 2017-10-30 (13 days ago)

Revision history for this message
Xavier Besseron (besserox) wrote :
Revision history for this message
Xavier Besseron (besserox) wrote :

I consider this bug should have a high priority because it makes OpenMPI unusable and unreliable on Ubuntu Artful.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in openmpi (Ubuntu):
status: New → Confirmed
Revision history for this message
Xavier Besseron (besserox) wrote :

Any progress on that issue?

Would it make sense to remove the option `--enable-heterogeneous` until this is fixed in Open MPI?

Revision history for this message
tvrusso (russo-bogodyn) wrote :

We have just spent today hunting down a user bug report for Xyce (which uses Trilinos, and its Zoltan library) that turn out to be exactly this issue -- the user is seeing strange results from Xyce, and one of our developers tracked it to exactly the issue that MPI_Send "to" the same processor is not in fact received *by* that processor. It is easily seen using a program similar to the snippet in the initial report, which I will attach. If the program is compiled with mpicc and run with any number of processors, it will show "BAD BAD BAD" on every line reporting what was received by processor N from processor N, but no problem for those received by processor M from processor N with M!=N.

I have confirmed also that building OpenMPI without "--enable-heterogeneous" makes the issue go away.

Revision history for this message
Amir Ishaque (amirishaque) wrote :

Will be good to see this issue addressed. I encountered this bug while running regression test for Xyce 6.9 parallel build and the error occurred in Zoltan module. I wonder whether this has an impact on other packages--for example ngspice--built with mpi options. SHould we be suspicious of openmpi packaged with Ubuntu 18.04?

Revision history for this message
Xavier Besseron (besserox) wrote :

I confirm this bug still exists in Ubuntu Bionic, which is annoying because it is an LTS.

Revision history for this message
tvrusso (russo-bogodyn) wrote :

I would just like to add that the README for OpenMPI at github has this text now:

--enable-heterogeneous
  Enable support for running on heterogeneous clusters (e.g., machines
  with different endian representations). Heterogeneous support is
    disabled by default because it imposes a minor performance penalty.

  *** THIS FUNCTIONALITY IS CURRENTLY BROKEN - DO NOT USE ***

Revision history for this message
Adrian Croucher (acroucher) wrote :

I am using Ubuntu 18.04 (Bionic) and am trying to work around this bug by rebuilding the source package for openmpi, according to these instructions for Ubuntu 17.10:

https://github.com/firedrakeproject/firedrake/issues/1153

The recipe for rebuilding there is:

# Download openmpi source and install build dependencies
sudo apt-get -y build-dep openmpi
apt-get source openmpi
cd openmpi-*

# Fix compile flag that is broken on 17.10
sed -i '/enable-heterogeneous/d' debian/rules

# Build newly fixed package
debuild -uc -us -b

# Install the newly built package
sudo debi

However the build step is failing with the following error:

dh_auto_configure: ./configure --build=x86_64-linux-gnu --prefix=/usr --includedir=\${prefix}/include --mandir=\${prefix}/share/man --infodir=\${prefix}/share/info --sysconfdir=/etc --localstatedir=/var --disable-silent-rules --libdir=\${prefix}/lib/x86_64-linux-gnu --libexecdir=\${prefix}/lib/x86_64-linux-gnu --disable-maintainer-mode --disable-dependency-tracking --with-libfabric --with-psm --with-jdk-dir=/usr/lib/jvm/default-java --enable-mpi-java --enable-opal-btl-usnic-unit-tests --disable-wrapper-rpath --enable-mpi-thread-multiple --disable-silent-rules --enable-mpi-cxx --with-hwloc=/usr/ --with-libltdl=/usr/ --with-devel-headers --with-slurm --with-sge --without-tm --disable-vt --sysconfdir=/etc/openmpi --libdir=\${prefix}/lib/x86_64-linux-gnu/openmpi/lib --includedir=\${prefix}/lib/x86_64-linux-gnu/openmpi/include returned exit code 1
debian/rules:76: recipe for target 'override_dh_auto_configure' failed
make[1]: *** [override_dh_auto_configure] Error 2
make[1]: Leaving directory '/home/acro018/software/openmpi-2.1.1'
debian/rules:65: recipe for target 'build' failed
make: *** [build] Error 2
dpkg-buildpackage: error: debian/rules build subprocess returned exit status 2
debuild: fatal error at line 1152:
dpkg-buildpackage -rfakeroot -us -uc -ui -b failed

Should I be using a different process in Ubuntu 18.04 to rebuild this package?

And has there been any progress in getting this bug fixed in the repositories?

Revision history for this message
Xavier Besseron (besserox) wrote :

@acroucher I re-tried on my side to manually build the package on Ubuntu Bionic and it failed because of the Java bindings. It is hard to know if your issue is the same without seeing your config.log. But this is on I managed to workaround that and re-build the package on Bionic.

* Option 1: disable the Java MPI interface

/!\ This will potentially break any package that depends on that.

Before running `debuild -uc -us -b` you can comment the line containing `--enable-mpi-java` in the file `debian/rules`

* Option 2: use the older OpenJDK 8 package

The reason is that OpenMPI looks for the command `javah` which have been removed form OpenJDK release after 8.

# Install OpenJDK 8
sudo apt install openjdk-8-jdk-headless:amd64

# Force OpenMPI to use OpenJDK 8
sed 's|--with-jdk-dir=/usr/lib/jvm/default-java|--with-jdk-dir=/usr/lib/jvm/java-8-openjdk-amd64|'

# Then you can run
debuild -uc -us -b

I hope this helps.

Revision history for this message
Xavier Besseron (besserox) wrote :

There was a typo in my previous comment, so I will post here the full list of commands to re-build OpenMPI package on Ubuntu Bionic without --enable-heterogeneous

# Install OpenJDK 8
sudo apt install openjdk-8-jdk-headless

# Download openmpi source and install build dependencies
sudo apt-get -y build-dep openmpi
apt-get source openmpi
cd openmpi-*

# Remove --enable-heterogeneous configure option
sed -i '/enable-heterogeneous/d' debian/rules

# Force OpenMPI to use OpenJDK 8
sed -i 's|--with-jdk-dir=/usr/lib/jvm/default-java|--with-jdk-dir=/usr/lib/jvm/java-8-openjdk-amd64|' debian/rules

# Build newly fixed package
debuild -uc -us -b

# Install the newly built package
sudo debi

Revision history for this message
Adrian Croucher (acroucher) wrote :

@Xavier Thanks very much, that fixes the problem for me.

(I also added a 'dch -i' so that the package version number is upgraded- otherwise the package manager keeps trying to 'update' back to the original one.)

Revision history for this message
Thomas Heller (thom-heller) wrote :

Bumping this issue as we failures with the openmpi version in bionic caused by this Friday. Having this fixed is highly appreciated.

Revision history for this message
Waldemberg D Ginú (bergginu) wrote :

Having this fixed is highly appreciated. (2x)

Revision history for this message
Jed Brown (jed-w) wrote :

The --enable-heterogenous was removed by Debian upstream for libopenmpi3 as a result of this bug report, but it was never applied to libopenmpi2.
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=886336

It will continue to cause breakage for users until it is patched.

Revision history for this message
Jeff Squyres (jsquyres-cisco) wrote :

Wow, just found this bug.

I am one of the upstream Open MPI developers; we just had this exact issue reported to us, and I reported it in https://bugs.launchpad.net/ubuntu/+source/openmpi/+bug/1838684. I guess I'll go mark that one as a duplicate of this one.

The fix for this issue is very, very simple -- someone just needs to fix Ubuntu's recipe for building Open MPI to *not* include --enable-heterogeneous.

We'd also strongly suggest that Ubuntu at least upgrade to the latest Open MPI 2.1.x (which is 2.1.6, at the time of this writing). There are later series available (we're up to Open MPI v4.0.x these days), but Open MPI v2.1.6 is ABI compatible with the already-shipping-in-Ubuntu Open MPI v2.1.1.

Revision history for this message
Paride Legovini (paride) wrote :

According to the changelog, this has been fixed in Debian with version 3.0.1~rc1-2 of src:openmpi. A fixed version of the package is available in Eoan and Focal. I agree it would be nice fix this in Bionic too; this requires a Stable Release Update (SRU):

  https://wiki.ubuntu.com/StableReleaseUpdates

and as the openmpi package is in universe the process needs to be community driven. Given that the fix is on the packaging bits, and not on openmpi itself, I expect it would go smoothly.

Jeff: updating to new upstream versions normally of out of the scope of SRUs (see link above). The next Ubuntu LTS release is just a couple of months away :)

Changed in openmpi (Ubuntu Focal):
status: Confirmed → Fix Released
Changed in openmpi (Ubuntu Eoan):
status: New → Fix Released
Changed in openmpi (Ubuntu Bionic):
status: New → Confirmed
Changed in openmpi (Debian):
status: Unknown → Fix Released
Revision history for this message
Ben Zwick (benzwick) wrote :

When will this be fixed? Considering that Ubuntu 18.04 LTS is supported until 2023-04, the severity of this bug, and how easy it is to fix (repackage the same version of OpenMPI compiled without the not recommended --enable-heterogeneous option) it is quite disappointing that after so many years it has not been fixed as it makes the default MPI shipped with Ubuntu LTS basically unusable.

Also, what is the problem with updating from version 2.1.1 to 2.1.6 in an SRU? As Jeff said they are ABI compatible and surely this update cannot make the situation any worse than it is now!

Revision history for this message
Adrian Croucher (acroucher) wrote :

I have just tested this on a new Ubuntu 20.04 install and can confirm the bug finally appears to have been fixed. Thanks very much to everyone who helped make that happen :-)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.