glibc 2.38 causes hangs on some openMPI-using packages

Bug #2031912 reported by Simon Chopin
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
dolfin (Ubuntu)
Invalid
Undecided
Unassigned
glibc (Ubuntu)
Fix Released
Critical
Unassigned
h5py (Ubuntu)
Invalid
Critical
Unassigned
mpi4py (Ubuntu)
Invalid
Undecided
Unassigned
openmpi (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

This occurs on amd64 and armhf.

Relevant logs:

706s tests/test_file.py::TestPathlibSupport::test_pathlib_name_match PASSED [ 65%]
706s tests/test_file.py::TestPickle::test_dump_error PASSED [ 65%]
10706s tests/test_file.py::TestMPI::test_mpio autopkgtest [16:36:57]: ERROR: timed out on command "su -s /bin/bash ubuntu -c set -e; export USER=`id -nu`; . /etc/profile >/dev/null 2>&1 || true; . ~/.profile >/dev/null 2>&1 || true; buildtree="/tmp/autopkgtest.izAQWQ/build.Q5d/src"; mkdir -p -m 1777 -- "/tmp/autopkgtest.izAQWQ/python3-mpi-artifacts"; export AUTOPKGTEST_ARTIFACTS="/tmp/autopkgtest.izAQWQ/python3-mpi-artifacts"; export ADT_ARTIFACTS="$AUTOPKGTEST_ARTIFACTS"; mkdir -p -m 755 "/tmp/autopkgtest.izAQWQ/autopkgtest_tmp"; export AUTOPKGTEST_TMP="/tmp/autopkgtest.izAQWQ/autopkgtest_tmp"; export ADTTMP="$AUTOPKGTEST_TMP"; export DEBIAN_FRONTEND=noninteractive; export LANG=C.UTF-8; export DEB_BUILD_OPTIONS=parallel=2; unset LANGUAGE LC_CTYPE LC_NUMERIC LC_TIME LC_COLLATE LC_MONETARY LC_MESSAGES LC_PAPER LC_NAME LC_ADDRESS LC_TELEPHONE LC_MEASUREMENT LC_IDENTIFICATION LC_ALL;rm -f /tmp/autopkgtest_script_pid; set -C; echo $$ > /tmp/autopkgtest_script_pid; set +C; trap "rm -f /tmp/autopkgtest_script_pid" EXIT INT QUIT PIPE; cd "$buildtree"; export 'ADT_TEST_TRIGGERS=glibc/2.38-1ubuntu3'; chmod +x /tmp/autopkgtest.izAQWQ/build.Q5d/src/debian/tests/python3-mpi; touch /tmp/autopkgtest.izAQWQ/python3-mpi-stdout /tmp/autopkgtest.izAQWQ/python3-mpi-stderr; /tmp/autopkgtest.izAQWQ/build.Q5d/src/debian/tests/python3-mpi 2> >(tee -a /tmp/autopkgtest.izAQWQ/python3-mpi-stderr >&2) > >(tee -a /tmp/autopkgtest.izAQWQ/python3-mpi-stdout);" (kind: test)

Full logs: https://autopkgtest.ubuntu.com/results/autopkgtest-mantic/mantic/amd64/h/h5py/20230817_163711_290b0@/log.gz

Marking Critical as this blocks the glibc transition.

Simon Chopin (schopin)
Changed in glibc (Ubuntu):
importance: Undecided → Critical
Revision history for this message
Simon Chopin (schopin) wrote :

Adding mpi4py to the list of affected packages as it seems very likely to be the exact same bug.

Simon Chopin (schopin)
tags: added: foundations-todo
Simon Chopin (schopin)
summary: - h5py autopkgtests freeze when run against glibc 2.38
+ glibc 2.38 causes hangs on some openMPI-using packages
Revision history for this message
Simon Chopin (schopin) wrote :

According to https://qa.debian.org/excuses.php?experimental=1&package=glibc similar issues appear on the Debian infra, so this rules out our deltas. Also, they see the failures on amd64, which is great as it could make it easier to bisect :)

Revision history for this message
Athos Ribeiro (athos-ribeiro) wrote :

Dolfin builds seem to be affected as well. Adding a tracker here.

Revision history for this message
Simon Chopin (schopin) wrote :

For the record, I ran the tests against a glibc with the most recent stable patches (as of 2023-08-21), which didn't solve the problem: https://launchpad.net/~schopin/+archive/ubuntu/glibc-2.38-snapshot

Revision history for this message
Athos Ribeiro (athos-ribeiro) wrote :

This is the shortest reproducer I could come up with so far:

$ cat > test.py <<EOF
import pytest
from h5py import File

@pytest.mark.mpi
class TestMPI:
    def test_mpio(self, mpi_file_name):
        """ MPIO driver and options """
        from mpi4py import MPI

        with File(mpi_file_name, 'w', driver='mpio', comm=MPI.COMM_WORLD) as f:
            assert f
            assert f.driver == 'mpio'
EOF

$ mpirun -n 2 pytest -vvv --with-mpi test.py

You will need those h5py test dependencies and glibc 2.38 from proposed to reproduce it.

Depends: python3-all,
         python3-h5py,
         python3-h5py-mpi,
         python3-pytest,
         python3-pytest-mpi,
         python3-unittest2,
         mpi-default-bin

Revision history for this message
Danilo Egea Gondolfo (danilogondolfo) wrote :

Here is a shorter one:

$ cat > test.py << EOF
from h5py import File
from mpi4py import MPI

with File('/tmp/aaaa', 'w', driver='mpio', comm=MPI.COMM_WORLD) as f:
  print(f)
EOF

$ mpirun -n 2 python3 test.py

It's just not as reliable, sometimes it works hehe

All the process will stuck in a poll loop waiting for IO on a couple of sockets. The timeout is zero so all the process will eat the CPUs alive...

Revision history for this message
Sergio Durigan Junior (sergiodj) wrote :

Athos and I had a fun conversation about this bug, which prompted me to look more deep into what's going on. I think I found why the bug is happening. It's an interesting race involving multithreading, semaphores and "open(2)" flags.

After some GDB/strace analysis, and having the gut feeling that this is one of those "race-condition between runs due to something happening in the filesystem", I initially found that the problem happens because, on the first run (and then on subsequent runs that are odd-numbered), the semaphore created by openmpi (called /dev/shm/sem.OMPIO_aaaa) doesn't exist. This causes sem_open[0] to fail to open the file (because of the O_CREAT file used here[1]). Note that this open(2) is performed concurrently by the two threads created by mpirun. Also note that this first failure is expected, because sem_open is being invoked with O_CREAT (this time a sem_open flag!) by openmpi.

As can be seen, when sem_open fails to open the file at the location mentioned above it will clean things up and go to the label "try_again". This time, we're inside a section of the code which expects the semaphore file to exist. As such, O_CREAT (the "open" flag) needs to be removed from open_flags, but it isn't because [2] is above the label.

I'm still building glibc with the proposed change to test the fix, but I'm pretty sure that that line needs to be moved inside the label, so I submitted [3]. I'll report back on the results of the test tomorrow.

[0]: https://sourceware.org/cgit/glibc/tree/sysdeps/pthread/sem_open.c?id=f6c8204fd7fabf0cf4162eaf10ccf23258e4d10e
[1]: https://sourceware.org/cgit/glibc/tree/sysdeps/pthread/sem_open.c?id=f6c8204fd7fabf0cf4162eaf10ccf23258e4d10e#n138
[2]: https://sourceware.org/cgit/glibc/tree/sysdeps/pthread/sem_open.c?id=f6c8204fd7fabf0cf4162eaf10ccf23258e4d10e#n69
[3]: https://inbox<email address hidden>/T/#u

Revision history for this message
Simon Chopin (schopin) wrote :

I've prepared a version with a variation on Sergio's patch at https://launchpad.net/~schopin/+archive/ubuntu/glibc-2.38-snapshot/+packages (currently building)

Feel free to try it out :)

Revision history for this message
Sergio Durigan Junior (sergiodj) wrote : Re: [Bug 2031912] Re: glibc 2.38 causes hangs on some openMPI-using packages

On Wednesday, August 23 2023, Simon Chopin wrote:

> I've prepared a version with a variation on Sergio's patch at
> https://launchpad.net/~schopin/+archive/ubuntu/glibc-2.38-snapshot/+packages
> (currently building)

Thanks, Simon.

This fixes the problem with openmpi. IMHO it's ready to be uploaded.

Cheers,

--
Sergio
GPG key ID: E92F D0B3 6B14 F1F4 D8E0 EB2F 106D A1C8 C3CB BF14

Revision history for this message
Simon Chopin (schopin) wrote :

I won't upload before Friday, as I'd like to avoid multiple uploads as much as possible (lots of tests triggered). Hopefully we'll have resolutions on the other 2 bugs by then.

Revision history for this message
Simon Chopin (schopin) wrote :

We'll hint glibc to unblock most of the archive, and then I'll upload this. Given the size of the migration, it'll likely take several britney runs to move it all. While that happens, I can't upload a new version.

Given all that, and since it's already pretty late on a Friday evening, I'll push the upload to Monday morning (CEST). A good weekend to all involved :)

Simon Chopin (schopin)
Changed in glibc (Ubuntu):
status: New → Fix Committed
Changed in dolfin (Ubuntu):
status: New → Invalid
Changed in h5py (Ubuntu):
status: New → Invalid
Changed in mpi4py (Ubuntu):
status: New → Invalid
Changed in openmpi (Ubuntu):
status: New → Invalid
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package glibc - 2.38-1ubuntu4

---------------
glibc (2.38-1ubuntu4) mantic; urgency=medium

  * Import the upstream maintenance branch
  * d/p/lp2031912.patch: Fix regression in sem_open that breaks OpenMPI
    (LP: #2031912)

 -- Simon Chopin <email address hidden> Mon, 28 Aug 2023 17:23:19 +0200

Changed in glibc (Ubuntu):
status: Fix Committed → Fix Released
Benjamin Drung (bdrung)
tags: removed: foundations-todo
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.