Ubuntu
glibc package

s390x autopkgtest regression of libflame vs glibc in Jammy

Bug #2024207 reported by Simon Chopin on 2023-06-16

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	glibc (Ubuntu)	Invalid	Undecided	Simon Chopin
	Jammy	Invalid	Critical	Simon Chopin

Bug Description

The libflame autopgktests on Jammy are now failing on s390x against glibc 2.35-0ubuntu3.2.

It's triggering a timeout in the numpy-with-libflame test suite. To reproduce, you need python3-numpy, libflame1 and libflame-dev install.

The issue seems to be in numpy/f2py/tests/test_compile_function.py::test_f2py_init_compile. To be able to investigate this, I had to change /usr/lib/python3/dist-packages/numpy/_pytesttester.py, line 183:

- pytest_args += ["-m", label]
+ pytest_args += ["-k", label]

and then I used the following Python script to reproduce:

#!/usr/bin/python3

import numpy as np
np.test("test_f2py_init_compile", verbose=3)

I haven't managed to go further yet, except that I know that the bug doesn't seem to trigger if running under strace.

See original description

Tags:

Simon Chopin (schopin) on 2023-06-16

Changed in glibc (Ubuntu Jammy):
importance:	Undecided → Critical
Changed in glibc (Ubuntu):
importance:	Critical → Undecided
Changed in glibc (Ubuntu Jammy):
status:	New → Triaged
Changed in glibc (Ubuntu):
status:	Triaged → Fix Released
tags:	added: regression-proposed update-excuse

Simon Chopin (schopin) on 2023-06-16

Changed in glibc (Ubuntu Jammy):
assignee:	nobody → Simon Chopin (schopin)

Simon Chopin (schopin) on 2023-06-21

description:

updated

Revision history for this message

Simon Chopin (schopin) wrote on 2023-06-21:

TL;DR: Now the tests pass, but I didn't do a thing.

Long follow up on this: I was investigating this on a fairly beefy VM (8 cores, 16G RAM), and managed to reproduce the issue quickly with a ~60% hit rate.

The test that times out is basically a thin wrapper around a subprocess invocation (via subprocess.run) of a Python interpreter, which itself uses the Python multiprocessing system to execute the Fortran compiler.

When the issue occurs, the entire pool of the mp subprocess is waiting for new tasks, except for a single thread that waits on a kernel semaphore. Since the Python stack for that thread is entirely in the CPython codebase and is in a finalizer, I would guess there's a race condition on freeing up a lock on a shared resource, which I'd wager is stdout or similar.

Removing the pthread-related patch from the glibc SRU didn't improve the situation, despite being the most likely culprit (bug 2007796), so I figured I'd try to reproduce on a VM with similar capabilities as the ones on the autopkgtest infra (4c/8G as libflame is marked as big) before trying anything else.

Lo and behold, on that new VM I was unable to reproduce the issue. Puzzled, I asked the nice folks in the QA team if by any chance the doc for the VM sizing was out-of-date. It's not, and they even kindly gave me access to a VM directly on the infra. I still was unable to reproduce.

Finally I just re-ran the tests, and now they pass. Comparing the logs, the only difference I could spot is the upgrade linux-libc-dev 5.15.0-73.80 -> 5.15.0-75.82.

Also of note, it turns out those tests have been disabled in subsequent versions in Debian as they're flaky and don't provide much value since numpy isn't compiled with libflame support, so, if the issue comes back, I'll probably ask for them to be hinted.

Changed in glibc (Ubuntu):
status:	Fix Released → Invalid
Changed in glibc (Ubuntu Jammy):
status:	Triaged → Invalid

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

Ubuntuglibc package

s390x autopkgtest regression of libflame vs glibc in Jammy

Bug Description

Other bug subscribers

Remote bug watches

Ubuntu
glibc package