Clustalo 1.2.4-6 segfaults on s390x

Bug #1903817 reported by Christian Ehrhardt  on 2020-11-11
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ubuntu on IBM z Systems
Medium
bugproxy
clustalo (Debian)
Unknown
Unknown
clustalo (Ubuntu)
Medium
Unassigned
gcc-10 (Ubuntu)
Undecided
Unassigned

Bug Description

Hi,
with gcc-10.2 clustalo segfaults on s390x.

First of all I beg your pardon, but I didn't find an upstream bug tracker for custalo but
think you should be aware. But furthermore I think this might eventually be a gcc bug (or at least needs the s390x gcc experts to look at).
I decided to open this bug to track things and have a joint conversation, but then ping the custalo mail about it and let it be mirrored to IBM.

Issue:
I see this with the test used in Debian:
  # Run additional test from python-biopython package to verify that
  # this will work as well
  src/clustalo -i debian/tests/biopython_testdata/f002 --guidetree-out temp_test.dnd -o temp_test.aln --outfmt clustal --force

We run into this segfault:
Thread 9 "clustalo" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x3fff9ef8870 (LWP 55818)]
0x000002aa000176e2 in PairDistances._omp_fn.0 () at pair_dist.c:353
353 KTuplePairDist((*distmat), mseq, iChunkStarts[iChunk],
(gdb) bt
#0 0x000002aa000176e2 in PairDistances._omp_fn.0 () at pair_dist.c:353
#1 0x000003fffdaa2066 in gomp_thread_start (xdata=<optimized out>) at ../../../src/libgomp/team.c:123
#2 0x000003fffd709556 in start_thread (arg=0x3fff9ef8870) at pthread_create.c:463
#3 0x000003fffd921d46 in thread_start () at ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

Debugging showed that this is depending on the optimization, when I build
with -O0 (for debugging) the problem goes away.

A usual build uses -O3 (from the build system) followed by -g -O2 (from the
default Debian build flags). For the time being we can avoid the issue by
setting -O0 there. But I wanted to ask if this is something you could look into?

In valgrind I see this reported as "Invalid read of size 4"

In the backtrace it is:
gdb) p $_siginfo
$3 = {si_signo = 11, si_errno = 0, si_code = 1, _sifields = {_pad = {0, -16384, 0 <repeats 26 times>}, _kill = {si_pid = 0, si_uid = 4294950912}, _timer = {si_tid = 0, si_overrun = -16384,
      si_sigval = {sival_int = 0, sival_ptr = 0x0}}, _rt = {si_pid = 0, si_uid = 4294950912, si_sigval = {sival_int = 0, sival_ptr = 0x0}}, _sigchld = {si_pid = 0, si_uid = 4294950912,
      si_status = 0, si_utime = 0, si_stime = 0}, _sigfault = {si_addr = 0xffffc000}, _sigpoll = {si_band = 4294950912, si_fd = 0}}}

The instructions are
│ 0x2aa000176d6 <PairDistances._omp_fn.0+246> lg %r2,40(%r9) │
│ 0x2aa000176dc <PairDistances._omp_fn.0+252> sllg %r1,%r10,2 │
│ >0x2aa000176e2 <PairDistances._omp_fn.0+258> lgf %r5,0(%r1,%r3) │
│ 0x2aa000176e8 <PairDistances._omp_fn.0+264> lgf %r4,0(%r1,%r8)

So it tries to load from
r3 0xffffcf80 4294954880
+ r1 0x24 36
into r5

And that matches the segfault address of si_addr = 0xffffc000

@IBM
to reproduce:
1. get an Ubuntu 20.10 system on s390x (or anything with gcc-10.2 while OTOH it seems gcc-10 was fine).
2. edit /etc/apt/sources.list
  2a) add deb-src lines to be able to get the source
  2b) enable proposed to be able to get custalo 1.2.4-6
3. run the build
  $ ./debian/rules build
This will end in the crash that is to debug.

@Custalo people:
If you need s390x system access please check out the IBM Community cloud [1][2]
which should give you a free VM.

[1]: https://developer.ibm.com/components/ibm-linuxone/gettingstarted/?_ga=2.85909726.636290536.1605082467-259352313.1597225455
[2]: https://zcloud.marist.edu/#/login

tags: added: update-excuse

FYI: Temporary mitigation forwarded to Debian clustalo packaging at https://salsa.debian.org/med-team/clustalo/-/merge_requests/1

@Frank - could you get this properly tagged and mirrored so that the IBM gcc people could have a look?

Frank Heimes (fheimes) on 2020-11-11
Changed in ubuntu-z-systems:
assignee: nobody → Frank Heimes (fheimes)
assignee: Frank Heimes (fheimes) → bugproxy (bugproxy)
tags: added: reverse-proxy-bugzilla s390x
Frank Heimes (fheimes) on 2020-11-11
Changed in ubuntu-z-systems:
importance: Undecided → High
bugproxy (bugproxy) on 2020-11-11
tags: added: architecture-s39064 bugnameltc-189191 severity-high targetmilestone-inin2004

------- Comment From <email address hidden> 2020-11-12 03:58 EDT-------
Are there any suspicious warnings in the compile log?
What about "-O2 -fno-strict-aliasing"? Does it work with these options?

I haven't seen any suspicious warnings myself, but I can retry with/without -fno-strict-aliasing and attach the log for your own consideration.

And BTW o/ Andreas :-)

TL;DR: "-O2 -fno-strict-aliasing is as broken as -O2, while -O0 still works"
I'll attach logs of the three cases ...

As I said it is 100% reproducible and should be for you as well if you follow my steps I mentioned in the report.

Let me know if you see anything in the logs that you need me to work on, otherwise all that is left for me is wishing you best luck with debugging.

As discussed via IRC here a tarball of my build result (the whole tree)

Executing this in it segfaults:
src/clustalo -i debian/tests/biopython_testdata/f002 --guidetree-out temp_test.dnd -o temp_test.aln --outfmt clustal --force

In a discussion with Andreas we failed to break it in his environment but it reliably failed in mine. Then I found the crash in dmesg, there it looks like this:

[2431454.082712] User process fault: interruption code 003b ilc:2 in clustalo[2aa2d000000+d4000]
[2431454.082725] Failing address: 000000002d0d5000 TEID: 000000002d0d5800
[2431454.082727] Fault in primary space mode while using user ASCE.
[2431454.082729] AS:0000000e6fcb41c7 R3:0000000000000024
[2431454.082734] CPU: 3 PID: 1359979 Comm: clustalo Tainted: P O 5.4.0-51-generic #56-Ubuntu
[2431454.082735] Hardware name: IBM 2964 N63 400 (LPAR)
[2431454.082737] User PSW : 0705200180000000 000002aa2d011028
[2431454.082739] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:1 AS:0 CC:2 PM:0 RI:0 EA:3
[2431454.082741] User GPRS: 000003ff00000018 0000000000000002 000000002d0d5560 000002aa2d0ba8e4
[2431454.082742] 000002aa2d0ba8e4 0000000000000007 0000000000000019 000000002d0d5560
[2431454.082743] 0000000000000000 000003ff00000001 000002aa2edaef70 000003ff8a6767a0
[2431454.082744] 000003ff8a62bf98 000002aa2d0baa80 000002aa2d01a130 000003ffcabfc350
[2431454.082753] User Code: 000002aa2d01101a: b9040013 lgr %r1,%r3
                            000002aa2d01101e: d207f0c0b028 mvc 192(8,%r15),40(%r11)
                           #000002aa2d011024: b9040034 lgr %r3,%r4
                           >000002aa2d011028: 584020c0 l %r4,192(%r2)
                            000002aa2d01102c: 1941 cr %r4,%r1
                            000002aa2d01102e: a7240022 brc 2,000002aa2d011072
                            000002aa2d011032: eb110003000d sllg %r1,%r1,3
                            000002aa2d011038: b9080021 agr %r2,%r1
[2431454.082765] Last Breaking-Event-Address:
[2431454.082770] [<000002aa2d01a12a>] 0x2aa2d01a12a

The crash happens in 5.4.0-51-generic and a Hirsute LXD container therein.
But also in a Hirsute VM under 5.8.0-25-generic

On the latter the crash looks like (similar):

[520179.980824] report_user_fault: 5 callbacks suppressed
[520179.980827] User process fault: interruption code 003b ilc:3 in clustalo[2aa05900000+aa000]
[520179.980842] Failing address: 00000000ddf7c000 TEID: 00000000ddf7c800
[520179.980843] Fault in primary space mode while using user ASCE.
[520179.980845] AS:0000000028f281c7 R3:0000000000000024
[520179.980852] CPU: 1 PID: 83822 Comm: clustalo Not tainted 5.8.0-25-generic #26-Ubuntu
[520179.980853] Hardware name: IBM 2964 N63 400 (KVM/Linux)
[520179.980855] User PSW : 0705200180000000 000002aa059176e2
[520179.980857] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:1 AS:0 CC:2 PM:0 RI:0 EA:3
[520179.980858] User GPRS: 0000000000000004 0000000000000000 000003ffddf7c838 00000000ddf7c7d0
[520179.980859] 000002aa05e2b1d8 0000000000000001 0000000000000000 000002aa05e29f00
[520179.980860] 000003ffddf7c7e0 000003ffddf7c840 0000000000000000 0000000000000001
[520179.980860] 000003ffbe7abf98 000003ffddf7c998 000002aa05917682 000003ffddf7c568
[520179.980872] User Code: 000002aa059176d0: e330f0c80004 lg %r3,200(%r15)
                           000002aa059176d6: e32090280004 lg %r2,40(%r9)
                          #000002aa059176dc: eb1a0002000d sllg %r1,%r10,2
                          >000002aa059176e2: e35130000014 lgf %r5,0(%r1,%r3)
                           000002aa059176e8: e34180000014 lgf %r4,0(%r1,%r8)
                           000002aa059176ee: 6080f0c0 std %f8,192(%r15)
                           000002aa059176f2: b9040037 lgr %r3,%r7
                           000002aa059176f6: a7ab0001 aghi %r10,1
[520179.980883] Last Breaking-Event-Address:
[520179.980893] [<000003ffbe28d144>] 0x3ffbe28d144

Segfault avoidable (thanks AndiKr) if we set OMP_NUM_THREADS, but that is just hiding the underlying issue which probably is code that isn't s390x compatible.

OMP_NUM_THREADS=1 src/clustalo -i debian/tests/biopython_testdata/f002 --guidetree-out temp_test.dnd -o temp_test.aln --outfmt clustal --force

Never the less, it now feels less an compiler and more like a code issue.
I have updated my MP with the new insight but think we can leave it as-is until resolved in the Debian Packaging.

[1]: https://salsa.debian.org/med-team/clustalo/-/merge_requests/1#note_203049

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2020-11-20 10:37 EDT-------
The problem boils down to a buffer overflow. Consider the following code snippet from file pair_dist.c

int iChunkStarts[iNumberOfThreads];
int iChunkEnds[iNumberOfThreads];

// ...

for(iChunk = 0; iChunk <= iNumberOfThreads; iChunk++)
{
iChunkEnd = iChunkStart;
if (iChunk == iNumberOfThreads - 1){
iChunkStart = 0;
}
else if (iend == jend){
iChunkStart = iend - ((double)(iend - istart) * sqrt(((double)iChunk + 1.0)/(double)iNumberOfThreads));
}
else {
iChunkStart = iend - (iend - istart) * (iChunk + 1) / (double)(iNumberOfThreads);
}
iChunkStarts[iChunk] = iChunkStart;
iChunkEnds[iChunk] = iChunkEnd;
/*printf("%s:%d: C=%d, ie=%d, is=%d, je=%d, js=%d, Cstart=%d, Cend=%d, diff=%d\n",
__FILE__, __LINE__, iChunk, iend, istart, jend, jstart, iChunkStart, iChunkEnd, iChunkEnd-iChunkStart);*/
}

if (PAIRDIST_KTUPLE == pairdist_type) {

Log(&rLog, LOG_INFO, "Calculating pairwise ktuple-distances...");

NewProgress(&prProgress, LogGetFP(&rLog, LOG_INFO),
"Ktuple-distance calculation progress", bPrintCR);
#ifdef HAVE_OPENMP
#pragma omp parallel for private(iChunk) schedule(dynamic)
#endif
for(iChunk = 0; iChunk < iNumberOfThreads; iChunk++)
{
KTuplePairDist((*distmat), mseq, iChunkStarts[iChunk],
iChunkEnds[iChunk], jstart, jend, NULL, prProgress,
&ulStepNo, ulTotalStepNo);
}

The first loop writes behind VLAs iChunkStarts and iChunkEnds.
The closure for the upcoming OMP loop

.omp_data_o.21.iChunkEnds.3 = iChunkEnds.3_72;
.omp_data_o.21.iChunkStarts.1 = iChunkStarts.1_70;
.omp_data_o.21.ulTotalStepNo = ulTotalStepNo_81;
.omp_data_o.21.jend = jend_77(D);
.omp_data_o.21.jstart = jstart_91(D);
.omp_data_o.21.mseq = mseq_93(D);
.omp_data_o.21.distmat = distmat_78(D);
.omp_data_o.21.ulStepNo = &ulStepNo;
.omp_data_o.21.prProgress = &prProgress;
__builtin_GOMP_parallel (PairDistances._omp_fn.0, &.omp_data_o.21, 0, 0);

is placed adjacent to those VLAs iChunkStarts and iChunkEnds on S/390. The overflow results in overwriting the high part of

.omp_data_o.21.iChunkEnds.3

which finally results in a SEGVAULT once the corresponding OMP function is executed and the pointer is dereferenced.

Long story short:

diff --git a/src/clustal/pair_dist.c.orig b/src/clustal/pair_dist.c
index e6dbdc3..bb79e61 100644
--- a/src/clustal/pair_dist.c.orig
+++ b/src/clustal/pair_dist.c
@@ -321,7 +321,7 @@ PairDistances(symmatrix_t **distmat, mseq_t *mseq, int pairdist_type, bool bPerc

/* FIXME: can get rid of iChunkStart, iChunkEnd now that we're using the arrays */
iChunkStart = iend;
- for(iChunk = 0; iChunk <= iNumberOfThreads; iChunk++)
+ for(iChunk = 0; iChunk < iNumberOfThreads; iChunk++)
{
iChunkEnd = iChunkStart;
if (iChunk == iNumberOfThreads - 1){

should fix this.

Thank you Andreas and Stefan,
I've pinged the Debian MR [1] and upstream (via Mail which you are on CC) about it.
Let us see how things evolve from here.

[1]: https://salsa.debian.org/med-team/clustalo/-/merge_requests/1

Changed in gcc-10 (Ubuntu):
status: New → Invalid
Changed in clustalo (Ubuntu):
status: New → Triaged
importance: Undecided → Medium
Frank Heimes (fheimes) on 2020-11-23
Changed in ubuntu-z-systems:
importance: High → Medium
status: New → Triaged

FYI: Also filed as Debian bug (linked here) and got a response on the Debian MR.

FYI fixed in 1.2.4-7 which is coming our way via auto-sync

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.