Ubuntu
clustalo package

Clustalo 1.2.4-6 segfaults on s390x

Bug #1903817 reported by Christian Ehrhardt  on 2020-11-11

This bug affects 1 person

	Status	Importance	Assigned to
Ubuntu on IBM z Systems	Fix Released	Medium	bugproxy
clustalo (Debian)	Fix Released	Undecided	Unassigned
clustalo (Ubuntu)	Fix Released	Medium	Unassigned
gcc-10 (Ubuntu)	Invalid	Undecided	Unassigned

Bug Description

Hi,
with gcc-10.2 clustalo segfaults on s390x.

First of all I beg your pardon, but I didn't find an upstream bug tracker for custalo but
think you should be aware. But furthermore I think this might eventually be a gcc bug (or at least needs the s390x gcc experts to look at).
I decided to open this bug to track things and have a joint conversation, but then ping the custalo mail about it and let it be mirrored to IBM.

Issue:
I see this with the test used in Debian:
  # Run additional test from python-biopython package to verify that
  # this will work as well
  src/clustalo -i debian/tests/biopython_testdata/f002 --guidetree-out temp_test.dnd -o temp_test.aln --outfmt clustal --force

We run into this segfault:
Thread 9 "clustalo" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x3fff9ef8870 (LWP 55818)]
0x000002aa000176e2 in PairDistances._omp_fn.0 () at pair_dist.c:353
353 KTuplePairDist((*distmat), mseq, iChunkStarts[iChunk],
(gdb) bt
#0 0x000002aa000176e2 in PairDistances._omp_fn.0 () at pair_dist.c:353
#1 0x000003fffdaa2066 in gomp_thread_start (xdata=<optimized out>) at ../../../src/libgomp/team.c:123
#2 0x000003fffd709556 in start_thread (arg=0x3fff9ef8870) at pthread_create.c:463
#3 0x000003fffd921d46 in thread_start () at ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

Debugging showed that this is depending on the optimization, when I build
with -O0 (for debugging) the problem goes away.

A usual build uses -O3 (from the build system) followed by -g -O2 (from the
default Debian build flags). For the time being we can avoid the issue by
setting -O0 there. But I wanted to ask if this is something you could look into?

In valgrind I see this reported as "Invalid read of size 4"

In the backtrace it is:
gdb) p $_siginfo
$3 = {si_signo = 11, si_errno = 0, si_code = 1, _sifields = {_pad = {0, -16384, 0 <repeats 26 times>}, _kill = {si_pid = 0, si_uid = 4294950912}, _timer = {si_tid = 0, si_overrun = -16384,
si_sigval = {sival_int = 0, sival_ptr = 0x0}}, _rt = {si_pid = 0, si_uid = 4294950912, si_sigval = {sival_int = 0, sival_ptr = 0x0}}, _sigchld = {si_pid = 0, si_uid = 4294950912,
si_status = 0, si_utime = 0, si_stime = 0}, _sigfault = {si_addr = 0xffffc000}, _sigpoll = {si_band = 4294950912, si_fd = 0}}}

The instructions are
│ 0x2aa000176d6 <PairDistances._omp_fn.0+246> lg %r2,40(%r9) │
│ 0x2aa000176dc <PairDistances._omp_fn.0+252> sllg %r1,%r10,2 │
│ >0x2aa000176e2 <PairDistances._omp_fn.0+258> lgf %r5,0(%r1,%r3) │
│ 0x2aa000176e8 <PairDistances._omp_fn.0+264> lgf %r4,0(%r1,%r8)

So it tries to load from
r3 0xffffcf80 4294954880
+ r1 0x24 36
into r5

And that matches the segfault address of si_addr = 0xffffc000

@IBM
to reproduce:
1. get an Ubuntu 20.10 system on s390x (or anything with gcc-10.2 while OTOH it seems gcc-10 was fine).
2. edit /etc/apt/sources.list
  2a) add deb-src lines to be able to get the source
  2b) enable proposed to be able to get custalo 1.2.4-6
3. run the build
  $ ./debian/rules build
This will end in the crash that is to debug.

@Custalo people:
If you need s390x system access please check out the IBM Community cloud [1][2]
which should give you a free VM.

[1]: https://developer.ibm.com/components/ibm-linuxone/gettingstarted/?_ga=2.85909726.636290536.1605082467-259352313.1597225455
[2]: https://zcloud.marist.edu/#/login

Tags:

Christian Ehrhardt  (paelzer) on 2020-11-11

tags:

added: update-excuse

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2020-11-11:

FYI: Temporary mitigation forwarded to Debian clustalo packaging at https://salsa.debian.org/med-team/clustalo/-/merge_requests/1

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2020-11-11:

@Frank - could you get this properly tagged and mirrored so that the IBM gcc people could have a look?

Frank Heimes (fheimes) on 2020-11-11

Changed in ubuntu-z-systems:
assignee:	nobody → Frank Heimes (fheimes)
assignee:	Frank Heimes (fheimes) → bugproxy (bugproxy)
tags:	added: reverse-proxy-bugzilla s390x

Frank Heimes (fheimes) on 2020-11-11

Changed in ubuntu-z-systems:
importance:	Undecided → High

bugproxy (bugproxy) on 2020-11-11

tags:

added: architecture-s39064 bugnameltc-189191 severity-high targetmilestone-inin2004

Revision history for this message

bugproxy (bugproxy) wrote on 2020-11-12: Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2020-11-12 03:58 EDT-------
Are there any suspicious warnings in the compile log?
What about "-O2 -fno-strict-aliasing"? Does it work with these options?

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2020-11-16:

I haven't seen any suspicious warnings myself, but I can retry with/without -fno-strict-aliasing and attach the log for your own consideration.

And BTW o/ Andreas :-)

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2020-11-17:

TL;DR: "-O2 -fno-strict-aliasing is as broken as -O2, while -O0 still works"
I'll attach logs of the three cases ...

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2020-11-17:

buildlog O0 - working Edit (63.6 KiB, text/plain)

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2020-11-17:

buildlog O2 - failing Edit (63.4 KiB, text/plain)

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2020-11-17:

buildlog O2 fno-strict-aliasing - failing as well Edit (67.1 KiB, text/plain)

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2020-11-17:

As I said it is 100% reproducible and should be for you as well if you follow my steps I mentioned in the report.

Let me know if you see anything in the logs that you need me to work on, otherwise all that is left for me is wishing you best luck with debugging.

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2020-11-17:

#10

tar of the built tree that contains the failing binary Edit (9.7 MiB, application/x-tar)

As discussed via IRC here a tarball of my build result (the whole tree)

Executing this in it segfaults:
src/clustalo -i debian/tests/biopython_testdata/f002 --guidetree-out temp_test.dnd -o temp_test.aln --outfmt clustal --force

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2020-11-17:

#11

In a discussion with Andreas we failed to break it in his environment but it reliably failed in mine. Then I found the crash in dmesg, there it looks like this:

[2431454.082712] User process fault: interruption code 003b ilc:2 in clustalo[2aa2d000000+d4000]
[2431454.082725] Failing address: 000000002d0d5000 TEID: 000000002d0d5800
[2431454.082727] Fault in primary space mode while using user ASCE.
[2431454.082729] AS:0000000e6fcb41c7 R3:0000000000000024
[2431454.082734] CPU: 3 PID: 1359979 Comm: clustalo Tainted: P O 5.4.0-51-generic #56-Ubuntu
[2431454.082735] Hardware name: IBM 2964 N63 400 (LPAR)
[2431454.082737] User PSW : 0705200180000000 000002aa2d011028
[2431454.082739] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:1 AS:0 CC:2 PM:0 RI:0 EA:3
[2431454.082741] User GPRS: 000003ff00000018 0000000000000002 000000002d0d5560 000002aa2d0ba8e4
[2431454.082742] 000002aa2d0ba8e4 0000000000000007 0000000000000019 000000002d0d5560
[2431454.082743] 0000000000000000 000003ff00000001 000002aa2edaef70 000003ff8a6767a0
[2431454.082744] 000003ff8a62bf98 000002aa2d0baa80 000002aa2d01a130 000003ffcabfc350
[2431454.082753] User Code: 000002aa2d01101a: b9040013 lgr %r1,%r3
                            000002aa2d01101e: d207f0c0b028 mvc 192(8,%r15),40(%r11)
                           #000002aa2d011024: b9040034 lgr %r3,%r4
                           >000002aa2d011028: 584020c0 l %r4,192(%r2)
                            000002aa2d01102c: 1941 cr %r4,%r1
                            000002aa2d01102e: a7240022 brc 2,000002aa2d011072
                            000002aa2d011032: eb110003000d sllg %r1,%r1,3
                            000002aa2d011038: b9080021 agr %r2,%r1
[2431454.082765] Last Breaking-Event-Address:
[2431454.082770] [<000002aa2d01a12a>] 0x2aa2d01a12a

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2020-11-17:

#12

The crash happens in 5.4.0-51-generic and a Hirsute LXD container therein.
But also in a Hirsute VM under 5.8.0-25-generic

On the latter the crash looks like (similar):

[520179.980824] report_user_fault: 5 callbacks suppressed
[520179.980827] User process fault: interruption code 003b ilc:3 in clustalo[2aa05900000+aa000]
[520179.980842] Failing address: 00000000ddf7c000 TEID: 00000000ddf7c800
[520179.980843] Fault in primary space mode while using user ASCE.
[520179.980845] AS:0000000028f281c7 R3:0000000000000024
[520179.980852] CPU: 1 PID: 83822 Comm: clustalo Not tainted 5.8.0-25-generic #26-Ubuntu
[520179.980853] Hardware name: IBM 2964 N63 400 (KVM/Linux)
[520179.980855] User PSW : 0705200180000000 000002aa059176e2
[520179.980857] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:1 AS:0 CC:2 PM:0 RI:0 EA:3
[520179.980858] User GPRS: 0000000000000004 0000000000000000 000003ffddf7c838 00000000ddf7c7d0
[520179.980859] 000002aa05e2b1d8 0000000000000001 0000000000000000 000002aa05e29f00
[520179.980860] 000003ffddf7c7e0 000003ffddf7c840 0000000000000000 0000000000000001
[520179.980860] 000003ffbe7abf98 000003ffddf7c998 000002aa05917682 000003ffddf7c568
[520179.980872] User Code: 000002aa059176d0: e330f0c80004 lg %r3,200(%r15)
                           000002aa059176d6: e32090280004 lg %r2,40(%r9)
                          #000002aa059176dc: eb1a0002000d sllg %r1,%r10,2
                          >000002aa059176e2: e35130000014 lgf %r5,0(%r1,%r3)
                           000002aa059176e8: e34180000014 lgf %r4,0(%r1,%r8)
                           000002aa059176ee: 6080f0c0 std %f8,192(%r15)
                           000002aa059176f2: b9040037 lgr %r3,%r7
                           000002aa059176f6: a7ab0001 aghi %r10,1
[520179.980883] Last Breaking-Event-Address:
[520179.980893] [<000003ffbe28d144>] 0x3ffbe28d144

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2020-11-17:

#13

Core dump gatheres in a Hirsute s390x VM with 5.8.0-25-generic - matches the source/build tarball formerly attached Edit (95.0 KiB, application/x-tar)

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2020-11-17:

#14

Segfault avoidable (thanks AndiKr) if we set OMP_NUM_THREADS, but that is just hiding the underlying issue which probably is code that isn't s390x compatible.

OMP_NUM_THREADS=1 src/clustalo -i debian/tests/biopython_testdata/f002 --guidetree-out temp_test.dnd -o temp_test.aln --outfmt clustal --force

Never the less, it now feels less an compiler and more like a code issue.
I have updated my MP with the new insight but think we can leave it as-is until resolved in the Debian Packaging.

[1]: https://salsa.debian.org/med-team/clustalo/-/merge_requests/1#note_203049

Revision history for this message

bugproxy (bugproxy) wrote on 2020-11-20:

#15

------- Comment From <email address hidden> 2020-11-20 10:37 EDT-------
The problem boils down to a buffer overflow. Consider the following code snippet from file pair_dist.c

int iChunkStarts[iNumberOfThreads];
int iChunkEnds[iNumberOfThreads];

// ...

for(iChunk = 0; iChunk <= iNumberOfThreads; iChunk++)
{
iChunkEnd = iChunkStart;
if (iChunk == iNumberOfThreads - 1){
iChunkStart = 0;
}
else if (iend == jend){
iChunkStart = iend - ((double)(iend - istart) * sqrt(((double)iChunk + 1.0)/(double)iNumberOfThreads));
}
else {
iChunkStart = iend - (iend - istart) * (iChunk + 1) / (double)(iNumberOfThreads);
}
iChunkStarts[iChunk] = iChunkStart;
iChunkEnds[iChunk] = iChunkEnd;
/*printf("%s:%d: C=%d, ie=%d, is=%d, je=%d, js=%d, Cstart=%d, Cend=%d, diff=%d\n",
__FILE__, __LINE__, iChunk, iend, istart, jend, jstart, iChunkStart, iChunkEnd, iChunkEnd-iChunkStart);*/
}

if (PAIRDIST_KTUPLE == pairdist_type) {

Log(&rLog, LOG_INFO, "Calculating pairwise ktuple-distances...");

NewProgress(&prProgress, LogGetFP(&rLog, LOG_INFO),
"Ktuple-distance calculation progress", bPrintCR);
#ifdef HAVE_OPENMP
#pragma omp parallel for private(iChunk) schedule(dynamic)
#endif
for(iChunk = 0; iChunk < iNumberOfThreads; iChunk++)
{
KTuplePairDist((*distmat), mseq, iChunkStarts[iChunk],
iChunkEnds[iChunk], jstart, jend, NULL, prProgress,
&ulStepNo, ulTotalStepNo);
}

The first loop writes behind VLAs iChunkStarts and iChunkEnds.
The closure for the upcoming OMP loop

.omp_data_o.21.iChunkEnds.3 = iChunkEnds.3_72;
.omp_data_o.21.iChunkStarts.1 = iChunkStarts.1_70;
.omp_data_o.21.ulTotalStepNo = ulTotalStepNo_81;
.omp_data_o.21.jend = jend_77(D);
.omp_data_o.21.jstart = jstart_91(D);
.omp_data_o.21.mseq = mseq_93(D);
.omp_data_o.21.distmat = distmat_78(D);
.omp_data_o.21.ulStepNo = &ulStepNo;
.omp_data_o.21.prProgress = &prProgress;
__builtin_GOMP_parallel (PairDistances._omp_fn.0, &.omp_data_o.21, 0, 0);

is placed adjacent to those VLAs iChunkStarts and iChunkEnds on S/390. The overflow results in overwriting the high part of

.omp_data_o.21.iChunkEnds.3

which finally results in a SEGVAULT once the corresponding OMP function is executed and the pointer is dereferenced.

Long story short:

diff --git a/src/clustal/pair_dist.c.orig b/src/clustal/pair_dist.c
index e6dbdc3..bb79e61 100644
--- a/src/clustal/pair_dist.c.orig
+++ b/src/clustal/pair_dist.c
@@ -321,7 +321,7 @@ PairDistances(symmatrix_t **distmat, mseq_t *mseq, int pairdist_type, bool bPerc

/* FIXME: can get rid of iChunkStart, iChunkEnd now that we're using the arrays */
iChunkStart = iend;
- for(iChunk = 0; iChunk <= iNumberOfThreads; iChunk++)
+ for(iChunk = 0; iChunk < iNumberOfThreads; iChunk++)
{
iChunkEnd = iChunkStart;
if (iChunk == iNumberOfThreads - 1){

should fix this.

------- Comment From Stefan.Schulze.Frielinghaus@ibm.com 2020-11-20 10:37 EDT-------
The problem boils down to a buffer overflow.  Consider the following code snippet from file pair_dist.c

int iChunkStarts[iNumberOfThreads];
int iChunkEnds[iNumberOfThreads];

// ...

if (PAIRDIST_KTUPLE == pairdist_type) {

Log(&rLog, LOG_INFO, "Calculating pairwise ktuple-distances...");

The first loop writes behind VLAs iChunkStarts and iChunkEnds.
The closure for the upcoming OMP loop

is placed adjacent to those VLAs iChunkStarts and iChunkEnds on S/390.  The overflow results in overwriting the high part of

.omp_data_o.21.iChunkEnds.3

which finally results in a SEGVAULT once the corresponding OMP function is executed and the pointer is dereferenced.

Long story short:

/* FIXME: can get rid of iChunkStart, iChunkEnd now that we're using the arrays */
iChunkStart = iend;
-        for(iChunk = 0; iChunk <= iNumberOfThreads; iChunk++)
+        for(iChunk = 0; iChunk < iNumberOfThreads; iChunk++)
{
iChunkEnd = iChunkStart;
if (iChunk == iNumberOfThreads - 1){

should fix this.

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2020-11-23:

#16

Thank you Andreas and Stefan,
I've pinged the Debian MR [1] and upstream (via Mail which you are on CC) about it.
Let us see how things evolve from here.

[1]: https://salsa.debian.org/med-team/clustalo/-/merge_requests/1

Changed in gcc-10 (Ubuntu):
status:	New → Invalid
Changed in clustalo (Ubuntu):
status:	New → Triaged
importance:	Undecided → Medium

Frank Heimes (fheimes) on 2020-11-23

Changed in ubuntu-z-systems:
importance:	High → Medium
status:	New → Triaged

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2020-11-23:

#17

FYI: Also filed as Debian bug (linked here) and got a response on the Debian MR.

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2020-11-23:

#18

FYI fixed in 1.2.4-7 which is coming our way via auto-sync

Revision history for this message

Michael R. Crusoe (misterc) wrote on 2020-12-09:

#19

Fixed in Debain unstable clustalo 1.2.4-7

Changed in clustalo (Debian):
importance:	Unknown → Undecided
status:	Unknown → New
status:	New → Fix Released

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2020-12-09:

#20

Thanks - build and migrated in Ubuntu as well
https://launchpad.net/ubuntu/+source/clustalo/1.2.4-7

Changed in clustalo (Ubuntu):
status:	Triaged → Fix Released

Frank Heimes (fheimes) on 2020-12-10

Changed in ubuntu-z-systems:
status:	Triaged → Fix Released

Revision history for this message

bugproxy (bugproxy) wrote on 2020-12-10:

#21

------- Comment From <email address hidden> 2020-12-10 04:58 EDT-------
IBM Bugzilla status->closed, Fix Released with hirsuite

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

debbugs #975511
[done normal] Edit

Bug watches keep track of this bug in other bug trackers.

Ubuntuclustalo package

Clustalo 1.2.4-6 segfaults on s390x

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
clustalo package