slurmdbd segfaults on armhf

Bug #2059131 reported by Danilo Egea Gondolfo
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
slurm-wlm (Ubuntu)
New
High
Unassigned

Bug Description

The slurmdbd segfaults on armhf.

This is from the autopkgtest logs:

613s × slurmdbd.service - Slurm DBD accounting daemon
613s Loaded: loaded (/usr/lib/systemd/system/slurmdbd.service; enabled; preset: enabled)
613s Active: failed (Result: core-dump) since Tue 2024-03-26 09:13:24 UTC; 19s ago
613s Duration: 627ms
613s Docs: man:slurmdbd(8)
613s Process: 3801 ExecStart=/usr/sbin/slurmdbd -D -s $SLURMDBD_OPTIONS (code=dumped, signal=SEGV)
613s Main PID: 3801 (code=dumped, signal=SEGV)
613s CPU: 28ms
613s
613s Mar 26 09:13:23 autopkgtest-lxd-kbbzor systemd[1]: Started slurmdbd.service - Slurm DBD accounting daemon.
613s Mar 26 09:13:23 autopkgtest-lxd-kbbzor (slurmdbd)[3801]: slurmdbd.service: Referenced but unset environment variable evaluates to an empty string: SLURMDBD_OPTIONS
613s Mar 26 09:13:23 autopkgtest-lxd-kbbzor slurmdbd[3801]: slurmdbd: error: Unable to open pidfile `/run/slurmdbd.pid': Permission denied
613s Mar 26 09:13:23 autopkgtest-lxd-kbbzor slurmdbd[3801]: slurmdbd: Not running as root. Can't drop supplementary groups
613s Mar 26 09:13:23 autopkgtest-lxd-kbbzor slurmdbd[3801]: slurmdbd: accounting_storage/as_mysql: _check_mysql_concat_is_sane: MySQL server version is: 5.5.5-10.11.7-MariaDB-2ubuntu1
613s Mar 26 09:13:23 autopkgtest-lxd-kbbzor slurmdbd[3801]: slurmdbd: error: Database settings not recommended values: innodb_buffer_pool_size innodb_lock_wait_timeout
613s Mar 26 09:13:24 autopkgtest-lxd-kbbzor systemd[1]: slurmdbd.service: Main process exited, code=dumped, status=11/SEGV
613s Mar 26 09:13:24 autopkgtest-lxd-kbbzor systemd[1]: slurmdbd.service: Failed with result 'core-dump'.
614s autopkgtest [09:13:44]: test sacct: -----------------------]
617s sacct FAIL non-zero exit status 3

Trying to run the binary in an armhf LXD container also fails:

root@autopkgtest-lxd-gzeypl:/tmp/autopkgtest.VdD4R7/build.sTW/src# slurmdbd -D
slurmdbd: accounting_storage/as_mysql: _check_mysql_concat_is_sane: MySQL server version is: 5.5.5-10.11.7-MariaDB-2ubuntu1
slurmdbd: error: Database settings not recommended values: innodb_buffer_pool_size innodb_lock_wait_timeout
Segmentation fault

The database settings error seems to not be related as they are also happening on other archtectures.

Running on gdb I get the following stack trace:

(gdb) bt
#0 __GI_strlen () at ../sysdeps/arm/armv6t2/strlen.S:126
#1 0xf7d8927a in __printf_buffer (buf=buf@entry=0xfffee780, format=<optimized out>, ap=..., mode_flags=mode_flags@entry=2) at vfprintf-process-arg.c:435
#2 0xf7d9cd26 in __vsnprintf_internal (string=<optimized out>, maxlen=maxlen@entry=100, format=<optimized out>, args=..., args@entry=..., mode_flags=2) at vsnprintf.c:96
#3 0xf7e02bba in ___vsnprintf_chk (s=<optimized out>, maxlen=maxlen@entry=100, flag=flag@entry=2, slen=slen@entry=4294967295, format=<optimized out>, ap=...) at vsnprintf_chk.c:34
#4 0xf7f62f2a in vsnprintf (__ap=..., __fmt=0xf7c6dc4c "insert into %s (creation_time, mod_time, table_name, definition) values (%ld, %ld, '%s', '%s') on duplicate key update definition='%s', mod_time=%ld;",
    __n=100, __s=<optimized out>) at /usr/include/arm-linux-gnueabihf/bits/stdio2.h:68
#5 _xstrdup_vprintf (str=str@entry=0xfffee87c,
    fmt=fmt@entry=0xf7c6dc4c "insert into %s (creation_time, mod_time, table_name, definition) values (%ld, %ld, '%s', '%s') on duplicate key update definition='%s', mod_time=%ld;", ap=ap@entry=...)
    at ../../../src/common/xstring.c:799
#6 0xf7f6384c in xstrdup_printf (fmt=0xf7c6dc4c "insert into %s (creation_time, mod_time, table_name, definition) values (%ld, %ld, '%s', '%s') on duplicate key update definition='%s', mod_time=%ld;")
    at ../../../src/common/xstring.c:511
#7 0xf7c58ed4 in _mysql_make_table_current (ending=<optimized out>, fields=0x0, table_name=0xf7c60020 "convert_version_table", mysql_conn=0x415468) at ../../../src/database/mysql_common.c:667
#8 mysql_db_create_table (mysql_conn=mysql_conn@entry=0x415468, table_name=0xf7c60020 "convert_version_table", fields=<optimized out>, fields@entry=0xf7c5f220, ending=<optimized out>)
    at ../../../src/database/mysql_common.c:1190
#9 0xf7c238f6 in _as_mysql_acct_check_tables (mysql_conn=0x0) at ../../../../../src/plugins/accounting_storage/mysql/accounting_storage_mysql.c:887
#10 init () at ../../../../../src/plugins/accounting_storage/mysql/accounting_storage_mysql.c:2892
#11 0xf7eebcae in plugin_load_from_file (p=0xfffeef80, p@entry=0xfffeefb0, fq_path=<optimized out>) at ../../../src/common/plugin.c:173
#12 0xf7eebec6 in plugin_load_and_link (type_name=<optimized out>, n_syms=0, n_syms@entry=80, names=0x0, names@entry=0xf7fca01c <syms>, ptrs=0x1, ptrs@entry=0xf7fccf44 <ops>) at ../../../src/common/plugin.c:233
#13 0xf7eec038 in plugin_context_create (plugin_type=plugin_type@entry=0xf7fb392c "accounting_storage", uler_type=<optimized out>, ptrs=ptrs@entry=0xf7fccf44 <ops>, names=0xf7fca01c <syms>,
    names_size=names_size@entry=320) at ../../../src/common/plugin.c:364
#14 0xf7f64114 in acct_storage_g_init () at ../../../src/interfaces/accounting_storage.c:345
#15 0x00404460 in main (argc=<optimized out>, argv=0xfffef5a4) at ../../../src/slurmdbd/slurmdbd.c:167

It crashes inside strlen.S.

This is the initial parameters passed to vsnprintf:

(gdb) frame 7
#7 0xf7c58ed4 in _mysql_make_table_current (ending=<optimized out>, fields=0x0, table_name=0xf7c60020 "convert_version_table", mysql_conn=0x415468) at ../../../src/database/mysql_common.c:667
667 query2 = xstrdup_printf("insert into %s (creation_time, "
(gdb) l
662 if (mysql_db_query(mysql_conn, query)) {
663 xfree(query);
664 return SLURM_ERROR;
665 }
666 quoted = slurm_add_slash_to_quotes(correct_query);
667 query2 = xstrdup_printf("insert into %s (creation_time, "
668 "mod_time, table_name, definition) "
669 "values (%ld, %ld, '%s', '%s') "
670 "on duplicate key update "
671 "definition='%s', mod_time=%ld;",
(gdb) l
672 table_defs_table, now, now,
673 table_name, quoted,
674 quoted, now);
675 xfree(quoted);
676 debug3("query\n%s", query2);
677 if (mysql_db_query(mysql_conn, query2)) {
678 xfree(query2);
679 return SLURM_ERROR;
680 }
681 xfree(query2);

The final string is partially generated:

(gdb) frame 5
#5 _xstrdup_vprintf (str=str@entry=0xfffee87c,
    fmt=fmt@entry=0xf7c6dc4c "insert into %s (creation_time, mod_time, table_name, definition) values (%ld, %ld, '%s', '%s') on duplicate key update definition='%s', mod_time=%ld;", ap=ap@entry=...)
    at ../../../src/common/xstring.c:799
799 n = vsnprintf(p, size, fmt, our_ap);
(gdb) p p
$1 = 0x447a10 "insert into table_defs_table (creation_time, mod_time, table_name, definition) values (1711462045, "

It crashes at this offset in the strlen.S code:

(gdb) frame 0
Download failed: Invalid argument. Continuing without source file ./string/../sysdeps/arm/armv6t2/strlen.S.
#0 __GI_strlen () at ../sysdeps/arm/armv6t2/strlen.S:126
warning: 126 ../sysdeps/arm/armv6t2/strlen.S: No such file or directory
(gdb) i r pc
pc 0xf7db752e 0xf7db752e <__GI_strlen+173>

(gdb) disas
Dump of assembler code for function __GI_strlen:
...
   0xf7db7525 <+163>: ldrd r4, r5, [sp], #8
   0xf7db7529 <+167>: add.w r0, r0, r2, lsr #3
   0xf7db752d <+171>: bx lr
   0xf7db752f <+173>: ldrd r2, r3, [r1]
   0xf7db7533 <+177>: and.w r5, r4, #3
   0xf7db7537 <+181>: rsb r0, r4, #0
...

It tries to load from whatever address r1 points to, but r1 seems to contain a timestamp that is supposed to be part of the final string (the timestamp seems to be 1711462045 actually):

(gdb) i r r1
r1 0x6602d698 1711462040
(gdb) x 0x6602d698
0x6602d698: Cannot access memory at address 0x6602d698

Revision history for this message
Danilo Egea Gondolfo (danilogondolfo) wrote :

The variable "now" used in xstrdup_printf is defined as "time_t now = time(NULL);" hmmmm

Revision history for this message
Danilo Egea Gondolfo (danilogondolfo) wrote :

The problem is the format string using %ld with the new time_t in armhf, now it's long long int (%lld)...

query2 = xstrdup_printf("insert into %s (creation_time, "
 "mod_time, table_name, definition) "
 "values (%ld, %ld, '%s', '%s') "
 "on duplicate key update "
 "definition='%s', mod_time=%ld;",
 table_defs_table, now, now,
 table_name, quoted,
 quoted, now);

Steve Langasek (vorlon)
tags: added: time-t update-excuse
Revision history for this message
Danilo Egea Gondolfo (danilogondolfo) wrote :

There are lots of instances of this issue. After fixing the one above it breaks in a different place

root@autopkgtest-lxd-bqwqyd:~# slurmdbd -D
slurmdbd: accounting_storage/as_mysql: _check_mysql_concat_is_sane: MySQL server version is: 5.5.5-10.11.7-MariaDB-2ubuntu1
slurmdbd: error: Database settings not recommended values: innodb_buffer_pool_size innodb_lock_wait_timeout
slurmdbd: error: mysql_query failed: 167 Out of range value for column 'id' at row 3
insert into tres_table (creation_time, id, deleted, type) values (1711475131, 0, 0, 'cpu'), (1, -140242792, 0, 'mem'), (1711475131, 0, 0, 'energy'), (2, 0, 0, 'node'), (1711475131, 0, 0, 'billing'), (3, 0, 0, 'vmem'), (1711475131, 0, 0, 'pages'), (4, 0, 1, 'dynamic_offset') on duplicate key update deleted=VALUES(deleted), type=VALUES(type), id=VALUES(id);
slurmdbd: fatal: problem adding static tres

Revision history for this message
Danilo Egea Gondolfo (danilogondolfo) wrote :
Revision history for this message
Steve Langasek (vorlon) wrote :

tests being ignored, this will be promoted to the release pocket, so marking high for fixing or binary removal

Changed in slurm-wlm (Ubuntu):
importance: Undecided → High
Revision history for this message
Danilo Egea Gondolfo (danilogondolfo) wrote :

I've got the (non-flaky) tests passing

autopkgtest [10:47:08]: @@@@@@@@@@@@@@@@@@@@ summary
srun PASS
sbatch FLAKY non-zero exit status 1
sacct PASS
mpi PASS

but the patch is getting ridiculous... I had to patch %ld -> %lld in 42 files. As I'm not changing the variable types, they will warn in non-armhf architectures as time_t is not a long long int there. In the end they have the same size but still...

I'm inclined to request a binary package removal on armhf.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.