ucx library fails with Genoa CPUs and InfiniBand
Bug #2055222 reported by
Quesar
This bug affects 3 people
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
ucx (Ubuntu) |
Confirmed
|
Undecided
|
Unassigned |
Bug Description
Running MPI jobs or some ucx_perftest tests with Genoa CPUs and Infiniband fails when ucx is built with a gcc version newer than 10.3 due to optimizations that convert code into "memmove" calls. I worked with Nvidia to identify and resolve the issues. Here's the links to the 2 patches that resolve the issue:
https:/
https:/
Please include these patches into the ucx package to resolve the issues.
To post a comment you must log in.
Can these patches be added to the ucx package please? This issue is affecting all Genoa clusters with Infiniband.
Here's the type of error it causes:
root@rschhpc210:~# ucx_perftest 13557:0] perftest.c:899 UCX WARN CPU affinity is not set (bound to 384 cpus). Performance may be impacted. ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- --+ ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- --+ 13557:0: 13557] ib_mlx5_log.c:162 Remote access on mlx5_0:1/IB (synd 0x13 vend 0x88 hw_synd 0/0) 13557:0: 13557] ib_mlx5_log.c:162 RC QP 0x3177 wqe[60241]: RDMA_READ s-- [rva 0x7fc08799c000 rkey 0x2f1b1] [va 0x7fc4e3f63000 len 1048576 lkey 0x1bdd26] [rqpn 0x102 dlid=33 sl=0 port=1 src_path_bits=0] 64-linux- gnu/libucs. so.0(ucs_ handle_ error+0x2e4) [0x7fc4e5535fc4] 64-linux- gnu/libucs. so.0(ucs_ fatal_error_ message+ 0xb6) [0x7fc4e5536176] 64-linux- gnu/libucs. so.0(+0x25c9a) [0x7fc4e553ac9a] 64-linux- gnu/libucs. so.0(ucs_ log_dispatch+ 0xe4) [0x7fc4e55344a4] 64-linux- gnu/ucx/ libuct_ ib.so.0( uct_ib_ mlx5_completion _with_err+ 0x5ed) [0x7fc4e509d6fd] 64-linux- gnu/ucx/ libuct_ ib.so.0( +0x3eb16) [0x7fc4e50b9b16] 64-linux- gnu/libucp. so.0(ucp_ worker_ progress+ 0x7a) [0x7fc4e55ed28a] +0x416de) [0x56329edf56de] +0x1ff92) [0x56329edd3f92] +0x82ea) [0x56329edbc2ea] +0x5a94) [0x56329edb9a94] 64-linux- gnu/libc. so.6(+0x29d90) [0x7fc4e5229d90] 64-linux- gnu/libc. so.6(__ libc_start_ main+0x80) [0x7fc4e5229e40] +0x6375) [0x56329edba375] ======= ======= ======= =====
[1698428074.879303] [rschhpc210:
Waiting for connection...
Accepted connection from 10.3.8.219:54350
+------
| API: protocol layer |
| Test: am latency |
| Data layout: (automatic) |
| Send memory: host |
| Recv memory: host |
| Message size: 1048576 |
| AM header size: 0 |
+------
[rschhpc210:
[rschhpc210:
==== backtrace (tid: 13557) ====
0 /lib/x86_
1 /lib/x86_
2 /lib/x86_
3 /lib/x86_
4 /lib/x86_
5 /lib/x86_
6 /lib/x86_
7 ucx_perftest(
8 ucx_perftest(
9 ucx_perftest(
10 ucx_perftest(
11 /lib/x86_
12 /lib/x86_
13 ucx_perftest(
=======
Aborted (core dumped)