ucx library fails with Genoa CPUs and InfiniBand

Bug #2055222 reported by Quesar
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
ucx (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

Running MPI jobs or some ucx_perftest tests with Genoa CPUs and Infiniband fails when ucx is built with a gcc version newer than 10.3 due to optimizations that convert code into "memmove" calls. I worked with Nvidia to identify and resolve the issues. Here's the links to the 2 patches that resolve the issue:

https://github.com/openucx/ucx/pull/9692
https://github.com/openucx/ucx/pull/9714

Please include these patches into the ucx package to resolve the issues.

Revision history for this message
Quesar (rick-microway) wrote :

Can these patches be added to the ucx package please? This issue is affecting all Genoa clusters with Infiniband.

Here's the type of error it causes:

root@rschhpc210:~# ucx_perftest
[1698428074.879303] [rschhpc210:13557:0] perftest.c:899 UCX WARN CPU affinity is not set (bound to 384 cpus). Performance may be impacted.
Waiting for connection...
Accepted connection from 10.3.8.219:54350
+----------------------------------------------------------------------------------------------------------+
| API: protocol layer |
| Test: am latency |
| Data layout: (automatic) |
| Send memory: host |
| Recv memory: host |
| Message size: 1048576 |
| AM header size: 0 |
+----------------------------------------------------------------------------------------------------------+
[rschhpc210:13557:0:13557] ib_mlx5_log.c:162 Remote access on mlx5_0:1/IB (synd 0x13 vend 0x88 hw_synd 0/0)
[rschhpc210:13557:0:13557] ib_mlx5_log.c:162 RC QP 0x3177 wqe[60241]: RDMA_READ s-- [rva 0x7fc08799c000 rkey 0x2f1b1] [va 0x7fc4e3f63000 len 1048576 lkey 0x1bdd26] [rqpn 0x102 dlid=33 sl=0 port=1 src_path_bits=0]
==== backtrace (tid: 13557) ====
0 /lib/x86_64-linux-gnu/libucs.so.0(ucs_handle_error+0x2e4) [0x7fc4e5535fc4]
1 /lib/x86_64-linux-gnu/libucs.so.0(ucs_fatal_error_message+0xb6) [0x7fc4e5536176]
2 /lib/x86_64-linux-gnu/libucs.so.0(+0x25c9a) [0x7fc4e553ac9a]
3 /lib/x86_64-linux-gnu/libucs.so.0(ucs_log_dispatch+0xe4) [0x7fc4e55344a4]
4 /lib/x86_64-linux-gnu/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x5ed) [0x7fc4e509d6fd]
5 /lib/x86_64-linux-gnu/ucx/libuct_ib.so.0(+0x3eb16) [0x7fc4e50b9b16]
6 /lib/x86_64-linux-gnu/libucp.so.0(ucp_worker_progress+0x7a) [0x7fc4e55ed28a]
7 ucx_perftest(+0x416de) [0x56329edf56de]
8 ucx_perftest(+0x1ff92) [0x56329edd3f92]
9 ucx_perftest(+0x82ea) [0x56329edbc2ea]
10 ucx_perftest(+0x5a94) [0x56329edb9a94]
11 /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7fc4e5229d90]
12 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7fc4e5229e40]
13 ucx_perftest(+0x6375) [0x56329edba375]
=================================
Aborted (core dumped)

Revision history for this message
Quesar (rick-microway) wrote :

I reproduced this on a Sapphire Rapids cluster now too, and the same patches fixed it.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in ucx (Ubuntu):
status: New → Confirmed
Revision history for this message
Quesar (rick-microway) wrote :

This bug report includes the solution. Can someone please acknowledge and respond to it? This is an easy fix at this point but it has been ignored for over a month already.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.