Comment 0 for bug 1947206

Revision history for this message
dann frazier (dannf) wrote :

[Impact]
Nvidia notified me via private email that they'd discovered some issues with the ib_peer_memory patch we are carrying in hirsute/impish and sent me a patch intended to resolve them. My knowledge of these changes is limited to what is mentioned in the commit message:

- Allow clients to opt out of unmap during invalidation
- Fix some bugs in the sequencing of mlx5 MRs
- Enable ATS for peer memory

[Test Case]
ib_write_bw from the perftest package, rebuilt with CUDA support, can be used as a smoke test of this feature. I'll attach a sample test script here. I've verified this test passes with the kernels in the archive, and continues to pass with the provided patch applied.

[Fix]
Nvidia has emailed me fixes for both trees. They are not currently available in a public tree elsewhere, though I'm told at some point they should end up in a branch here:
  https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git/

[What could go wrong]
The only known use case for ib_peer_memory are Nvidia GPU users making use of the GPU PeerDirect feature where GPUs can share memory with one another over an Infiniband network. Bugs here could cause problems (hangs, crashes, corruption) with such workloads.