kselftest net:reuseport_bpf_numa fails on J-nvidia-6.5 on ARM64 node hinyari (failed to pin to node)

Bug #2058441 reported by Po-Hsu Lin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
ubuntu-kernel-tests
New
Undecided
Unassigned

Bug Description

This issue can be found exclusively on ARM64 node "hinyari" with:
  * J-nvidia-6.5.0-1010.10
  * J-nvidia-6.5.0-1011.11
  * J-nvidia-6.5.0-1013.13

The failure message is identical to bug 2029273, but I think it's better to file a new one here.

Test log:
 Running 'make run_tests -C net TEST_PROGS=reuseport_bpf_numa TEST_GEN_PROGS='' TEST_CUSTOM_PROGS='''
 make: Entering directory '/home/ubuntu/autotest/client/tmp/ubuntu_kselftests_net/src/linux/tools/testing/selftests/net'
 TAP version 13
 1..1
 # timeout set to 0
 # selftests: net: reuseport_bpf_numa
 # ---- IPv4 UDP ----
 # send node 0, receive socket 0
 # ./reuseport_bpf_numa: failed to pin to node: Invalid argument
 not ok 1 selftests: net: reuseport_bpf_numa # exit=1
 make: Leaving directory '/home/ubuntu/autotest/client/tmp/ubuntu_kselftests_net/src/linux/tools/testing/selftests/net'

Revision history for this message
Jacob Martin (jacobmartin) wrote :

Same failure also only seen on hinyari, with kernels:
* N-nvidia-6.8.0-1007.7
* N-nvidia-6.8.0-1008.8
* J-nvidia-6.8-6.8.0-1008.8~22.04.1

Revision history for this message
Jacob Martin (jacobmartin) wrote :

The node hinyari has online NUMA nodes 2-8 that have no associated CPUs. The test succeeds for node 0 and skips offline node 1. Once the test reaches node 2, libnuma's call to sched_setaffinity provides an empty mask (node 2 has no CPUs), resulting in an expected EINVAL return code.

I'm not yet sure what nodes 2-8 represent, nor why these seemingly empty nodes are deemed online.

ubuntu@hinyari:~$ numactl -H
available: 8 nodes (0,2-8)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
node 0 size: 490285 MB
node 0 free: 486132 MB
node 2 cpus:
node 2 size: 0 MB
node 2 free: 0 MB
node 3 cpus:
node 3 size: 0 MB
node 3 free: 0 MB
node 4 cpus:
node 4 size: 0 MB
node 4 free: 0 MB
node 5 cpus:
node 5 size: 0 MB
node 5 free: 0 MB
node 6 cpus:
node 6 size: 0 MB
node 6 free: 0 MB
node 7 cpus:
node 7 size: 0 MB
node 7 free: 0 MB
node 8 cpus:
node 8 size: 0 MB
node 8 free: 0 MB
node distances:
node 0 2 3 4 5 6 7 8
  0: 10 80 80 80 80 80 80 80
  2: 80 10 255 255 255 255 255 255
  3: 80 255 10 255 255 255 255 255
  4: 80 255 255 10 255 255 255 255
  5: 80 255 255 255 10 255 255 255
  6: 80 255 255 255 255 10 255 255
  7: 80 255 255 255 255 255 10 255
  8: 80 255 255 255 255 255 255 10

Revision history for this message
Jacob Martin (jacobmartin) wrote :

According to the document [1]: NUMA node 0 is the Grace CPU, node 1 is the Hopper GPU, and nodes 2-8 are used for MIG [2].

[1] https://docs.nvidia.com/grace-performance-tuning-guide.pdf
[2] https://www.nvidia.com/en-us/technologies/multi-instance-gpu/

tags: added: 6.8
tags: added: noble
Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

This issue can be found on node hinyari with N-nvidia-lowlatency 6.8.0-1009.9.1 as well.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.