ubuntu-kernel-tests

kselftest net:reuseport_bpf_numa fails on J-nvidia-6.5 on ARM64 node hinyari (failed to pin to node)

Bug #2058441 reported by Po-Hsu Lin on 2024-03-20

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	ubuntu-kernel-tests	New	Undecided	Unassigned

Bug Description

This issue can be found exclusively on ARM64 node "hinyari" with:
  * J-nvidia-6.5.0-1010.10
  * J-nvidia-6.5.0-1011.11
  * J-nvidia-6.5.0-1013.13

The failure message is identical to bug 2029273, but I think it's better to file a new one here.

Test log:
Running 'make run_tests -C net TEST_PROGS=reuseport_bpf_numa TEST_GEN_PROGS='' TEST_CUSTOM_PROGS='''
make: Entering directory '/home/ubuntu/autotest/client/tmp/ubuntu_kselftests_net/src/linux/tools/testing/selftests/net'
TAP version 13
1..1
# timeout set to 0
# selftests: net: reuseport_bpf_numa
# ---- IPv4 UDP ----
# send node 0, receive socket 0
# ./reuseport_bpf_numa: failed to pin to node: Invalid argument
not ok 1 selftests: net: reuseport_bpf_numa # exit=1
make: Leaving directory '/home/ubuntu/autotest/client/tmp/ubuntu_kselftests_net/src/linux/tools/testing/selftests/net'

Tags:

Revision history for this message

Jacob Martin (jacobmartin) wrote on 2024-06-21:

Same failure also only seen on hinyari, with kernels:
* N-nvidia-6.8.0-1007.7
* N-nvidia-6.8.0-1008.8
* J-nvidia-6.8-6.8.0-1008.8~22.04.1

Revision history for this message

Jacob Martin (jacobmartin) wrote on 2024-06-21:

The node hinyari has online NUMA nodes 2-8 that have no associated CPUs. The test succeeds for node 0 and skips offline node 1. Once the test reaches node 2, libnuma's call to sched_setaffinity provides an empty mask (node 2 has no CPUs), resulting in an expected EINVAL return code.

I'm not yet sure what nodes 2-8 represent, nor why these seemingly empty nodes are deemed online.

ubuntu@hinyari:~$ numactl -H
available: 8 nodes (0,2-8)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
node 0 size: 490285 MB
node 0 free: 486132 MB
node 2 cpus:
node 2 size: 0 MB
node 2 free: 0 MB
node 3 cpus:
node 3 size: 0 MB
node 3 free: 0 MB
node 4 cpus:
node 4 size: 0 MB
node 4 free: 0 MB
node 5 cpus:
node 5 size: 0 MB
node 5 free: 0 MB
node 6 cpus:
node 6 size: 0 MB
node 6 free: 0 MB
node 7 cpus:
node 7 size: 0 MB
node 7 free: 0 MB
node 8 cpus:
node 8 size: 0 MB
node 8 free: 0 MB
node distances:
node 0 2 3 4 5 6 7 8
  0: 10 80 80 80 80 80 80 80
  2: 80 10 255 255 255 255 255 255
  3: 80 255 10 255 255 255 255 255
  4: 80 255 255 10 255 255 255 255
  5: 80 255 255 255 10 255 255 255
  6: 80 255 255 255 255 10 255 255
  7: 80 255 255 255 255 255 10 255
  8: 80 255 255 255 255 255 255 10

Revision history for this message

Jacob Martin (jacobmartin) wrote on 2024-07-01:

According to the document [1]: NUMA node 0 is the Grace CPU, node 1 is the Hopper GPU, and nodes 2-8 are used for MIG [2].

[1] https://docs.nvidia.com/grace-performance-tuning-guide.pdf
[2] https://www.nvidia.com/en-us/technologies/multi-instance-gpu/

tags:	added: 6.8
tags:	added: noble

Revision history for this message

Po-Hsu Lin (cypressyew) wrote on 2024-07-11:

This issue can be found on node hinyari with N-nvidia-lowlatency 6.8.0-1009.9.1 as well.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.