DPDK with OVS cannot allocate CPU on different socket/processor

Bug #1719392 reported by Richard
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
dpdk (Ubuntu)
Expired
Undecided
Unassigned

Bug Description

Procedure is almost same as https://bugs.launchpad.net/ubuntu/+source/dpdk/+bug/1719387

OS: Ubuntu 17.10 , update to latest version
Hardware : ThunderX 2S System (Gigabyte R150)

Reproduce procedure
====
DPDKDEV1=0006:01:00.1
DPDKDEV2=0006:01:00.2

dpdk-devbind -b vfio-pci $DPDKDEV1
dpdk-devbind -b vfio-pci $DPDKDEV2

sysctl -w vm.nr_hugepages=24
umount /dev/hugepages
mount -t hugetlbfs none /dev/hugepages
grep HugePages_ /proc/meminfo

pkill ovs
sleep 5

rm -rf /etc/openvswitch/*
rm -rf /var/run/openvswitch/*
rm -rf /var/log/openvswitch/*

ovsdb-tool create /etc/openvswitch/conf.db /usr/share/openvswitch/vswitch.ovsschema
ovsdb-server --remote=punix:/var/run/openvswitch/db.sock --remote=db:Open_vSwitch,Open_vSwitch,manager_options --private-key=db:Open_vSwitch,SSL,private_key --certificate=db:Open_vSwitch,SSL,certificate --bootstrap-ca-cert=db:Open_vSwitch,SSL,ca_cert --pidfile --detach --log-file

ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true
ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-socket-mem="1024,1024"

ovs-vsctl --no-wait init
TXQ=2 ovs-vswitchd --pidfile --detach --log-file

ovs-vsctl del-br br0
ovs-vsctl --log-file=/var/log/openvswitch/ovs-ctl.log add-br br0 -- set bridge br0 datapath_type=netdev
ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk options:dpdk-devargs=${DPDKDEV1}
ovs-vsctl add-port br0 dpdk1 -- set Interface dpdk1 type=dpdk options:dpdk-devargs=${DPDKDEV2}

ovs-ofctl add-flow br0 in_port=1,action=output:2
ovs-ofctl add-flow br0 in_port=2,action=output:1

ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x6
ovs-vsctl set Interface dpdk0 options:n_rxq=2
ovs-vsctl set Interface dpdk1 options:n_rxq=2

====
if replace pmd-cpu-mask=0x6 with pmd-cpu-mask=0x30000000000030.
Only two CPU works
example htop result. it should have two 100% loading CPU on 5/6. but only have two CPUs totally.
https://www.flickr.com/photos/richliu_tw/37058355540/in/dateposted-public/

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hi,
that is most interesting.
I never tried to go across a node on PMD assignments.
Are you able to check the cpu masks of the processes/threads it spawned?

I need to reproduce this to fully check this, but this will need a while as I'm soon on vacation and likely don't get to this before.
If you have that cgroup data that would be great.

Revision history for this message
Richard (richliu) wrote :

I can do that, if you have reference command or URL I can follow would be better.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hi,
like with you htop you see the PIDs of the PMD.
There should be four but obviously only the two that do the 100% busy poll are easy to spot.

You could start geting the cgroup of all these processes with like:
cat /proc/<pid>/cgroup
Report the full output of that and which processes currently are the PMD here.

Then next we want to look at the actual cgroup.
So if the former output had like:
10:cpuset:/foo
[...]
3:cpu,cpuacct:/bar.bar

You'd find those in
/sys/fs/cgroup/cpuset/foo/*
/sys/fs/cgroup/cpu/bar.bar/*
[...]

It is usually better to be complete and then filter so you might attach the output of the following:
top -b -d 2 -n 1
ps axlf
for i in $(pgrep chrome); do printf "\nPID: %d\n" "$i"; cat /proc/$i/cgroup; done
for i in $(find /sys/fs/cgroup -name '*cpu*'); do if [ -f $i ]; then printf "\nGROUP: %s\n" "$i"; cat $i; fi; done

So step 1 gives us the mapping of process to cgroups and step 2 the config of the groups.
Redirect that to a file and attach it here.

In general I'm not even sure if the PMDs might intentionally not all spin, but I need more data to come up with e.g. a question to upstream.

Revision history for this message
Richard (richliu) wrote :

This is openvswitch cgroup.

Revision history for this message
Richard (richliu) wrote :

This is script output

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hi,
I looked into this once more - the data before wasn't useful as it seems it uses internal task_set.
But I came to think this is intentional IIRC it wants to have the PMD local to the Interface only.
So it will use the cpus on the same node as the interface, but not the others.

Could you report:
- $ numactl -H
- $ for i in /sys/bus/pci/devices/*; do printf "%s => %2s\n" $(basename $i) $(cat $i/numa_node); done

With that we could check if that theory is true.
Perfect would be if you also have another card on another numa node, which then only works on the other nodes cpu cores.

Changed in dpdk (Ubuntu):
status: New → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for dpdk (Ubuntu) because there has been no activity for 60 days.]

Changed in dpdk (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.