DPDK 23.11.1 / OVS 3.3.0 cause test failures

Bug #2067889 reported by Christian Ehrhardt 
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
dpdk (Ubuntu)
Invalid
Undecided
Unassigned
openvswitch (Ubuntu)
Fix Released
Undecided
Unassigned
Noble
Confirmed
Undecided
Frode Nordahl

Bug Description

https://autopkgtest.ubuntu.com/packages/o/openvswitch/oracular/amd64

Bad
Oracular OVS test with DPDK 23.11.1-1
https://autopkgtest.ubuntu.com/results/autopkgtest-oracular/oracular/amd64/o/openvswitch/20240531_100906_6f957@/log.gz

Good
Oracular OVS test with DPDK 23.11-1
https://autopkgtest.ubuntu.com/results/autopkgtest-oracular/oracular/amd64/o/openvswitch/20240529_111623_6f8db@/log.gz

The logs are messy to read if not used to it, this is a try at debugging what is going wrong.
So far it seems other people have retried 14 times, and 14714 fail at the same tests - I think there is a real issue.

As shown below, one can find always these fail:
4: OVS-DPDK - ping vhost-user ports FAILED (system-dpdk.at:148)
5: OVS-DPDK - ping vhost-user-client ports FAILED (system-dpdk.at:224)
16: OVS-DPDK - MTU increase vport port FAILED (system-dpdk.at:609)
17: OVS-DPDK - MTU decrease vport port FAILED (system-dpdk.at:651)
20: OVS-DPDK - MTU upper bound vport port FAILED (system-dpdk.at:767)
21: OVS-DPDK - MTU lower bound vport port FAILED (system-dpdk.at:809)

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Oracular OVS bad test with DPDK 23.11.1-1
https://autopkgtest.ubuntu.com/results/autopkgtest-oracular/oracular/amd64/o/openvswitch/20240531_100906_6f957@/log.gz

5047s 151 tests were successful.
5047s 36 tests were skipped.
5047s + update-alternatives --set ovs-vswitchd /usr/lib/openvswitch-switch/ovs-vswitchd
5047s update-alternatives: using /usr/lib/openvswitch-switch/ovs-vswitchd to provide /usr/sbin/ovs-vswitchd (ovs-vswitchd) in manual mode
5047s /tmp/autopkgtest.62SGkT/build.KmH/src
5047s + dirs +1
5047s + popd
5047s + umount /UZ4
5047s + rmdir /UZ4
5047s + exit 1

Oracular OVS good test with DPDK 23.11-1
https://autopkgtest.ubuntu.com/results/autopkgtest-oracular/oracular/amd64/o/openvswitch/20240529_111623_6f8db@/log.gz

4957s 148 tests were successful.
4957s 49 tests were skipped.
4957s + cleanup
4957s + rc=0
4957s + set +e
4957s + '[' 0 -ne 0 ']'
4957s + '[' -L /usr/bin/nc ']'
4957s + rm -f /usr/bin/nc
4957s + '[' dpdk = dpdk ']'
4957s + mv /etc/dpdk/dpdk.conf.bak /etc/dpdk/dpdk.conf
4957s + systemctl restart dpdk
4957s + update-alternatives --set ovs-vswitchd /usr/lib/openvswitch-switch/ovs-vswitchd
4957s update-alternatives: using /usr/lib/openvswitch-switch/ovs-vswitchd to provide /usr/sbin/ovs-vswitchd (ovs-vswitchd) in manual mode
4957s /tmp/autopkgtest.vQL7uJ/build.4Ev/src
4957s + dirs +1
4957s + popd
4957s + umount /S31
4957s + rmdir /S31
4957s + exit 0

The arches that worked are no counter to that:
 dpdk SKIP exit status 77 and marked as skippable

Theory, legit issue with the new DPDK breaking the OVS testcase

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

One has to realize, even the good path hits the cleanup function, but entered
with rc=0. It then cleans up and exits with the RC it got.

The bad case OTOH seems to do parts of the cleanup function, but not all of it.
It ends the tests, with more successful tests (151) but then enters cleanup
with a bad RC.

Note:
There is also an autopkgtest/nova issue in the bad log:
5063s /home/ubuntu/autopkgtest/ssh-setup/nova: 234: [: x86_64: unexpected operator
5063s /home/ubuntu/autopkgtest/ssh-setup/nova: 237: [: x86_64: unexpected operator
But myself and bryce think this is an orthogonal issue.

Diving into the log a bit deeper made me find:

4: OVS-DPDK - ping vhost-user ports FAILED (system-dpdk.at:148)
5: OVS-DPDK - ping vhost-user-client ports FAILED (system-dpdk.at:224)
16: OVS-DPDK - MTU increase vport port FAILED (system-dpdk.at:609)
17: OVS-DPDK - MTU decrease vport port FAILED (system-dpdk.at:651)
20: OVS-DPDK - MTU upper bound vport port FAILED (system-dpdk.at:767)
21: OVS-DPDK - MTU lower bound vport port FAILED (system-dpdk.at:809)

The partial confusion is from the test running multiple suites:
=> openvswitch-afxdp-testsuite Good 136 skip 39
=> system-dpdk-testsuite Good 142 skip 49 Fail 6
=> system-userspace-testsuite Good 151 skip 36
=> system-offloads-testsuite Good 152 skip 33

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :
Download full text (3.5 KiB)

After that one knows which sub-log to check, for ease of consumption
I've attached the subsection here that talks about the actually failing tests.

Fail #1
/tmpdir/_debian/tests/system-dpdk-testsuite.dir/004/cleanup: line 1: kill: (275003) - No such process
4. system-dpdk.at:102: 4. OVS-DPDK - ping vhost-user ports (system-dpdk.at:102): FAILED (system-dpdk.at:148)

Fail #2
> 2024-05-31T10:08:11.896Z|00087|dpdk|INFO|VHOST_CONFIG: (/tmpdir/_debian/tests/system-dpdk-testsuite.dir/005/dpdkvhostclient0) free connfd 52
> 2024-05-31T10:08:11.896Z|00088|netdev_dpdk|INFO|vHost Device '/tmpdir/_debian/tests/system-dpdk-testsuite.dir/005/dpdkvhostclient0' not found
tests/system-dpdk-testsuite: line 1754: 275274 Killed dpdk-testpmd $eal_options -- $testpmd_options > testpmd.log 2>&1
5. system-dpdk.at:158: 5. OVS-DPDK - ping vhost-user-client ports (system-dpdk.at:158): FAILED (system-dpdk.at:224)

Fail #3
> 2024-05-31T10:08:17.179Z|00120|dpdk|INFO|VHOST_CONFIG: (/tmpdir/_debian/tests/system-dpdk-testsuite.dir/016/dpdkvhostclient0) free connfd 52
> 2024-05-31T10:08:17.179Z|00121|netdev_dpdk|INFO|vHost Device '/tmpdir/_debian/tests/system-dpdk-testsuite.dir/016/dpdkvhostclient0' not found
tests/system-dpdk-testsuite: line 1754: 275495 Killed dpdk-testpmd $eal_options -- $testpmd_options > testpmd.log 2>&1
16. system-dpdk.at:577: 16. OVS-DPDK - MTU increase vport port (system-dpdk.at:577): FAILED (system-dpdk.at:609)

Fail #4
> 2024-05-31T10:08:22.436Z|00079|dpdk|INFO|VHOST_CONFIG: (/tmpdir/_debian/tests/system-dpdk-testsuite.dir/017/dpdkvhostclient0) free connfd 52
> 2024-05-31T10:08:22.436Z|00080|netdev_dpdk|INFO|vHost Device '/tmpdir/_debian/tests/system-dpdk-testsuite.dir/017/dpdkvhostclient0' not found
tests/system-dpdk-testsuite: line 1767: 275658 Killed dpdk-testpmd $eal_options -- $testpmd_options > testpmd.log 2>&1
17. system-dpdk.at:618: 17. OVS-DPDK - MTU decrease vport port (system-dpdk.at:618): FAILED (system-dpdk.at:651)

Fail #5
> 2024-05-31T10:08:28.375Z|00121|dpdk|INFO|VHOST_CONFIG: (/tmpdir/_debian/tests/system-dpdk-testsuite.dir/020/dpdkvhostclient0) free connfd 52
> 2024-05-31T10:08:28.375Z|00122|netdev_dpdk|INFO|vHost Device '/tmpdir/_debian/tests/system-dpdk-testsuite.dir/020/dpdkvhostclient0' not found
tests/system-dpdk-testsuite: line 1754: 275858 Killed dpdk-testpmd $eal_options -- $testpmd_options > testpmd.log 2>&1
20. system-dpdk.at:736: 20. OVS-DPDK - MTU upper bound vport port (system-dpdk.at:736): FAILED (system-dpdk.at:767)

Fail #6
> 2024-05-31T10:08:32.113Z|00079|dpdk|INFO|VHOST_CONFIG: (/tmpdir/_debian/tests/system-dpdk-testsuite.dir/021/dpdkvhostclient0) free connfd 52
> 2024-05-31T10:08:32.113Z|00080|netdev_dpdk|INFO|vHost Device '/tmpdir/_debian/tests/system-dpdk-testsuite.dir/021/dpdkvhostclient0' not found
tests/system-dpdk-testsuite: line 1767: 276061 Killed dpdk-testpmd $eal_options -- $testpmd_options > testpmd.log 2>&1
21. system-dpdk.at:778: 21. OVS-DPDK - MTU lower bound vport port (system-dpdk.at:778): FAILED (system-dpdk.at:809)

They seem to all stumble over some cleanup.
The question now is if that is a symptom (something b...

Read more...

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Again, is that the root cause or a symptom of a former failure?
Not all tests are very clear in that regard, but one I liked.
As it fails BEFORE this cleanup phase.

The "MTU upper bound vport port" test is like this:

> 2024-05-31T10:08:28.364Z|00117|netdev_dpdk|WARN|dpdkvhostuserclient0: unsupported MTU 9711
> 2024-05-31T10:08:28.364Z|00118|netdev|WARN|failed to set MTU for network device dpdkvhostuserclient0: Invalid argument
> 2024-05-31T10:08:28.374Z|00119|bridge|INFO|bridge br10: deleted interface dpdkvhostuserclient0 on port 1
> 2024-05-31T10:08:28.374Z|00120|dpif_netdev|INFO|PMD thread on numa_id: 0, core id: 3 destroyed.
> 2024-05-31T10:08:28.374Z|00002|dpdk(pmd-c03/id:10)|INFO|PMD thread released DPDK lcore 1.
> 2024-05-31T10:08:28.375Z|00121|dpdk|INFO|VHOST_CONFIG: (/tmpdir/_debian/tests/system-dpdk-testsuite.dir/020/dpdkvhostclient0) free connfd 52
> 2024-05-31T10:08:28.375Z|00122|netdev_dpdk|INFO|vHost Device '/tmpdir/_debian/tests/system-dpdk-testsuite.dir/020/dpdkvhostclient0' not found
tests/system-dpdk-testsuite: line 1754: 275858 Killed dpdk-testpmd $eal_options -- $testpmd_options > testpmd.log 2>&1
20. system-dpdk.at:736: 20. OVS-DPDK - MTU upper bound vport port (system-dpdk.at:736): FAILED (system-dpdk.at:767)

But this is a red-herring, as the setting of this is meant to fail.
The test says:
  "Set MTU value above upper bound and check for error"

So it might indeed be the cleanup that fails?

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :
Download full text (3.1 KiB)

Trying to isolate the commands used in the failing test (works in a not too small Oracular VM):

0. Prep install and config
 apt install openvswitch-switch-dpdk dpdk-dev
 echo "NR_2M_PAGES=784" >> /etc/dpdk/dpdk.conf
 systemctl restart dpdk
 update-alternatives --set ovs-vswitchd /usr/lib/openvswitch-switch-dpdk/ovs-vswitchd-dpdk
 ovs-vsctl --no-wait init
 ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true
 ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-extra="--log-level=pmd.*:error --no-pci"
1. start OVS with DPDK support (this is the point where you'd restart new test iterations)
 systemctl stop openvswitch-switch ovs-vswitchd
 rm /var/log/openvswitch/ovs-vswitchd.log
 systemctl start openvswitch-switch ovs-vswitchd
 ovs-vsctl show
2. create an internal DPDK bridge with a vhostuserclient port with defined MTU
 ovs-vsctl add-br br10 -- set bridge br10 datapath_type=netdev
 OVS_RUNDIR="/tmpdir/_debian/tests/system-dpdk-testsuite.dir/020"
 rm -rf $OVS_RUNDIR
 mkdir -p $OVS_RUNDIR
 ovs-vsctl add-port br10 dpdkvhostuserclient0 -- set Interface dpdkvhostuserclient0 type=dpdkvhostuserclient options:vhost-server-path=$OVS_RUNDIR/dpdkvhostclient0
 ovs-vsctl show
3. connect a vhostuserclient server
 # Do this in a shell of its own, it will stay up
 dpdk-testpmd --in-memory --socket-mem=512 --single-file-segments --no-pci --vdev="net_virtio_user,path=$OVS_RUNDIR/dpdkvhostclient0,server=1 --vdev=net_tap0,iface=tap0 -- -a --stats-period 2 --nb-cores 2 --rxq 2 --txq 2
4. check if it is truly ready
 ovs-vsctl show
 grep "virtio is now ready for processing" /var/log/openvswitch/ovs-vswitchd.log
 # should have a happy dpdkvhostuserclient0
 ovs-vsctl get Interface dpdkvhostuserclient0 link_state
 # should say "up"
5. set happy MTU
 ovs-vsctl set Interface dpdkvhostuserclient0 mtu_request=9702
 ovs-vsctl get Interface dpdkvhostuserclient0 mtu
 # should deliver 9702
6. set the bigger MTU
 ovs-vsctl set Interface dpdkvhostuserclient0 mtu_request=9711
 # the ovs-vswitchd.log will say (expected)
   dpdkvhostuserclient0: unsupported MTU 9711
   failed to set MTU for network device dpdkvhostuserclient0: Invalid argument
 ovs-vsctl get Interface dpdkvhostuserclient0 mtu
   # will still deliver 9702
7. cleanup the port
 ovs-vsctl del-port br10 dpdkvhostuserclient0
 # Should work without bad RC, log in ovs-vswitchd.log will say:
    bridge|INFO|bridge br10: deleted interface dpdkvhostuserclient0 on port 1
    dpif_netdev|INFO|PMD thread on numa_id: 0, core id: 1 destroyed.
    dpdk(pmd-c01/id:8)|INFO|PMD thread released DPDK lcore 1.
    dpdk|INFO|VHOST_CONFIG: (/tmpdir/_debian/tests/system-dpdk-testsuite.dir/020/dpdkvhostclient0) free connfd 50
    netdev_dpdk|INFO|vHost Device '/tmpdir/_debian/tests/system-dpdk-testsuite.dir/020/dpdkvhostclient0' not found
 ovs-vsctl del-br br10
 ovs-vsctl show
 # should show an empty switch

^^ Here I expected the issue with the new code (OVS + DPDK being new).
But sadly I have proven that the "not found" on the path was a red herring as well.
It comes up in the good path as well.
The good case just does not push the log output to the test-output.
Hence it seemed to be an issue "only in the bad case" bu...

Read more...

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

From here I was capturing full logs of this with DPDK 23.11-1 and 23.11.1-1

23.11-1 console output:

root@o-dpdk-ovs:~# systemctl stop openvswitch-switch ovs-vswitchd
root@o-dpdk-ovs:~# rm /var/log/openvswitch/ovs-vswitchd.log
root@o-dpdk-ovs:~# systemctl start openvswitch-switch ovs-vswitchd
root@o-dpdk-ovs:~# ovs-vsctl show
9a966e25-a911-45d1-a5ad-1beeb7c142a1
    ovs_version: "3.3.0"
root@o-dpdk-ovs:~# ovs-vsctl add-br br10 -- set bridge br10 datapath_type=netdev
root@o-dpdk-ovs:~# OVS_RUNDIR="/tmpdir/_debian/tests/system-dpdk-testsuite.dir/020"
root@o-dpdk-ovs:~# rm -rf $OVS_RUNDIR
root@o-dpdk-ovs:~# mkdir -p $OVS_RUNDIR
root@o-dpdk-ovs:~# ovs-vsctl add-port br10 dpdkvhostuserclient0 -- set Interface dpdkvhostuserclient0 type=dpdkvhostuserclient options:vhost-server-path=$OVS_RUNDIR/dpdkvhostclient0
root@o-dpdk-ovs:~# ovs-vsctl show
9a966e25-a911-45d1-a5ad-1beeb7c142a1
    Bridge br10
        datapath_type: netdev
        Port dpdkvhostuserclient0
            Interface dpdkvhostuserclient0
                type: dpdkvhostuserclient
                options: {vhost-server-path="/tmpdir/_debian/tests/system-dpdk-testsuite.dir/020/dpdkvhostclient0"}
        Port br10
            Interface br10
                type: internal
    ovs_version: "3.3.0"
root@o-dpdk-ovs:~# ovs-vsctl get Interface dpdkvhostuserclient0 link_state
up
root@o-dpdk-ovs:~# ovs-vsctl set Interface dpdkvhostuserclient0 mtu_request=9702
root@o-dpdk-ovs:~# ovs-vsctl get Interface dpdkvhostuserclient0 mtu
9702
root@o-dpdk-ovs:~# ovs-vsctl set Interface dpdkvhostuserclient0 mtu_request=9711
root@o-dpdk-ovs:~# ovs-vsctl get Interface dpdkvhostuserclient0 mtu
9702
root@o-dpdk-ovs:~# ovs-vsctl del-port br10 dpdkvhostuserclient0
root@o-dpdk-ovs:~# ovs-vsctl del-br br10
root@o-dpdk-ovs:~# ovs-vsctl show
9a966e25-a911-45d1-a5ad-1beeb7c142a1
    ovs_version: "3.3.0"

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

23.11.1-1 console output

root@o-dpdk-ovs:~# systemctl stop openvswitch-switch ovs-vswitchd
root@o-dpdk-ovs:~# rm /var/log/openvswitch/ovs-vswitchd.log
root@o-dpdk-ovs:~# systemctl start openvswitch-switch ovs-vswitchd
root@o-dpdk-ovs:~# ovs-vsctl show
9a966e25-a911-45d1-a5ad-1beeb7c142a1
    ovs_version: "3.3.0"
root@o-dpdk-ovs:~# ovs-vsctl add-br br10 -- set bridge br10 datapath_type=netdev
root@o-dpdk-ovs:~# OVS_RUNDIR="/tmpdir/_debian/tests/system-dpdk-testsuite.dir/020"
root@o-dpdk-ovs:~# rm -rf $OVS_RUNDIR
root@o-dpdk-ovs:~# mkdir -p $OVS_RUNDIR
root@o-dpdk-ovs:~# ovs-vsctl add-port br10 dpdkvhostuserclient0 -- set Interface dpdkvhostuserclient0 type=dpdkvhostuserclient options:vhost-server-path=$OVS_RUNDIR/dpdkvhostclient0
root@o-dpdk-ovs:~# ovs-vsctl show
9a966e25-a911-45d1-a5ad-1beeb7c142a1
    Bridge br10
        datapath_type: netdev
        Port dpdkvhostuserclient0
            Interface dpdkvhostuserclient0
                type: dpdkvhostuserclient
                options: {vhost-server-path="/tmpdir/_debian/tests/system-dpdk-testsuite.dir/020/dpdkvhostclient0"}
        Port br10
            Interface br10
                type: internal
    ovs_version: "3.3.0"
root@o-dpdk-ovs:~# ovs-vsctl get Interface dpdkvhostuserclient0 link_state
up
root@o-dpdk-ovs:~# ovs-vsctl set Interface dpdkvhostuserclient0 mtu_request=9702
root@o-dpdk-ovs:~# ovs-vsctl get Interface dpdkvhostuserclient0 mtu
9702
root@o-dpdk-ovs:~# ovs-vsctl set Interface dpdkvhostuserclient0 mtu_request=9711
root@o-dpdk-ovs:~# ovs-vsctl get Interface dpdkvhostuserclient0 mtu
9702
root@o-dpdk-ovs:~# ovs-vsctl del-port br10 dpdkvhostuserclient0
root@o-dpdk-ovs:~# ovs-vsctl del-br br10

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

The only very unique entry I got here in those logs of the dpdk enabled
openvswitch in the bad case was:

00001|dpdk|ERR|TELEMETRY: Socket write base info to client failed

Not sure where to go with that, is that from the switch or the client even?
I might need to reach out if there are any known issues where, now that I have
some pointers, this rings a bell.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Frode mentioned that OVS 3.3.1 might get some bonus love in regard to DPDK 23.11.1 [1].
In the past that was only ever needed for new features, which the stable update does not have, but there is a chance that this is just a test bug I do not yet understand fully - so yeah, giving this a look with OVS 3.3.1 might be worth a try.

Worst case there is a chance that DPDK 23.11.1 only works with OVS >=3.3.1 which we need to express in dependencies than (which is odd as DPDK does not depend on OVS, but vice versa)

Also adding a tag to show up in excuses, as this is what makes those stuck in proposed migration.

tags: added: update-excuse
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

The log message I have seen might be a red herring as well [1] added that but unless something insists on "non-ERR" that should not be the reason as it is not changing behavior.

But then, what else am I looking for, Frode confirmed that he will do some triage and debug from the OVS POV which might help to find where things actually break.

[1]: https://github.com/DPDK/dpdk-stable/commit/cbd1c165bb1ea86efc31c8ac309ee5e7c5f555cb

Revision history for this message
Christian Ehrhardt  (paelzer) wrote (last edit ):

Directly running just the failing test (of the 6 I picked the MTU upper bound that I analyzed above) works like this:

$ apt install ubuntu-dev-tools conntrack curl devscripts dpdk-dev ethtool equivs net-tools ncat python3-pyftpdlib tcpdump wget openvswitch-common openvswitch-doc openvswitch-ipsec openvswitch-pki openvswitch-source openvswitch-switch openvswitch-switch-dpdk openvswitch-test openvswitch-testcontroller openvswitch-vtep python3-openvswitch
# special needs of the OVS tests
$ apt-get remove --yes --purge netcat-openbsd
$ ln -sf /usr/bin/ncat /usr/bin/nc
# Get the source and build it for the tests
$ pull-lp-source openvswitch
$ cd openvswitch-3.3.0/
$ mk-build-deps --install --tool "apt-get -o Debug::pkgProblemResolver=yes --no-install-recommends --yes" --remove debian/control
$ ./debian/rules build
# Hugepages for the tests
$ echo "NR_2M_PAGES=784" >> /etc/dpdk/dpdk.conf
$ systemctl restart dpdk
# Set dpdk enabled version in alternatives
$ update-alternatives --set ovs-vswitchd /usr/lib/openvswitch-switch-dpdk/ovs-vswitchd-dpdk
# Disable conflicting system services
$ systemctl stop openvswitch-ipsec ovs-vswitchd ovsdb-server
# shorten paths to not break tests
$ BIND_MOUNT_DIR=$(mktemp -d /XXX)
$ mount --bind . "${BIND_MOUNT_DIR}"
$ cd "${BIND_MOUNT_DIR}"
# Run the full dpdk test suite (to ensure the case you want works with the usual context)
$ ./tests/system-dpdk-testsuite -C _debian/tests -j1 AUTOTEST_PATH="$(pwd)/tests:$(pwd)/_debian/tests"
# run selected test - in our case 20
$ ./tests/system-dpdk-testsuite -C _debian/tests -j1 AUTOTEST_PATH="$(pwd)/tests:$(pwd)/_debian/tests" 20

This way the good test works and can be re-run, setting up a bad case for comparison ...

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

After the setup outlined above I was running the set the autopkgtest identified (4-5 16-17 20-21
). With that I was able to trigger good/bad case in a reproducible way.

In my repro-environment, test 5 also fails in the good case - which might just be my test environment's fault. Hence I'm comparing 4 16-17 20-21 now and will attach the detailed test logs for them for the good and bad case (which only differs in the DPDK version used 23.11 vs 23.11.1).

But the good case leaves no details of the execution in the log, hence I'll also collect and attach the full output of _debian/tests/system-dpdk-testsuite.dir which holds all these details.

Hopefully we find a pattern comparing the testcases one by one in these ...

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Here the differences I spot on a high level:

4. OVS-DPDK - ping vhost-user ports
 - testpmd.log - not much difference
 - ovsdb-server.log
   ovsdb-server is killed (sig 15) in the bad case 7 seconds after starting
   that kill is after the test log ends, so it could be cleanup
 - ovs-vswitchd.log
   dpdk|ERR|TELEMETRY: Socket write base info to client failed (likely red herring)
   in addition to all-same messages like the good case, the bad case shows
   dpdk|ERR|remove_xdp_program(): bpf_xdp_query_id failed
   dpdk|ERR|eth_dev_close(): Error while removing XDP program.
 - system-dpdk-testsuite.log
   differs at the end, good case seems to enter cleanup while bad case does lots of things
   bad case has more content as it reports detailed steps
   can't cleanup testpmd (pid already gone)

16. OVS-DPDK - MTU increase vport port
17. OVS-DPDK - MTU decrease vport port (same)
20. OVS-DPDK - MTU upper bound vport port (same)
21. OVS-DPDK - MTU lower bound vport port (same)
 - testpmd.log
   good case sees ports going down and stopping
 - ovsdb-server.log
      ovsdb-server is killed (sig 15) in the bad case a few seconds after starting
      that kill is after the test log ends, so it could be cleanup
 - ovs-vswitchd.log
      dpdk|ERR|TELEMETRY: Socket write base info to client failed (likely red herring)
 - system-dpdk-testsuite.log
   differs at the end, good case seems to enter cleanup while bad case does lots of things
   bad case has more content as it reports detailed steps
   can't cleanup testpmd (pid already gone)

So on case 4 it could be the xdp error which is new.
On the other cases only the telemetry is odd, but that seemed (see above) like a false positive.
Yet there are chances a component I didn't see yet tests on no ERR messages.
Worth a rebuild with this reverted?

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :
Download full text (6.1 KiB)

With what I know from comment #14 and comment #5 I tried to get deeper to finally find what actually goes wrong which still eludes me :-/

Env vars that we need:
OVS_RUNDIR=/var/run/openvswitch/
abs_top_srcdir=../_debian/openvswitch/

1. db creation + starting ovsdb with --detach in background + checking if it has started
=> logs in /var/log/openvswitch/ovsdb-server.log

  ../../tests/system-dpdk.at:741: touch .conf.db.~lock~
  ../../tests/system-dpdk.at:741: ovsdb-tool create conf.db $abs_top_srcdir/vswitchd/vswitch.ovsschema
  ../../tests/system-dpdk.at:741: ovsdb-server --detach --no-chdir --pidfile --log-file --remote=punix:$OVS_RUNDIR/db.sock

2. initialize and start vswitch in background
=> logs in /var/log/openvswitch/ovs-vswitchd.log

../../tests/system-dpdk.at:741: ovs-vsctl --no-wait init
../../tests/system-dpdk.at:741: ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true
../../tests/system-dpdk.at:741: ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-extra="--log-level=pmd.*:error --no-pci"
../../tests/system-dpdk.at:741: ovs-vswitchd --detach --no-chdir --pidfile --log-file -vvconn -vofproto_dpif -vunixctl

3. add bridge and port
../../tests/system-dpdk.at:744: ovs-vsctl add-br br10 -- set bridge br10 datapath_type=netdev
../../tests/system-dpdk.at:745: ovs-vsctl add-port br10 dpdkvhostuserclient0 -- \
              set Interface dpdkvhostuserclient0 type=dpdkvhostuserclient options:vhost-server-path=$OVS_RUNDIR/dpdkvhostclient0

4. check for log to be as expected
system-dpdk.at:745: waiting until grep "VHOST_CONFIG: ($OVS_RUNDIR/dpdkvhostclient0) vhost-user client: socket created" ovs-vswitchd.log...
2024-06-05T08:16:32.191Z|00064|dpdk|INFO|VHOST_CONFIG: (/cPJ/_debian/tests/system-dpdk-testsuite.dir/020/dpdkvhostclient0) vhost-user client: socket created, fd: 52
system-dpdk.at:745: wait succeeded immediately
system-dpdk.at:745: waiting until grep "vHost User device 'dpdkvhostuserclient0' created in 'client' mode, using client socket" ovs-vswitchd.log...
2024-06-05T08:16:32.191Z|00065|netdev_dpdk|INFO|vHost User device 'dpdkvhostuserclient0' created in 'client' mode, using client socket '/cPJ/_debian/tests/system-dpdk-testsuite.dir/020/dpdkvhostclient0'
system-dpdk.at:745: wait succeeded immediately
system-dpdk.at:745: waiting until grep "VHOST_CONFIG: ($OVS_RUNDIR/dpdkvhostclient0) reconnecting..." ovs-vswitchd.log...
2024-06-05T08:16:32.191Z|00067|dpdk|INFO|VHOST_CONFIG: (/cPJ/_debian/tests/system-dpdk-testsuite.dir/020/dpdkvhostclient0) reconnecting...
system-dpdk.at:745: wait succeeded immediately

=> so far, all good in both cases

6. start testpmd in other console
dpdk-testpmd --in-memory --socket-mem=512 --single-file-segments --no-pci --vdev="net_virtio_user,path=$OVS_RUNDIR/dpdkvhostclient0,server=1"

Only diff so far, the known:
  dpdk|ERR|TELEMETRY: Socket write base info to client failed

5. set valid MTU

../../tests/system-dpdk.at:746: ovs-vsctl set Interface dpdkvhostuserclient0 mtu_request=9702
../../tests/system-dpdk.at:747: ovs-vsctl show

6. set invalid MTU

../../tests/system-dpdk.at:756: ovs-vsctl get Interface dpdkvhostuserclient0 mtu
../../tests/system-dpdk.at:761: ovs-vsctl set I...

Read more...

Revision history for this message
Christian Ehrhardt  (paelzer) wrote (last edit ):

Thanks to m4 expansion I had to change it in tests/system-dpdk-testsuite for retesting.
The real fix would be in ./tests/system-dpdk-macros.at though.

Adding the following to the rules made it work:
  /TELEMETRY: Socket write base info to client failed/d

Now the questions is "is this a warning that is fine to have in the tests or is this masking a legit error?"

I'll propose that to openswitch setting a set of people on CC that deal with those topics.
In that discussion we can hopefully conclude the above question.

P.S.That was a long strange trip - I wish the test would say something more clear like "I found this .... in the log which I didn't expect", but since it does not (clear enough to me at least) that was a debugging rabbit hole.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

We found a different solution that fixes even more.

Included as:
commit cc99622485062b902362ffa3a08317dc0f926164
Author: David Marchand <email address hidden>
Date: Thu Jun 6 15:11:11 2024 +0200

    system-dpdk: Fix socket conflict when starting testpmd.

Which is in upstream stable update v3.3.1

Frode will as usual prepare v3.3.1 which will then in turn fix the tests.
There is no real issue which would imply needing a versioned dependency.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

New OVS is in Oracular
  https://launchpad.net/ubuntu/+source/openvswitch/3.3.1-1

Some tests already were happy in old runs, ppc64 and arm64 have not run the way they need to.
I've triggered them with the new DPDK and the new OVS together.

Unless some other, formerly masked, test issue affects that it should be good now.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Fully complete in Oracular now and DPDK/OVS migrated together
https://launchpad.net/ubuntu/+source/dpdk/23.11.1-1

@Frode - will you prepare an OVS 3.3.1 for Noble so we can do the DPDK stable update afterwards?

no longer affects: dpdk (Ubuntu Noble)
Changed in dpdk (Ubuntu):
status: New → Invalid
Changed in openvswitch (Ubuntu):
status: New → Fix Released
Changed in openvswitch (Ubuntu Noble):
status: New → Confirmed
assignee: nobody → Frode Nordahl (fnordahl)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.