DPDK 23.11.1 / OVS 3.3.0 cause test failures
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
dpdk (Ubuntu) |
Invalid
|
Undecided
|
Unassigned | ||
openvswitch (Ubuntu) |
Fix Released
|
Undecided
|
Unassigned | ||
Noble |
Confirmed
|
Undecided
|
Frode Nordahl |
Bug Description
https:/
Bad
Oracular OVS test with DPDK 23.11.1-1
https:/
Good
Oracular OVS test with DPDK 23.11-1
https:/
The logs are messy to read if not used to it, this is a try at debugging what is going wrong.
So far it seems other people have retried 14 times, and 14714 fail at the same tests - I think there is a real issue.
As shown below, one can find always these fail:
4: OVS-DPDK - ping vhost-user ports FAILED (system-
5: OVS-DPDK - ping vhost-user-client ports FAILED (system-
16: OVS-DPDK - MTU increase vport port FAILED (system-
17: OVS-DPDK - MTU decrease vport port FAILED (system-
20: OVS-DPDK - MTU upper bound vport port FAILED (system-
21: OVS-DPDK - MTU lower bound vport port FAILED (system-
Christian Ehrhardt (paelzer) wrote : | #1 |
Christian Ehrhardt (paelzer) wrote : | #2 |
One has to realize, even the good path hits the cleanup function, but entered
with rc=0. It then cleans up and exits with the RC it got.
The bad case OTOH seems to do parts of the cleanup function, but not all of it.
It ends the tests, with more successful tests (151) but then enters cleanup
with a bad RC.
Note:
There is also an autopkgtest/nova issue in the bad log:
5063s /home/ubuntu/
5063s /home/ubuntu/
But myself and bryce think this is an orthogonal issue.
Diving into the log a bit deeper made me find:
4: OVS-DPDK - ping vhost-user ports FAILED (system-
5: OVS-DPDK - ping vhost-user-client ports FAILED (system-
16: OVS-DPDK - MTU increase vport port FAILED (system-
17: OVS-DPDK - MTU decrease vport port FAILED (system-
20: OVS-DPDK - MTU upper bound vport port FAILED (system-
21: OVS-DPDK - MTU lower bound vport port FAILED (system-
The partial confusion is from the test running multiple suites:
=> openvswitch-
=> system-
=> system-
=> system-
Christian Ehrhardt (paelzer) wrote : | #3 |
After that one knows which sub-log to check, for ease of consumption
I've attached the subsection here that talks about the actually failing tests.
Fail #1
/tmpdir/
4. system-dpdk.at:102: 4. OVS-DPDK - ping vhost-user ports (system-
Fail #2
> 2024-05-
> 2024-05-
tests/system-
5. system-dpdk.at:158: 5. OVS-DPDK - ping vhost-user-client ports (system-
Fail #3
> 2024-05-
> 2024-05-
tests/system-
16. system-dpdk.at:577: 16. OVS-DPDK - MTU increase vport port (system-
Fail #4
> 2024-05-
> 2024-05-
tests/system-
17. system-dpdk.at:618: 17. OVS-DPDK - MTU decrease vport port (system-
Fail #5
> 2024-05-
> 2024-05-
tests/system-
20. system-dpdk.at:736: 20. OVS-DPDK - MTU upper bound vport port (system-
Fail #6
> 2024-05-
> 2024-05-
tests/system-
21. system-dpdk.at:778: 21. OVS-DPDK - MTU lower bound vport port (system-
They seem to all stumble over some cleanup.
The question now is if that is a symptom (something b...
Christian Ehrhardt (paelzer) wrote : | #4 |
Again, is that the root cause or a symptom of a former failure?
Not all tests are very clear in that regard, but one I liked.
As it fails BEFORE this cleanup phase.
The "MTU upper bound vport port" test is like this:
> 2024-05-
> 2024-05-
> 2024-05-
> 2024-05-
> 2024-05-
> 2024-05-
> 2024-05-
tests/system-
20. system-dpdk.at:736: 20. OVS-DPDK - MTU upper bound vport port (system-
But this is a red-herring, as the setting of this is meant to fail.
The test says:
"Set MTU value above upper bound and check for error"
So it might indeed be the cleanup that fails?
Christian Ehrhardt (paelzer) wrote : | #5 |
Trying to isolate the commands used in the failing test (works in a not too small Oracular VM):
0. Prep install and config
apt install openvswitch-
echo "NR_2M_PAGES=784" >> /etc/dpdk/dpdk.conf
systemctl restart dpdk
update-
ovs-vsctl --no-wait init
ovs-vsctl --no-wait set Open_vSwitch . other_config:
ovs-vsctl --no-wait set Open_vSwitch . other_config:
1. start OVS with DPDK support (this is the point where you'd restart new test iterations)
systemctl stop openvswitch-switch ovs-vswitchd
rm /var/log/
systemctl start openvswitch-switch ovs-vswitchd
ovs-vsctl show
2. create an internal DPDK bridge with a vhostuserclient port with defined MTU
ovs-vsctl add-br br10 -- set bridge br10 datapath_
OVS_RUNDIR=
rm -rf $OVS_RUNDIR
mkdir -p $OVS_RUNDIR
ovs-vsctl add-port br10 dpdkvhostusercl
ovs-vsctl show
3. connect a vhostuserclient server
# Do this in a shell of its own, it will stay up
dpdk-testpmd --in-memory --socket-mem=512 --single-
4. check if it is truly ready
ovs-vsctl show
grep "virtio is now ready for processing" /var/log/
# should have a happy dpdkvhostusercl
ovs-vsctl get Interface dpdkvhostusercl
# should say "up"
5. set happy MTU
ovs-vsctl set Interface dpdkvhostusercl
ovs-vsctl get Interface dpdkvhostusercl
# should deliver 9702
6. set the bigger MTU
ovs-vsctl set Interface dpdkvhostusercl
# the ovs-vswitchd.log will say (expected)
dpdkvhostuse
failed to set MTU for network device dpdkvhostusercl
ovs-vsctl get Interface dpdkvhostusercl
# will still deliver 9702
7. cleanup the port
ovs-vsctl del-port br10 dpdkvhostusercl
# Should work without bad RC, log in ovs-vswitchd.log will say:
bridge|
dpif_
dpdk(
dpdk|
netdev_
ovs-vsctl del-br br10
ovs-vsctl show
# should show an empty switch
^^ Here I expected the issue with the new code (OVS + DPDK being new).
But sadly I have proven that the "not found" on the path was a red herring as well.
It comes up in the good path as well.
The good case just does not push the log output to the test-output.
Hence it seemed to be an issue "only in the bad case" bu...
Christian Ehrhardt (paelzer) wrote : | #6 |
From here I was capturing full logs of this with DPDK 23.11-1 and 23.11.1-1
23.11-1 console output:
root@o-dpdk-ovs:~# systemctl stop openvswitch-switch ovs-vswitchd
root@o-dpdk-ovs:~# rm /var/log/
root@o-dpdk-ovs:~# systemctl start openvswitch-switch ovs-vswitchd
root@o-dpdk-ovs:~# ovs-vsctl show
9a966e25-
ovs_version: "3.3.0"
root@o-dpdk-ovs:~# ovs-vsctl add-br br10 -- set bridge br10 datapath_
root@o-dpdk-ovs:~# OVS_RUNDIR=
root@o-dpdk-ovs:~# rm -rf $OVS_RUNDIR
root@o-dpdk-ovs:~# mkdir -p $OVS_RUNDIR
root@o-dpdk-ovs:~# ovs-vsctl add-port br10 dpdkvhostusercl
root@o-dpdk-ovs:~# ovs-vsctl show
9a966e25-
Bridge br10
Port dpdkvhostusercl
Port br10
ovs_version: "3.3.0"
root@o-dpdk-ovs:~# ovs-vsctl get Interface dpdkvhostusercl
up
root@o-dpdk-ovs:~# ovs-vsctl set Interface dpdkvhostusercl
root@o-dpdk-ovs:~# ovs-vsctl get Interface dpdkvhostusercl
9702
root@o-dpdk-ovs:~# ovs-vsctl set Interface dpdkvhostusercl
root@o-dpdk-ovs:~# ovs-vsctl get Interface dpdkvhostusercl
9702
root@o-dpdk-ovs:~# ovs-vsctl del-port br10 dpdkvhostusercl
root@o-dpdk-ovs:~# ovs-vsctl del-br br10
root@o-dpdk-ovs:~# ovs-vsctl show
9a966e25-
ovs_version: "3.3.0"
Christian Ehrhardt (paelzer) wrote : | #7 |
23.11.1-1 console output
root@o-dpdk-ovs:~# systemctl stop openvswitch-switch ovs-vswitchd
root@o-dpdk-ovs:~# rm /var/log/
root@o-dpdk-ovs:~# systemctl start openvswitch-switch ovs-vswitchd
root@o-dpdk-ovs:~# ovs-vsctl show
9a966e25-
ovs_version: "3.3.0"
root@o-dpdk-ovs:~# ovs-vsctl add-br br10 -- set bridge br10 datapath_
root@o-dpdk-ovs:~# OVS_RUNDIR=
root@o-dpdk-ovs:~# rm -rf $OVS_RUNDIR
root@o-dpdk-ovs:~# mkdir -p $OVS_RUNDIR
root@o-dpdk-ovs:~# ovs-vsctl add-port br10 dpdkvhostusercl
root@o-dpdk-ovs:~# ovs-vsctl show
9a966e25-
Bridge br10
Port dpdkvhostusercl
Port br10
ovs_version: "3.3.0"
root@o-dpdk-ovs:~# ovs-vsctl get Interface dpdkvhostusercl
up
root@o-dpdk-ovs:~# ovs-vsctl set Interface dpdkvhostusercl
root@o-dpdk-ovs:~# ovs-vsctl get Interface dpdkvhostusercl
9702
root@o-dpdk-ovs:~# ovs-vsctl set Interface dpdkvhostusercl
root@o-dpdk-ovs:~# ovs-vsctl get Interface dpdkvhostusercl
9702
root@o-dpdk-ovs:~# ovs-vsctl del-port br10 dpdkvhostusercl
root@o-dpdk-ovs:~# ovs-vsctl del-br br10
Christian Ehrhardt (paelzer) wrote : | #8 |
The only very unique entry I got here in those logs of the dpdk enabled
openvswitch in the bad case was:
00001|dpdk|
Not sure where to go with that, is that from the switch or the client even?
I might need to reach out if there are any known issues where, now that I have
some pointers, this rings a bell.
Christian Ehrhardt (paelzer) wrote : | #9 |
Christian Ehrhardt (paelzer) wrote : | #10 |
Christian Ehrhardt (paelzer) wrote : | #11 |
Christian Ehrhardt (paelzer) wrote : | #12 |
Frode mentioned that OVS 3.3.1 might get some bonus love in regard to DPDK 23.11.1 [1].
In the past that was only ever needed for new features, which the stable update does not have, but there is a chance that this is just a test bug I do not yet understand fully - so yeah, giving this a look with OVS 3.3.1 might be worth a try.
Worst case there is a chance that DPDK 23.11.1 only works with OVS >=3.3.1 which we need to express in dependencies than (which is odd as DPDK does not depend on OVS, but vice versa)
Also adding a tag to show up in excuses, as this is what makes those stuck in proposed migration.
tags: | added: update-excuse |
Christian Ehrhardt (paelzer) wrote : | #13 |
The log message I have seen might be a red herring as well [1] added that but unless something insists on "non-ERR" that should not be the reason as it is not changing behavior.
But then, what else am I looking for, Frode confirmed that he will do some triage and debug from the OVS POV which might help to find where things actually break.
[1]: https:/
Christian Ehrhardt (paelzer) wrote (last edit ): | #14 |
Directly running just the failing test (of the 6 I picked the MTU upper bound that I analyzed above) works like this:
$ apt install ubuntu-dev-tools conntrack curl devscripts dpdk-dev ethtool equivs net-tools ncat python3-pyftpdlib tcpdump wget openvswitch-common openvswitch-doc openvswitch-ipsec openvswitch-pki openvswitch-source openvswitch-switch openvswitch-
# special needs of the OVS tests
$ apt-get remove --yes --purge netcat-openbsd
$ ln -sf /usr/bin/ncat /usr/bin/nc
# Get the source and build it for the tests
$ pull-lp-source openvswitch
$ cd openvswitch-3.3.0/
$ mk-build-deps --install --tool "apt-get -o Debug::
$ ./debian/rules build
# Hugepages for the tests
$ echo "NR_2M_PAGES=784" >> /etc/dpdk/dpdk.conf
$ systemctl restart dpdk
# Set dpdk enabled version in alternatives
$ update-alternatives --set ovs-vswitchd /usr/lib/
# Disable conflicting system services
$ systemctl stop openvswitch-ipsec ovs-vswitchd ovsdb-server
# shorten paths to not break tests
$ BIND_MOUNT_
$ mount --bind . "${BIND_MOUNT_DIR}"
$ cd "${BIND_MOUNT_DIR}"
# Run the full dpdk test suite (to ensure the case you want works with the usual context)
$ ./tests/
# run selected test - in our case 20
$ ./tests/
This way the good test works and can be re-run, setting up a bad case for comparison ...
Christian Ehrhardt (paelzer) wrote : | #15 |
After the setup outlined above I was running the set the autopkgtest identified (4-5 16-17 20-21
). With that I was able to trigger good/bad case in a reproducible way.
In my repro-environment, test 5 also fails in the good case - which might just be my test environment's fault. Hence I'm comparing 4 16-17 20-21 now and will attach the detailed test logs for them for the good and bad case (which only differs in the DPDK version used 23.11 vs 23.11.1).
But the good case leaves no details of the execution in the log, hence I'll also collect and attach the full output of _debian/
Hopefully we find a pattern comparing the testcases one by one in these ...
Christian Ehrhardt (paelzer) wrote : | #16 |
Christian Ehrhardt (paelzer) wrote : | #17 |
Christian Ehrhardt (paelzer) wrote : | #18 |
Here the differences I spot on a high level:
4. OVS-DPDK - ping vhost-user ports
- testpmd.log - not much difference
- ovsdb-server.log
ovsdb-server is killed (sig 15) in the bad case 7 seconds after starting
that kill is after the test log ends, so it could be cleanup
- ovs-vswitchd.log
dpdk|
in addition to all-same messages like the good case, the bad case shows
dpdk|
dpdk|
- system-
differs at the end, good case seems to enter cleanup while bad case does lots of things
bad case has more content as it reports detailed steps
can't cleanup testpmd (pid already gone)
16. OVS-DPDK - MTU increase vport port
17. OVS-DPDK - MTU decrease vport port (same)
20. OVS-DPDK - MTU upper bound vport port (same)
21. OVS-DPDK - MTU lower bound vport port (same)
- testpmd.log
good case sees ports going down and stopping
- ovsdb-server.log
ovsdb-server is killed (sig 15) in the bad case a few seconds after starting
that kill is after the test log ends, so it could be cleanup
- ovs-vswitchd.log
dpdk|
- system-
differs at the end, good case seems to enter cleanup while bad case does lots of things
bad case has more content as it reports detailed steps
can't cleanup testpmd (pid already gone)
So on case 4 it could be the xdp error which is new.
On the other cases only the telemetry is odd, but that seemed (see above) like a false positive.
Yet there are chances a component I didn't see yet tests on no ERR messages.
Worth a rebuild with this reverted?
Christian Ehrhardt (paelzer) wrote : | #19 |
With what I know from comment #14 and comment #5 I tried to get deeper to finally find what actually goes wrong which still eludes me :-/
Env vars that we need:
OVS_RUNDIR=
abs_top_
1. db creation + starting ovsdb with --detach in background + checking if it has started
=> logs in /var/log/
../..
../..
../..
2. initialize and start vswitch in background
=> logs in /var/log/
../../tests/
../../tests/
../../tests/
../../tests/
3. add bridge and port
../../tests/
../../tests/
set Interface dpdkvhostusercl
4. check for log to be as expected
system-dpdk.at:745: waiting until grep "VHOST_CONFIG: ($OVS_RUNDIR/
2024-06-
system-dpdk.at:745: wait succeeded immediately
system-dpdk.at:745: waiting until grep "vHost User device 'dpdkvhostuserc
2024-06-
system-dpdk.at:745: wait succeeded immediately
system-dpdk.at:745: waiting until grep "VHOST_CONFIG: ($OVS_RUNDIR/
2024-06-
system-dpdk.at:745: wait succeeded immediately
=> so far, all good in both cases
6. start testpmd in other console
dpdk-testpmd --in-memory --socket-mem=512 --single-
Only diff so far, the known:
dpdk|
5. set valid MTU
../../tests/
../../tests/
6. set invalid MTU
../../tests/
../../tests/
Christian Ehrhardt (paelzer) wrote (last edit ): | #20 |
Thanks to m4 expansion I had to change it in tests/system-
The real fix would be in ./tests/
Adding the following to the rules made it work:
/TELEMETRY: Socket write base info to client failed/d
Now the questions is "is this a warning that is fine to have in the tests or is this masking a legit error?"
I'll propose that to openswitch setting a set of people on CC that deal with those topics.
In that discussion we can hopefully conclude the above question.
P.S.That was a long strange trip - I wish the test would say something more clear like "I found this .... in the log which I didn't expect", but since it does not (clear enough to me at least) that was a debugging rabbit hole.
Christian Ehrhardt (paelzer) wrote : | #21 |
Submitted for upstream discussion: https:/
Christian Ehrhardt (paelzer) wrote : | #22 |
We found a different solution that fixes even more.
Included as:
commit cc99622485062b9
Author: David Marchand <email address hidden>
Date: Thu Jun 6 15:11:11 2024 +0200
system-dpdk: Fix socket conflict when starting testpmd.
Which is in upstream stable update v3.3.1
Frode will as usual prepare v3.3.1 which will then in turn fix the tests.
There is no real issue which would imply needing a versioned dependency.
Christian Ehrhardt (paelzer) wrote : | #23 |
New OVS is in Oracular
https:/
Some tests already were happy in old runs, ppc64 and arm64 have not run the way they need to.
I've triggered them with the new DPDK and the new OVS together.
Unless some other, formerly masked, test issue affects that it should be good now.
Christian Ehrhardt (paelzer) wrote : | #24 |
Fully complete in Oracular now and DPDK/OVS migrated together
https:/
@Frode - will you prepare an OVS 3.3.1 for Noble so we can do the DPDK stable update afterwards?
no longer affects: | dpdk (Ubuntu Noble) |
Changed in dpdk (Ubuntu): | |
status: | New → Invalid |
Changed in openvswitch (Ubuntu): | |
status: | New → Fix Released |
Changed in openvswitch (Ubuntu Noble): | |
status: | New → Confirmed |
assignee: | nobody → Frode Nordahl (fnordahl) |
Oracular OVS bad test with DPDK 23.11.1-1 /autopkgtest. ubuntu. com/results/ autopkgtest- oracular/ oracular/ amd64/o/ openvswitch/ 20240531_ 100906_ 6f957@/ log.gz
https:/
5047s 151 tests were successful. openvswitch- switch/ ovs-vswitchd alternatives: using /usr/lib/ openvswitch- switch/ ovs-vswitchd to provide /usr/sbin/ ovs-vswitchd (ovs-vswitchd) in manual mode t.62SGkT/ build.KmH/ src
5047s 36 tests were skipped.
5047s + update-alternatives --set ovs-vswitchd /usr/lib/
5047s update-
5047s /tmp/autopkgtes
5047s + dirs +1
5047s + popd
5047s + umount /UZ4
5047s + rmdir /UZ4
5047s + exit 1
Oracular OVS good test with DPDK 23.11-1 /autopkgtest. ubuntu. com/results/ autopkgtest- oracular/ oracular/ amd64/o/ openvswitch/ 20240529_ 111623_ 6f8db@/ log.gz
https:/
4957s 148 tests were successful. dpdk.conf. bak /etc/dpdk/dpdk.conf openvswitch- switch/ ovs-vswitchd alternatives: using /usr/lib/ openvswitch- switch/ ovs-vswitchd to provide /usr/sbin/ ovs-vswitchd (ovs-vswitchd) in manual mode t.vQL7uJ/ build.4Ev/ src
4957s 49 tests were skipped.
4957s + cleanup
4957s + rc=0
4957s + set +e
4957s + '[' 0 -ne 0 ']'
4957s + '[' -L /usr/bin/nc ']'
4957s + rm -f /usr/bin/nc
4957s + '[' dpdk = dpdk ']'
4957s + mv /etc/dpdk/
4957s + systemctl restart dpdk
4957s + update-alternatives --set ovs-vswitchd /usr/lib/
4957s update-
4957s /tmp/autopkgtes
4957s + dirs +1
4957s + popd
4957s + umount /S31
4957s + rmdir /S31
4957s + exit 0
The arches that worked are no counter to that:
dpdk SKIP exit status 77 and marked as skippable
Theory, legit issue with the new DPDK breaking the OVS testcase