Comment 12 for bug 1823836

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

> ------- Comment From <email address hidden> 2019-04-11 12:22 EDT-------
>. Can we agree to limit the discussion to this bug?

As I said the same already, yes.
Thanks to focus more on it again - this calms my former concerns!

> Based on Dave's debug, there doesn't appear to be anything that is
> architecture specific.

That is good the more that are affected the more will join the thought about it.
We still might end up needing to report it to upstream(s)

> The current situation is that we have access to P9, but not x86. On the other hand, you are likely to have access to x86 with CX-5. To narrow down this problem, can you see if this can be recreated on x86?

Of the HW I can tap at the moment my P9s have no Mellanox network, I
have an x86 box with CX-4, but none with CX-5 right now.
Trying on CX-4 just in case ...

$ lspci
08:00.0 Ethernet controller: Mellanox Technologies MT27710 Family
[ConnectX-4 Lx]
       Subsystem: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
       Physical Slot: 1
       Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr+ Stepping- SERR+ FastB2B- DisINTx+
       Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
       Latency: 0, Cache Line Size: 64 bytes
       Interrupt: pin A routed to IRQ 16
       NUMA node: 0
       Region 0: Memory at 94000000 (64-bit, prefetchable) [size=32M]
       [virtual] Expansion ROM at 92b00000 [disabled] [size=1M]
       Capabilities: <access denied>
       Kernel driver in use: mlx5_core
       Kernel modules: mlx5_core

But it CX-4 probes with mlx5 driver in DPDK as well.
Superficially one might think I hit the same error (-95 is ENOTSUP).
[...]
EAL: probe driver: 15b3:1015 net_mlx5
net_mlx5: port 0 verbs maximum priority: 8 expected 8/16
net_mlx5: probe of PCI device 0000:08:00.0 aborted after encountering
an error: Unknown error -95
[...]

I have never used this card for DPDK, just found the box with it.
So setup might be incomplete - and analysis shows that I'm stuck with
a different issue than you.

I installed the debug symbols from the archive and checked in GDB
which path it takes through mlx5_pci_probe
I reach detection of case #2, then mlx5_dev_spawn throws the message
  net_mlx5: port 0 verbs maximum priority: 8 expected 8/16
And then it takes the path of failing to initialize from there.

In the initialization function it seems to take a normal route first
(I still would expect it to use the mlx4 pmd)
Single stepping through mlx5_dev_spawn seems normal (it reaches far)

Close to the end this call:
err = mlx5_flow_discover_priorities(eth_dev);
Then yields the message we saw in my case:
  net_mlx5: port 0 verbs maximum priority: 8 expected 8/16
This is setting the err to -95 which fails us eventually.

This is interesting as it does
  priority = vprio[i];
[...]
  switch (priority) {
  8: ...
  16: ...
   default: ...
It reaches default, but then reports vprio[i] as being 8 which means
it should not have reached the default: path.
It is the mlx5_glue->create_flow(drop->qp, &flow_attr.attr); that
returns a null pointer due to "ibv_create_flow" failing.

As I said this card was never used/set up - so that might be whatever
(e.g. bad FW level).
The reporting of the error is a bit silly, as it would be flow
creation, but that was fixed in
commit 4fb27c1d "net/mlx5: fix flow priorities probing error path"

Yep it should be Firmware in my case, I have a
08:00.0 Ethernet controller: Mellanox Technologies MT27710 Family
[ConnectX-4 Lx]
Which per https://doc.dpdk.org/guides/nics/mlx5.html needs at least FW level
   ConnectX-4 Lx: 14.21.1000 and above.
But I have:
$ sudo ibv_devinfo
hca_id: mlx5_0
       transport: InfiniBand (0)
       fw_ver: 14.12.1240
[...]
       board_id: MT_2430110032

You have -95 (ENOTSUP) as well, but not this message, so your case
fails initializing something else.
Which means that my setup as-is might not be representative to your case :-/

... a not short holiday in firmware update land later ...

Now running with FW 14.24.1000 and working for me:

$ sudo /usr/bin/dpdk-testpmd -w 0000:08:00.0 -l 0-3 -n 4 -- -i -a
EAL: Detected 12 lcore(s)
EAL: Detected 1 NUMA nodes
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: No free hugepages reported in hugepages-1048576kB
EAL: Probing VFIO support...
EAL: PCI device 0000:08:00.0 on NUMA socket 0
EAL: probe driver: 15b3:1015 net_mlx5
Interactive-mode selected
Auto-start selected
testpmd: create a new mbuf pool <mbuf_pool_socket_0>: n=171456,
size=2176, socket=0
testpmd: preferred mempool ops selected: ring_mp_mc
[...]

Ok, once my setup was fixed it worked fine.
By any chance are your FW levels up to date?
For your card that should be:
  ConnectX-5: 16.21.1000 and above.
  ConnectX-5 Ex: 16.21.1000 and above.

You can run ibv_devinfo to get what you currently have.
  $ sudo ibv_devinfo

If it is not your FW, we are back to my suggestion to file this with
the infiniband/dpdk/mellanox folks to get their experience as well.

> The underlying assumption is that all the relevant package versions between x86 and ppc64le are identical. Is that assumption correct?

Yes the code is the same across architectures.