Comment 10 for bug 1723127

Revision history for this message
Dan Streetman (ddstreet) wrote :

> I'd say we go with option #2. Please provide information on how to proceed, and how to
> undo any changes we test :)

ok, so first, these instructions may cause the card to hang; the system may need to be rebooted or the driver reloaded. The changes here can be undone by resetting the card; rebooting or reloading the driver.

Also please note these instructions are ONLY FOR i40e NICs!

The process here is to clear all the nic's hardware asserts, and then enable each of them one-by-one and try to reproduce the MDD event. That way, when it reproduces, we know exactly which hw assert triggered it.

First, find your nic's pci address, e.g. ethtool -i NIC | grep bus-info

Then (as root) cd to "/sys/kernel/debug/i40e/BUSID" (replace BUSID with your nic's actual pci addr). You should see a "command" file there.

Now zero out the registers:

$ echo write 0xe648c 0 > command
$ echo write 0x442f4 0 > command

Then, set a single bit; starting with 0x1 on the first register:

$ echo write 0xe648c 0x1 > command

Do normal testing. There are 3 possibilities at this step:

a) you test long enough to be sure the problem was avoided
b) your system and/or nic hangs due to an "uncaught" MDD event
c) you reproduce the problem, and see the TX error and PF reset

For either (a) or (b), that means this bit isn't the one we're looking for, so move to the next bit:

$ echo write 0xe648c 0 > command
$ echo write 0x442f4 0 > command
$ echo write 0xe648c 0x2 > command

Then retest. Replace "0x2" with incrementing bits, as you test each bit. Note this is setting individual bits, so the sequence to test is (in hex) 1, 2, 4, 8, 10, 20, 40, 80, 100, etc. This is a 32 bit register so the highest bit to test is 0x80000000. If you test all bits in register 0xe648c without reproducing the problem, then move on to register 0x442f4 testing bit-by-bit again starting at 0x1 again. You should be able to reproduce the problem with one of the bits set in one of these two registers, according to what I've been told by Intel.

As you set each bit, you should get output in your dmesg and/or syslog or kern.log, indicating the current value of the registers, e.g.:

write: 0xe648c = 0x1

You can also manually read the registers at any time with:

$ echo read 0xe648c > command
$ echo read 0x442f4 > command

you should see the results in dmesg/logs, e.g.:

read: 0xe648c = 0x1

Once/if you do reproduce the problem, make note of the values for both registers (i.e. what bit was set), and report that back here. I'll check with Intel to find what the specific bit indicates the problem was.

Thanks!