Comment 20 for bug 1681909

------- Comment From <email address hidden> 2018-03-06 11:15 EDT-------
(In reply to comment #45)
> Hi, Murilo.
>
> Can you test it on 16.04 using kdump-tools from xenial-proposed? Maybe the
> noirqdistrib option might be related to the EEH issues.
>

Ok, I'll give it a try.

(In reply to comment #46)
> Looking at the log, I noticed the EEH is frozen right after finding the
> Broadcom card. Is that one the tg3?
>
> [ OK ] Found device NetXtreme BCM5719 Gigabit Ethernet PCIe.
> [ 8.191135] EEH: Frozen PE#7 on PHB#21 detected
> [ 8.191280] EEH: PE location: S00210f, PHB location: N/A

Yeah correct, this is the tg3 device. But the EEH is seen in a PHB different then the one the adapter is in. This adapter is PHB#01, where the EEH is seen in the PHB#21.

>
> Also, the recovery problem seems to be caused by ast.
>
> [ 18.267005] EEH: 2100000 reads ignored for recovering device at
> location=S00210f driver=ast pci addr=0021:10:00.0
> [ 18.267334] EEH: Might be infinite loop in ast driver
>
> Looking at the upstream logs, one commit came up. Can you open a new bug for
> it?
>
> commit 298360af3dab45659810fdc51aba0c9f4097e4f6
> Author: Russell Currey <email address hidden>
> Date: Thu Dec 15 16:12:41 2016 +1100
>
> drivers/gpu/drm/ast: Fix infinite loop if read fails

Cascardo, about the mentioned patch, it is already in this kernel, when I look at the changelog for linux-image-4.4.0-116-generic:
* Xenial update to v4.4.41 stable release (LP: #1655041)
- drivers/gpu/drm/ast: Fix infinite loop if read fails

And also this is not the only device that is hitting the EEH, when I blacklisted the ast module I still see the EEH hitting the other slots behind the PLX switch

I was able to collect a full dmesg output by adding the dmesg command to the KDUMP_FAIL_CMD option, still no luck in getting it to drop to a shell.