------- Comment From <email address hidden> 2018-03-06 11:15 EDT-------
(In reply to comment #45)
> Hi, Murilo.
>
> Can you test it on 16.04 using kdump-tools from xenial-proposed? Maybe the
> noirqdistrib option might be related to the EEH issues.
>
Ok, I'll give it a try.
(In reply to comment #46)
> Looking at the log, I noticed the EEH is frozen right after finding the
> Broadcom card. Is that one the tg3?
>
> [ OK ] Found device NetXtreme BCM5719 Gigabit Ethernet PCIe.
> [ 8.191135] EEH: Frozen PE#7 on PHB#21 detected
> [ 8.191280] EEH: PE location: S00210f, PHB location: N/A
Yeah correct, this is the tg3 device. But the EEH is seen in a PHB different then the one the adapter is in. This adapter is PHB#01, where the EEH is seen in the PHB#21.
>
> Also, the recovery problem seems to be caused by ast.
>
> [ 18.267005] EEH: 2100000 reads ignored for recovering device at
> location=S00210f driver=ast pci addr=0021:10:00.0
> [ 18.267334] EEH: Might be infinite loop in ast driver
>
> Looking at the upstream logs, one commit came up. Can you open a new bug for
> it?
>
> commit 298360af3dab45659810fdc51aba0c9f4097e4f6
> Author: Russell Currey <email address hidden>
> Date: Thu Dec 15 16:12:41 2016 +1100
>
> drivers/gpu/drm/ast: Fix infinite loop if read fails
Cascardo, about the mentioned patch, it is already in this kernel, when I look at the changelog for linux-image-4.4.0-116-generic:
* Xenial update to v4.4.41 stable release (LP: #1655041)
- drivers/gpu/drm/ast: Fix infinite loop if read fails
And also this is not the only device that is hitting the EEH, when I blacklisted the ast module I still see the EEH hitting the other slots behind the PLX switch
I was able to collect a full dmesg output by adding the dmesg command to the KDUMP_FAIL_CMD option, still no luck in getting it to drop to a shell.
------- Comment From <email address hidden> 2018-03-06 11:15 EDT-------
(In reply to comment #45)
> Hi, Murilo.
>
> Can you test it on 16.04 using kdump-tools from xenial-proposed? Maybe the
> noirqdistrib option might be related to the EEH issues.
>
Ok, I'll give it a try.
(In reply to comment #46)
> Looking at the log, I noticed the EEH is frozen right after finding the
> Broadcom card. Is that one the tg3?
>
> [ OK ] Found device NetXtreme BCM5719 Gigabit Ethernet PCIe.
> [ 8.191135] EEH: Frozen PE#7 on PHB#21 detected
> [ 8.191280] EEH: PE location: S00210f, PHB location: N/A
Yeah correct, this is the tg3 device. But the EEH is seen in a PHB different then the one the adapter is in. This adapter is PHB#01, where the EEH is seen in the PHB#21.
> 59810fdc51aba0c 9f4097e4f6 gpu/drm/ ast: Fix infinite loop if read fails
> Also, the recovery problem seems to be caused by ast.
>
> [ 18.267005] EEH: 2100000 reads ignored for recovering device at
> location=S00210f driver=ast pci addr=0021:10:00.0
> [ 18.267334] EEH: Might be infinite loop in ast driver
>
> Looking at the upstream logs, one commit came up. Can you open a new bug for
> it?
>
> commit 298360af3dab456
> Author: Russell Currey <email address hidden>
> Date: Thu Dec 15 16:12:41 2016 +1100
>
> drivers/
Cascardo, about the mentioned patch, it is already in this kernel, when I look at the changelog for linux-image- 4.4.0-116- generic: gpu/drm/ ast: Fix infinite loop if read fails
* Xenial update to v4.4.41 stable release (LP: #1655041)
- drivers/
And also this is not the only device that is hitting the EEH, when I blacklisted the ast module I still see the EEH hitting the other slots behind the PLX switch
I was able to collect a full dmesg output by adding the dmesg command to the KDUMP_FAIL_CMD option, still no luck in getting it to drop to a shell.