Comment 8 for bug 1928393

Revision history for this message
Thiago Jung Bauermann (thiago-bauermann) wrote : Re: [Bug 1928393] Re: linux-firmware 1.197 causes kernel to report error "amdgpu: [gfxhub0] retry page fault"

Over the weekend I was finally able to revert back to the previous versions
of the org.freedesktop.Platform and org.freedesktop.Platform.GL.default
flatpak runtimes. It turns out that the `flatpak history` command wasn't
necessay for the rollback.

Em sexta-feira, 14 de maio de 2021, às 13:14:22 -03, Seth Forshee escreveu:
> Before we revert we should see if newer firmware fixes the issue, and
> make sure we are only changing the specific firmware files for your
> hardware.
>
> I think your hardware is the "Picasso" series. Can you try the
> following? If you are unsure about any of the following steps, let me
> know and I can provide you with test packages to install instead.
>
> Save all files matching /lib/firmware/amdgpu/picasso* from linux-
> firmware 1.190.5. Reinstall 1.197, then overwrite the picasso firmware
> files with the ones you saved. Reboot, and confirm that the issues you
> see with 1.197 are fixed. If they are not fixed, then there's no need to
> proceed as we haven't found the correct firmware files which are causing
> your issues.

I did that exactly that, and I was able to run for 4 days without any retry
page fault error. This makes me confident that the 1.190.5 firmware doesn't
have the bug, and also that the amdgpu/picasso* files are the relevant
ones.

> Then please download the picasso firmware files from here:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-> firmware.git/tree/amdgpu
>
> Use the "plain" link next to each file to download the file. Overwrite
> the files in /lib/firmware/admgpu with these files, reboot, and see if
> you continue to have problems.

I also did that, using the files from commit 55d964905a2b. More recent
commits in that repo didn't touch the amdgpu directory so they're still the
most recent firmware files for my hardware.

Unfortunately I still saw the retry page fault message on dmesg with it,
and very soon after boot (IIRC it happened while running the sddm login
manager, before I log in). On the bright side, it didn't have any advert
effect on my computer and I just noticed hours later because I specifically
grepped for it. So perhaps the latest firmware has a less nasty version of
the bug?

And just to double-check the baseline reference, I also ran with pristine
linux-firmware 1.197, the version which made my machine so unstable. I had
a somewhat different experience this time. The bug still happened, but only
after 20h of uptime. And the symptom was "just" a visual glitch while
scrolling inside Firefox, not a complete freeze of the display and
keyboard, as I was experiencing originally. Perhaps if I rebooted and
insisted on using it again I would experience worse effects. But I thought
that was enough to confirm that 1.197 is still bad.

So I'm not sure what to make of all this. I still wasn't able to pinpoint
exactly what triggers the worst manifestation of the bug. But of the three
versions of linux-firmware I used (1.190.5, 1.197 and upstream), 1.197 is
still the one where things are worse so IMHO the picasso files need to come
from one of the other two versions. 1.190.5 is the rock solid one, so I
think it's the safest bet. But perhaps the upstream version is not too bad?