Comment 5 for bug 1910562

Revision history for this message
munbi (gabriele) wrote :

So, after testing several different kernels and live distros to pinpoint this bug, I finally found out the problem: it's an interaction between lm-sensors and amdgpu driver with kernel > 5.4.0.

I found out by chance because I noticed the problem happened only after logging in with a graphical session.

This is what is happening:
- a gnome extension to monitor sensors/temps calls the 'sensors' utility from package lm-sensors every 10 senconds
- sensors 'hangs' for a couple of seconds when poking something related to the amdgpu driver
- amdgpu driver spits some warning/errors on vt console and dmesg
- fans starts spinning for one sec
- then sensors continue normally displaying the readouts from other sensor

This is the output of 'sensors', taken in a non-graphical console (ctr+alt+F3) with kernel 5.8.0-41:

ucsi_source_psy_USBC000:001-isa-0000
Adapter: ISA adapter
in0: 5.00 V (min = +5.00 V, max = +5.00 V)
curr1: 0.00 A (max = +0.00 A)

iwlwifi_1-virtual-0
Adapter: Virtual device
temp1: +37.0°C

ucsi_source_psy_USBC000:002-isa-0000
Adapter: ISA adapter
in0: 5.00 V (min = +5.00 V, max = +5.00 V)
curr1: 0.00 A (max = +0.00 A)

pch_cannonlake-virtual-0
Adapter: Virtual device
temp1: +55.0°C

BAT0-acpi-0
Adapter: ACPI interface
in0: 8.48 V
curr1: 1000.00 uA

amdgpu-pci-0100
Adapter: PCI adapter
[ 112.780951] [drm:dce110_edp_wait_for_hpd_ready [amdgpu]] *ERROR* dce110_edp_wait_for_hpd_ready: wait timed out!
[ 113.380939] [drm:dce110_edp_wait_for_hpd_ready [amdgpu]] *ERROR* dce110_edp_wait_for_hpd_ready: wait timed out!
vddgfx: 1.05 V
edge: +44.0°C (crit = +94.0°C, hyst = -273.1°C)
power1: 7.12 W (cap = 35.00 W)

coretemp-isa-0000
Adapter: ISA adapter
Package id 0: +51.0°C (high = +100.0°C, crit = +100.0°C)
Core 0: +51.0°C (high = +100.0°C, crit = +100.0°C)
Core 1: +48.0°C (high = +100.0°C, crit = +100.0°C)
Core 2: +47.0°C (high = +100.0°C, crit = +100.0°C)
Core 3: +46.0°C (high = +100.0°C, crit = +100.0°C)
Core 4: +45.0°C (high = +100.0°C, crit = +100.0°C)
Core 5: +45.0°C (high = +100.0°C, crit = +100.0°C)

dell_smm-virtual-0
Adapter: Virtual device
fan1: 2480 RPM
fan2: 2471 RPM

nvme-pci-0200
Adapter: PCI adapter
Composite: +46.9°C (low = -273.1°C, high = +69.8°C)
                       (crit = +79.8°C)
Sensor 1: +46.9°C (low = -273.1°C, high = +65261.8°C)
Sensor 2: +47.9°C (low = -273.1°C, high = +65261.8°C)
Sensor 5: +66.8°C (low = -273.1°C, high = +65261.8°C)

acpitz-acpi-0
Adapter: ACPI interface
temp1: +25.0°C (crit = +107.0°C)

This is the complete kernel log from amgpu when this happens:

[ 111.572873] [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
[ 112.780951] [drm:dce110_edp_wait_for_hpd_ready [amdgpu]] *ERROR* dce110_edp_wait_for_hpd_ready: wait timed out!
[ 113.380939] [drm:dce110_edp_wait_for_hpd_ready [amdgpu]] *ERROR* dce110_edp_wait_for_hpd_ready: wait timed out!
[ 113.411556] [drm] UVD and UVD ENC initialized successfully.
[ 113.521534] [drm] VCE initialized successfully.

It seems that lm-sensors poking the amdgpu thermal sensor i triggering some sort of reset and/or causing the thermal infrastructure to spin up the fans

Note that this is not happening with kernel 5.4, with which sensor reports this:

amdgpu-pci-0100
Adapter: PCI adapter
vddgfx: N/A
edge: N/A (crit = +94.0°C, hyst = -273.1°C)
power1: N/A (cap = 35.00 W)

iwlwifi_1-virtual-0
Adapter: Virtual device
temp1: +37.0°C

pch_cannonlake-virtual-0
Adapter: Virtual device
temp1: +54.0°C

BAT0-acpi-0
Adapter: ACPI interface
in0: 8.48 V
curr1: 1000.00 uA

coretemp-isa-0000
Adapter: ISA adapter
Package id 0: +48.0°C (high = +100.0°C, crit = +100.0°C)
Core 0: +44.0°C (high = +100.0°C, crit = +100.0°C)
Core 1: +48.0°C (high = +100.0°C, crit = +100.0°C)
Core 2: +44.0°C (high = +100.0°C, crit = +100.0°C)
Core 3: +45.0°C (high = +100.0°C, crit = +100.0°C)
Core 4: +44.0°C (high = +100.0°C, crit = +100.0°C)
Core 5: +44.0°C (high = +100.0°C, crit = +100.0°C)

dell_smm-virtual-0
Adapter: Virtual device
fan1: 0 RPM
fan2: 0 RPM

acpitz-acpi-0
Adapter: ACPI interface
temp1: +25.0°C (crit = +107.0°C)

Note the missing data about amdgpu and no console kernel warning messages.

Disabling the gnome sensor check extension solves the problem for now, but there is definitely something going on here.

Please feel free to ask me for anything I can do/test to help solve this problem