Acer Aspire A315 IOAPIC failure on Ubuntu 18.04, kernel hangs, can't load, kernel freeze (AMD Ryzen 5/Radeon/Raven) / AMDGPU Hybrid crash

Bug #1776563 reported by Richard Baka on 2018-06-12
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Linux
Incomplete
Medium
amd
Undecided
Unassigned
linux (Ubuntu)
Medium
Unassigned
linux-firmware (Ubuntu)
Undecided
Unassigned

Bug Description

CPU: Ryzen 5 2500U
VGA: Radeon 535
Notebook: Acer Aspire A315

This is a brand new notebook on the market with Ryzen 5/Radeon.
The default kernel of Ubuntu(18.04) hangs at loading with message:

tsc: Refined TSC clocksource calibration: 1996.250 MHz
clocksource: tsc: mask: 0xffffffffffffffff max_cycles: (...), max_idle_ns: (...)
Soft lockup

Using pci=noacpi kernel parameter kernel loads without any problem but my notebook produces more heat than on Win10. If I know right Acer notebooks need ACPI to the correct power management.

The same thing happens on mainline 4.17,4.18rc1-2.
BIOS upgrade to the latest version: 1.08 hasn't helped

This problem has been reported upstream:
https://bugzilla.kernel.org/show_bug.cgi?id=200087

The latest correctly working kernel was 4.13.* but the heat problem was present with this too.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1776563

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: bionic

apport-collect 1776563 can't be entered because the kernel can not load.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
summary: - Acer Aspire A315 (Ryzen5/Radeon/FHD) Ubuntu 18.04 kernel cant load
+ Ubuntu 18.04 kernel can't load kernel on Acer Aspire A315
+ (Ryzen5/Radeon/FHD)
summary: - Ubuntu 18.04 kernel can't load kernel on Acer Aspire A315
- (Ryzen5/Radeon/FHD)
+ Ubuntu 18.04 can't load kernel on Acer Aspire A315 (Ryzen5/Radeon/FHD)
no longer affects: bugzilla (Ubuntu)
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in xserver-xorg-video-amdgpu (Ubuntu):
status: New → Confirmed
Freihut (freihut) wrote :

Had this on my A315 too, but I returned it to the vendor. Seems to be an UEFI-Bug, because it doesn't happened with my Ryzen 2500U from HP. Could also be related to that Ryzen/Radeon 535 combination (Vega/CGN 3).

On Grub-Menu press E and add "pci=noacpi" as kernel-parameter (where normally "quite splash" is). Then go on booting by pressing F10.
Sometimes (XFCE) it was also necessary to add "nomodeset" to boot, Gnome for example didn't need it (AFAIK).

I remember, I also needed to install amd's pro driver (for 18.04) via amdgpu-pro-install to get rid of the "nomodeset". I was able to run amdgpu-pro-uninstall later and still not needed the "nomodeset". Could be related to my system, but you may give it a try.
I was also using Kernel 4.17 (Mainline), which is available on http://kernel.ubuntu.com/~kernel-ppa/mainline/ or with UKUU https://www.omgubuntu.co.uk/2017/02/ukuu-easy-way-to-install-mainline-kernel-ubuntu

Richard Baka (bakarichard91) wrote :

Thanks Freihut, I will try this.

Richard Baka (bakarichard91) wrote :

It works but very slow. This could be an ACPI problem.

Richard Baka (bakarichard91) wrote :

I installed the new amdgpu pro driver and everything is very fast now. This bug should be reported to freedesktop, would you like somebody to do it? :D

Richard Baka (bakarichard91) wrote :

*Sorry correction: Who would like to do it? :D

Richard Baka (bakarichard91) wrote :

"The fact that ACPI was designed by a group of monkeys high on LSD, and is some of the worst designs in the industry obviously makes running it at any point pretty damn ugly."
Torvalds, Linus (2005-07-31). Message. linux-kernel mailing list. IU. Retrieved on 2006-08-28.

Richard Baka (bakarichard91) wrote :

Power management doesn't work well this way. It was hot a little. I've changed back to win10. This should be fixed by kernel developers or with a downstream patch.

Created attachment 276583
dmesg after starting kernel with pci=noacpi

This is a brand new notebook on the market with Ryzen 5/Radeon. With disabled ACPI kernel boots without any problem but my notebook produces more heat than on Win10. Otherwise this happens when it is stayed on the bios screen in a while.

CPU: AMD Ryzen 5 2500U
GPU1: AMD Radeon Vega 8
GPU2: AMD Radeon 535

(I wrote to Acer to fix their bios problems but they said Linux is not supported. I don't think they are right but what can I do?)

Created attachment 276585
attachment-31427-0.html

Out of office 6/18-6/27

Created attachment 276587
Soft lockup failure without noacpi

Nothing changes with disabled iommu.

Created attachment 276589
dmesg after amd_iommu_dump=1

[ 0.000000] AMD-Vi: Using IVHD type 0x11
[ 0.000000] AMD-Vi: device: 00:00.2 cap: 0040 seg: 0 flags: b0 info 0000
[ 0.000000] AMD-Vi: mmio-addr: 00000000fd900000
[ 0.000000] AMD-Vi: DEV_SELECT_RANGE_START devid: 00:01.0 flags: 00
[ 0.000000] AMD-Vi: DEV_RANGE_END devid: ff:1f.6
[ 0.000000] AMD-Vi: DEV_ALIAS_RANGE devid: ff:00.0 flags: 00 devid_to: 00:14.4
[ 0.000000] AMD-Vi: DEV_RANGE_END devid: ff:1f.7
[ 0.000000] AMD-Vi: DEV_SPECIAL(HPET[0]) devid: 00:14.0
[ 0.000000] AMD-Vi: DEV_SPECIAL(IOAPIC[33]) devid: 00:14.0
[ 0.000000] AMD-Vi: DEV_SPECIAL(IOAPIC[34]) devid: 00:00.1
[ 0.000000] [Firmware Bug]: AMD-Vi: No southbridge IOAPIC found

no longer affects: xserver-xorg-video-amdgpu (Ubuntu)

Created attachment 276591
Error message before freezing (without quite splash)

Please try booting with linux 4.18-rc1 or later. Also, please try 4.18-rc1+ with/without ACPI

Hi Erik,

Absolutely the same thing on 4.18rc1 and on rc2 too.

Fedora loads without any additional parameters(mysterious).

[ 0.000000] Switched APIC routing to physical flat.
[ 0.002000] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
[ 0.007000] tsc: Fast TSC calibration using PIT
[ 0.008000] tsc: Detected 1996.299 MHz processor
[ 0.008000] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x398d0c7513b, max_idle_ns: 881590744042 ns
[ 0.008000] Calibrating delay loop (skipped), value calculated using timer frequency.. 3992.59 BogoMIPS (lpj=1996299)

Heat production may be still present but I can't measure it because there is no temperature values in "sensors" (there is 5 values in Win10).

Created attachment 277069
Fedora loads without noacpi

summary: - Ubuntu 18.04 can't load kernel on Acer Aspire A315 (Ryzen5/Radeon/FHD)
+ Acer Aspire A315 ACPI failure on Ubuntu 18.04 (Ryzen5/Radeon/FHD)
summary: - Acer Aspire A315 ACPI failure on Ubuntu 18.04 (Ryzen5/Radeon/FHD)
+ Acer Aspire A315 ACPI failure on Ubuntu 18.04 (Ryzen5/Radeon)
summary: - Acer Aspire A315 ACPI failure on Ubuntu 18.04 (Ryzen5/Radeon)
+ Acer Aspire A315 ACPI failure on Ubuntu, kernel hangs, can't load 18.04
+ (Ryzen5/Radeon)
summary: - Acer Aspire A315 ACPI failure on Ubuntu, kernel hangs, can't load 18.04
+ Acer Aspire A315 ACPI failure on Ubuntu 18.04, kernel hangs, can't load
(Ryzen5/Radeon)
description: updated
summary: Acer Aspire A315 ACPI failure on Ubuntu 18.04, kernel hangs, can't load
- (Ryzen5/Radeon)
+ (AMD Ryzen 5/Radeon/Raven)
summary: - Acer Aspire A315 ACPI failure on Ubuntu 18.04, kernel hangs, can't load
- (AMD Ryzen 5/Radeon/Raven)
+ Acer Aspire A315 ACPI failure on Ubuntu 18.04, kernel hangs, can't load,
+ kernel freeze (AMD Ryzen 5/Radeon/Raven)

Erik, I think this is in connection with clocksource calibration but I'm not an expert.

This works:
[ 0.007000] tsc: Fast TSC calibration using PIT
[ 0.008000] tsc: Detected 1996.299 MHz processor
[ 0.008000] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x398d0c7513b, max_idle_ns: 881590744042 ns

This doesn't:
[...] tsc: Refined tsc clocksource calibration: ...
[...] clocksource: tsc: mask: 0xfff...f (...)

Changed in linux:
importance: Unknown → Medium
status: Unknown → Incomplete

Hi, I was trying another kernel parameters and noapic seems to work. It is not needed to disable the whole ACPI "service", however I don't know how important apic is. On kernel 4.18 even temperature sensors appear.
Power management is almost perfect if cpu governor is set to powersave.

At least amdgpu crashes now so kernel doesn't start without nomodeset. Could this be an acpi problem or I should ask kernel firmware developers?

Hi,
amdgpu doesn't crash on my a315-41g-r40x (BIOS V1.08) with
  linux-next-next-20180713 compiled with VGA_SWITCHEROO=N
and with
  kernel parameters: ivrs_ioapic[4]=00:14.0 ivrs_ioapic[5]=00:00.2

gg71, where have you been till now? :D
Thanks, I will try it.

gg71, it works almost perfectly, thanks again. I have been working on this for ca one month. Please write a mail to me if you have any new info.

The solution for Acer A315-41G-* notebooks: (USE AT YOUR OWN RISK - PLS be very careful)

1. Load kernel with these parameters: ivrs_ioapic[4]=00:14.0 ivrs_ioapic[5]=00:00.2 nomodeset
This is how it can be done (1. answer/first half 1-4): https://askubuntu.com/questions/19486/how-do-i-add-a-kernel-boot-parameter

1/b.(if it is not installed) Install ubuntu and load installed kernel again using the parameters (see 1.)

2. Start a terminal and do these steps:
> cd ~
> mkdir kernelbuild
> cd kernelbuild
> wget -c https://cdn.kernel.org/pub/linux/kernel/v4.x/linux-4.17.6.tar.xz
> tar -xvf linux-4.17.6.tar.xz
> cd linux-4.17.6
> sudo apt install git build-essential kernel-package fakeroot libncurses5-dev libssl-dev ccache bison flex
> make menuconfig
+> Save,OK,EXIT
> nano .config
+> ctrl+w and search for CONFIG_VGA_SWITCHEROO=y
+> replace y with n (this is not ideal and should be fixed later)
+> ctrl+o, enter
> make -j4 (this will take a while, be patient)
> make modules_install
> sudo make install
> sudo nano /etc/default/grub
+> Edit the correct line and add the parameters: GRUB_CMDLINE_LINUX_DEFAULT="quiet splash ivrs_ioapic[4]=00:14.0 ivrs_ioapic[5]=00:00.2"
+>CTRL+O, enter
>sudo update-grub
+> reboot and start the correct kernel

If you install xsensors (sudo apt install xsensors) and start it (xsensors) you can monitor the temperature values of your notebook. (Recommended)

Richard Baka (bakarichard91) wrote :

Dear Ubuntu maintainers,

couldn't this be fixed by an ubuntu kernel patch? The hardest part is to disable gpu switching at kernel load time. APIC fixing parameters can be hardcoded for these models I think or search for the correct pci controller using a smart script.

This was a hell of an investigation, never again. Thanks for gg71, he/she is a lifesaver.

Hi Richard:

This issue should be related to the buggy BIOS ivrs table.
Kernel panic when found no southbridge device ID.

Could you try boot kernel with "amd_iommu_dump=1 amd_iommu=off" (remove other kernel parameters you tried to solve this issue).

If it works, please attach the dmesg here.
I will try to make a kernel patch to make kernel boot with irq map disabled instead of panic.

Richard Baka (bakarichard91) wrote :

Hi AaronMa,

thanks for the response. I tried it but it didn't work. I think iommu problem is not the main reason of the kernel hang. Otherwise it can be disabled in BIOS and there is no change.

The main reason is: https://bugzilla.kernel.org/attachment.cgi?id=276587 like you can se on this picture is that IOAPIC[4] and IOAPIC[5] are not in the invrs table so we should search the correct pci controllers using lspci and give them to the kernel.

In this way:
LINUX_DEFAULT="quiet splash ivrs_ioapic[4]=00:14.0 ivrs_ioapic[5]=00:00.2"

Kernel can be started even with noapic but two sensors will be missing and the advanced touchpad functions will not work. This is the reason of CONFIG_VGA_SWITCHEROO=n compile time kernel parameter.

There is an another problem: this notebook has two GPUs and amdgpu (or the kernel, I don't know) can not handle this correctly so gpu switching has to be disabled

Richard Baka (bakarichard91) wrote :

Kernel can be started even with noapic but two sensors will be missing and the advanced touchpad functions will not work.

!!!This line is not here: This is the reason of CONFIG_VGA_SWITCHEROO=n compile time kernel parameter.

There is an another problem: this notebook has two GPUs and amdgpu (or the kernel, I don't know) can not handle this correctly so gpu switching has to be disabled
!!!But here: This is the reason of CONFIG_VGA_SWITCHEROO=n compile time kernel parameter.

Richard Baka (bakarichard91) wrote :

AaronMa,

This is the iommu debug:

[ 0.000000] AMD-Vi: Using IVHD type 0x11
[ 0.000000] AMD-Vi: device: 00:00.2 cap: 0040 seg: 0 flags: b0 info 0000
[ 0.000000] AMD-Vi: mmio-addr: 00000000fd900000
[ 0.000000] AMD-Vi: DEV_SELECT_RANGE_START devid: 00:01.0 flags: 00
[ 0.000000] AMD-Vi: DEV_RANGE_END devid: ff:1f.6
[ 0.000000] AMD-Vi: DEV_ALIAS_RANGE devid: ff:00.0 flags: 00 devid_to: 00:14.4
[ 0.000000] AMD-Vi: DEV_RANGE_END devid: ff:1f.7
[ 0.000000] AMD-Vi: DEV_SPECIAL(HPET[0]) devid: 00:14.0
[ 0.000000] AMD-Vi: DEV_SPECIAL(IOAPIC[33]) devid: 00:14.0
[ 0.000000] AMD-Vi: DEV_SPECIAL(IOAPIC[34]) devid: 00:00.1
[ 0.000000] [Firmware Bug]: AMD-Vi: No southbridge IOAPIC found

I will give you the correct iommu "addresses" after dinner :).

Richard Baka (bakarichard91) wrote :

HOT NEWS!!

CONFIG_VGA_SWITCHEROO=n can be avoided using these kernel parameters amdgpu.runpm=0 radeon.modeset=0.
Further investigation is in progress...

Richard Baka (bakarichard91) wrote :

This could be the better solution because of the notebook's lowest heating but I'm not sure.

Richard Baka (bakarichard91) wrote :
Download full text (4.5 KiB)

Hi all,

After a bit of testing the power management seems to be better but it is far away from perfect. I don't see any anomaly watching temperature sensors (instead of ath10k_hwmon-pci(?!??)) but my notebook is definitely warm if I hold it on my lap.
This is more better on win10, I don't know why.

mosomaci@pc:~$ sensors
k10temp-pci-00c3
Adapter: PCI adapter
Tdie: +55.0°C (high = +70.0°C)
Tctl: +55.0°C

amdgpu-pci-0100
Adapter: PCI adapter
vddgfx: +0.81 V
fan1: N/A
temp1: +50.0°C (crit = +104000.0°C, hyst = -273.1°C)
power1: 1.13 kW (cap = 28.00 W)

ath10k_hwmon-pci-0300
Adapter: PCI adapter
temp1: +91.0°C

amdgpu-pci-0400
Adapter: PCI adapter
vddgfx: N/A
vddnb: N/A
fan1: N/A
temp1: +55.0°C (crit = +80.0°C, hyst = +0.0°C)
power1: N/A

Could our APIC fix not a perfect solution for this problem? I know that the DSDT is totally broken:

[ 0.088280] ACPI: Added _OSI(Module Device)
[ 0.088280] ACPI: Added _OSI(Processor Device)
[ 0.088280] ACPI: Added _OSI(3.0 _SCP Extensions)
[ 0.088280] ACPI: Added _OSI(Processor Aggregator Device)
[ 0.088280] ACPI: Added _OSI(Linux-Dell-Video)
[ 0.092591] ACPI: [Firmware Bug]: BIOS _OSI(Linux) query ignored
[ 0.100296] ACPI BIOS Error (bug): Failure creating [\_SB.PCI0.LPC0.EC0._Q46], AE_ALREADY_EXISTS (20180531/dswload2-316)
[ 0.100309] ACPI Error: AE_ALREADY_EXISTS, During name lookup/catalog (20180531/psobject-221)
[ 0.100313] ACPI Error: Ignore error and continue table load (20180531/psobject-604)
[ 0.100321] ACPI BIOS Error (bug): Could not resolve [\_SB.PCI0.LPC0.EC0.UX**], AE_NOT_FOUND (20180531/psargs-330)
[ 0.100326] ACPI Error: Ignore error and continue table load (20180531/psobject-604)
[ 0.100332] ACPI BIOS Error (bug): Could not resolve [\_SB.PCI0.LPC0.EC0.M000], AE_NOT_FOUND (20180531/psargs-330)
[ 0.100336] ACPI Error: Ignore error and continue table load (20180531/psobject-604)
[ 0.100343] ACPI BIOS Error (bug): Could not resolve [\_SB.PCI0.LPC0.EC0.M049], AE_NOT_FOUND (20180531/psargs-330)
[ 0.100347] ACPI Error: Ignore error and continue table load (20180531/psobject-604)
[ 0.100353] ACPI BIOS Error (bug): Could not resolve [\_SB.PCI0.LPC0.EC0.M280], AE_NOT_FOUND (20180531/psargs-330)
[ 0.100357] ACPI Error: Ignore error and continue table load (20180531/psobject-604)
[ 0.100364] ACPI BIOS Error (bug): Could not resolve [\_SB.PCI0.LPC0.EC0.M009], AE_NOT_FOUND (20180531/psargs-330)
[ 0.100369] ACPI Error: Ignore error and continue table load (20180531/psobject-604)
[ 0.100372] ACPI Error: Skipping While/If block (20180531/psloop-594)
[ 0.100378] ACPI BIOS Error (bug): Could not resolve [\_SB.PCI0.LPC0.EC0.M000], AE_NOT_FOUND (20180531/psargs-330)
[ 0.100383] ACPI Error: Ignore error and continue table load (20180531/psobject-604)
[ 0.100390] ACPI Error: Cannot release Mutex [QMUX], not acquired (20180531/exmutex-359)
[ 0.100394] ACPI Error: Ignore error and continue table load (20180531/psobject-604)
[ 0.100402] ACPI BIOS Error (bug): Could not resolve [\_SB.PCI0.GPP2.BCM5], AE_NOT_FOUND (20180531...

Read more...

Richard Baka (bakarichard91) wrote :

*instead of ath10k_hwmon-pci(?!??) -> except of ath10k_hwmon-pci

Richard Baka (bakarichard91) wrote :

Here is a hiDPI scaling script for Gnome3:

#!/bin/bash
gsettings set org.gnome.desktop.interface scaling-factor 2
eval sleep 1;xrandr --output eDP --scale 1.6x1.6 --panning 3072x1728

Richard Baka (bakarichard91) wrote :

Dear Ubuntu Maintainers,

here is the summary:

1. Kernel freeze can be resolved by using the mentioned kernel parameters:
> ivrs_ioapic[4]=00:14.0 ivrs_ioapic[5]=00:00.2

It would be the best if the broken DSTD tables were fixed but I think nobody will do it.
The workaround with the parameters seems to be a correct solution.

2. For the amdgpu crash there is a patch what works correctly. It will be merged to the upstream after testing.
https://bugzilla.kernel.org/show_bug.cgi?id=200517

Patch: https://bugzilla.kernel.org/attachment.cgi?id=277375&action=diff&collapsed=&headers=1&format=raw

summary: - Acer Aspire A315 ACPI failure on Ubuntu 18.04, kernel hangs, can't load,
- kernel freeze (AMD Ryzen 5/Radeon/Raven)
+ Acer Aspire A315 IOAPIC failure on Ubuntu 18.04, kernel hangs, can't
+ load, kernel freeze (AMD Ryzen 5/Radeon/Raven) / AMDGPU Hybrid crash
Richard Baka (bakarichard91) wrote :

@@ -, +, @@
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_atpx_handler.c | 1 +
 1 file changed, 1 insertion(+)
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_atpx_handler.c
+++ a/drivers/gpu/drm/amd/amdgpu/amdgpu_atpx_handler.c
@@ -575,6 +575,7 @@ static const struct amdgpu_px_quirk amdgpu_px_quirk_list[] = {
  { 0x1002, 0x6900, 0x1002, 0x0124, AMDGPU_PX_QUIRK_FORCE_ATPX },
  { 0x1002, 0x6900, 0x1028, 0x0812, AMDGPU_PX_QUIRK_FORCE_ATPX },
  { 0x1002, 0x6900, 0x1028, 0x0813, AMDGPU_PX_QUIRK_FORCE_ATPX },
+ { 0x1002, 0x6900, 0x1025, 0x125A, AMDGPU_PX_QUIRK_FORCE_ATPX },
  { 0, 0, 0, 0, 0 },
 };

--

Richard Baka (bakarichard91) wrote :
tags: added: patch
Kai-Heng Feng (kaihengfeng) wrote :

Please send that patch to <email address hidden>

Richard Baka (bakarichard91) wrote :

Hi Kai-Heng Feng,

I've received the patch from Alex Deucher. Is it really needed to send to that mail? He said:

"Assuming it fixes the issue, I'll go ahead and apply it to upstream and stable kernels."

Changed in linux (Ubuntu):
importance: Undecided → Medium
status: Confirmed → Triaged
Kai-Heng Feng (kaihengfeng) wrote :

Right then let's wait for the commit lands in mainline.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.