kernel >= 5.13 BUG: kernel NULL pointer dereference

Bug #1953261 reported by Aku
72
This bug affects 15 people
Affects Status Importance Assigned to Milestone
acpi-call (Ubuntu)
Fix Released
Undecided
Unassigned

Bug Description

As described here https://blog.monosoul.dev/2021/10/16/ubuntu-21-10-and-acpi-call-dkms-bug/ and here https://github.com/linrunner/TLP/issues/599 loading acpi_call results in a kernel NULL pointer dereference.

Background and fix see here:
https://github.com/nix-community/acpi_call/issues/15

As the original acpi_call project seems to be halted maybe a switch to nix community would be a good idea.

Related branches

Aku (waldopepper)
description: updated
description: updated
Aku (waldopepper)
affects: tlp-upstream → tlp (Ubuntu)
no longer affects: tlp (Ubuntu)
description: updated
description: updated
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in acpi-call (Ubuntu):
status: New → Confirmed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package acpi-call - 1.2.2-1

---------------
acpi-call (1.2.2-1) unstable; urgency=medium

  * [e4c61df] Remove '-guest' from Vcs-* URLs
  * [022a8aa] Use nix-community repository as new upstream (Closes: #989384)
  * [0776fac] New upstream version 1.2.2 (LP: #1953261, #1901452)
  * [6ddc66d] Remove all patches (already in new upstream)
  * [3eb35a3] Bump Standards-Version to 4.6.0
  * [2808ee6] Bump debhelper compat to 13
  * [2ea6bde] Drop unused lintian overrides
      - testsuite-autopkgtest-missing
      - spelling-error-in-changelog

 -- Raphaël Halimi <email address hidden> Wed, 02 Feb 2022 17:07:02 +0100

Changed in acpi-call (Ubuntu):
status: Confirmed → Fix Released
Revision history for this message
Cédric Cabessa (cedc) wrote :

Can it be backported to the latest LTS?

Using `tlp` to control the laptop charge trigger a reboot every time. This is quite critical I believe.

Thanks

Revision history for this message
Anton Stötzer (sttzr) wrote :

Me and my brothers each own a Thinkpad x131e and were running Ubuntu 20.04 with tlp very smoothly. Until with the recent kernel update 5.13 three weeks ago each of us encountered the error `Error: The Non-Volatile Variable Storage is About Full` on boot. At first the error could be ignored by entering BIOS setup and save settings. But after a few restarts it was stuck when the error message appears, not even being able to enter BIOS setup or to boot. Unlike newer Thinkpads (e.g. X1) this model does not provide a firmware option to clear the NVRAM storage if it is full. Two of us could not resolve the issue before this to happen – so our ThinkPads are now BRICKED!!!
At least one of us could rescue his Laptop so far by deleting the dump-* files in /etc/firmware/efi/efivars and not restarting the laptop, only using standby. We then found the fix by https://gist.github.com/roadkell/9e98db6656e28fbbf1bf51082040f67f and manually installing the acpi-call module 1.2.2 from the nix-community made the error on boot go away. So big thanks to them!

I know this bug is not the fault of the Ubuntu team, but nevertheless it's quite unacceptable that an update destroys our hardware. This got me in the middle of my exams and a project I had to finish. I know some people would say this laptop model is quite old, buy a new one. But I choose Ubuntu because of sustainability – because it doesn't force you to buy new hardware every two years. Me and my brothers had our ThinkPad x131e's perfectly up to speed with modern SSD and RAM doing audio editing and blender animations. And now it's just not booting anymore because NVRAM memory chip is full.

So I suggest getting the acpi-call package version 1.2.2 into every release of Ubuntu as soon as possible so no more hardware gets destroyed!

I hope I will find someone who can somehow physically reset the chip on the motherboards that are now unbootable. Removing the CMOS-Battery did not clear the NVRAM. Any thoughts on this? Any help appreciated!

Revision history for this message
Raphaël Halimi (raph) wrote :

Bricked how ?

Can you still access BIOS setup and/or boot on a USB key ?

Revision history for this message
Anton Stötzer (sttzr) wrote :

Sadly no. When turning the laptop on, there is the info screen with bios version etc. visible for two seconds. Then the Error Message appears: „Error: The non-volatile variable storage is about full. Press F1 to enter Setup.“ At first it was possible to enter BIOS setup and then exit the setup to continue boot. But after two restarts pressing F1 just doesn't work anymore. The error screen just stays. The only thing I can do is power down and startup again to re-arrive at the error screen. I removed CMOS-Battery for halve an hour. The only thing that changed is that the error page now additionally shows the line „0271: Real Time Clock Error – Check Date and Time settings.“ above the line with the non-volatile error. Still no way to enter BIOS setup or boot from USB. As mentioned above this happened to two different Thinkpad x131e in different cities from me and my brother. My second brother could be warned in time to apply the fix from nix-community while booting was still possible.

Revision history for this message
Anton Stötzer (sttzr) wrote :

I could imagine that this error potentially also affects other Thinkpad models. Of course if they have a larger NVRAM it will take longer for the dump files to fill it up. Maybe even weeks if only a few dump files are generated with each restart. At least some newer models like the X1 Yoga seem to have a firmware option to clear the storage if full: See Hardware Maintenance Manual page 45 https://download.lenovo.com/pccbbs/mobiles_pdf/tp_x1_carbon-yoga_hmm_en.pdf#page=45&zoom=auto,-195,488

Revision history for this message
Raphaël Halimi (raph) wrote :

Did you try to press F1 **before** the message appears ?

I'm really surprised that Lenovo didn't foresee any method to access BIOS when the NVRAM is full.

Revision history for this message
Anton Stötzer (sttzr) wrote :

Thanks for your reply raph!
Yes I tried pressing F1 repeatedly after pressing the power button. I also tried using an external USB keyboard. Also tried F12 for boot menu. But the error seems to block startup before it checks if F1/F12 are pressed. As described earlier, the first two/three times or so the error appeared it was actually possible to enter bios setup with F1. Sadly I didn't find the cause of the error soon enough. So now the NVRAM seems to be so full, it can't even enter BIOS or listen to keys pressed :(

I brought my laptop to a local repair store. They couldn't fix it themselves but will send it to a store in berlin specialized on motherboard repairs an firmware flashing. If I understood correctly, clearing NVRAM completely could also make the laptop unbootable. I hope they can somehow clear and reinstall the original firmware image using a special hardware interface. I will report any progress. If you have any other ideas, I can do tests with the other broken laptop of my brother. Thanks a lot for your interest!

Revision history for this message
Raphaël Halimi (raph) wrote (last edit ):

Sadly, if you can't access BIOS, I have no other idea. My "plan" was to try to enter the BIOS somehow, then disable UEFI boot (to boot plain old BIOS compatibility mode) and boot Linux or a live Linux USB key, and see if you can clear the dump files from there (which may or may not be possible if you boot in BIOS compatibility mode - maybe /sys/firmware/efi is accessible only when booted in UEFI mode, I don't know, I didn't test that).

Being the maintainer of the Debian acpi-call package and having delayed the new version for so long, I feel kind of responsible for your problem. Little did I know that a manufacturer, let alone a well-established one like Lenovo, wouldn't foresee this problem and let their product be bricked by a full NVRAM. I know that the X131e is a low-end model, but still, it hardly makes sense.

Another thing that I don't understand is that there is a failsafe in the Linux kernel that prevents writing to the NVRAM when it's more that 50% full; I don't know why Ubuntu kept filling your NVRAM with these dumps until it was completely full. Did you add "efi_no_storage_paranoia" to the kernel command line in GRUB configuration ? Or maybe you were using an older kernel dating from before this failsafe was implemented ?

Revision history for this message
Anton Stötzer (sttzr) wrote :

Thanks! Unfortunally not possible to switch to legacy boot like this. Also I think I've read somewhere that efivars can only be accessed from a uefi boot session.

No I have not set something like "efi_no_storage_paranoia". It was a standard ubuntu 20.04 installation with tlp installed from the official package sources. Same on my two brothers laptops. From the dpkg.log of my brother:

`2022-01-20 09:55:50 install linux-image-5.13.0-27-generic:amd64 <keine> 5.13.0-27.29~20.04.1`

This was when the error message started to appear.

On my laptop the day before I was locked out, I did a copy of the /sys/firmware/efi/efivars folder. Would that help you? The only thing is, that when copying these files using nautilus, the dump-* files could not be copied. They became empty files, although they were about 10-20kB. The other files are 19.4kB. Do you know a way to find out the total available size of the NVRAM?

Revision history for this message
Raphaël Halimi (raph) wrote :

Weird, 5.13 is a fairly recent version, it should have the failsafe.

About the contents of /sys/firmware/efi/efivars, I wouldn't know what to do with it, I'm just the package maintainer, I didn't write the kernel module itself, and I'm not a specialist of NVRAM.

I really hope that the store will be able to bring your computer back to life.

Revision history for this message
Matthew Bradley (mwbradley) wrote :

Hello,

Thanks for taking the time to update this package with the new upstream. I'm also being bitten by this bug. (T430 thinkpad)

Is there an ETA for when this will land in the regular update channels?

Thanks,
    -Matt

Revision history for this message
Matthew Bradley (mwbradley) wrote :

(I'm on 20.04, btw)

Revision history for this message
Matthew Bradley (mwbradley) wrote :

Is this not going to be backported?

It appears to have only been applied to Jammy Jellyfish (22.04) but I'm still experiencing this problem, with null pointer dereferences and everything, on 20.04(.4) after moving to the new 5.13 kernel from 5.11, presumably after a HWE update.

Revision history for this message
Matthew Bradley (mwbradley) wrote :

Examining things in detail:

Impish Indri (21.10) and Focal Fossa (21.04) have both been upgraded from kernel 5.11 to 5.13 but this package has only been updated for Jammy Jellyfish (22.04). [1] shows acpi-call for Jammy at 1.2.2-1 while the other two remain at 1.1.0-6 and 1.1.0-5 respectively.

[2] shows that Impish has been updated to 5.13 and my own personal machine running Focal Fossa 20.04 has both 5.11 and 5.13 kernels, which were installed automatically as part of normal updates. Consequently, I'm still experiencing the null pointer dereference issues on Focal and I think it's safe to assume anyone using tlp on Impish is still exposed as well.

Could you please backport this change from 22.04 to 21.10 and 21.04 to correct the persisting null pointer references there?

[1] https://launchpad.net/ubuntu/+source/acpi-call

[2] https://packages.ubuntu.com/impish-updates/kernel/linux-image-5.13.0-30-generic

Revision history for this message
DiagonalArg (diagonalarg) wrote :

I've just been hit with this on a Thinkpad W530 running Ubuntu 20.04.4.

After freaking out, I did succeed in finding my way to a reboot by entering and saving the BIOS. Then I immediately removed tlp - which is clearly not a long-term solution. As far as I can tell nothing else is using acpi-call-dkms, so I'm thinking I may remove that too, so as to avoid mistakenly bricking my machine.

If I am to remove the /sys/firmware/efivars/dump-* files, am I to also remove the associated /sys/firmware/efi/vars/dump-* directories? Each of those directories look like (choosing just one):

$ ls -l efivars/dump*
-rw-r--r-- 1 root root 644 Feb 27 00:12 efivars/dump-type0-10-1-1645912016-C-cfc8fc79-be2e-4ddc-97f0-9f98bfe298a0

$ ls -l vars/dump*
vars/dump-type0-10-1-1645912016-C-cfc8fc79-be2e-4ddc-97f0-9f98bfe298a0:
total 0
-r-------- 1 root root 4096 Feb 27 00:48 attributes
-r-------- 1 root root 4096 Feb 27 00:48 data
-r-------- 1 root root 4096 Feb 27 00:48 guid
-rw------- 1 root root 4096 Feb 27 00:48 raw_var
-r-------- 1 root root 4096 Feb 27 00:48 size

Revision history for this message
DiagonalArg (diagonalarg) wrote :

Can someone let us know if we're going to get a backport to 20.04? If so, I'll just go without tlp until then. Otherwise I'll install the newer acpi-call-dkms by hand.

Revision history for this message
DiagonalArg (diagonalarg) wrote (last edit ):

I am seeing in the Arch UEFI documentation:

    UEFI Runtime Variables Support (efivarfs filesystem - /sys/firmware/efi/efivars). This option is important as this is required to manipulate UEFI runtime variables using tools like /usr/bin/efibootmgr. The configuration option below has been added in kernel 3.10 and later.

    CONFIG_EFIVAR_FS=y

    UEFI Runtime Variables Support (old efivars sysfs interface - /sys/firmware/efi/vars). This option should be disabled to prevent any potential issues with both efivarfs and sysfs-efivars enabled.

    CONFIG_EFI_VARS=n

Unfortunately, in Ubuntu 20.04, both of these are set =y

Is this an error? Should I be reporting this as a kernel bug? And what are the implications? Should we be deleting dump-* in both?

Edit: I tried deleting ../efivars/dump*. Even after a reboot, the ../vars/dump* directories remain. Worse, the ../efivars/dump* files are back!

So, I broke this issue out as a kernel bug:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1962572

Edit2: I repeated the deletion of ../efivars/dump* and again rebooted. Now those files are gone, along with the ../vars/dump* directories. If I had not removed tlp and acpi-call-dkms before the first reboot, then rebooting may well have filled the NVRAM and bricked the machine!

Revision history for this message
IT Kaufmann GmbH (itkfm) wrote :

Meanwhile, tlp is partially broken because of this on 20.04 with HWE kernel.

Revision history for this message
linrunner (linrunner) wrote (last edit ):

Hi,

uninstalling TLP is unnecessary, it is not broken either. The best workaround is to uninstall acpi-call-dkms or use version 1.2.2 from the TLP PPA.

The charge thresholds will continue to work without acpi-call, because they are built in since kernel 4.19. You only lose tlp recalibrate. But even that will be possible soon (kernel 5.17, TLP 1.5) without acpi-call-dkms.

Revision history for this message
Anton Stötzer (sttzr) wrote :

@linrunner: Thanks for the explanation that tlp will mostly work without acpi-call-dkms and completely without it in the future! Also your warning hint from the TLP doc might help some people here:

"Warning: On Ubuntu 21.10 and 20.04.4 the acpi-call-dkms packages in the official repositories are incompatible with the provided kernel 5.13 and may cause TLP battery care malfunction, system freezes and reboots. Solution: use acpi-call-dkms version 1.2.2 from the TLP PPA or download from Ubuntu 22.04 and install manually." – https://linrunner.de/tlp/installation/ubuntu.html#thinkpads-only

@raph: Still this package is not fixed for Ubuntu 20.04 and could render more laptops unbootable. The bug status „Fix released“ is not true for 20.04. Could someone maybe adjust the bug status? Could you provide some insight what's holding you back to get the fix into Ubuntu 20.04?

Revision history for this message
Raphaël Halimi (raph) wrote :

@sttzr :

I'm the Debian package maintainer, I don't maintain the package for Ubuntu, nor do I have Ubuntu upload rights.

I watch the bug reports for Ubuntu packages and fix them "upstream" (Debian is Ubuntu's upstream), but that's it.

If you want to learn more about the relationship between Debian and Ubuntu, please read this :

https://wiki.ubuntu.com/Ubuntu/ForDebianDevelopers

Unfortunately, I don't plan to become an Ubuntu developer or uploader, I don't have time for this.

However, a couple of weeks ago, when I saw @mwbradley's messages and yours, I tried to help by searching how to request for a package to be backported in Ubuntu older releases.

I'm pretty sure that there used to be some wiki page explaining how a "mere" user could request for a particular package to be backported, but after spending nearly an hour crawling on Ubuntu wiki, I couldn't find it. All I found was the backports page which explains that one has to prepare a backported package, open a bug, find a sponsor, etc etc. I have all my tools to prepare packages for Debian, creating such a set of tools for Ubuntu, even only for two particular releases, would take me hours, let alone opening two bugs, finding a sponsor, request the backports, etc etc. I don't have that time. At that point, I was just fed up, and gave up without even leaving a note here to explain it.

@linrunner was more clever than me by sending directly an e-mail to a well known Ubuntu developer who previously patched acpi-call in Ubuntu for a this kernel incompatibility when it first appeared. We're still waiting for an answer.

If you want this backport that bad, try to find the procedure to request it; if not, just download version 1.2.2 from Debian unstable or Ubuntu devel, and install it on your machine.

Sorry, but I won't spend any more time on this bug.

Regards,

--
Raphaël

Revision history for this message
Johannes Rohr (jorohr) wrote :

@sttzr: I also had the problem, that at some point, when message „Error: The non-volatile variable storage is about full. Press F1 to enter Setup.“ was displayed, the F1 key did not work any more.

What saved my day was connecting an external keyboard (USB) and pressing F1 on it.

I haven't found a way to make the root issue disappear, though, it has stuck with me, and I don't use that TLP thing, I didn't even know about it.

tags: added: focal kernel-bug
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.