kernel BUG at /build/buildd/linux-3.11.0/fs/buffer.c:1268!; RIP: 0010:[<ffffffff816e3efd>] [<ffffffff816e3efd>] check_irqs_on.part.11+0x4/0x6

Bug #1265841 reported by Maarten Baert on 2014-01-03
26
This bug affects 4 people
Affects Status Importance Assigned to Milestone
eCryptfs
High
Tyler Hicks
linux (Ubuntu)
Medium
Unassigned

Bug Description

This only happens when aesni_intel is loaded.

In my attempts to find an easy way to reproduce this bug:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1265684
I actually found a very simple way to trigger a bug that is similar but not identical. I suspect that both bugs have the same cause: something in the kernel is disabling IRQs and the ext4 code will crash when this happens. The stack trace for this bug is different from the other one. This one appears to be less severe, the system is still usable after the crash, only the process that caused the crash will hang (uninterruptible sleep). This bug is 100% reproducible on both Ubuntu 13.10 with kernel 3.11.0 and Arch Linux with kernel 3.12.6.

The steps to reproduce the bug are based on this:
http://www.spinics.net/lists/linux-ext4/msg38949.html

* Set up an ecryptfs 'Private' folder in your home directory.
* In that directory, create a file called 'crashme.c' with the following code in it:
#include <assert.h>
int main() { assert(0); }

* Compile the program:
gcc -Wall crashme.c -o crashme

* Change the core dump pattern so core dumps are saved in the current directory:
echo "coredump-%p" | sudo tee /proc/sys/kernel/core_pattern

* Enable core dumps:
ulimit -c unlimited

* Make sure that you have a second terminal open to run dmesg, because you may not be able to do so later.
* Run 'crashme' - this will hang and trigger the bug:
./crashme

ProblemType: Bug
DistroRelease: Ubuntu 13.10
Package: linux-image-3.11.0-15-generic 3.11.0-15.23
ProcVersionSignature: Ubuntu 3.11.0-15.23-generic 3.11.10
Uname: Linux 3.11.0-15-generic x86_64
NonfreeKernelModules: nvidia
ApportVersion: 2.12.5-0ubuntu2.2
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: maarten 1666 F.... lxpanel
CRDA: Error: [Errno 2] No such file or directory: 'iw'
Date: Fri Jan 3 15:58:24 2014
EcryptfsInUse: Yes
HibernationDevice: RESUME=UUID=bc17e234-da75-457f-b17c-22d9c0e27dd8
InstallationDate: Installed on 2013-12-28 (6 days ago)
InstallationMedia: Lubuntu 13.10 "Saucy Salamander" - Release amd64 (20131016.1)
IwConfig:
 eth0 no wireless extensions.

 lo no wireless extensions.
MachineType: Gigabyte Technology Co., Ltd. Z87X-D3H
MarkForUpload: True
ProcFB: 0 EFI VGA
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-3.11.0-15-generic.efi.signed root=UUID=5a8ae1fc-91bf-4ce0-8dea-a519976fd56b ro quiet splash vt.handoff=7
RelatedPackageVersions:
 linux-restricted-modules-3.11.0-15-generic N/A
 linux-backports-modules-3.11.0-15-generic N/A
 linux-firmware 1.116
RfKill:

SourcePackage: linux
StagingDrivers: zram
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 08/02/2013
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: F7
dmi.board.asset.tag: To be filled by O.E.M.
dmi.board.name: Z87X-D3H-CF
dmi.board.vendor: Gigabyte Technology Co., Ltd.
dmi.board.version: x.x
dmi.chassis.asset.tag: To Be Filled By O.E.M.
dmi.chassis.type: 3
dmi.chassis.vendor: Gigabyte Technology Co., Ltd.
dmi.chassis.version: To Be Filled By O.E.M.
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvrF7:bd08/02/2013:svnGigabyteTechnologyCo.,Ltd.:pnZ87X-D3H:pvrTobefilledbyO.E.M.:rvnGigabyteTechnologyCo.,Ltd.:rnZ87X-D3H-CF:rvrx.x:cvnGigabyteTechnologyCo.,Ltd.:ct3:cvrToBeFilledByO.E.M.:
dmi.product.name: Z87X-D3H
dmi.product.version: To be filled by O.E.M.
dmi.sys.vendor: Gigabyte Technology Co., Ltd.

Maarten Baert (maarten-baert) wrote :
description: updated
tags: added: latest-bios-f7 needs-upstream-testing regression-potential

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed

Blacklisting the aesni_intel module appears to fix this crash (but maybe not the other one, I don't know). I will try to get the latest kernel and do more testing.

Maarten Baert, thank you for reporting this and helping make Ubuntu better. Could you please confirm this issue exists with the latest development release of Ubuntu? ISO images are available from http://cdimage.ubuntu.com/daily-live/current/ . If the issue remains, please just make a comment to this.

If reproducible, could you also please test the latest upstream kernel available (not the daily folder) following https://wiki.ubuntu.com/KernelMainlineBuilds ? It will allow additional upstream developers to examine the issue. Once you've tested the upstream kernel, please comment on which kernel version specifically you tested. If this bug is fixed in the mainline kernel, please add the following tags:
kernel-fixed-upstream
kernel-fixed-upstream-VERSION-NUMBER

where VERSION-NUMBER is the version number of the kernel you tested. For example:
kernel-fixed-upstream-v3.13-rc6

This can be done by clicking on the yellow circle with a black pencil icon next to the word Tags located at the bottom of the bug description. As well, please remove the tag:
needs-upstream-testing

If the mainline kernel does not fix this bug, please add the following tags:
kernel-bug-exists-upstream
kernel-bug-exists-upstream-VERSION-NUMBER

As well, please remove the tag:
needs-upstream-testing

Once testing of the upstream kernel is complete, please mark this bug's Status as Confirmed. Please let us know your results. Thank you for your understanding.

Changed in linux (Ubuntu):
importance: Undecided → Medium
status: Confirmed → Incomplete
Maarten Baert (maarten-baert) wrote :

Still reproducible on upstream kernel 3.13.0-rc6. As with the other kernels, it only happens when aesni_intel is loaded.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
tags: added: kernel-bug-exists-upstream kernel-bug-exists-upstream-v3.13-rc6
removed: needs-upstream-testing
description: updated

Maarten Baert, thank you for providing the requested information. Could you please provide the information following https://help.ubuntu.com/community/DebuggingIRQProblems and advise if it changes anything?

summary: - Kernel crash in ext4/ecryptfs after core dump in encrypted folder
+ kernel BUG at /build/buildd/linux-3.11.0/fs/buffer.c:1268!; RIP:
+ 0010:[<ffffffff816e3efd>] [<ffffffff816e3efd>]
+ check_irqs_on.part.11+0x4/0x6
Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Maarten Baert (maarten-baert) wrote :

I tried kernel 3.13.0-rc6 with:
* noapic
* pci=routeirq
* pci=noacpi
* acpi=off
In all cases I can still reproduce the bug, nothing changes.

tags: added: kernel-da-key
Maarten Baert (maarten-baert) wrote :

I don't know, this is a new computer. But on my older computer I had an identical setup that I have used for months with kernels 3.9 up to 3.12 with no issues at all. On the new computer it happened twice in just one week. So I suspect it is hardware-related somehow.

Maarten Baert (maarten-baert) wrote :

Oh, and for what it's worth, I have run memtest86+ and mprime to eliminate the most obvious hardware problems, and both tests didn't find any errors.

Maarten Baert (maarten-baert) wrote :

I tried to install 12.04 but I ran into some issues. The installer seems buggy, it used the wrong EFI partition and failed to install grub, but I was still able to boot it by re-using the 13.10 grub install, and I think the install is okay. Unfortunately 12.04 doesn't seem to have drivers for my ethernet controller, so I couldn't connect to the internet and update (this was already the case with the live as well). I did the test anyway based on the packages that came with the CD (that was the very first 12.04 I believe, not 12.04.2).

It turns out ecryptfs on 12.04 does not load the aesni_intel module (I suppose it didn't support it yet). Because of that it does not crash. I tried loading the aesni_intel module manually, but this does not seem to make a difference (ecryptfs probably doesn't use it).

So technically this is a regression from 12.04, but only because 12.04 didn't use aesni_intel yet.

My previous installation *did* use aesni_intel without problems (but with different hardware).

Maarten Baert, thank you for performing the requested test. Would Quantal allow a proper test via http://releases.ubuntu.com/quantal/ ?

Maarten Baert (maarten-baert) wrote :

I can do it if you really want, but I think I already know the result. According to this:
http://www.phoronix.com/scan.php?page=news_item&px=MTM2OTg
... ecryptfs AES-NI support was only added in kernel 3.10. Both quantal and raring use older kernels, so they will not load the aesni_intel module and won't be affected, just like 12.04.

I will try to install some older kernels on saucy, I should be able to find out if the issue started with kernel 3.10 or later.

Maarten Baert (maarten-baert) wrote :

The tests I did confirmed what I suspected:

* kernel 3.13 is affected (if aesni_intel is loaded)
* kernel 3.11 is affected (if aesni_intel is loaded)
* kernel 3.10 is affected (if aesni_intel is loaded)
* kernel 3.9 is NOT affected (ecryptfs doesn't use aesni_intel so it doesn't matter whether it's loaded)

Maarten Baert, the issue you are reporting is an upstream one. Could you please report this problem through the appropriate channel by following the instructions _verbatim_ at https://wiki.ubuntu.com/Bugs/Upstream/kernel ?

Please provide a direct URL to your e-mail to the mailing list once you have made it so that it may be tracked.

Thank you for your understanding.

tags: removed: regression-potential
Changed in linux (Ubuntu):
status: Incomplete → Triaged

I am able to reproduce this bug as well, using Ubuntu saucy (13.10) with kernel 3.11.0-15-generic #23-Ubuntu SMP. The machine is an HP Folio13 laptop. I am happy to provide more info and/or test patches.

I actually encountered it independently. When you run latex from emacs, it spawns evince to view your generated pdf. When you re-latex your source file, emacs kills evince with SIGQUIT (actually it "types" the QUIT character to it in a pseudo-tty; not so friendly, huh?) triggering a core dump and this bug. I could only reproduce it intermittently at first which I think was because I had apport running, which intercepted the core dump. With apport disabled it is reproducible.

A question: Maarten, it looks like you sent this to the ecryptfs mailing list - are you sure that's the right place, given that it looks like aesni_intel may be the culprit?

Maarten Baert (maarten-baert) wrote :

I don't know, the cause is apparently that IRQs are disabled, but I don't know whether that's considered an error. The aesni_intel module is needed to trigger the bug, but that doesn't imply that it contains the bug. It's possible that ecryptfs is using it wrong, or that the fallback code (which is used when aesni_intel is not available) happens to contain something that avoids the bug. Since ecryptfs is the module that binds all these other modules together, I thought that was the right place to ask.

The ecryptfs mailing list doesn't seem very active though, is it normal that this takes so long? Should I send it directly to one of the maintainers?

Download full text (6.7 KiB)

For what it's worth, here is the backtrace from when I reproduced the bug using emacs/evince. Maybe it is helpful to look for similarities in the code path, though it certainly sounds like the crypto code in ecryptfs is the place to begin. I may try putting in lots of WARN_ON(irqs_disabled()).

Just as a note, from disassembly it doesn't appear that the aesni_intel module contains the cli instruction, so interrupts must get disabled somewhere else. As a wild guess, I speculate that somewhere there is a irq_enable/irq_disable pair with the possibility to erroneously jump out from the middle, and something about using the aesni_intel module makes that happen. Maybe in the generic crypto code that only calls aesni_intel if it's available?

[ 322.435871] ------------[ cut here ]------------
[ 322.435925] kernel BUG at /build/buildd/linux-3.11.0/fs/buffer.c:1268!
[ 322.435979] invalid opcode: 0000 [#1] SMP
[ 322.436017] Modules linked in: xt_recent michael_mic arc4 dm_crypt joydev ip6t_REJECT xt_hl ip6t_rt nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT xt_comment xt_LOG parport_pc ppdev lp parport xt_limit xt_tcpudp xt_addrtype nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack ip6table_filter ip6_tables nf_conntrack_netbios_ns nf_conntrack_broadcast nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack iptable_filter ip_tables x_tables bnep rfcomm bluetooth x86_pkg_temp_thermal intel_powerclamp coretemp kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd hp_wmi sparse_keymap snd_hda_codec_hdmi snd_hda_codec_idt binfmt_misc uvcvideo videobuf2_vmalloc snd_hda_intel snd_hda_codec videobuf2_memops snd_hwdep videobuf2_core videodev snd_pcm lib80211_crypt_tkip snd_page_alloc snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq microcode wl(POF) snd_seq_device snd_timer lib80211 psmouse serio_raw cfg80211 rtsx_pci_ms snd memstick lpc_ich mei_me soundcore mei mac_hid rtsx_pci_sdmmc i915 i2c_algo_bit drm_kms_helper sdhci_pci sdhci drm ahci r8169 rtsx_pci mii libahci wmi video
[ 322.437181] CPU: 3 PID: 3174 Comm: evince Tainted: PF O 3.11.0-15-generic #23-Ubuntu
[ 322.437266] Hardware name: Hewlett-Packard HP Folio 13 Notebook PC/17F8, BIOS F.0B 01/23/2013
[ 322.437353] task: ffff880146af2ee0 ti: ffff880144152000 task.ti: ffff880144152000
[ 322.437511] RIP: 0010:[<ffffffff816e3efd>] [<ffffffff816e3efd>] check_irqs_on.part.11+0x4/0x6
[ 322.437699] RSP: 0018:ffff8801441534c8 EFLAGS: 00010046
[ 322.437805] RAX: 0000000000000086 RBX: 0000000000001000 RCX: ffff880144955800
[ 322.437937] RDX: 0000000000001000 RSI: 0000000000000554 RDI: ffff88014934a3c0
[ 322.438069] RBP: ffff8801441534c8 R08: 0000000000000000 R09: 0000000000000000
[ 322.438186] R10: ffff880144955800 R11: 0000000000001000 R12: ffff880144153650
[ 322.438262] R13: ffff8801438b9000 R14: ffff88014f8a8000 R15: ffff88014934a3c0
[ 322.438338] FS: 00007feab6487a00(0000) GS:ffff88014fac0000(0000) knlGS:0000000000000000
[ 322.438425] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 322.438486] CR2: 00000000021aa1b8 CR3: 0000000001c0e000 CR4: 00000000000407e0
[ 322.438561] Stack:
[ 322.438584] ffff880144153538 ffff...

Read more...

Tyler Hicks (tyhicks) wrote :

Thanks for the great bug report, Maarten. I've linked it into the upstream eCryptfs tracker (also hosted on Launchpad) and I'll start looking into it.

It initially feels like a bug in the aesni module, but that's just an early guess.

Changed in ecryptfs:
assignee: nobody → Tyler Hicks (tyhicks)
importance: Undecided → High
Download full text (5.1 KiB)

Tyler, before you spend any time on this, I've already investigated and
think it may actually be a bug in the core kernel FPU code. I reported it
on LKML with a tentative patch: https://lkml.org/lkml/2014/1/19/86 I
haven't yet received any feedback or discussion from LKML; when I have a
chance I was thinking of writing it up as a formal patch submission, in
hopes that it would get more attention that way. If you have any other
ideas (about the bug itself or how to get it fixed), that would be great!

Thanks!

Nate

On Mon, 27 Jan 2014, Tyler Hicks wrote:

> Thanks for the great bug report, Maarten. I've linked it into the
> upstream eCryptfs tracker (also hosted on Launchpad) and I'll start
> looking into it.
>
> It initially feels like a bug in the aesni module, but that's just an
> early guess.
>
> ** Also affects: ecryptfs
> Importance: Undecided
> Status: New
>
> ** Changed in: ecryptfs
> Assignee: (unassigned) => Tyler Hicks (tyhicks)
>
> ** Changed in: ecryptfs
> Importance: Undecided => High
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1265841
>
> Title:
> kernel BUG at /build/buildd/linux-3.11.0/fs/buffer.c:1268!; RIP:
> 0010:[<ffffffff816e3efd>] [<ffffffff816e3efd>]
> check_irqs_on.part.11+0x4/0x6
>
> Status in eCryptfs:
> New
> Status in “linux” package in Ubuntu:
> Triaged
>
> Bug description:
> This only happens when aesni_intel is loaded.
>
> In my attempts to find an easy way to reproduce this bug:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1265684
> I actually found a very simple way to trigger a bug that is similar but not identical. I suspect that both bugs have the same cause: something in the kernel is disabling IRQs and the ext4 code will crash when this happens. The stack trace for this bug is different from the other one. This one appears to be less severe, the system is still usable after the crash, only the process that caused the crash will hang (uninterruptible sleep). This bug is 100% reproducible on both Ubuntu 13.10 with kernel 3.11.0 and Arch Linux with kernel 3.12.6.
>
> The steps to reproduce the bug are based on this:
> http://www.spinics.net/lists/linux-ext4/msg38949.html
>
> * Set up an ecryptfs 'Private' folder in your home directory.
> * In that directory, create a file called 'crashme.c' with the following code in it:
> #include <assert.h>
> int main() { assert(0); }
>
> * Compile the program:
> gcc -Wall crashme.c -o crashme
>
> * Change the core dump pattern so core dumps are saved in the current directory:
> echo "coredump-%p" | sudo tee /proc/sys/kernel/core_pattern
>
> * Enable core dumps:
> ulimit -c unlimited
>
> * Make sure that you have a second terminal open to run dmesg, because you may not be able to do so later.
> * Run 'crashme' - this will hang and trigger the bug:
> ./crashme
>
> ProblemType: Bug
> DistroRelease: Ubuntu 13.10
> Package: linux-image-3.11.0-15-generic 3.11.0-15.23
> ProcVersionSignature: Ubuntu 3.11.0-15.23-generic 3.11.10
> Uname: Linux 3.11.0-15-generic x86_64
> NonfreeKernelModules: nvidia
> ApportVersion: 2.12.5-0ubuntu2.2
> ...

Read more...

Stefan Bader (smb) wrote :

Just happened to stumble across a lengthy thread on LKML/stable related to this. To me it looks to be settling down to have two proposed patches:

* Re: [PATCH] Make math_state_restore() save and restore the interrupt flag
 - x86, fpu: remove the logic of non-eager fpu mem allocation at the first usage
 - x86, fpu: check tsk_used_math() in kernel_fpu_end() for eager fpu

which would need confirmation (at least the second one)

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers