ext4_mb_generate_buddy:756: group N, block bitmap and bg descriptor inconsistent: X vs Y

Bug #1423672 reported by Simon Déziel on 2015-02-19
96
This bug affects 16 people
Affects Status Importance Assigned to Milestone
Linux
Unknown
Unknown
linux (Debian)
Fix Released
Unknown
linux (Ubuntu)
High
Unassigned
Trusty
High
Chris J Arges
linux-lts-utopic (Ubuntu)
Undecided
Unassigned
Trusty
Undecided
Chris J Arges

Bug Description

 SRU Justification:

    Impact: Users of VMs running 3.13/3.16 and ext4 can experience data corruption in the guest.
    Fix: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?h=7dec5603b6b8dc4c3e1c65d318bd2a5a8c62a424
    Testcase: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=818502#22

--

I noticed that one of my VM had this "dmesg -T" output:

[Tue Feb 17 09:53:27 2015] systemd-udevd[5433]: starting version 204
[Thu Feb 19 06:25:08 2015] EXT4-fs error (device vda1): ext4_mb_generate_buddy:756: group 5, block bitmap and bg descriptor inconsistent: 16446 vs 16445 free clusters
[Thu Feb 19 06:25:09 2015] Aborting journal on device vda1-8.
[Thu Feb 19 06:25:09 2015] EXT4-fs (vda1): Remounting filesystem read-only
[Thu Feb 19 06:25:09 2015] ------------[ cut here ]------------
[Thu Feb 19 06:25:09 2015] WARNING: CPU: 0 PID: 9946 at /build/buildd/linux-3.13.0/fs/ext4/ext4_jbd2.c:259 __ext4_handle_dirty_metadata+0x1a2/0x1c0()
[Thu Feb 19 06:25:09 2015] Modules linked in: nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_owner xt_conntrack nf_conntrack iptable_filter ip_tables x_tables serio_raw psmouse floppy
[Thu Feb 19 06:25:09 2015] CPU: 0 PID: 9946 Comm: logrotate Not tainted 3.13.0-45-generic #74-Ubuntu
[Thu Feb 19 06:25:09 2015] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[Thu Feb 19 06:25:09 2015] 0000000000000009 ffff880003a11aa0 ffffffff81720eb6 0000000000000000
[Thu Feb 19 06:25:09 2015] ffff880003a11ad8 ffffffff810677cd ffff880000c41340 0000000000000000
[Thu Feb 19 06:25:09 2015] ffff88000a58e000 ffffffff81835280 0000000000001302 ffff880003a11ae8
[Thu Feb 19 06:25:09 2015] Call Trace:
[Thu Feb 19 06:25:09 2015] [<ffffffff81720eb6>] dump_stack+0x45/0x56
[Thu Feb 19 06:25:09 2015] [<ffffffff810677cd>] warn_slowpath_common+0x7d/0xa0
[Thu Feb 19 06:25:09 2015] [<ffffffff810678aa>] warn_slowpath_null+0x1a/0x20
[Thu Feb 19 06:25:09 2015] [<ffffffff8126e862>] __ext4_handle_dirty_metadata+0x1a2/0x1c0
[Thu Feb 19 06:25:09 2015] [<ffffffff81246a5a>] ? ext4_dirty_inode+0x2a/0x60
[Thu Feb 19 06:25:09 2015] [<ffffffff81277086>] ext4_free_blocks+0x646/0xbf0
[Thu Feb 19 06:25:09 2015] [<ffffffff810aacc5>] ? wake_up_bit+0x25/0x30
[Thu Feb 19 06:25:09 2015] [<ffffffff812685b5>] ext4_ext_rm_leaf+0x505/0x8f0
[Thu Feb 19 06:25:09 2015] [<ffffffff81267527>] ? __ext4_ext_check+0x197/0x370
[Thu Feb 19 06:25:09 2015] [<ffffffff8126ad00>] ? ext4_ext_remove_space+0xc0/0x7e0
[Thu Feb 19 06:25:09 2015] [<ffffffff8126af5c>] ext4_ext_remove_space+0x31c/0x7e0
[Thu Feb 19 06:25:09 2015] [<ffffffff8126d300>] ext4_ext_truncate+0xb0/0xe0
[Thu Feb 19 06:25:09 2015] [<ffffffff81244eb9>] ext4_truncate+0x379/0x3c0
[Thu Feb 19 06:25:09 2015] [<ffffffff81245a18>] ext4_evict_inode+0x408/0x4d0
[Thu Feb 19 06:25:09 2015] [<ffffffff811d8f60>] evict+0xb0/0x1b0
[Thu Feb 19 06:25:09 2015] [<ffffffff811d9775>] iput+0xf5/0x180
[Thu Feb 19 06:25:09 2015] [<ffffffff811d4698>] __dentry_kill+0x1a8/0x200
[Thu Feb 19 06:25:09 2015] [<ffffffff811d4795>] dput+0xa5/0x180
[Thu Feb 19 06:25:09 2015] [<ffffffff811bf7e6>] __fput+0x176/0x260
[Thu Feb 19 06:25:09 2015] [<ffffffff811bf91e>] ____fput+0xe/0x10
[Thu Feb 19 06:25:09 2015] [<ffffffff810882f7>] task_work_run+0xa7/0xe0
[Thu Feb 19 06:25:09 2015] [<ffffffff81013ed7>] do_notify_resume+0x97/0xb0
[Thu Feb 19 06:25:09 2015] [<ffffffff81731c2a>] int_signal+0x12/0x17
[Thu Feb 19 06:25:09 2015] ---[ end trace ebff9843d81b5c41 ]---
[Thu Feb 19 06:25:09 2015] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
[Thu Feb 19 06:25:09 2015] IP: [<ffffffff8125d4c1>] __ext4_error_inode+0x31/0x160
[Thu Feb 19 06:25:09 2015] PGD 167067 PUD 161067 PMD 0
[Thu Feb 19 06:25:09 2015] Oops: 0000 [#1] SMP
[Thu Feb 19 06:25:09 2015] Modules linked in: nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_owner xt_conntrack nf_conntrack iptable_filter ip_tables x_tables serio_raw psmouse floppy
[Thu Feb 19 06:25:09 2015] CPU: 0 PID: 9946 Comm: logrotate Tainted: G W 3.13.0-45-generic #74-Ubuntu
[Thu Feb 19 06:25:09 2015] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[Thu Feb 19 06:25:09 2015] task: ffff880009ac4800 ti: ffff880003a10000 task.ti: ffff880003a10000
[Thu Feb 19 06:25:09 2015] RIP: 0010:[<ffffffff8125d4c1>] [<ffffffff8125d4c1>] __ext4_error_inode+0x31/0x160
[Thu Feb 19 06:25:09 2015] RSP: 0000:ffff880003a11a58 EFLAGS: 00010292
[Thu Feb 19 06:25:09 2015] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000086
[Thu Feb 19 06:25:09 2015] RDX: 0000000000001302 RSI: ffffffff81a6e81f RDI: 0000000000000000
[Thu Feb 19 06:25:09 2015] RBP: ffff880003a11ae8 R08: ffffffff81a78568 R09: 0000000000000005
[Thu Feb 19 06:25:09 2015] R10: 00000000ffffffe2 R11: ffff880003a117ce R12: 0000000000000086
[Thu Feb 19 06:25:09 2015] R13: ffffffff81835280 R14: 0000000000001302 R15: ffffffff81a78568
[Thu Feb 19 06:25:09 2015] FS: 00007f74eaca4840(0000) GS:ffff88000b800000(0000) knlGS:0000000000000000
[Thu Feb 19 06:25:09 2015] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[Thu Feb 19 06:25:09 2015] CR2: 0000000001de4000 CR3: 0000000009b45000 CR4: 00000000000006f0
[Thu Feb 19 06:25:09 2015] Stack:
[Thu Feb 19 06:25:09 2015] ffff880003a11a60 0000000000000103 ebff9843d81b5c41 000000000000321e
[Thu Feb 19 06:25:09 2015] 00000000000014d8 0000000000000092 000000000000020e ffff88000a58e000
[Thu Feb 19 06:25:09 2015] ffff880003a11ae8 ffffffff8126e372 ffffffff810677df ffff880000c41340
[Thu Feb 19 06:25:09 2015] Call Trace:
[Thu Feb 19 06:25:09 2015] [<ffffffff8126e372>] ? ext4_journal_abort_handle+0x42/0xc0
[Thu Feb 19 06:25:09 2015] [<ffffffff810677df>] ? warn_slowpath_common+0x8f/0xa0
[Thu Feb 19 06:25:09 2015] [<ffffffff8126e7cf>] __ext4_handle_dirty_metadata+0x10f/0x1c0
[Thu Feb 19 06:25:09 2015] [<ffffffff81277086>] ext4_free_blocks+0x646/0xbf0
[Thu Feb 19 06:25:09 2015] [<ffffffff810aacc5>] ? wake_up_bit+0x25/0x30
[Thu Feb 19 06:25:09 2015] [<ffffffff812685b5>] ext4_ext_rm_leaf+0x505/0x8f0
[Thu Feb 19 06:25:09 2015] [<ffffffff81267527>] ? __ext4_ext_check+0x197/0x370
[Thu Feb 19 06:25:09 2015] [<ffffffff8126ad00>] ? ext4_ext_remove_space+0xc0/0x7e0
[Thu Feb 19 06:25:09 2015] [<ffffffff8126af5c>] ext4_ext_remove_space+0x31c/0x7e0
[Thu Feb 19 06:25:09 2015] [<ffffffff8126d300>] ext4_ext_truncate+0xb0/0xe0
[Thu Feb 19 06:25:09 2015] [<ffffffff81244eb9>] ext4_truncate+0x379/0x3c0
[Thu Feb 19 06:25:09 2015] [<ffffffff81245a18>] ext4_evict_inode+0x408/0x4d0
[Thu Feb 19 06:25:09 2015] [<ffffffff811d8f60>] evict+0xb0/0x1b0
[Thu Feb 19 06:25:09 2015] [<ffffffff811d9775>] iput+0xf5/0x180
[Thu Feb 19 06:25:09 2015] [<ffffffff811d4698>] __dentry_kill+0x1a8/0x200
[Thu Feb 19 06:25:09 2015] [<ffffffff811d4795>] dput+0xa5/0x180
[Thu Feb 19 06:25:09 2015] [<ffffffff811bf7e6>] __fput+0x176/0x260
[Thu Feb 19 06:25:09 2015] [<ffffffff811bf91e>] ____fput+0xe/0x10
[Thu Feb 19 06:25:09 2015] [<ffffffff810882f7>] task_work_run+0xa7/0xe0
[Thu Feb 19 06:25:09 2015] [<ffffffff81013ed7>] do_notify_resume+0x97/0xb0
[Thu Feb 19 06:25:09 2015] [<ffffffff81731c2a>] int_signal+0x12/0x17
[Thu Feb 19 06:25:09 2015] Code: 48 89 e5 41 57 4d 89 c7 41 56 41 89 d6 41 55 49 89 f5 48 c7 c6 1f e8 a6 81 41 54 49 89 cc 53 48 89 fb 48 83 ec 68 4c 89 4c 24 60 <48> 8b 47 28 48 8b 57 40 48 8b 80 f8 02 00 00 48 8b 40 68 89 90
[Thu Feb 19 06:25:09 2015] RIP [<ffffffff8125d4c1>] __ext4_error_inode+0x31/0x160
[Thu Feb 19 06:25:09 2015] RSP <ffff880003a11a58>
[Thu Feb 19 06:25:09 2015] CR2: 0000000000000028
[Thu Feb 19 06:25:10 2015] ---[ end trace ebff9843d81b5c42 ]---

cron.daily jobs fired at 6:25:01 apparently:

# tail -n2 /var/log/syslog
Feb 19 06:17:01 git CRON[9848]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Feb 19 06:25:01 git CRON[9853]: (root) CMD (test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily ))

# run-parts --test /etc/cron.daily
/etc/cron.daily/apt
/etc/cron.daily/autoremove
/etc/cron.daily/dpkg
/etc/cron.daily/hdd-backup
/etc/cron.daily/logrotate
/etc/cron.daily/passwd
/etc/cron.daily/upstart

It seems like all the jobs ran and the upstart one somehow triggered the crash:

# ls -alt /var/log/upstart/ | head
total 272
drwxrwxr-x 5 root syslog 4096 Feb 19 06:25 ..
drwxr-xr-x 2 root root 4096 Feb 14 06:25 .
-rw-r----- 1 root root 591 Feb 12 12:26 ureadahead.log.1.gz
-rw-r----- 1 root root 178 Feb 12 12:25 mountall.log.1.gz

Now that I have collected some information (sorry, I don't have ubuntu-bug installed on the VM) I'll reboot it and see how it goes.

More information on the VM:

# lsb_release -rd
Description: Ubuntu 14.04.2 LTS
Release: 14.04
# apt-cache policy linux-image-3.13.0-45-generic
linux-image-3.13.0-45-generic:
  Installed: 3.13.0-45.74
  Candidate: 3.13.0-45.74
  Version table:
 *** 3.13.0-45.74 0
        500 http://archive.ubuntu.com/ubuntu/ trusty-updates/main amd64 Packages
        100 /var/lib/dpkg/status
---
AlsaDevices:
 total 0
 crw-rw---- 1 root audio 116, 1 Feb 19 14:34 seq
 crw-rw---- 1 root audio 116, 33 Feb 19 14:34 timer
AplayDevices: Error: [Errno 2] No such file or directory
ApportVersion: 2.14.1-0ubuntu3.7
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory
AudioDevicesInUse: Error: [Errno 2] No such file or directory
CRDA: Error: [Errno 2] No such file or directory
CurrentDmesg:
 [ 13.891047] init: console-font main process (855) terminated with status 71
 [ 13.952825] init: plymouth-splash main process (870) terminated with status 1
 [ 217.853139] random: nonblocking pool is initialized
DistroRelease: Ubuntu 14.04
IwConfig: Error: [Errno 2] No such file or directory
Lspci: Error: [Errno 2] No such file or directory
Lsusb: Error: [Errno 2] No such file or directory
MachineType: QEMU Standard PC (i440FX + PIIX, 1996)
Package: linux (not installed)
PciMultimedia:

ProcEnviron:
 TERM=xterm
 PATH=(custom, no user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB:

ProcKernelCmdLine: root=UUID=cb9cbdad-c668-4503-85db-fcf9b02f3495 ro console=tty0 console=ttyS0,38400
ProcVersionSignature: Ubuntu 3.13.0-45.74-generic 3.13.11-ckt13
RelatedPackageVersions:
 linux-restricted-modules-3.13.0-45-generic N/A
 linux-backports-modules-3.13.0-45-generic N/A
 linux-firmware N/A
RfKill: Error: [Errno 2] No such file or directory
Tags: trusty
Uname: Linux 3.13.0-45-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:

_MarkForUpload: True
dmi.bios.date: 01/01/2011
dmi.bios.vendor: Bochs
dmi.bios.version: Bochs
dmi.chassis.type: 1
dmi.chassis.vendor: Bochs
dmi.modalias: dmi:bvnBochs:bvrBochs:bd01/01/2011:svnQEMU:pnStandardPC(i440FX+PIIX,1996):pvrpc-i440fx-2.0:cvnBochs:ct1:cvr:
dmi.product.name: Standard PC (i440FX + PIIX, 1996)
dmi.product.version: pc-i440fx-2.0
dmi.sys.vendor: QEMU

Simon Déziel (sdeziel) wrote :
Simon Déziel (sdeziel) wrote :

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1423672

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: trusty

apport information

tags: added: apport-collected
description: updated

apport information

apport information

apport information

apport information

apport information

apport information

Do you have a way to reproduce this bug, or was it a one time event?

Changed in linux (Ubuntu):
importance: Undecided → Medium
tags: added: kernel-da-key
Simon Déziel (sdeziel) wrote :

It was a one time event. I'll report here if/when this happens again.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Simon Déziel (sdeziel) wrote :

This problem just occurred on *another* VM. Again, right after the cron.daily run.

Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.0 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

If you are unable to test the mainline kernel, for example it will not boot, please add the tag: 'kernel-unable-to-test-upstream'.
Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.0-rc1-vivid/

Simon Déziel (sdeziel) on 2015-03-04
summary: - ext4 turned read-only following a cronjob run
+ ext4 turned read-only during logrotate daily run

This occurred on another VM. I guess I will have to roll our the 4.0-rc2 kernel everywhere.

Joseph Salisbury (jsalisbury) wrote :

It would be good to know if v4.0-rc2 fixes this. If it does, we can look at the git logs to see what may be the fix, or perform a "Reverse" bisect.

Joseph Salisbury (jsalisbury) wrote :

Also, did this issue start happening after an update/upgrade? Was there a prior kernel version where you were not having this particular problem? This will help determine if the problem you are seeing is the result of a regression, and when this regression was introduced. If this is a regression, we could also perform a kernel bisect to identify the commit that introduced the problem, if it's not fixed in 4.0.

Simon Déziel (sdeziel) wrote :

The first occurrence of this problem goes back to Feb 17th. While I keep all the VMs up to date on a daily basis, I now have doubts about the hypervisor's RAM (not ECC by the way). I brought all the VMs offline to have their FSes checked (all ext4). Several of them needed fixing by fsck. Since they are all part of a mdadm RAID1 array with 2 devices, maybe there was some bit flipping on one of the drives.

Before loosing anyone's time on this issue, I'll first make sure the hypervisor's RAM is sane. So I'll get back after some memtest86+ then I'll look at the 4.0 kernel.

Thanks Joseph

Mark Deneen (mdeneen) wrote :
Download full text (8.5 KiB)

This happened to me here as well. Also in a VM.

[ 42.042806] ------------[ cut here ]------------
[ 42.042812] WARNING: CPU: 0 PID: 617 at /build/buildd/linux-3.13.0/fs/ext4/ext4_jbd2.c:259 __ext4_handle_dirty_metadata+0x1a2/0x1c0()
[ 42.042813] Modules linked in: kvm_intel kvm cirrus snd_hda_intel ttm snd_hda_codec snd_hwdep drm_kms_helper serio_raw snd_pcm lp parport snd_page_alloc snd_timer drm snd soundcore syscopyarea sysfillrect sysimgblt i2c_piix4 mac_hid psmouse floppy pata_acpi
[ 42.042826] CPU: 0 PID: 617 Comm: mv Not tainted 3.13.0-45-generic #74-Ubuntu
[ 42.042828] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[ 42.042829] 0000000000000009 ffff88007acd3ad0 ffffffff81720eb6 0000000000000000
[ 42.042831] ffff88007acd3b08 ffffffff810677cd ffff88007bf53d68 0000000000000000
[ 42.042833] ffff88007b5c4000 ffffffff81835280 0000000000001302 ffff88007acd3b18
[ 42.042835] Call Trace:
[ 42.042839] [<ffffffff81720eb6>] dump_stack+0x45/0x56
[ 42.042842] [<ffffffff810677cd>] warn_slowpath_common+0x7d/0xa0
[ 42.042844] [<ffffffff810678aa>] warn_slowpath_null+0x1a/0x20
[ 42.042846] [<ffffffff8126e862>] __ext4_handle_dirty_metadata+0x1a2/0x1c0
[ 42.042850] [<ffffffff81246a5a>] ? ext4_dirty_inode+0x2a/0x60
[ 42.042853] [<ffffffff81277086>] ext4_free_blocks+0x646/0xbf0
[ 42.042855] [<ffffffff812685b5>] ext4_ext_rm_leaf+0x505/0x8f0
[ 42.042857] [<ffffffff81267527>] ? __ext4_ext_check+0x197/0x370
[ 42.042859] [<ffffffff8126ad00>] ? ext4_ext_remove_space+0xc0/0x7e0
[ 42.042861] [<ffffffff8126af5c>] ext4_ext_remove_space+0x31c/0x7e0
[ 42.042863] [<ffffffff8126d300>] ext4_ext_truncate+0xb0/0xe0
[ 42.042865] [<ffffffff81244eb9>] ext4_truncate+0x379/0x3c0
[ 42.042867] [<ffffffff81245a18>] ext4_evict_inode+0x408/0x4d0
[ 42.042870] [<ffffffff811d8f60>] evict+0xb0/0x1b0
[ 42.042872] [<ffffffff811d9775>] iput+0xf5/0x180
[ 42.042875] [<ffffffff811ce09e>] do_unlinkat+0x18e/0x2b0
[ 42.042878] [<ffffffff8172d2ca>] ? do_page_fault+0x1a/0x70
[ 42.042880] [<ffffffff8172c949>] ? do_async_page_fault+0x29/0xe0
[ 42.042882] [<ffffffff811cf006>] SyS_unlink+0x16/0x20
[ 42.042884] [<ffffffff8173196d>] system_call_fastpath+0x1a/0x1f
[ 42.042885] ---[ end trace 18d6f1c79bfdd3f4 ]---
[ 42.042889] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
[ 42.042930] IP: [<ffffffff8125d4c1>] __ext4_error_inode+0x31/0x160
[ 42.042961] PGD 7c351067 PUD 79b43067 PMD 0
[ 42.042987] Oops: 0000 [#1] SMP
[ 42.043714] Modules linked in: kvm_intel kvm cirrus snd_hda_intel ttm snd_hda_codec snd_hwdep drm_kms_helper serio_raw snd_pcm lp parport snd_page_alloc snd_timer drm snd soundcore syscopyarea sysfillrect sysimgblt i2c_piix4 mac_hid psmouse floppy pata_acpi
[ 42.045134] CPU: 0 PID: 617 Comm: mv Tainted: G W 3.13.0-45-generic #74-Ubuntu
[ 42.045134] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[ 42.045134] task: ffff88007accb000 ti: ffff88007acd2000 task.ti: ffff88007acd2000
[ 42.045134] RIP: 0010:[<ffffffff8125d4c1>] [<ffffffff8125d4c1>] __ext4_error_inode+0x31/0x160
[ 42.045134] RSP: 0018:...

Read more...

On 03/04/2015 05:07 PM, Simon Déziel wrote:
> Before loosing anyone's time on this issue, I'll first make sure the
> hypervisor's RAM is sane. So I'll get back after some memtest86+ then
> I'll look at the 4.0 kernel.

After 15+ hours of memtest86+ (8 full passes), I don't think the RAM is
at fault. I ran a fsck on every VMs and many of them had issues. I'm
still using the default Ubuntu kernel for now.

One weird thing I noticed is that apparently all the affected VMs have
trouble with the /etc/cron.daily/logrotate job:

 error: error creating output file /var/log/syslog.1.gz: File exists
 run-parts: /etc/cron.daily/logrotate exited with return code 1

On those, there was a very old /var/log/syslog.1 and a more recent .1.gz
one too. I removed both and will see if it helps.

Changed in linux (Ubuntu):
importance: Medium → High
Changed in linux (Ubuntu Trusty):
importance: Undecided → High
status: New → Confirmed

FWIW, I've seen this happen, too, basically since I upgraded to Trusty, I believe. Only ever happens on VMs, never on physical machines. All my virtual disks use Virtio, cache mode 'none' (yeah, I know), I/O mode 'native', with raw LVM volumes as backend.

On 03/09/2015 04:18 PM, Oliver Brakmann wrote:
> Only ever happens on VMs, never on physical machines. All my
> virtual disks use Virtio, cache mode 'none' (yeah, I know), I/O mode
> 'native', with raw LVM volumes as backend.

Exact same setup here. What's wrong with cache mode set to none? This
avoids double page caching.

It's been a while since there was any activity on this bug. Does this issues still happen with the latest updates?

Changed in linux (Ubuntu Trusty):
status: Confirmed → Incomplete
Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Oliver Brakmann (obrakmann) wrote :

Indeed I haven't seen this error in a few weeks, the last time was maybe around six-eight weeks ago. But we all know how absence of evidence isn't evidence of absence :)

Simon Déziel (sdeziel) wrote :

Things seem to have stabilized recently as the last occurrence I witnessed was on May 13th.

I searched all my 2014 and 2015 logs for "ext4_mb_generate_buddy" and here is what I got:

Nov 24 02:54:03 smb kernel: [842091.081319] EXT4-fs error (device vda1): ext4_mb_generate_buddy:756: group 1, block bitmap and bg descriptor inconsistent: 8198 vs 8197 free clusters
Jan 18 06:25:17 ns0 kernel: [330912.633362] EXT4-fs error (device vda1): ext4_mb_generate_buddy:756: group 3, block bitmap and bg descriptor inconsistent: 17617 vs 17616 free clusters
Jan 19 06:30:38 ns0 kernel: [417633.256202] EXT4-fs (vda1): initial error at time 1421580317: ext4_mb_generate_buddy:756
Mar 1 06:25:08 ns0 kernel: [154713.220446] EXT4-fs error (device vda1): ext4_mb_generate_buddy:756: group 4, block bitmap and bg descriptor inconsistent: 17270 vs 17268 free clusters
Mar 2 06:26:44 ns0 kernel: [241210.340779] EXT4-fs (vda1): initial error at time 1425209107: ext4_mb_generate_buddy:756
Mar 4 06:25:09 pm kernel: [75552.905227] EXT4-fs error (device vda1): ext4_mb_generate_buddy:756: group 6, block bitmap and bg descriptor inconsistent: 9361 vs 9359 free clusters
Mar 15 06:25:12 smb kernel: [231369.706448] EXT4-fs error (device vda1): ext4_mb_generate_buddy:756: group 3, block bitmap and bg descriptor inconsistent: 17537 vs 17536 free clusters
Mar 23 16:24:25 apt kernel: [ 38.398673] EXT4-fs error (device vda1): ext4_mb_generate_buddy:756: group 33, block bitmap and bg descriptor inconsistent: 22354 vs 22352 free clusters
Mar 26 06:25:18 pm kernel: [223295.593946] EXT4-fs error (device vda1): ext4_mb_generate_buddy:756: group 6, block bitmap and bg descriptor inconsistent: 11499 vs 11497 free clusters
Apr 13 17:42:38 smb kernel: [422214.146392] EXT4-fs error (device dm-4): ext4_mb_generate_buddy:756: group 256, block bitmap and bg descriptor inconsistent: 672 vs 671 free clusters
Apr 14 09:31:22 smtp kernel: [479136.955401] EXT4-fs error (device vda1): ext4_mb_generate_buddy:756: group 1, block bitmap and bg descriptor inconsistent: 7688 vs 7686 free clusters
Apr 26 06:25:07 ns0 kernel: [767851.910454] EXT4-fs error (device vda1): ext4_mb_generate_buddy:756: group 0, block bitmap and bg descriptor inconsistent: 22322 vs 22321 free clusters
Apr 27 06:27:00 ns0 kernel: [854365.159204] EXT4-fs (vda1): initial error at time 1430043907: ext4_mb_generate_buddy:756
Apr 27 06:27:00 ns0 kernel: [854365.169903] EXT4-fs (vda1): last error at time 1430043907: ext4_mb_generate_buddy:756
May 7 08:37:08 log kernel: [ 33.482706] EXT4-fs error (device vdc): ext4_mb_generate_buddy:756: group 2, block bitmap and bg descriptor inconsistent: 28613 vs 28611 free clusters
May 8 08:39:00 log kernel: [86545.376048] EXT4-fs (vdc): initial error at time 1431002228: ext4_mb_generate_buddy:756
May 8 08:39:00 log kernel: [86545.376051] EXT4-fs (vdc): last error at time 1431002228: ext4_mb_generate_buddy:756
May 13 06:25:13 git kernel: [131633.710249] EXT4-fs error (device vda1): ext4_mb_generate_buddy:756: group 5, block bitmap and bg descriptor inconsistent: 14076 vs 14074 free clusters

Oliver Brakmann (obrakmann) wrote :

No, it isn't gone after all, it just happened again:

Jun 13 15:20:37 oberon kernel: [117379.592167] EXT4-fs error (device dm-1): ext4_mb_generate_buddy:756: group 1, block bitmap and bg descriptor inconsistent: 19382 vs 19381 free clusters
Jun 13 15:20:37 oberon kernel: [117379.592903] Aborting journal on device dm-1-8.
Jun 13 15:20:37 oberon kernel: [117379.593659] EXT4-fs (dm-1): Remounting filesystem read-only

$ uname -a
Linux oberon 3.13.0-54-generic #91-Ubuntu SMP Tue May 26 19:15:08 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

Simon Déziel (sdeziel) on 2015-06-15
Changed in linux (Ubuntu Trusty):
status: Incomplete → Confirmed
summary: - ext4 turned read-only during logrotate daily run
+ ext4_mb_generate_buddy:756: group N, block bitmap and bg descriptor
+ inconsistent: X vs Y
Simon Déziel (sdeziel) wrote :

And yet again:

root@rproxy:~# dmesg -T | tail
[Thu Jun 11 15:28:56 2015] type=1400 audit(1434050938.364:24): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="nc" pid=658 comm="apparmor_parser"
[Thu Jun 11 15:28:59 2015] init: plymouth-upstart-bridge main process ended, respawning
[Thu Jun 11 15:29:02 2015] init: console-font main process (757) terminated with status 71
[Thu Jun 11 15:29:03 2015] init: plymouth-splash main process (815) terminated with status 1
[Thu Jun 11 18:18:40 2015] random: nonblocking pool is initialized
[Mon Jun 15 16:53:35 2015] EXT4-fs (vda1): pa ffff8800061e1000: logic 3293, phys. 469944, len 424
[Mon Jun 15 16:53:35 2015] EXT4-fs error (device vda1): ext4_mb_release_inode_pa:3753: group 14, free 13, pa_free 11
[Mon Jun 15 16:53:35 2015] Aborting journal on device vda1-8.
[Mon Jun 15 16:53:36 2015] EXT4-fs (vda1): Remounting filesystem read-only
[Mon Jun 15 16:53:36 2015] EXT4-fs error (device vda1): ext4_journal_check_start:56: Detected aborted journal

root@rproxy:~# uname -a
Linux rproxy.dmz.sdeziel.info 3.13.0-54-generic #91-Ubuntu SMP Tue May 26 19:15:08 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

gschoenberger (schoeni-georg) wrote :

Seems we are also confronted with this error in one of our VMWare guests:
[45780.021968] EXT4-fs error (device dm-9): ext4_mb_generate_buddy:756: group 1411, block bitmap and bg descriptor inconsistent: 32757 vs 32768 free clusters
[45780.022029] Aborting journal on device dm-9-8.
[45780.055657] EXT4-fs (dm-9): Remounting filesystem read-only
[45780.055733] EXT4-fs error (device dm-9) in ext4_orphan_add:2609: Journal has aborted

It happened four times the last three weeks, pretty annoying. We have tried to "harden" ext4 in some way by using:
* ext4 rw,relatime,nodelalloc,errors=remount-ro,commit=2
as mount options. We have also increased hung_task_timeout_secs as we have found some hints in a RedHat KB article.

Any ideas on how to fix that ext4 error in the guest?

Peter (peter-hirn) wrote :

I'm having this exact issue with Debian Jessie. Virtual machines running with LVM storage unpredictably failing after cron.daily.

[680406.357587] EXT4-fs error (device vda1): ext4_mb_generate_buddy:757: group 49, block bitmap and bg descriptor inconsistent: 18356 vs 18355 free clusters
[680406.361696] Aborting journal on device vda1-8.
[680406.453757] EXT4-fs error (device vda1): ext4_journal_check_start:56:
[680406.453999] EXT4-fs (vda1): Remounting filesystem read-only
[680406.457551] Detected aborted journal

[766809.056039] EXT4-fs (vda1): error count since last fsck: 2
[766809.056069] EXT4-fs (vda1): initial error at time 1441383430: ext4_mb_generate_buddy:757
[766809.056082] EXT4-fs (vda1): last error at time 1441383430: ext4_journal_check_start:56
[853316.576033] EXT4-fs (vda1): error count since last fsck: 2
[853316.576052] EXT4-fs (vda1): initial error at time 1441383430: ext4_mb_generate_buddy:757
[853316.576066] EXT4-fs (vda1): last error at time 1441383430: ext4_journal_check_start:56

root@collect1 ~ # uname -a
Linux collect1 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u3 (2015-08-04) x86_64 GNU/Linux

maxnuv (massimo-n) wrote :

Same error, virtual machine 14.04 running on kvm, server 14.04.

No hardware issue, the entire server was replaced after the second stop.

No hard disk issue, the server is on raid hardware disk.

The server is running more than one virtual server, only one affected.

davidak (davidak) wrote :

We have the same error on a 14.04 VM running on a XenServer 6.2 (paravirtualized).

EXT4-fs error (device dm-1): ext4_mb_generate_buddy:756: group 36, block bitmap and bg descriptor inconsistent: 30504 vs 30149 free clusters

Linux xxx 3.13.0-63-generic #103-Ubuntu SMP Fri Aug 14 21:42:59 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

i created a new VM and restored a backup and run the daily cronjobs to reproduce it but it didn't happen again.

run-parts -v --exit-on-error /etc/cron.daily

output of dmesg -T

...
[Mon Sep 14 14:47:00 2015] random: lvm urandom read with 64 bits of entropy available
[Mon Sep 14 14:47:00 2015] bio: create slab <bio-1> at 1
[Mon Sep 14 14:47:00 2015] EXT4-fs (dm-1): INFO: recovery required on readonly filesystem
[Mon Sep 14 14:47:00 2015] EXT4-fs (dm-1): write access will be enabled during recovery
[Mon Sep 14 14:47:01 2015] random: nonblocking pool is initialized
[Mon Sep 14 14:47:01 2015] EXT4-fs (dm-1): orphan cleanup on readonly fs
[Mon Sep 14 14:47:01 2015] EXT4-fs (dm-1): 16 orphan inodes deleted
[Mon Sep 14 14:47:01 2015] EXT4-fs (dm-1): recovery complete
[Mon Sep 14 14:47:01 2015] EXT4-fs (dm-1): mounted filesystem with ordered data mode. Opts: (null)
...

that seams to be ok. i also tried with the cron.weekly...

LouieGosselin (0-ubunbu-d) wrote :

I'm on Debian, but it's happening to me as well.
KVM with virtual disks backed by LVM volumes on the host.

Both the VM and the host are running
Linux version 3.16.0-4-amd64 (<email address hidden>) (gcc version 4.8.4 (Debian 4.8.4-1) ) #1 SMP Debian 3.16.7-ckt11-1+deb8u2 (2015-07-17)

First occurred on August 24th. Manual fsck required, lots of files converted to lost-inodes.
[1992712.418275] EXT4-fs error (device vda2): ext4_mb_generate_buddy:757: group 96, block bitmap and bg descriptor inconsistent: 24017 vs 24015 free clusters
[1992712.513438] Aborting journal on device vda2-8.
[1992712.514007] EXT4-fs (vda2): Remounting filesystem read-only
[1992712.514205] EXT4-fs error (device vda2) in ext4_evict_inode:243: Journal has aborted

Happened again today September 14th in the same VM.
[1489393.753098] EXT4-fs error (device vda2): ext4_mb_generate_buddy:757: group 144, block bitmap and bg descriptor inconsistent: 23914 vs 23913 free clusters
[1489393.803865] Aborting journal on device vda2-8.
[1489393.804439] EXT4-fs (vda2): Remounting filesystem read-only

This is the first syslog activity since I rebooted in August, no block IO errors on the guest or host.

Manual fsck required, files were lost, but everything is running again.

It has not happened in other VMs running older kernels. It also has not happened on the host, however there's very little file system activity on the host. The fact that it hasn't happened on other VMs leads me to believe the bug is inside the guest rather than with KVM - perhaps ext4 or the virtual disk driver.

maxnuv (massimo-n) wrote :

Same error, on a single VM, virtio driver, ext4:

EXT4-fs (vda1): pa ffff880010c3daf8: logic 512, pyhs. 6750208, len 512
EXT4-fs error (device vda1): ext4_mb_release_inode_pa:3773: group 206, free 143, pa_free 142
Aborting journal on device vda1-8
Disk quota active

No memory error (checked)
No disk error (checked too)

At system reset, filesystem check and correction, quota data corrupted, quotacheck take long time to complete.

The error is strictly correlated to filesystem activity, when raise, error is more frequent.

ubuntu 14.04 all updated, kernel 3.16.0-49-generic

Emmanuel Lacour (elacour) wrote :

Same problem here on 2 VMs, appeared 4 times in one week.

VMs are running Debian Jessie, on Debian wheezy hypervisors running ceph giant, libvirt 1.2.9-9~bpo70+1, qemu 1:2.1+dfsg-12~bpo70+1.

FS is ext4, using virtio and rbd with writeback.

After first crash we upgraded kernel to 4.1.6-1~bpo8+1. But it crashed again next day :(

Seems to be related to memory usage at least for last crash.

Attached the trace.

We added more memory to see if that helps to avoid triggering the bug.

I'm getting this same issue every day

3.2.0-4-amd64 #1 SMP Debian 3.2.68-1+deb7u4 x86_64 GNU/Linux

Guest is running under proxmox 3.4 virtio

Emmanuel Lacour (elacour) wrote :

Adding memory did not helps, but it seems to have fixed the issue by just removing the fstab option errors=remount-ro on the mountpoints having problems. It's usually only set on root fs and we have no problem on root fs, but by mistake this option was added to other mountpoints and triggered the bug.

No more ro-remounts since one week and no ext4 messages in logs.

Jan Wagner (waja) wrote :

I got also hit with the problem on Debian Jessie VM [virtio] (KVM host is Jessie as well).

The following bug might be related: https://bugzilla.kernel.org/show_bug.cgi?id=104571

I'm trying now '"cache=none"' for this VM and see how that turns out.

On 10/13/2015 11:50 AM, Jan Wagner wrote:
> I got also hit with the problem on Debian Jessie VM [virtio] (KVM host
> is Jessie as well).
>
> The following bug might be related:
> https://bugzilla.kernel.org/show_bug.cgi?id=104571
>
> I'm trying now '"cache=none"' for this VM and see how that turns out.

FYI, all my VMs use cache=none and have small HDDs (lv of 2-4G) yet the
regularly trip on this bug.

Simon

LouieGosselin (0-ubunbu-d) wrote :

It happened here again.

8/24 ext4 corruption
9/14 ext4 corruption
9/29 update/reboot
10/16 ext4 corruption

This time the corruption was severe. 1743 files from multiple directories got moved into lost+found.
It took me almost 2 hours this morning to verify & fix everything. Fortunately every time this has happened, all the files were dated prior to the daily backup and "diff -qr ..." shows exactly what was lost.

As far as I can tell this is not a memory issue and the ext4 FS is using 10G out of 100G.

Every time the corruption has been in /var/mail. However the VM is mostly used for mail so it may not be significant. The /var/mail branch itself is 4.3G

3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u4 (2015-09-19) x86_64 GNU/Linux

I'm holding back the kernels on other VM's so this is the only VM with the problem.

Is anyone able to reproduce this on demand?

I really need to do something because this is causing downtime during normal business hours. I'll probably try one of the following:
1. Rebuild the FS from scratch and see if ext4 corruption continues.
2. Use ext3 or something else.

Jan Wagner (waja) wrote :

On 10/13/2015 06:00 PW, Simon Déziel wrote:
> On 10/13/2015 11:50 AM, Jan Wagner wrote:
>> I'm trying now '"cache=none"' for this VM and see how that turns out.
>
> FYI, all my VMs use cache=none and have small HDDs (lv of 2-4G) yet the
> regularly trip on this bug.

looks like this didn't fixed the problem for me.

LouieGosselin (0-ubunbu-d) wrote :

It's happened again. I've spent several hours on this and I've been able to recreate the failure under some synthetic conditions with a sacrificial VM.

The filebench defaults do not cause an ext4 crash for me, but the following do:

load workloads/fileserver
set $dir=/tmp/
set $nfiles=200000
set $meandirwidth=30000
run 120

The ext4 error never happens in the filebench'es init phase, only 50s or so into the 50 threaded run phase. Less extreme settings won't produce a consistent crash.

Reducing the amount of free memory makes the errors much more likely.

This is before running filebench:
             total used free shared buffers cached
Mem: 482M 99M 382M 300K 27M 20M
-/+ buffers/cache: 52M 429M
Swap: 1.9G 94M 1.8G

This is while running filebench one second before the crash:
             total used free shared buffers cached
Mem: 482M 476M 5.6M 284K 27M 18M
-/+ buffers/cache: 430M 51M
Swap: 1.9G 253M 1.6G
2769.63

The error is reproducible in cloned VMs.

Moving swap to another disk changes nothing.

As far as I can tell, the error never happens with ext4 filesystems other than the root FS where executables are running from.

I've tried bonnie, stress-ng, and simple scripts, I have not been able to get these to crash ext4.

The sacrificial VM has not crashed after add an extra 500MB to it.

Although production was never under such heavy loads, I've added 500MB to the production VM to see if it helps anyways.

LouieGosselin (0-ubunbu-d) wrote :

I'm posting again to add that I conducted some more tests and ext3 does not encounter corruption under the same conditions. I hope this information is helpful to others, if anyone needs more information let me know and I'll see what I can do. I'll probably switch my own VMs to ext3 so I don't have to worry about these FS crashes.

Chris J Arges (arges) wrote :

Louie,

Can you post some additional information that may help in debugging this issue.
1) Can you post the output of 'virsh dumpxml <vm_domain>' of the affected VM?
2) Can you post the /boot/config-`uname -r` of the affected VM's kernel?
3) What type of partitioning layout does your VM have?

It seems that it has been reproducible in 3.13 to 3.16 in this report, and the upstream report shows 3.19.

I tried this in a local VM with 32GB disk, 2 cpu, 512MB memory, swapfile and could not trivially reproduce with the filebench test case you mentioned above.

--chris

Jan Wagner (waja) wrote :

Am 29.10.15 um 03:29 schrieb LouieGosselin:
> I'm posting again to add that I conducted some more tests and ext3 does
> not encounter corruption under the same conditions. I hope this
> information is helpful to others, if anyone needs more information let
> me know and I'll see what I can do. I'll probably switch my own VMs to
> ext3 so I don't have to worry about these FS crashes.

I created a new LVM Volume and a new ext3 FS. I copied the whole FS from
ext4 to ext3 and running now the VM on ext3.

Today the ext4_mb_generate_buddy bug did hit me agian. With ext3 FS and
running the VM with "cache=none" setting.

I'm running LoCo mirror for my country and this affects our file server for over a week already.
All our official file services got offline now.
We're also providing free VM/Hosting for OpenSource projects, I can't image how the disaster will become.

Any suggestion?

All VMs are VMware guest.

root@ncnu-ftp:/ubuntu/mirror# LANG=C free -m
             total used free shared buffers cached
Mem: 7984 1453 6531 0 66 352
-/+ buffers/cache: 1033 6950
Swap: 4091 0 4091

root@ncnu-ftp:/ubuntu/mirror# LANG=C touch /lala
touch: cannot touch '/lala': Read-only file system

root@ncnu-ftp:/ubuntu/mirror# LANG=C mount -o remount,rw /
mount: cannot remount block device /dev/mapper/ub--ftp-root read-write, is write-protected

root@ncnu-ftp:/ubuntu/mirror# dumpe2fs /dev/mapper/ub--ftp-root /dev/|grep -i Filesystem\ state
dumpe2fs 1.42.9 (4-Feb-2014)
Filesystem state: clean with errors

root@ncnu-ftp:/ubuntu/mirror# LC_ALL=C dmesg -T |grep error
[Thu Nov 5 00:35:46 2015] EXT4-fs (dm-0): re-mounted. Opts: errors=remount-ro
[Thu Nov 5 00:35:55 2015] EXT4-fs error (device dm-0): ext4_mb_generate_buddy:756: group 371, block bitmap and bg descriptor inconsistent: 25478 vs 25477 free clusters
[Thu Nov 5 00:35:55 2015] IP: [<ffffffff8125e511>] __ext4_error_inode+0x31/0x160
[Thu Nov 5 00:35:55 2015] RIP: 0010:[<ffffffff8125e511>] [<ffffffff8125e511>] __ext4_error_inode+0x31/0x160
[Thu Nov 5 00:35:55 2015] RIP [<ffffffff8125e511>] __ext4_error_inode+0x31/0x160
[Thu Nov 5 00:35:55 2015] EXT4-fs error (device dm-0): ext4_remount:4816: Abort forced by user
[Thu Nov 5 00:40:39 2015] EXT4-fs (dm-0): error count since last fsck: 15
[Thu Nov 5 00:40:39 2015] EXT4-fs (dm-0): initial error at time 1445904785: ext4_reserve_inode_write:4928
[Thu Nov 5 00:40:39 2015] EXT4-fs (dm-0): last error at time 1446654956: ext4_remount:4816
[Thu Nov 5 01:18:38 2015] EXT4-fs error (device dm-0): ext4_remount:4816: Abort forced by user
[Thu Nov 5 01:18:59 2015] EXT4-fs error (device dm-0): ext4_remount:4816: Abort forced by user
[Thu Nov 5 01:28:59 2015] EXT4-fs error (device dm-0): ext4_remount:4816: Abort forced by user
[Thu Nov 5 01:58:08 2015] EXT4-fs error (device dm-0): ext4_remount:4816: Abort forced by user
[Thu Nov 5 02:20:41 2015] EXT4-fs error (device dm-0): ext4_remount:4816: Abort forced by user

root@ncnu-ftp:/ubuntu/mirror# uname -a
Linux ncnu-ftp 3.13.0-63-generic #103-Ubuntu SMP Fri Aug 14 21:42:59 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

root@ncnu-ftp:/ubuntu/mirror# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 14.04.3 LTS
Release: 14.04
Codename: trusty

Jan Wagner (waja) on 2015-11-12
Changed in linux (Debian):
importance: Undecided → Unknown
status: New → Unknown
affects: linux-kernel-headers → linux
Jan Wagner (waja) wrote :
Changed in linux (Debian):
status: Unknown → Confirmed
LouieGosselin (0-ubunbu-d) wrote :

chris,

Here is what you asked for, sorry for not getting it earlier.

I don't use virsh. This is how I started KVM to trigger the problem interactively (curses interface):

kvm -drive file=/dev/raid/shared,media=disk,if=none,cache=none,aio=native,format=raw,id=hd0 -device virtio-blk-pci,drive=hd0 -smp 2 -m 1000 -netdev tap,ifname=vm_shared,script=no,downscript=no,id=eth0 -device virtio-net-pci,netdev=eth0,mac=52:54:00:12:34:58 -name shared -runas shared -curses

fdisk -l
Disk /dev/vda: 100 GiB, 107374182400 bytes, 209715200 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: AE4BDF3E-0B83-4C17-B104-A5139722F263

Device Start End Sectors Size Type
/dev/vda1 2048 3905535 3903488 1.9G Linux swap
/dev/vda2 3905536 209713151 205807616 98.1G Linux filesystem

It hasn't happened in this particular VM since upping the RAM so the VM doesn't swap.

My intention was to reproduce on non-production hardware, and then try different kernels, rule out LVM, virtio, etc. But I'm in the middle of a new assignment, I probably won't have time to do this myself before December.

LouieGosselin (0-ubunbu-d) wrote :

Oops...the above kvm command line is correct but it did not crash with -m 1000, that's what production is using now.
It was crashing consistently with -m 512 about a minute into the synthetic FS load.

Udo Giacomozzi (udo-launchpad) wrote :

FYI, the same errors happened to me on real hardware (a Raspberry Pi 2 B), based on a custom Debian Jessie image.

The image uses a ext2 filesystem created on a x86 Boot2docker host (Kernel "Linux fbd0c1340061 4.1.12-boot2docker #1 SMP Tue Nov 3 06:03:36 UTC 2015 x86_64 GNU/Linux").

While remounting the filesystem as r-w on the Raspberry and writing to it I got:

[53406.370524] EXT4-fs (mmcblk0p3): mounting ext2 file system using the ext4 subsystem
[53406.575883] EXT4-fs (mmcblk0p3): mounted filesystem without journal. Opts: (null)
[53416.394833] EXT4-fs error (device mmcblk0p3): ext4_mb_generate_buddy:757: group 1, block bitmap and bg descriptor inconsistent: 8953 vs 8990 free clusters
[53435.245967] EXT4-fs error (device mmcblk0p3): ext4_lookup:1417: inode #2: comm rsync: deleted inode referenced: 46849
[53481.805006] EXT4-fs error (device mmcblk0p3): ext4_lookup:1417: inode #2: comm ls: deleted inode referenced: 46849

The Raspi isn't running the most current Kernel yet (it's "Linux intermodul 3.19.3-v7 #1 SMP PREEMPT Mon Nov 30 08:37:00 UTC 2015 armv7l GNU/Linux"), but perhaps this helps analyzing this bug as it's not a VM...

nunyaz info (project1750) wrote :

I am able to reproduce this bug every single time I suspend and then resume one of my laptops. Which is running xubun tu 15.10 with 4.2.0-16-generic kernel on a lenovo R400. I suspect possible a hardware problem as this only happens on 1 out of 3 R400 that I have. I'll report back if I remember.

This bug is probably what I describe in
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=818502

Bug in Debian 3.16 kernel (kernel from current stable Jessie) used as KVM hypervisor on older Intel CPU.

Simon Déziel (sdeziel) on 2016-03-21
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Simon Déziel (sdeziel) wrote :

Thanks Václav, your conclusion about older Intel CPU seems to match my setup since this only happens on a Xeon E3110 (which is in fact a re-branded Core2 Duo E8400).

Thanks for bisecting this and figure the fix was:

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?h=7dec5603b6b8dc4c3e1c65d318bd2a5a8c62a424

Emmanuel Lacour (elacour) wrote :

Here we have this problem on two servers runnning "Intel(R) Xeon(R) CPU X5460 @ 3.16GHz" (and debian 7.0 backport kernel 3.16 on host, 4.x on VM). We doesn't seems to have this problem on others nodes runnning:

Intel(R) Xeon(R) CPU E5345 @ 2.33GHz
Intel(R) Xeon(R) CPU 5160 @ 3.00GHz
Intel(R) Xeon(R) CPU E5320 @ 1.86GHz
Intel(R) Xeon(R) CPU X3470 @ 2.93GHz

We are going to try an upgrade to see if it solves the problem. If yes, it's a pretty good news!

LouieGosselin (0-ubunbu-d) wrote :

Good work.

We also have PE2950III systems running "Intel(R) Xeon(R) CPU X5460 @ 3.16GHz".

If this is indeed the fix, I'm confused why it would only affect certain cpus?
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?h=7dec5603b6b8dc4c3e1c65d318bd2a5a8c62a424

I'll have to come up with a plan to replace debian's stable/jessie kernel with an unmanaged one on the host. I'm not keen on doing that as the DRAC units on these are not very reliable...

Chris J Arges (arges) wrote :

Here's a test build for trusty 3.13 with https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?h=7dec5603b6b8dc4c3e1c65d318bd2a5a8c62a424 applied:

http://people.canonical.com/~arges/lp1423672/

Can someone verify this does fix the issue so this can be SRU'ed into 3.13?

Leonardo Borda (lborda) wrote :

Hi Chris,

This is also seen on kernel 3.16.0-51-generic. Could you get us a kernel build test for 3.16 as well ?

Thank you
Leo

Chris J Arges (arges) wrote :

Leo,
Also uploaded a lts-utopic build here:
http://people.canonical.com/~arges/lp1423672/

It will all debs with '3.16' in the name.

Changed in linux-lts-utopic (Ubuntu):
status: New → Invalid
Simon Déziel (sdeziel) wrote :

Thanks Chris. So far, no regression with the Trusty kernel:

$ uname -a
Linux xeon 3.13.0-86-generic #130~lp1423672v201604200743 SMP Wed Apr 20 12:44:20 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Since the issue only happens rarely, more people testing it would be welcome.

Chris J Arges (arges) on 2016-04-20
Changed in linux (Ubuntu Trusty):
assignee: nobody → Chris J Arges (arges)
Changed in linux-lts-utopic (Ubuntu Trusty):
assignee: nobody → Chris J Arges (arges)
Chris J Arges (arges) on 2016-04-21
description: updated
Tim Gardner (timg-tpi) on 2016-04-21
Changed in linux (Ubuntu Trusty):
status: Confirmed → Fix Committed
Changed in linux-lts-utopic (Ubuntu Trusty):
status: New → Fix Committed
Simon Déziel (sdeziel) wrote :

I recently reinstalled the affected host to run Xenial so I can no longer test the proposed fix for the 3.13 kernel.

Kamal Mostafa (kamalmostafa) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-trusty' to 'verification-done-trusty'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-trusty
Kamal Mostafa (kamalmostafa) wrote :

Anyone affected by this issue: We're looking for a positive verification that the Trusty kernel version currently available in -proposed (3.13.0-87.132) fixes this problem. If you can provide that confirmation, please do!

LouieGosselin (0-ubunbu-d) wrote :

I have a "PE 2950III Intel(R) Xeon(R) CPU X5460 @ 3.16GHz" server here and I've been trying to test this out. I'm using an "rsync" copy of an original server exhibiting the problem. So far though I've been unable to reproduce the original error at all.

It would seem that using the exact same OS/kernel/binaries, the error doesn't happen on a fresh filesystem, I guess there must have been something about the filesystem image itself that triggered the fault. So my dilemma is that I don't know how to reproduce this fault on a fresh install. So while I can test this update, I'm not sure how valid the test will be on an installation that isn't faulting.

Does anyone have a suggestion or have an idea about how to reproduce the conditions?

Simon Déziel (sdeziel) wrote :

Louie, https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=818502#22 provides a simple way to reproduce. If you could give it a try that would be appreciated.

FYI, when my hypervisor was running Trusty (3.13), the problem was reproducible on fresh VMs with brand new ext4 FSes, so hopefully that will be easy to reproduce for you.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 3.13.0-87.133

---------------
linux (3.13.0-87.133) trusty; urgency=low

  [ Kamal Mostafa ]

  * Release Tracking Bug
    - LP: #1585315

  [ Upstream Kernel Changes ]

  * Revert "usb: hub: do not clear BOS field during reset device"
    - LP: #1582864

linux (3.13.0-87.132) trusty; urgency=low

  [ Kamal Mostafa ]

  * Release Tracking Bug
    - LP: #1582398

  [ Kamal Mostafa ]

  * [Config] Drop ozwpan from the ABI

  [ Luis Henriques ]

  * [Config] CONFIG_USB_WPAN_HCD=n
    - LP: #1463740
    - CVE-2015-4004

  [ Prarit Bhargava ]

  * SAUCE: (no-up) ACPICA: Dispatcher: Update thread ID for recursive
    method calls
    - LP: #1577898

  [ Upstream Kernel Changes ]

  * usbnet: cleanup after bind() in probe()
    - LP: #1567191
    - CVE-2016-3951
  * KVM: x86: bit-ops emulation ignores offset on 64-bit
    - LP: #1423672
  * USB: usbip: fix potential out-of-bounds write
    - LP: #1572666
    - CVE-2016-3955
  * x86/mm/32: Enable full randomization on i386 and X86_32
    - LP: #1568523
    - CVE-2016-3672
  * Input: gtco - fix crash on detecting device without endpoints
    - LP: #1575706
    - CVE-2016-2187
  * atl2: Disable unimplemented scatter/gather feature
    - LP: #1561403
    - CVE-2016-2117
  * ALSA: usb-audio: Skip volume controls triggers hangup on Dell USB Dock
    - LP: #1577905
  * fs/pnode.c: treat zero mnt_group_id-s as unequal
    - LP: #1572316
  * propogate_mnt: Handle the first propogated copy being a slave
    - LP: #1572316
  * drm: Balance error path for GEM handle allocation
    - LP: #1579610
  * x86/mm: Add barriers and document switch_mm()-vs-flush synchronization
    - LP: #1538429
    - CVE-2016-2069
  * x86/mm: Improve switch_mm() barrier comments
    - LP: #1538429
    - CVE-2016-2069
  * net: fix infoleak in llc
    - LP: #1578496
    - CVE-2016-4485
  * net: fix infoleak in rtnetlink
    - LP: #1578497
    - CVE-2016-4486

 -- Kamal Mostafa <email address hidden> Tue, 24 May 2016 11:04:30 -0700

Changed in linux (Ubuntu Trusty):
status: Fix Committed → Fix Released
status: Fix Committed → Fix Released
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux-lts-utopic - 3.16.0-73.95~14.04.1

---------------
linux-lts-utopic (3.16.0-73.95~14.04.1) trusty; urgency=low

  [ Kamal Mostafa ]

  * CVE-2016-1583 (LP: #1588871)
    - ecryptfs: fix handling of directory opening
    - SAUCE: proc: prevent stacking filesystems on top
    - SAUCE: ecryptfs: forbid opening files without mmap handler

 -- Andy Whitcroft <email address hidden> Thu, 09 Jun 2016 08:46:24 +0100

Changed in linux-lts-utopic (Ubuntu Trusty):
status: Fix Committed → Fix Released
Changed in linux (Debian):
status: Confirmed → Fix Released
tags: added: verification-done-trusty
removed: verification-needed-trusty
chenyuchai (chenyuchai) wrote :

Hi,all:
  I need your confirmation.Is the Fix: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?h=7dec5603b6b8dc4c3e1c65d318bd2a5a8c62a424 to solve this issue?If not,pls show the patch,thank you very much!

Simon Déziel (sdeziel) wrote :

Chenyuchai, yes, https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?h=7dec5603b6b8dc4c3e1c65d318bd2a5a8c62a424 is the fix. It was integrated in the Trusty kernel 3.13.0-87.133

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.