Ubuntu
linux package

ext4_mb_generate_buddy:756: group N, block bitmap and bg descriptor inconsistent: X vs Y

Bug #1423672 reported by Simon Déziel on 2015-02-19

This bug affects 16 people

	Status	Importance	Assigned to
Linux	Confirmed	Medium	linux-kernel-bugs #104571
linux (Debian)	Fix Released	Unknown	debbugs #772848
linux (Ubuntu)	Confirmed	High	Unassigned
Trusty	Fix Released	High	Chris J Arges
linux-lts-utopic (Ubuntu)	Invalid	Undecided	Unassigned
Trusty	Fix Released	Undecided	Chris J Arges

Bug Description

SRU Justification:

    Impact: Users of VMs running 3.13/3.16 and ext4 can experience data corruption in the guest.
    Fix: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?h=7dec5603b6b8dc4c3e1c65d318bd2a5a8c62a424
    Testcase: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=818502#22

I noticed that one of my VM had this "dmesg -T" output:

[Tue Feb 17 09:53:27 2015] systemd-udevd[5433]: starting version 204
[Thu Feb 19 06:25:08 2015] EXT4-fs error (device vda1): ext4_mb_generate_buddy:756: group 5, block bitmap and bg descriptor inconsistent: 16446 vs 16445 free clusters
[Thu Feb 19 06:25:09 2015] Aborting journal on device vda1-8.
[Thu Feb 19 06:25:09 2015] EXT4-fs (vda1): Remounting filesystem read-only
[Thu Feb 19 06:25:09 2015] ------------[ cut here ]------------
[Thu Feb 19 06:25:09 2015] WARNING: CPU: 0 PID: 9946 at /build/buildd/linux-3.13.0/fs/ext4/ext4_jbd2.c:259 __ext4_handle_dirty_metadata+0x1a2/0x1c0()
[Thu Feb 19 06:25:09 2015] Modules linked in: nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_owner xt_conntrack nf_conntrack iptable_filter ip_tables x_tables serio_raw psmouse floppy
[Thu Feb 19 06:25:09 2015] CPU: 0 PID: 9946 Comm: logrotate Not tainted 3.13.0-45-generic #74-Ubuntu
[Thu Feb 19 06:25:09 2015] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[Thu Feb 19 06:25:09 2015] 0000000000000009 ffff880003a11aa0 ffffffff81720eb6 0000000000000000
[Thu Feb 19 06:25:09 2015] ffff880003a11ad8 ffffffff810677cd ffff880000c41340 0000000000000000
[Thu Feb 19 06:25:09 2015] ffff88000a58e000 ffffffff81835280 0000000000001302 ffff880003a11ae8
[Thu Feb 19 06:25:09 2015] Call Trace:
[Thu Feb 19 06:25:09 2015] [<ffffffff81720eb6>] dump_stack+0x45/0x56
[Thu Feb 19 06:25:09 2015] [<ffffffff810677cd>] warn_slowpath_common+0x7d/0xa0
[Thu Feb 19 06:25:09 2015] [<ffffffff810678aa>] warn_slowpath_null+0x1a/0x20
[Thu Feb 19 06:25:09 2015] [<ffffffff8126e862>] __ext4_handle_dirty_metadata+0x1a2/0x1c0
[Thu Feb 19 06:25:09 2015] [<ffffffff81246a5a>] ? ext4_dirty_inode+0x2a/0x60
[Thu Feb 19 06:25:09 2015] [<ffffffff81277086>] ext4_free_blocks+0x646/0xbf0
[Thu Feb 19 06:25:09 2015] [<ffffffff810aacc5>] ? wake_up_bit+0x25/0x30
[Thu Feb 19 06:25:09 2015] [<ffffffff812685b5>] ext4_ext_rm_leaf+0x505/0x8f0
[Thu Feb 19 06:25:09 2015] [<ffffffff81267527>] ? __ext4_ext_check+0x197/0x370
[Thu Feb 19 06:25:09 2015] [<ffffffff8126ad00>] ? ext4_ext_remove_space+0xc0/0x7e0
[Thu Feb 19 06:25:09 2015] [<ffffffff8126af5c>] ext4_ext_remove_space+0x31c/0x7e0
[Thu Feb 19 06:25:09 2015] [<ffffffff8126d300>] ext4_ext_truncate+0xb0/0xe0
[Thu Feb 19 06:25:09 2015] [<ffffffff81244eb9>] ext4_truncate+0x379/0x3c0
[Thu Feb 19 06:25:09 2015] [<ffffffff81245a18>] ext4_evict_inode+0x408/0x4d0
[Thu Feb 19 06:25:09 2015] [<ffffffff811d8f60>] evict+0xb0/0x1b0
[Thu Feb 19 06:25:09 2015] [<ffffffff811d9775>] iput+0xf5/0x180
[Thu Feb 19 06:25:09 2015] [<ffffffff811d4698>] __dentry_kill+0x1a8/0x200
[Thu Feb 19 06:25:09 2015] [<ffffffff811d4795>] dput+0xa5/0x180
[Thu Feb 19 06:25:09 2015] [<ffffffff811bf7e6>] __fput+0x176/0x260
[Thu Feb 19 06:25:09 2015] [<ffffffff811bf91e>] ____fput+0xe/0x10
[Thu Feb 19 06:25:09 2015] [<ffffffff810882f7>] task_work_run+0xa7/0xe0
[Thu Feb 19 06:25:09 2015] [<ffffffff81013ed7>] do_notify_resume+0x97/0xb0
[Thu Feb 19 06:25:09 2015] [<ffffffff81731c2a>] int_signal+0x12/0x17
[Thu Feb 19 06:25:09 2015] ---[ end trace ebff9843d81b5c41 ]---
[Thu Feb 19 06:25:09 2015] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
[Thu Feb 19 06:25:09 2015] IP: [<ffffffff8125d4c1>] __ext4_error_inode+0x31/0x160
[Thu Feb 19 06:25:09 2015] PGD 167067 PUD 161067 PMD 0
[Thu Feb 19 06:25:09 2015] Oops: 0000 [#1] SMP
[Thu Feb 19 06:25:09 2015] Modules linked in: nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_owner xt_conntrack nf_conntrack iptable_filter ip_tables x_tables serio_raw psmouse floppy
[Thu Feb 19 06:25:09 2015] CPU: 0 PID: 9946 Comm: logrotate Tainted: G W 3.13.0-45-generic #74-Ubuntu
[Thu Feb 19 06:25:09 2015] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[Thu Feb 19 06:25:09 2015] task: ffff880009ac4800 ti: ffff880003a10000 task.ti: ffff880003a10000
[Thu Feb 19 06:25:09 2015] RIP: 0010:[<ffffffff8125d4c1>] [<ffffffff8125d4c1>] __ext4_error_inode+0x31/0x160
[Thu Feb 19 06:25:09 2015] RSP: 0000:ffff880003a11a58 EFLAGS: 00010292
[Thu Feb 19 06:25:09 2015] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000086
[Thu Feb 19 06:25:09 2015] RDX: 0000000000001302 RSI: ffffffff81a6e81f RDI: 0000000000000000
[Thu Feb 19 06:25:09 2015] RBP: ffff880003a11ae8 R08: ffffffff81a78568 R09: 0000000000000005
[Thu Feb 19 06:25:09 2015] R10: 00000000ffffffe2 R11: ffff880003a117ce R12: 0000000000000086
[Thu Feb 19 06:25:09 2015] R13: ffffffff81835280 R14: 0000000000001302 R15: ffffffff81a78568
[Thu Feb 19 06:25:09 2015] FS: 00007f74eaca4840(0000) GS:ffff88000b800000(0000) knlGS:0000000000000000
[Thu Feb 19 06:25:09 2015] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[Thu Feb 19 06:25:09 2015] CR2: 0000000001de4000 CR3: 0000000009b45000 CR4: 00000000000006f0
[Thu Feb 19 06:25:09 2015] Stack:
[Thu Feb 19 06:25:09 2015] ffff880003a11a60 0000000000000103 ebff9843d81b5c41 000000000000321e
[Thu Feb 19 06:25:09 2015] 00000000000014d8 0000000000000092 000000000000020e ffff88000a58e000
[Thu Feb 19 06:25:09 2015] ffff880003a11ae8 ffffffff8126e372 ffffffff810677df ffff880000c41340
[Thu Feb 19 06:25:09 2015] Call Trace:
[Thu Feb 19 06:25:09 2015] [<ffffffff8126e372>] ? ext4_journal_abort_handle+0x42/0xc0
[Thu Feb 19 06:25:09 2015] [<ffffffff810677df>] ? warn_slowpath_common+0x8f/0xa0
[Thu Feb 19 06:25:09 2015] [<ffffffff8126e7cf>] __ext4_handle_dirty_metadata+0x10f/0x1c0
[Thu Feb 19 06:25:09 2015] [<ffffffff81277086>] ext4_free_blocks+0x646/0xbf0
[Thu Feb 19 06:25:09 2015] [<ffffffff810aacc5>] ? wake_up_bit+0x25/0x30
[Thu Feb 19 06:25:09 2015] [<ffffffff812685b5>] ext4_ext_rm_leaf+0x505/0x8f0
[Thu Feb 19 06:25:09 2015] [<ffffffff81267527>] ? __ext4_ext_check+0x197/0x370
[Thu Feb 19 06:25:09 2015] [<ffffffff8126ad00>] ? ext4_ext_remove_space+0xc0/0x7e0
[Thu Feb 19 06:25:09 2015] [<ffffffff8126af5c>] ext4_ext_remove_space+0x31c/0x7e0
[Thu Feb 19 06:25:09 2015] [<ffffffff8126d300>] ext4_ext_truncate+0xb0/0xe0
[Thu Feb 19 06:25:09 2015] [<ffffffff81244eb9>] ext4_truncate+0x379/0x3c0
[Thu Feb 19 06:25:09 2015] [<ffffffff81245a18>] ext4_evict_inode+0x408/0x4d0
[Thu Feb 19 06:25:09 2015] [<ffffffff811d8f60>] evict+0xb0/0x1b0
[Thu Feb 19 06:25:09 2015] [<ffffffff811d9775>] iput+0xf5/0x180
[Thu Feb 19 06:25:09 2015] [<ffffffff811d4698>] __dentry_kill+0x1a8/0x200
[Thu Feb 19 06:25:09 2015] [<ffffffff811d4795>] dput+0xa5/0x180
[Thu Feb 19 06:25:09 2015] [<ffffffff811bf7e6>] __fput+0x176/0x260
[Thu Feb 19 06:25:09 2015] [<ffffffff811bf91e>] ____fput+0xe/0x10
[Thu Feb 19 06:25:09 2015] [<ffffffff810882f7>] task_work_run+0xa7/0xe0
[Thu Feb 19 06:25:09 2015] [<ffffffff81013ed7>] do_notify_resume+0x97/0xb0
[Thu Feb 19 06:25:09 2015] [<ffffffff81731c2a>] int_signal+0x12/0x17
[Thu Feb 19 06:25:09 2015] Code: 48 89 e5 41 57 4d 89 c7 41 56 41 89 d6 41 55 49 89 f5 48 c7 c6 1f e8 a6 81 41 54 49 89 cc 53 48 89 fb 48 83 ec 68 4c 89 4c 24 60 <48> 8b 47 28 48 8b 57 40 48 8b 80 f8 02 00 00 48 8b 40 68 89 90
[Thu Feb 19 06:25:09 2015] RIP [<ffffffff8125d4c1>] __ext4_error_inode+0x31/0x160
[Thu Feb 19 06:25:09 2015] RSP <ffff880003a11a58>
[Thu Feb 19 06:25:09 2015] CR2: 0000000000000028
[Thu Feb 19 06:25:10 2015] ---[ end trace ebff9843d81b5c42 ]---

cron.daily jobs fired at 6:25:01 apparently:

# tail -n2 /var/log/syslog
Feb 19 06:17:01 git CRON[9848]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Feb 19 06:25:01 git CRON[9853]: (root) CMD (test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily ))

# run-parts --test /etc/cron.daily
/etc/cron.daily/apt
/etc/cron.daily/autoremove
/etc/cron.daily/dpkg
/etc/cron.daily/hdd-backup
/etc/cron.daily/logrotate
/etc/cron.daily/passwd
/etc/cron.daily/upstart

It seems like all the jobs ran and the upstart one somehow triggered the crash:

# ls -alt /var/log/upstart/ | head
total 272
drwxrwxr-x 5 root syslog 4096 Feb 19 06:25 ..
drwxr-xr-x 2 root root 4096 Feb 14 06:25 .
-rw-r----- 1 root root 591 Feb 12 12:26 ureadahead.log.1.gz
-rw-r----- 1 root root 178 Feb 12 12:25 mountall.log.1.gz

Now that I have collected some information (sorry, I don't have ubuntu-bug installed on the VM) I'll reboot it and see how it goes.

More information on the VM:

# lsb_release -rd
Description: Ubuntu 14.04.2 LTS
Release: 14.04
# apt-cache policy linux-image-3.13.0-45-generic
linux-image-3.13.0-45-generic:
  Installed: 3.13.0-45.74
  Candidate: 3.13.0-45.74
  Version table:
*** 3.13.0-45.74 0
        500 http://archive.ubuntu.com/ubuntu/ trusty-updates/main amd64 Packages
        100 /var/lib/dpkg/status
---
AlsaDevices:
total 0
crw-rw---- 1 root audio 116, 1 Feb 19 14:34 seq
crw-rw---- 1 root audio 116, 33 Feb 19 14:34 timer
AplayDevices: Error: [Errno 2] No such file or directory
ApportVersion: 2.14.1-0ubuntu3.7
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory
AudioDevicesInUse: Error: [Errno 2] No such file or directory
CRDA: Error: [Errno 2] No such file or directory
CurrentDmesg:
[ 13.891047] init: console-font main process (855) terminated with status 71
[ 13.952825] init: plymouth-splash main process (870) terminated with status 1
[ 217.853139] random: nonblocking pool is initialized
DistroRelease: Ubuntu 14.04
IwConfig: Error: [Errno 2] No such file or directory
Lspci: Error: [Errno 2] No such file or directory
Lsusb: Error: [Errno 2] No such file or directory
MachineType: QEMU Standard PC (i440FX + PIIX, 1996)
Package: linux (not installed)
PciMultimedia:

ProcEnviron:
TERM=xterm
PATH=(custom, no user)
LANG=en_US.UTF-8
SHELL=/bin/bash
ProcFB:

ProcKernelCmdLine: root=UUID=cb9cbdad-c668-4503-85db-fcf9b02f3495 ro console=tty0 console=ttyS0,38400
ProcVersionSignature: Ubuntu 3.13.0-45.74-generic 3.13.11-ckt13
RelatedPackageVersions:
linux-restricted-modules-3.13.0-45-generic N/A
linux-backports-modules-3.13.0-45-generic N/A
linux-firmware N/A
RfKill: Error: [Errno 2] No such file or directory
Tags: trusty
Uname: Linux 3.13.0-45-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:

_MarkForUpload: True
dmi.bios.date: 01/01/2011
dmi.bios.vendor: Bochs
dmi.bios.version: Bochs
dmi.chassis.type: 1
dmi.chassis.vendor: Bochs
dmi.modalias: dmi:bvnBochs:bvrBochs:bd01/01/2011:svnQEMU:pnStandardPC(i440FX+PIIX,1996):pvrpc-i440fx-2.0:cvnBochs:ct1:cvr:
dmi.product.name: Standard PC (i440FX + PIIX, 1996)
dmi.product.version: pc-i440fx-2.0
dmi.sys.vendor: QEMU

See original description

Tags:

CVE References

Revision history for this message

Simon Déziel (sdeziel) wrote on 2015-02-19:

dmesg -T Edit (44.1 KiB, text/plain)

Revision history for this message

Simon Déziel (sdeziel) wrote on 2015-02-19:

/proc/version_signature Edit (42 bytes, text/plain)

Revision history for this message

Brad Figg (brad-figg) wrote on 2015-02-19: Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1423672

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status:	New → Incomplete
tags:	added: trusty

Revision history for this message

Simon Déziel (sdeziel) wrote on 2015-02-19: BootDmesg.txt

BootDmesg.txt Edit (28.8 KiB, text/plain)

apport information

tags:	added: apport-collected
description:	updated

Revision history for this message

Simon Déziel (sdeziel) wrote on 2015-02-19: ProcCpuinfo.txt

ProcCpuinfo.txt Edit (662 bytes, text/plain)

apport information

Revision history for this message

Simon Déziel (sdeziel) wrote on 2015-02-19: ProcInterrupts.txt

ProcInterrupts.txt Edit (1.4 KiB, text/plain)

apport information

Revision history for this message

Simon Déziel (sdeziel) wrote on 2015-02-19: ProcModules.txt

ProcModules.txt Edit (937 bytes, text/plain)

apport information

Revision history for this message

Simon Déziel (sdeziel) wrote on 2015-02-19: UdevDb.txt

UdevDb.txt Edit (53.5 KiB, text/plain)

apport information

Revision history for this message

Simon Déziel (sdeziel) wrote on 2015-02-19: UdevLog.txt

UdevLog.txt Edit (130.9 KiB, text/plain)

apport information

Revision history for this message

Simon Déziel (sdeziel) wrote on 2015-02-19: WifiSyslog.txt

#10

WifiSyslog.txt Edit (42.2 KiB, text/plain)

apport information

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2015-02-19: Re: ext4 turned read-only following a cronjob run

#11

Do you have a way to reproduce this bug, or was it a one time event?

Changed in linux (Ubuntu):
importance:	Undecided → Medium
tags:	added: kernel-da-key

Revision history for this message

Simon Déziel (sdeziel) wrote on 2015-02-19:

#12

It was a one time event. I'll report here if/when this happens again.

Changed in linux (Ubuntu):
status:	Incomplete → Confirmed

Revision history for this message

Simon Déziel (sdeziel) wrote on 2015-03-02:

#13

This problem just occurred on *another* VM. Again, right after the cron.daily run.

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2015-03-02:

#14

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.0 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

If you are unable to test the mainline kernel, for example it will not boot, please add the tag: 'kernel-unable-to-test-upstream'.
Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.0-rc1-vivid/

Simon Déziel (sdeziel) on 2015-03-04

summary:

- ext4 turned read-only following a cronjob run
+ ext4 turned read-only during logrotate daily run

Revision history for this message

Simon Déziel (sdeziel) wrote on 2015-03-04: Re: ext4 turned read-only during logrotate daily run

#15

This occurred on another VM. I guess I will have to roll our the 4.0-rc2 kernel everywhere.

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2015-03-04:

#16

It would be good to know if v4.0-rc2 fixes this. If it does, we can look at the git logs to see what may be the fix, or perform a "Reverse" bisect.

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2015-03-04:

#17

Also, did this issue start happening after an update/upgrade? Was there a prior kernel version where you were not having this particular problem? This will help determine if the problem you are seeing is the result of a regression, and when this regression was introduced. If this is a regression, we could also perform a kernel bisect to identify the commit that introduced the problem, if it's not fixed in 4.0.

Revision history for this message

Simon Déziel (sdeziel) wrote on 2015-03-04:

#18

The first occurrence of this problem goes back to Feb 17th. While I keep all the VMs up to date on a daily basis, I now have doubts about the hypervisor's RAM (not ECC by the way). I brought all the VMs offline to have their FSes checked (all ext4). Several of them needed fixing by fsck. Since they are all part of a mdadm RAID1 array with 2 devices, maybe there was some bit flipping on one of the drives.

Before loosing anyone's time on this issue, I'll first make sure the hypervisor's RAM is sane. So I'll get back after some memtest86+ then I'll look at the 4.0 kernel.

Thanks Joseph

Revision history for this message

Mark Deneen (mdeneen) wrote on 2015-03-09:

#19

Download full text (8.5 KiB)

This happened to me here as well. Also in a VM.

This happened to me here as well.  Also in a VM.

[   42.042806] ------------[ cut here ]------------
[   42.042812] WARNING: CPU: 0 PID: 617 at /build/buildd/linux-3.13.0/fs/ext4/ext4_jbd2.c:259 __ext4_handle_dirty_metadata+0x1a2/0x1c0()
[   42.042813] Modules linked in: kvm_intel kvm cirrus snd_hda_intel ttm snd_hda_codec snd_hwdep drm_kms_helper serio_raw snd_pcm lp parport snd_page_alloc snd_timer drm snd soundcore syscopyarea sysfillrect sysimgblt i2c_piix4 mac_hid psmouse floppy pata_acpi
[   42.042826] CPU: 0 PID: 617 Comm: mv Not tainted 3.13.0-45-generic #74-Ubuntu
[   42.042828] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[   42.042829]  0000000000000009 ffff88007acd3ad0 ffffffff81720eb6 0000000000000000
[   42.042831]  ffff88007acd3b08 ffffffff810677cd ffff88007bf53d68 0000000000000000
[   42.042833]  ffff88007b5c4000 ffffffff81835280 0000000000001302 ffff88007acd3b18
[   42.042835] Call Trace:
[   42.042839]  [<ffffffff81720eb6>] dump_stack+0x45/0x56
[   42.042842]  [<ffffffff810677cd>] warn_slowpath_common+0x7d/0xa0
[   42.042844]  [<ffffffff810678aa>] warn_slowpath_null+0x1a/0x20
[   42.042846]  [<ffffffff8126e862>] __ext4_handle_dirty_metadata+0x1a2/0x1c0
[   42.042850]  [<ffffffff81246a5a>] ? ext4_dirty_inode+0x2a/0x60
[   42.042853]  [<ffffffff81277086>] ext4_free_blocks+0x646/0xbf0
[   42.042855]  [<ffffffff812685b5>] ext4_ext_rm_leaf+0x505/0x8f0
[   42.042857]  [<ffffffff81267527>] ? __ext4_ext_check+0x197/0x370
[   42.042859]  [<ffffffff8126ad00>] ? ext4_ext_remove_space+0xc0/0x7e0
[   42.042861]  [<ffffffff8126af5c>] ext4_ext_remove_space+0x31c/0x7e0
[   42.042863]  [<ffffffff8126d300>] ext4_ext_truncate+0xb0/0xe0
[   42.042865]  [<ffffffff81244eb9>] ext4_truncate+0x379/0x3c0
[   42.042867]  [<ffffffff81245a18>] ext4_evict_inode+0x408/0x4d0
[   42.042870]  [<ffffffff811d8f60>] evict+0xb0/0x1b0
[   42.042872]  [<ffffffff811d9775>] iput+0xf5/0x180
[   42.042875]  [<ffffffff811ce09e>] do_unlinkat+0x18e/0x2b0
[   42.042878]  [<ffffffff8172d2ca>] ? do_page_fault+0x1a/0x70
[   42.042880]  [<ffffffff8172c949>] ? do_async_page_fault+0x29/0xe0
[   42.042882]  [<ffffffff811cf006>] SyS_unlink+0x16/0x20
[   42.042884]  [<ffffffff8173196d>] system_call_fastpath+0x1a/0x1f
[   42.042885] ---[ end trace 18d6f1c79bfdd3f4 ]---
[   42.042889] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
[   42.042930] IP: [<ffffffff8125d4c1>] __ext4_error_inode+0x31/0x160
[   42.042961] PGD 7c351067 PUD 79b43067 PMD 0 
[   42.042987] Oops: 0000 [#1] SMP 
[   42.043714] Modules linked in: kvm_intel kvm cirrus snd_hda_intel ttm snd_hda_codec snd_hwdep drm_kms_helper serio_raw snd_pcm lp parport snd_page_alloc snd_timer drm snd soundcore syscopyarea sysfillrect sysimgblt i2c_piix4 mac_hid psmouse floppy pata_acpi
[   42.045134] CPU: 0 PID: 617 Comm: mv Tainted: G        W     3.13.0-45-generic #74-Ubuntu
[   42.045134] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[   42.045134] task: ffff88007accb000 ti: ffff88007acd2000 task.ti: ffff88007acd2000
[   42.045134] RIP: 0010:[<ffffffff8125d4c1>]  [<ffffffff8125d4c1>] __ext4_error_inode+0x31/0x160
[   42.045134] RSP: 0018:ffff88007acd3a88  EFLAGS: 00010296
[   42.045134] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000100002
[   42.045134] RDX: 0000000000001302 RSI: ffffffff81a6e81f RDI: 0000000000000000
[   42.045134] RBP: ffff88007acd3b18 R08: ffffffff81a78568 R09: 0000000000000005
[   42.052627] init: failsafe main process (586) killed by TERM signal
[   42.053861] R10: 00000000ffffffe2 R11: ffff88007acd37fe R12: 0000000000100002
[   42.053861] R13: ffffffff81835280 R14: 0000000000001302 R15: ffffffff81a78568
[   42.053861] FS:  00007f1189ff9840(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
[   42.053861] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[   42.053861] CR2: 0000000000000028 CR3: 00000000789d8000 CR4: 00000000000006f0
[   42.053861] Stack:
[   42.053861]  ffff88007acd3a90 0000000000000103 18d6f1c79bfdd3f4 0000000000002818
[   42.053861]  000000000000142a 0000000000000092 0000000000000216 ffff88007b5c4000
[   42.053861]  ffff88007acd3b18 ffffffff8126e372 ffffffff810677df ffff88007bf53d68
[   42.053861] Call Trace:
[   42.053861]  [<ffffffff8126e372>] ? ext4_journal_abort_handle+0x42/0xc0
[   42.053861]  [<ffffffff810677df>] ? warn_slowpath_common+0x8f/0xa0
[   42.053861]  [<ffffffff8126e7cf>] __ext4_handle_dirty_metadata+0x10f/0x1c0
[   42.053861]  [<ffffffff81277086>] ext4_free_blocks+0x646/0xbf0
[   42.053861]  [<ffffffff812685b5>] ext4_ext_rm_leaf+0x505/0x8f0
[   42.053861]  [<ffffffff81267527>] ? __ext4_ext_check+0x197/0x370
[   42.053861]  [<ffffffff8126ad00>] ? ext4_ext_remove_space+0xc0/0x7e0
[   42.053861]  [<ffffffff8126af5c>] ext4_ext_remove_space+0x31c/0x7e0
[   42.053861]  [<ffffffff8126d300>] ext4_ext_truncate+0xb0/0xe0
[   42.053861]  [<ffffffff81244eb9>] ext4_truncate+0x379/0x3c0
[   42.053861]  [<ffffffff81245a18>] ext4_evict_inode+0x408/0x4d0
[   42.053861]  [<ffffffff811d8f60>] evict+0xb0/0x1b0
[   42.053861]  [<ffffffff811d9775>] iput+0xf5/0x180
[   42.053861]  [<ffffffff811ce09e>] do_unlinkat+0x18e/0x2b0
[   42.053861]  [<ffffffff8172d2ca>] ? do_page_fault+0x1a/0x70
[   42.053861]  [<ffffffff8172c949>] ? do_async_page_fault+0x29/0xe0
[   42.053861]  [<ffffffff811cf006>] SyS_unlink+0x16/0x20
[   42.053861]  [<ffffffff8173196d>] system_call_fastpath+0x1a/0x1f
[   42.053861] Code: 48 89 e5 41 57 4d 89 c7 41 56 41 89 d6 41 55 49 89 f5 48 c7 c6 1f e8 a6 81 41 54 49 89 cc 53 48 89 fb 48 83 ec 68 4c 89 4c 24 60 <48> 8b 47 28 48 8b 57 40 48 8b 80 f8 02 00 00 48 8b 40 68 89 90 
[   42.053861] RIP  [<ffffffff8125d4c1>] __ext4_error_inode+0x31/0x160
[   42.053861]  RSP <ffff88007acd3a88>
[   42.053861] CR2: 0000000000000028
[   42.054171] ---[ end trace 18d6f1c79bfdd3f5 ]---
[   42.099720] init: flush-early-job-log main process (590) terminated with status 1
[   42.372030] type=1400 audit(1425311935.380:8): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/sbin/dhclient" pid=697 comm="apparmor_parser"
[   42.373936] type=1400 audit(1425311935.380:9): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/usr/lib/NetworkManager/nm-dhcp-client.action" pid=697 comm="apparmor_parser"
[   42.375930] type=1400 audit(1425311935.380:10): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/usr/lib/connman/scripts/dhclient-script" pid=697 comm="apparmor_parser"
[   42.378329] type=1400 audit(1425311935.384:11): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/usr/lib/NetworkManager/nm-dhcp-client.action" pid=697 comm="apparmor_parser"
[   42.722323] init: dmesg main process (734) terminated with status 7
[   44.410227] init: plymouth-upstart-bridge main process ended, respawning
[   91.135333] random: nonblocking pool is initialized
[  302.048091] EXT4-fs (vda1): error count since last fsck: 4
[  302.048210] EXT4-fs (vda1): initial error at time 1425309788: ext4_journal_check_start:56
[  302.048312] EXT4-fs (vda1): last error at time 1425311935: ext4_mb_generate_buddy:756
[86807.520203] EXT4-fs (vda1): error count since last fsck: 4
[86807.520466] EXT4-fs (vda1): initial error at time 1425309788: ext4_journal_check_start:56
[86807.520633] EXT4-fs (vda1): last error at time 1425311935: ext4_mb_generate_buddy:756
[173315.040128] EXT4-fs (vda1): error count since last fsck: 4
[173315.040240] EXT4-fs (vda1): initial error at time 1425309788: ext4_journal_check_start:56
[173315.040408] EXT4-fs (vda1): last error at time 1425311935: ext4_mb_generate_buddy:756
[259822.560129] EXT4-fs (vda1): error count since last fsck: 4
[259822.560234] EXT4-fs (vda1): initial error at time 1425309788: ext4_journal_check_start:56
[259822.560408] EXT4-fs (vda1): last error at time 1425311935: ext4_mb_generate_buddy:756
[346330.080128] EXT4-fs (vda1): error count since last fsck: 4
[346330.080241] EXT4-fs (vda1): initial error at time 1425309788: ext4_journal_check_start:56
[346330.080415] EXT4-fs (vda1): last error at time 1425311935: ext4_mb_generate_buddy:756
[432837.600131] EXT4-fs (vda1): error count since last fsck: 4
[432837.600244] EXT4-fs (vda1): initial error at time 1425309788: ext4_journal_check_start:56
[432837.600423] EXT4-fs (vda1): last error at time 1425311935: ext4_mb_generate_buddy:756
[519345.120107] EXT4-fs (vda1): error count since last fsck: 4
[519345.120189] EXT4-fs (vda1): initial error at time 1425309788: ext4_journal_check_start:56
[519345.120287] EXT4-fs (vda1): last error at time 1425311935: ext4_mb_generate_buddy:756

Linux hostname 3.13.0-45-generic #74-Ubuntu SMP Tue Jan 13 19:36:28 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

Revision history for this message

Simon Déziel (sdeziel) wrote on 2015-03-09: Re: [Bug 1423672] Re: ext4 turned read-only during logrotate daily run

#20

On 03/04/2015 05:07 PM, Simon Déziel wrote:
> Before loosing anyone's time on this issue, I'll first make sure the
> hypervisor's RAM is sane. So I'll get back after some memtest86+ then
> I'll look at the 4.0 kernel.

After 15+ hours of memtest86+ (8 full passes), I don't think the RAM is
at fault. I ran a fsck on every VMs and many of them had issues. I'm
still using the default Ubuntu kernel for now.

One weird thing I noticed is that apparently all the affected VMs have
trouble with the /etc/cron.daily/logrotate job:

error: error creating output file /var/log/syslog.1.gz: File exists
run-parts: /etc/cron.daily/logrotate exited with return code 1

On those, there was a very old /var/log/syslog.1 and a more recent .1.gz
one too. I removed both and will see if it helps.

Joseph Salisbury (jsalisbury) on 2015-03-09

Changed in linux (Ubuntu):
importance:	Medium → High
Changed in linux (Ubuntu Trusty):
importance:	Undecided → High
status:	New → Confirmed

Revision history for this message

Oliver Brakmann (obrakmann) wrote on 2015-03-09: Re: ext4 turned read-only during logrotate daily run

#21

FWIW, I've seen this happen, too, basically since I upgraded to Trusty, I believe. Only ever happens on VMs, never on physical machines. All my virtual disks use Virtio, cache mode 'none' (yeah, I know), I/O mode 'native', with raw LVM volumes as backend.

Revision history for this message

Simon Déziel (sdeziel) wrote on 2015-03-09: Re: [Bug 1423672] Re: ext4 turned read-only during logrotate daily run

#22

On 03/09/2015 04:18 PM, Oliver Brakmann wrote:
> Only ever happens on VMs, never on physical machines. All my
> virtual disks use Virtio, cache mode 'none' (yeah, I know), I/O mode
> 'native', with raw LVM volumes as backend.

Exact same setup here. What's wrong with cache mode set to none? This
avoids double page caching.

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2015-06-03: Re: ext4 turned read-only during logrotate daily run

#23

It's been a while since there was any activity on this bug. Does this issues still happen with the latest updates?

Changed in linux (Ubuntu Trusty):
status:	Confirmed → Incomplete
Changed in linux (Ubuntu):
status:	Confirmed → Incomplete

Revision history for this message

Oliver Brakmann (obrakmann) wrote on 2015-06-09:

#24

Indeed I haven't seen this error in a few weeks, the last time was maybe around six-eight weeks ago. But we all know how absence of evidence isn't evidence of absence :)

Revision history for this message

Simon Déziel (sdeziel) wrote on 2015-06-10:

#25

Things seem to have stabilized recently as the last occurrence I witnessed was on May 13th.

I searched all my 2014 and 2015 logs for "ext4_mb_generate_buddy" and here is what I got:

Nov 24 02:54:03 smb kernel: [842091.081319] EXT4-fs error (device vda1): ext4_mb_generate_buddy:756: group 1, block bitmap and bg descriptor inconsistent: 8198 vs 8197 free clusters
Jan 18 06:25:17 ns0 kernel: [330912.633362] EXT4-fs error (device vda1): ext4_mb_generate_buddy:756: group 3, block bitmap and bg descriptor inconsistent: 17617 vs 17616 free clusters
Jan 19 06:30:38 ns0 kernel: [417633.256202] EXT4-fs (vda1): initial error at time 1421580317: ext4_mb_generate_buddy:756
Mar 1 06:25:08 ns0 kernel: [154713.220446] EXT4-fs error (device vda1): ext4_mb_generate_buddy:756: group 4, block bitmap and bg descriptor inconsistent: 17270 vs 17268 free clusters
Mar 2 06:26:44 ns0 kernel: [241210.340779] EXT4-fs (vda1): initial error at time 1425209107: ext4_mb_generate_buddy:756
Mar 4 06:25:09 pm kernel: [75552.905227] EXT4-fs error (device vda1): ext4_mb_generate_buddy:756: group 6, block bitmap and bg descriptor inconsistent: 9361 vs 9359 free clusters
Mar 15 06:25:12 smb kernel: [231369.706448] EXT4-fs error (device vda1): ext4_mb_generate_buddy:756: group 3, block bitmap and bg descriptor inconsistent: 17537 vs 17536 free clusters
Mar 23 16:24:25 apt kernel: [ 38.398673] EXT4-fs error (device vda1): ext4_mb_generate_buddy:756: group 33, block bitmap and bg descriptor inconsistent: 22354 vs 22352 free clusters
Mar 26 06:25:18 pm kernel: [223295.593946] EXT4-fs error (device vda1): ext4_mb_generate_buddy:756: group 6, block bitmap and bg descriptor inconsistent: 11499 vs 11497 free clusters
Apr 13 17:42:38 smb kernel: [422214.146392] EXT4-fs error (device dm-4): ext4_mb_generate_buddy:756: group 256, block bitmap and bg descriptor inconsistent: 672 vs 671 free clusters
Apr 14 09:31:22 smtp kernel: [479136.955401] EXT4-fs error (device vda1): ext4_mb_generate_buddy:756: group 1, block bitmap and bg descriptor inconsistent: 7688 vs 7686 free clusters
Apr 26 06:25:07 ns0 kernel: [767851.910454] EXT4-fs error (device vda1): ext4_mb_generate_buddy:756: group 0, block bitmap and bg descriptor inconsistent: 22322 vs 22321 free clusters
Apr 27 06:27:00 ns0 kernel: [854365.159204] EXT4-fs (vda1): initial error at time 1430043907: ext4_mb_generate_buddy:756
Apr 27 06:27:00 ns0 kernel: [854365.169903] EXT4-fs (vda1): last error at time 1430043907: ext4_mb_generate_buddy:756
May 7 08:37:08 log kernel: [ 33.482706] EXT4-fs error (device vdc): ext4_mb_generate_buddy:756: group 2, block bitmap and bg descriptor inconsistent: 28613 vs 28611 free clusters
May 8 08:39:00 log kernel: [86545.376048] EXT4-fs (vdc): initial error at time 1431002228: ext4_mb_generate_buddy:756
May 8 08:39:00 log kernel: [86545.376051] EXT4-fs (vdc): last error at time 1431002228: ext4_mb_generate_buddy:756
May 13 06:25:13 git kernel: [131633.710249] EXT4-fs error (device vda1): ext4_mb_generate_buddy:756: group 5, block bitmap and bg descriptor inconsistent: 14076 vs 14074 free clusters

Things seem to have stabilized recently as the last occurrence I witnessed was on May 13th.

I searched all my 2014 and 2015 logs for "ext4_mb_generate_buddy" and here is what I got:

Nov 24 02:54:03 smb kernel: [842091.081319] EXT4-fs error (device vda1): ext4_mb_generate_buddy:756: group 1, block bitmap and bg descriptor inconsistent: 8198 vs 8197 free clusters
Jan 18 06:25:17 ns0 kernel: [330912.633362] EXT4-fs error (device vda1): ext4_mb_generate_buddy:756: group 3, block bitmap and bg descriptor inconsistent: 17617 vs 17616 free clusters
Jan 19 06:30:38 ns0 kernel: [417633.256202] EXT4-fs (vda1): initial error at time 1421580317: ext4_mb_generate_buddy:756
Mar  1 06:25:08 ns0 kernel: [154713.220446] EXT4-fs error (device vda1): ext4_mb_generate_buddy:756: group 4, block bitmap and bg descriptor inconsistent: 17270 vs 17268 free clusters
Mar  2 06:26:44 ns0 kernel: [241210.340779] EXT4-fs (vda1): initial error at time 1425209107: ext4_mb_generate_buddy:756
Mar  4 06:25:09 pm kernel: [75552.905227] EXT4-fs error (device vda1): ext4_mb_generate_buddy:756: group 6, block bitmap and bg descriptor inconsistent: 9361 vs 9359 free clusters
Mar 15 06:25:12 smb kernel: [231369.706448] EXT4-fs error (device vda1): ext4_mb_generate_buddy:756: group 3, block bitmap and bg descriptor inconsistent: 17537 vs 17536 free clusters
Mar 23 16:24:25 apt kernel: [   38.398673] EXT4-fs error (device vda1): ext4_mb_generate_buddy:756: group 33, block bitmap and bg descriptor inconsistent: 22354 vs 22352 free clusters
Mar 26 06:25:18 pm kernel: [223295.593946] EXT4-fs error (device vda1): ext4_mb_generate_buddy:756: group 6, block bitmap and bg descriptor inconsistent: 11499 vs 11497 free clusters
Apr 13 17:42:38 smb kernel: [422214.146392] EXT4-fs error (device dm-4): ext4_mb_generate_buddy:756: group 256, block bitmap and bg descriptor inconsistent: 672 vs 671 free clusters
Apr 14 09:31:22 smtp kernel: [479136.955401] EXT4-fs error (device vda1): ext4_mb_generate_buddy:756: group 1, block bitmap and bg descriptor inconsistent: 7688 vs 7686 free clusters
Apr 26 06:25:07 ns0 kernel: [767851.910454] EXT4-fs error (device vda1): ext4_mb_generate_buddy:756: group 0, block bitmap and bg descriptor inconsistent: 22322 vs 22321 free clusters
Apr 27 06:27:00 ns0 kernel: [854365.159204] EXT4-fs (vda1): initial error at time 1430043907: ext4_mb_generate_buddy:756
Apr 27 06:27:00 ns0 kernel: [854365.169903] EXT4-fs (vda1): last error at time 1430043907: ext4_mb_generate_buddy:756
May  7 08:37:08 log kernel: [   33.482706] EXT4-fs error (device vdc): ext4_mb_generate_buddy:756: group 2, block bitmap and bg descriptor inconsistent: 28613 vs 28611 free clusters
May  8 08:39:00 log kernel: [86545.376048] EXT4-fs (vdc): initial error at time 1431002228: ext4_mb_generate_buddy:756
May  8 08:39:00 log kernel: [86545.376051] EXT4-fs (vdc): last error at time 1431002228: ext4_mb_generate_buddy:756
May 13 06:25:13 git kernel: [131633.710249] EXT4-fs error (device vda1): ext4_mb_generate_buddy:756: group 5, block bitmap and bg descriptor inconsistent: 14076 vs 14074 free clusters

Revision history for this message

Oliver Brakmann (obrakmann) wrote on 2015-06-13:

#26

No, it isn't gone after all, it just happened again:

Jun 13 15:20:37 oberon kernel: [117379.592167] EXT4-fs error (device dm-1): ext4_mb_generate_buddy:756: group 1, block bitmap and bg descriptor inconsistent: 19382 vs 19381 free clusters
Jun 13 15:20:37 oberon kernel: [117379.592903] Aborting journal on device dm-1-8.
Jun 13 15:20:37 oberon kernel: [117379.593659] EXT4-fs (dm-1): Remounting filesystem read-only

$ uname -a
Linux oberon 3.13.0-54-generic #91-Ubuntu SMP Tue May 26 19:15:08 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

Simon Déziel (sdeziel) on 2015-06-15

Changed in linux (Ubuntu Trusty):
status:	Incomplete → Confirmed
summary:	- ext4 turned read-only during logrotate daily run + ext4_mb_generate_buddy:756: group N, block bitmap and bg descriptor + inconsistent: X vs Y

Revision history for this message

Simon Déziel (sdeziel) wrote on 2015-06-15:

#27

And yet again:

root@rproxy:~# dmesg -T | tail
[Thu Jun 11 15:28:56 2015] type=1400 audit(1434050938.364:24): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="nc" pid=658 comm="apparmor_parser"
[Thu Jun 11 15:28:59 2015] init: plymouth-upstart-bridge main process ended, respawning
[Thu Jun 11 15:29:02 2015] init: console-font main process (757) terminated with status 71
[Thu Jun 11 15:29:03 2015] init: plymouth-splash main process (815) terminated with status 1
[Thu Jun 11 18:18:40 2015] random: nonblocking pool is initialized
[Mon Jun 15 16:53:35 2015] EXT4-fs (vda1): pa ffff8800061e1000: logic 3293, phys. 469944, len 424
[Mon Jun 15 16:53:35 2015] EXT4-fs error (device vda1): ext4_mb_release_inode_pa:3753: group 14, free 13, pa_free 11
[Mon Jun 15 16:53:35 2015] Aborting journal on device vda1-8.
[Mon Jun 15 16:53:36 2015] EXT4-fs (vda1): Remounting filesystem read-only
[Mon Jun 15 16:53:36 2015] EXT4-fs error (device vda1): ext4_journal_check_start:56: Detected aborted journal

root@rproxy:~# uname -a
Linux rproxy.dmz.sdeziel.info 3.13.0-54-generic #91-Ubuntu SMP Tue May 26 19:15:08 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

Revision history for this message

gschoenberger (schoeni-georg) wrote on 2015-06-22:

#28

Seems we are also confronted with this error in one of our VMWare guests:
[45780.021968] EXT4-fs error (device dm-9): ext4_mb_generate_buddy:756: group 1411, block bitmap and bg descriptor inconsistent: 32757 vs 32768 free clusters
[45780.022029] Aborting journal on device dm-9-8.
[45780.055657] EXT4-fs (dm-9): Remounting filesystem read-only
[45780.055733] EXT4-fs error (device dm-9) in ext4_orphan_add:2609: Journal has aborted

It happened four times the last three weeks, pretty annoying. We have tried to "harden" ext4 in some way by using:
* ext4 rw,relatime,nodelalloc,errors=remount-ro,commit=2
as mount options. We have also increased hung_task_timeout_secs as we have found some hints in a RedHat KB article.

Any ideas on how to fix that ext4 error in the guest?

Revision history for this message

Peter (peter-hirn) wrote on 2015-09-07:

#29

I'm having this exact issue with Debian Jessie. Virtual machines running with LVM storage unpredictably failing after cron.daily.

[680406.357587] EXT4-fs error (device vda1): ext4_mb_generate_buddy:757: group 49, block bitmap and bg descriptor inconsistent: 18356 vs 18355 free clusters
[680406.361696] Aborting journal on device vda1-8.
[680406.453757] EXT4-fs error (device vda1): ext4_journal_check_start:56:
[680406.453999] EXT4-fs (vda1): Remounting filesystem read-only
[680406.457551] Detected aborted journal

[766809.056039] EXT4-fs (vda1): error count since last fsck: 2
[766809.056069] EXT4-fs (vda1): initial error at time 1441383430: ext4_mb_generate_buddy:757
[766809.056082] EXT4-fs (vda1): last error at time 1441383430: ext4_journal_check_start:56
[853316.576033] EXT4-fs (vda1): error count since last fsck: 2
[853316.576052] EXT4-fs (vda1): initial error at time 1441383430: ext4_mb_generate_buddy:757
[853316.576066] EXT4-fs (vda1): last error at time 1441383430: ext4_journal_check_start:56

root@collect1 ~ # uname -a
Linux collect1 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u3 (2015-08-04) x86_64 GNU/Linux

Revision history for this message

maxnuv (massimo-n) wrote on 2015-09-10:

#30

Same error, virtual machine 14.04 running on kvm, server 14.04.

No hardware issue, the entire server was replaced after the second stop.

No hard disk issue, the server is on raid hardware disk.

The server is running more than one virtual server, only one affected.

Revision history for this message

davidak (davidak) wrote on 2015-09-14:

#31

We have the same error on a 14.04 VM running on a XenServer 6.2 (paravirtualized).

EXT4-fs error (device dm-1): ext4_mb_generate_buddy:756: group 36, block bitmap and bg descriptor inconsistent: 30504 vs 30149 free clusters

Linux xxx 3.13.0-63-generic #103-Ubuntu SMP Fri Aug 14 21:42:59 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

i created a new VM and restored a backup and run the daily cronjobs to reproduce it but it didn't happen again.

run-parts -v --exit-on-error /etc/cron.daily

output of dmesg -T

...
[Mon Sep 14 14:47:00 2015] random: lvm urandom read with 64 bits of entropy available
[Mon Sep 14 14:47:00 2015] bio: create slab <bio-1> at 1
[Mon Sep 14 14:47:00 2015] EXT4-fs (dm-1): INFO: recovery required on readonly filesystem
[Mon Sep 14 14:47:00 2015] EXT4-fs (dm-1): write access will be enabled during recovery
[Mon Sep 14 14:47:01 2015] random: nonblocking pool is initialized
[Mon Sep 14 14:47:01 2015] EXT4-fs (dm-1): orphan cleanup on readonly fs
[Mon Sep 14 14:47:01 2015] EXT4-fs (dm-1): 16 orphan inodes deleted
[Mon Sep 14 14:47:01 2015] EXT4-fs (dm-1): recovery complete
[Mon Sep 14 14:47:01 2015] EXT4-fs (dm-1): mounted filesystem with ordered data mode. Opts: (null)
...

that seams to be ok. i also tried with the cron.weekly...

Revision history for this message

LouieGosselin (0-ubunbu-d) wrote on 2015-09-14:

#32

I'm on Debian, but it's happening to me as well.
KVM with virtual disks backed by LVM volumes on the host.

Both the VM and the host are running
Linux version 3.16.0-4-amd64 (<email address hidden>) (gcc version 4.8.4 (Debian 4.8.4-1) ) #1 SMP Debian 3.16.7-ckt11-1+deb8u2 (2015-07-17)

First occurred on August 24th. Manual fsck required, lots of files converted to lost-inodes.
[1992712.418275] EXT4-fs error (device vda2): ext4_mb_generate_buddy:757: group 96, block bitmap and bg descriptor inconsistent: 24017 vs 24015 free clusters
[1992712.513438] Aborting journal on device vda2-8.
[1992712.514007] EXT4-fs (vda2): Remounting filesystem read-only
[1992712.514205] EXT4-fs error (device vda2) in ext4_evict_inode:243: Journal has aborted

Happened again today September 14th in the same VM.
[1489393.753098] EXT4-fs error (device vda2): ext4_mb_generate_buddy:757: group 144, block bitmap and bg descriptor inconsistent: 23914 vs 23913 free clusters
[1489393.803865] Aborting journal on device vda2-8.
[1489393.804439] EXT4-fs (vda2): Remounting filesystem read-only

This is the first syslog activity since I rebooted in August, no block IO errors on the guest or host.

Manual fsck required, files were lost, but everything is running again.

It has not happened in other VMs running older kernels. It also has not happened on the host, however there's very little file system activity on the host. The fact that it hasn't happened on other VMs leads me to believe the bug is inside the guest rather than with KVM - perhaps ext4 or the virtual disk driver.

Revision history for this message

maxnuv (massimo-n) wrote on 2015-09-14:

#33

Same error, on a single VM, virtio driver, ext4:

EXT4-fs (vda1): pa ffff880010c3daf8: logic 512, pyhs. 6750208, len 512
EXT4-fs error (device vda1): ext4_mb_release_inode_pa:3773: group 206, free 143, pa_free 142
Aborting journal on device vda1-8
Disk quota active

No memory error (checked)
No disk error (checked too)

At system reset, filesystem check and correction, quota data corrupted, quotacheck take long time to complete.

The error is strictly correlated to filesystem activity, when raise, error is more frequent.

ubuntu 14.04 all updated, kernel 3.16.0-49-generic

Revision history for this message

In Linux Kernel Bug Tracker #104571, linux-ext4 (linux-ext4-linux-kernel-bugs) wrote on 2015-09-15:

#71

Download full text (3.8 KiB)

This bug report is about ext4 metadata corruption on large (>=10TB) ext4 volumes.
This was also reported in 2014 [ http://marc.info/?l=linux-ext4&m=139878494527370&w=2 ]

I'm getting sporadic FS errors like this one:
(More of these i've pasted at https://8n1.org/10745/cc34)

| EXT4-fs error (device vdb): ext4_mb_generate_buddy:757:
| group 79842, block bitmap and bg descriptor inconsistent: 10073 vs 10071
| free clusters
| Aborting journal on device vdb-8.

An e2fsck run then shows:
| Pass 5: Checking group summary information
| Block bitmap differences: +(2616281446--2616281447)
| Free blocks count wrong (170942497, counted=129906218).
| Free inodes count wrong (670863012, counted=670860975).

I've patched my kernel with WARN_ON(1); inserted in tactical places and caught
one such situation:

| EXT4-fs (vdb): pa ffff880016544888: logic 982168, phys. 2469410748, len 104
| EXT4-fs error (device vdb): ext4_mb_release_inode_pa:3773: group 75360, free 38, pa_free 36
| Aborting journal on device vdb-8.
| EXT4-fs (vdb): Remounting filesystem read-only
| ------------[ cut here ]------------
| WARNING: CPU: 1 PID: 1706 at fs/ext4/mballoc.c:3774 ext4_mb_release_inode_pa.isra.27+0x1cb/0x2c0()
| Modules linked in: nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter xt_tcpudp ip6_tables
| nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter ip_tables
| x_tables cirrus ttm drm_kms_helper drm kvm_intel kvm ppdev syscopyarea sysfillrect
| 8250_fintek serio_raw i2c_piix4 sysimgblt pvpanic parport_pc mac_hid nfsd auth_rpcgss nfs_acl
| lockd grace sunrpc lp parport autofs4 psmouse floppy pata_acpi
| CPU: 1 PID: 1706 Comm: deluged Not tainted 3.19.8-ckt4 #1
| Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
| ffffffff81ab4fef ffff8800da1bb978 ffffffff817c3760 0000000000000007
| 0000000000000000 ffff8800da1bb9b8 ffffffff8107696a ffff8800da1bb9a8
| 0000000000000026 0000000000003825 0000000000003824 ffff880016544888
| Call Trace:
| [<ffffffff817c3760>] dump_stack+0x45/0x57
| [<ffffffff8107696a>] warn_slowpath_common+0x8a/0xc0
| [<ffffffff81076a5a>] warn_slowpath_null+0x1a/0x20
| [<ffffffff812b01bb>] ext4_mb_release_inode_pa.isra.27+0x1cb/0x2c0
| [<ffffffff812739df>] ? ext4_read_block_bitmap_nowait+0x26f/0x5f0
| [<ffffffff812b3c6a>] ext4_discard_preallocations+0x30a/0x490
| [<ffffffff8127b578>] ext4_da_update_reserve_space+0x178/0x1b0
| [<ffffffff812a9129>] ext4_ext_map_blocks+0xcd9/0xe50
| [<ffffffff8127b6d9>] ext4_map_blocks+0x129/0x570
| [<ffffffff8127e89d>] ? ext4_writepages+0x35d/0xca0
| [<ffffffff812ab3a9>] ? __ext4_journal_start_sb+0x69/0xe0
| [<ffffffff8127eac2>] ext4_writepages+0x582/0xca0
| [<ffffffff81187a4e>] do_writepages+0x1e/0x30
| [<ffffffff8117bbe9>] __filemap_fdatawrite_range+0x59/0x60
| [<ffffffff8117bc4c>] filemap_write_and_wait+0x2c/0x60
| [<ffffffff8120903d>] do_vfs_ioctl+0x3fd/0x4e0
| [<ffffffff812091a1>] SyS_ioctl+0x81/0xa0
| [<ffffffff817ca84d>] system_call_fastpath+0x16/0x1b
| ---[ end trace c7de4d0d78cb95b6 ]---
| EXT4-fs error (device vdb) in ext4_writepages:2412: IO failure
| EXT4-fs (vdb): ext4_writepages: jbd2_start: 9223372036854775751 pages, ino 84149503; err -30

After th...

This bug report is about ext4 metadata corruption on large (>=10TB) ext4 volumes.
This was also reported in 2014 [ http://marc.info/?l=linux-ext4&m=139878494527370&w=2 ]

I'm getting sporadic FS errors like this one:
(More of these i've pasted at https://8n1.org/10745/cc34)

|  EXT4-fs error (device vdb): ext4_mb_generate_buddy:757:
|   group 79842, block bitmap and bg descriptor inconsistent: 10073 vs 10071
|   free clusters
| Aborting journal on device vdb-8.

An e2fsck run then shows:
| Pass 5: Checking group summary information
| Block bitmap differences:  +(2616281446--2616281447)
| Free blocks count wrong (170942497, counted=129906218).
| Free inodes count wrong (670863012, counted=670860975).

I've patched my kernel with WARN_ON(1); inserted in tactical places and caught
one such situation:

| EXT4-fs (vdb): pa ffff880016544888: logic 982168, phys. 2469410748, len 104
| EXT4-fs error (device vdb): ext4_mb_release_inode_pa:3773: group 75360, free 38, pa_free 36
| Aborting journal on device vdb-8.
| EXT4-fs (vdb): Remounting filesystem read-only
| ------------[ cut here ]------------
| WARNING: CPU: 1 PID: 1706 at fs/ext4/mballoc.c:3774 ext4_mb_release_inode_pa.isra.27+0x1cb/0x2c0()
| Modules linked in: nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter xt_tcpudp ip6_tables
|     nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter ip_tables
|     x_tables cirrus ttm drm_kms_helper drm kvm_intel kvm ppdev syscopyarea sysfillrect
|     8250_fintek serio_raw i2c_piix4 sysimgblt pvpanic parport_pc mac_hid nfsd auth_rpcgss nfs_acl
|     lockd grace sunrpc lp parport autofs4 psmouse floppy pata_acpi
| CPU: 1 PID: 1706 Comm: deluged Not tainted 3.19.8-ckt4 #1
| Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
|  ffffffff81ab4fef ffff8800da1bb978 ffffffff817c3760 0000000000000007
|  0000000000000000 ffff8800da1bb9b8 ffffffff8107696a ffff8800da1bb9a8
|  0000000000000026 0000000000003825 0000000000003824 ffff880016544888
| Call Trace:
|  [<ffffffff817c3760>] dump_stack+0x45/0x57
|  [<ffffffff8107696a>] warn_slowpath_common+0x8a/0xc0
|  [<ffffffff81076a5a>] warn_slowpath_null+0x1a/0x20
|  [<ffffffff812b01bb>] ext4_mb_release_inode_pa.isra.27+0x1cb/0x2c0
|  [<ffffffff812739df>] ? ext4_read_block_bitmap_nowait+0x26f/0x5f0
|  [<ffffffff812b3c6a>] ext4_discard_preallocations+0x30a/0x490
|  [<ffffffff8127b578>] ext4_da_update_reserve_space+0x178/0x1b0
|  [<ffffffff812a9129>] ext4_ext_map_blocks+0xcd9/0xe50
|  [<ffffffff8127b6d9>] ext4_map_blocks+0x129/0x570
|  [<ffffffff8127e89d>] ? ext4_writepages+0x35d/0xca0
|  [<ffffffff812ab3a9>] ? __ext4_journal_start_sb+0x69/0xe0
|  [<ffffffff8127eac2>] ext4_writepages+0x582/0xca0
|  [<ffffffff81187a4e>] do_writepages+0x1e/0x30
|  [<ffffffff8117bbe9>] __filemap_fdatawrite_range+0x59/0x60
|  [<ffffffff8117bc4c>] filemap_write_and_wait+0x2c/0x60
|  [<ffffffff8120903d>] do_vfs_ioctl+0x3fd/0x4e0
|  [<ffffffff812091a1>] SyS_ioctl+0x81/0xa0
|  [<ffffffff817ca84d>] system_call_fastpath+0x16/0x1b
| ---[ end trace c7de4d0d78cb95b6 ]---
| EXT4-fs error (device vdb) in ext4_writepages:2412: IO failure
| EXT4-fs (vdb): ext4_writepages: jbd2_start: 9223372036854775751 pages, ino 84149503; err -30

After this, the system started logging a lot of this same message:
| EXT4-fs error (device vdb): ext4_find_extent:900: inode #84149503: comm deluged: 
|    pblk 225181822 bad header/extent: invalid magic - magic 53fd, entries 37907,
|    max 27407(0), depth 50401(0)

Ran e2fsck and got:
| Pass 5: Checking group summary information
| Block bitmap differences:   +(1431556444--1431556445) +(2469410748-2469410749)
| Free blocks count wrong (134030133, counted=57970467).
| Free inodes count wrong (670746893, counted=670746452).

Which is usually the same output for fsck in these situations.

This server is a QEMU KVM virtual machine running on Intel x64 hardware.

Revision history for this message

Emmanuel Lacour (elacour) wrote on 2015-09-17:

#34

kern.log with backtrace Edit (18.2 KiB, text/plain)

Same problem here on 2 VMs, appeared 4 times in one week.

VMs are running Debian Jessie, on Debian wheezy hypervisors running ceph giant, libvirt 1.2.9-9~bpo70+1, qemu 1:2.1+dfsg-12~bpo70+1.

FS is ext4, using virtio and rbd with writeback.

After first crash we upgraded kernel to 4.1.6-1~bpo8+1. But it crashed again next day :(

Seems to be related to memory usage at least for last crash.

Attached the trace.

We added more memory to see if that helps to avoid triggering the bug.

Revision history for this message

In Linux Kernel Bug Tracker #104571, linux-ext4 (linux-ext4-linux-kernel-bugs) wrote on 2015-09-28:

#72

Hit by this bug again:

| EXT4-fs error (device vdb): ext4_mb_generate_buddy:757: group 76916,
| block bitmap and bg descriptor inconsistent: 959 vs 957 free
| clusters
| Aborting journal on device vdb-8.
| EXT4-fs (vdb): Remounting filesystem read-only
| ------------[ cut here ]------------
| WARNING: CPU: 0 PID: 5000 at fs/ext4/mballoc.c:758
| ext4_mb_generate_buddy+0x1fa/0x340()
| Modules linked in: btrfs xor raid6_pq ufs qnx4 hfsplus hfs minix ntfs
| msdos jfs xfs libcrc32c xt_tcpudp nf_conntrack_ipv6 nf_defrag_ipv6
| ip6table_filter ip6_tables nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack
| nf_conntrack iptable_filter ip_tables x_tables cirrus ttm drm_kms_helper
| drm kvm_intel kvm ppdev syscopyarea sysfillrect serio_raw pvpanic
| sysimgblt 8250_fintek parport_pc i2c_piix4 mac_hid nfsd auth_rpcgss
| nfs_acl lockd grace sunrpc lp parport autofs4 psmouse floppy pata_acpi
| CPU: 0 PID: 5000 Comm: deluged Not tainted 3.19.8-ckt4 #1
| Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
| ffffffff81ab4fef ffff8800db637858 ffffffff817c3760 ffff88011fc0fdb8
| 0000000000000000 ffff8800db637898 ffffffff8107696a 0000000000000001
| 00000000000003bf ffff8800d98e4000 0000000000008000 ffff8800084c1000
| Call Trace:
| [<ffffffff817c3760>] dump_stack+0x45/0x57
| [<ffffffff8107696a>] warn_slowpath_common+0x8a/0xc0
| [<ffffffff81076a5a>] warn_slowpath_null+0x1a/0x20
| [<ffffffff812ae6ea>] ext4_mb_generate_buddy+0x1fa/0x340
| [<ffffffff812aed00>] ext4_mb_init_cache+0x390/0x700
| [<ffffffff812af5cd>] ext4_mb_load_buddy+0x1ad/0x360
| [<ffffffff812b25d7>] ext4_mb_regular_allocator+0x1b7/0x450
| [<ffffffff812b41f1>] ext4_mb_new_blocks+0x401/0x550
| [<ffffffff812a4120>] ? ext4_find_extent+0x140/0x330
| [<ffffffff812a8a87>] ext4_ext_map_blocks+0x637/0xe50
| [<ffffffff8127b6d9>] ext4_map_blocks+0x129/0x570
| [<ffffffff8127eac2>] ext4_writepages+0x582/0xca0
| [<ffffffff810f0700>] ? get_futex_key+0x1c0/0x2b0
| [<ffffffff810a71c8>] ? __enqueue_entity+0x78/0x80
| [<ffffffff81187a4e>] do_writepages+0x1e/0x30
| [<ffffffff8117bbe9>] __filemap_fdatawrite_range+0x59/0x60
| [<ffffffff8117bc4c>] filemap_write_and_wait+0x2c/0x60
| [<ffffffff8120903d>] do_vfs_ioctl+0x3fd/0x4e0
| [<ffffffff812091a1>] SyS_ioctl+0x81/0xa0
| [<ffffffff817ca84d>] system_call_fastpath+0x16/0x1b
| ---[ end trace 860c733f0437c9b7 ]---

After which the kernel starts logging in rapid succession:
| EXT4-fs error (device vdb): ext4_find_extent:900: inode #84410502:
| comm deluged: pblk 223281469 bad header/extent: invalid magic - magic
| 80da, entries 13942, max 3979(0), depth 42746(0)
| EXT4-fs error: 40 callbacks suppressed

Hit by this bug again:

| EXT4-fs error (device vdb): ext4_mb_generate_buddy:757: group 76916,
|    block bitmap and bg descriptor inconsistent: 959 vs 957 free
|    clusters
| Aborting journal on device vdb-8.
| EXT4-fs (vdb): Remounting filesystem read-only
| ------------[ cut here ]------------
| WARNING: CPU: 0 PID: 5000 at fs/ext4/mballoc.c:758
|    ext4_mb_generate_buddy+0x1fa/0x340()
| Modules linked in: btrfs xor raid6_pq ufs qnx4 hfsplus hfs minix ntfs
|    msdos jfs xfs libcrc32c xt_tcpudp nf_conntrack_ipv6 nf_defrag_ipv6
|    ip6table_filter ip6_tables nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack
|    nf_conntrack iptable_filter ip_tables x_tables cirrus ttm drm_kms_helper
|    drm kvm_intel kvm ppdev syscopyarea sysfillrect serio_raw pvpanic
|    sysimgblt 8250_fintek parport_pc i2c_piix4 mac_hid nfsd auth_rpcgss
|    nfs_acl lockd grace sunrpc lp parport autofs4 psmouse floppy pata_acpi
| CPU: 0 PID: 5000 Comm: deluged Not tainted 3.19.8-ckt4 #1
| Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
|  ffffffff81ab4fef ffff8800db637858 ffffffff817c3760 ffff88011fc0fdb8
|  0000000000000000 ffff8800db637898 ffffffff8107696a 0000000000000001
|  00000000000003bf ffff8800d98e4000 0000000000008000 ffff8800084c1000
| Call Trace:
|  [<ffffffff817c3760>] dump_stack+0x45/0x57
|  [<ffffffff8107696a>] warn_slowpath_common+0x8a/0xc0
|  [<ffffffff81076a5a>] warn_slowpath_null+0x1a/0x20
|  [<ffffffff812ae6ea>] ext4_mb_generate_buddy+0x1fa/0x340
|  [<ffffffff812aed00>] ext4_mb_init_cache+0x390/0x700
|  [<ffffffff812af5cd>] ext4_mb_load_buddy+0x1ad/0x360
|  [<ffffffff812b25d7>] ext4_mb_regular_allocator+0x1b7/0x450
|  [<ffffffff812b41f1>] ext4_mb_new_blocks+0x401/0x550
|  [<ffffffff812a4120>] ? ext4_find_extent+0x140/0x330
|  [<ffffffff812a8a87>] ext4_ext_map_blocks+0x637/0xe50
|  [<ffffffff8127b6d9>] ext4_map_blocks+0x129/0x570
|  [<ffffffff8127eac2>] ext4_writepages+0x582/0xca0
|  [<ffffffff810f0700>] ? get_futex_key+0x1c0/0x2b0
|  [<ffffffff810a71c8>] ? __enqueue_entity+0x78/0x80
|  [<ffffffff81187a4e>] do_writepages+0x1e/0x30
|  [<ffffffff8117bbe9>] __filemap_fdatawrite_range+0x59/0x60
|  [<ffffffff8117bc4c>] filemap_write_and_wait+0x2c/0x60
|  [<ffffffff8120903d>] do_vfs_ioctl+0x3fd/0x4e0
|  [<ffffffff812091a1>] SyS_ioctl+0x81/0xa0
|  [<ffffffff817ca84d>] system_call_fastpath+0x16/0x1b
| ---[ end trace 860c733f0437c9b7 ]---

After which the kernel starts logging in rapid succession:
| EXT4-fs error (device vdb): ext4_find_extent:900: inode #84410502:
|    comm deluged: pblk 223281469 bad header/extent: invalid magic - magic
|    80da, entries 13942, max 3979(0), depth 42746(0)
| EXT4-fs error: 40 callbacks suppressed

Revision history for this message

In Linux Kernel Bug Tracker #104571, linux-ext4 (linux-ext4-linux-kernel-bugs) wrote on 2015-09-28:

#73

e2fsck 1.42.9 (4-Feb-2014)
[ .. ]
Pass 5: Checking group summary information
Block bitmap differences: +(2520387693--2520387694)
Fix? yes

Free blocks count wrong (57970467, counted=36036558).
Fix? yes

Free inodes count wrong (670746452, counted=670746137).
Fix? yes

/dev/vdb: ***** FILE SYSTEM WAS MODIFIED *****
/dev/vdb: 342503/671088640 files (5.9% non-contiguous), 2648318002/2684354560 blocks

Revision history for this message

David Brown (david-james-brown) wrote on 2015-10-02:

#35

I'm getting this same issue every day

3.2.0-4-amd64 #1 SMP Debian 3.2.68-1+deb7u4 x86_64 GNU/Linux

Guest is running under proxmox 3.4 virtio

Revision history for this message

Emmanuel Lacour (elacour) wrote on 2015-10-08:

#36

Adding memory did not helps, but it seems to have fixed the issue by just removing the fstab option errors=remount-ro on the mountpoints having problems. It's usually only set on root fs and we have no problem on root fs, but by mistake this option was added to other mountpoints and triggered the bug.

No more ro-remounts since one week and no ext4 messages in logs.

Revision history for this message

Jan Wagner (waja) wrote on 2015-10-13:

#37

Screenshot_service_2015-10-13_17:25:18.png Edit (10.4 KiB, image/png)

I got also hit with the problem on Debian Jessie VM [virtio] (KVM host is Jessie as well).

The following bug might be related: https://bugzilla.kernel.org/show_bug.cgi?id=104571

I'm trying now '"cache=none"' for this VM and see how that turns out.

Revision history for this message

Simon Déziel (sdeziel) wrote on 2015-10-13: Re: [Bug 1423672] Re: ext4_mb_generate_buddy:756: group N, block bitmap and bg descriptor inconsistent: X vs Y

#38

On 10/13/2015 11:50 AM, Jan Wagner wrote:
> I got also hit with the problem on Debian Jessie VM [virtio] (KVM host
> is Jessie as well).
>
> The following bug might be related:
> https://bugzilla.kernel.org/show_bug.cgi?id=104571
>
> I'm trying now '"cache=none"' for this VM and see how that turns out.

FYI, all my VMs use cache=none and have small HDDs (lv of 2-4G) yet the
regularly trip on this bug.

Simon

Revision history for this message

LouieGosselin (0-ubunbu-d) wrote on 2015-10-16:

#39

It happened here again.

8/24 ext4 corruption
9/14 ext4 corruption
9/29 update/reboot
10/16 ext4 corruption

This time the corruption was severe. 1743 files from multiple directories got moved into lost+found.
It took me almost 2 hours this morning to verify & fix everything. Fortunately every time this has happened, all the files were dated prior to the daily backup and "diff -qr ..." shows exactly what was lost.

As far as I can tell this is not a memory issue and the ext4 FS is using 10G out of 100G.

Every time the corruption has been in /var/mail. However the VM is mostly used for mail so it may not be significant. The /var/mail branch itself is 4.3G

3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u4 (2015-09-19) x86_64 GNU/Linux

I'm holding back the kernels on other VM's so this is the only VM with the problem.

Is anyone able to reproduce this on demand?

I really need to do something because this is causing downtime during normal business hours. I'll probably try one of the following:
1. Rebuild the FS from scratch and see if ext4 corruption continues.
2. Use ext3 or something else.

Revision history for this message

Jan Wagner (waja) wrote on 2015-10-17:

#40

On 10/13/2015 06:00 PW, Simon Déziel wrote:
> On 10/13/2015 11:50 AM, Jan Wagner wrote:
>> I'm trying now '"cache=none"' for this VM and see how that turns out.
>
> FYI, all my VMs use cache=none and have small HDDs (lv of 2-4G) yet the
> regularly trip on this bug.

looks like this didn't fixed the problem for me.

Revision history for this message

LouieGosselin (0-ubunbu-d) wrote on 2015-10-28:

#41

It's happened again. I've spent several hours on this and I've been able to recreate the failure under some synthetic conditions with a sacrificial VM.

The filebench defaults do not cause an ext4 crash for me, but the following do:

load workloads/fileserver
set $dir=/tmp/
set $nfiles=200000
set $meandirwidth=30000
run 120

The ext4 error never happens in the filebench'es init phase, only 50s or so into the 50 threaded run phase. Less extreme settings won't produce a consistent crash.

Reducing the amount of free memory makes the errors much more likely.

This is before running filebench:
total used free shared buffers cached
Mem: 482M 99M 382M 300K 27M 20M
-/+ buffers/cache: 52M 429M
Swap: 1.9G 94M 1.8G

This is while running filebench one second before the crash:
total used free shared buffers cached
Mem: 482M 476M 5.6M 284K 27M 18M
-/+ buffers/cache: 430M 51M
Swap: 1.9G 253M 1.6G
2769.63

The error is reproducible in cloned VMs.

Moving swap to another disk changes nothing.

As far as I can tell, the error never happens with ext4 filesystems other than the root FS where executables are running from.

I've tried bonnie, stress-ng, and simple scripts, I have not been able to get these to crash ext4.

The sacrificial VM has not crashed after add an extra 500MB to it.

Although production was never under such heavy loads, I've added 500MB to the production VM to see if it helps anyways.

Revision history for this message

LouieGosselin (0-ubunbu-d) wrote on 2015-10-29:

#42

I'm posting again to add that I conducted some more tests and ext3 does not encounter corruption under the same conditions. I hope this information is helpful to others, if anyone needs more information let me know and I'll see what I can do. I'll probably switch my own VMs to ext3 so I don't have to worry about these FS crashes.

Revision history for this message

Chris J Arges (arges) wrote on 2015-10-29:

#43

Louie,

Can you post some additional information that may help in debugging this issue.
1) Can you post the output of 'virsh dumpxml <vm_domain>' of the affected VM?
2) Can you post the /boot/config-`uname -r` of the affected VM's kernel?
3) What type of partitioning layout does your VM have?

It seems that it has been reproducible in 3.13 to 3.16 in this report, and the upstream report shows 3.19.

I tried this in a local VM with 32GB disk, 2 cpu, 512MB memory, swapfile and could not trivially reproduce with the filebench test case you mentioned above.

--chris

Revision history for this message

Jan Wagner (waja) wrote on 2015-11-03:

#44

Am 29.10.15 um 03:29 schrieb LouieGosselin:
> I'm posting again to add that I conducted some more tests and ext3 does
> not encounter corruption under the same conditions. I hope this
> information is helpful to others, if anyone needs more information let
> me know and I'll see what I can do. I'll probably switch my own VMs to
> ext3 so I don't have to worry about these FS crashes.

I created a new LVM Volume and a new ext3 FS. I copied the whole FS from
ext4 to ext3 and running now the VM on ext3.

Today the ext4_mb_generate_buddy bug did hit me agian. With ext3 FS and
running the VM with "cache=none" setting.

Revision history for this message

BlueT - Matthew Lien - 練喆明 (bluet) wrote on 2015-11-04:

#45

I'm running LoCo mirror for my country and this affects our file server for over a week already.
All our official file services got offline now.
We're also providing free VM/Hosting for OpenSource projects, I can't image how the disaster will become.

Any suggestion?

All VMs are VMware guest.

root@ncnu-ftp:/ubuntu/mirror# LANG=C free -m
total used free shared buffers cached
Mem: 7984 1453 6531 0 66 352
-/+ buffers/cache: 1033 6950
Swap: 4091 0 4091

root@ncnu-ftp:/ubuntu/mirror# LANG=C touch /lala
touch: cannot touch '/lala': Read-only file system

root@ncnu-ftp:/ubuntu/mirror# LANG=C mount -o remount,rw /
mount: cannot remount block device /dev/mapper/ub--ftp-root read-write, is write-protected

root@ncnu-ftp:/ubuntu/mirror# dumpe2fs /dev/mapper/ub--ftp-root /dev/|grep -i Filesystem\ state
dumpe2fs 1.42.9 (4-Feb-2014)
Filesystem state: clean with errors

root@ncnu-ftp:/ubuntu/mirror# LC_ALL=C dmesg -T |grep error
[Thu Nov 5 00:35:46 2015] EXT4-fs (dm-0): re-mounted. Opts: errors=remount-ro
[Thu Nov 5 00:35:55 2015] EXT4-fs error (device dm-0): ext4_mb_generate_buddy:756: group 371, block bitmap and bg descriptor inconsistent: 25478 vs 25477 free clusters
[Thu Nov 5 00:35:55 2015] IP: [<ffffffff8125e511>] __ext4_error_inode+0x31/0x160
[Thu Nov 5 00:35:55 2015] RIP: 0010:[<ffffffff8125e511>] [<ffffffff8125e511>] __ext4_error_inode+0x31/0x160
[Thu Nov 5 00:35:55 2015] RIP [<ffffffff8125e511>] __ext4_error_inode+0x31/0x160
[Thu Nov 5 00:35:55 2015] EXT4-fs error (device dm-0): ext4_remount:4816: Abort forced by user
[Thu Nov 5 00:40:39 2015] EXT4-fs (dm-0): error count since last fsck: 15
[Thu Nov 5 00:40:39 2015] EXT4-fs (dm-0): initial error at time 1445904785: ext4_reserve_inode_write:4928
[Thu Nov 5 00:40:39 2015] EXT4-fs (dm-0): last error at time 1446654956: ext4_remount:4816
[Thu Nov 5 01:18:38 2015] EXT4-fs error (device dm-0): ext4_remount:4816: Abort forced by user
[Thu Nov 5 01:18:59 2015] EXT4-fs error (device dm-0): ext4_remount:4816: Abort forced by user
[Thu Nov 5 01:28:59 2015] EXT4-fs error (device dm-0): ext4_remount:4816: Abort forced by user
[Thu Nov 5 01:58:08 2015] EXT4-fs error (device dm-0): ext4_remount:4816: Abort forced by user
[Thu Nov 5 02:20:41 2015] EXT4-fs error (device dm-0): ext4_remount:4816: Abort forced by user

root@ncnu-ftp:/ubuntu/mirror# uname -a
Linux ncnu-ftp 3.13.0-63-generic #103-Ubuntu SMP Fri Aug 14 21:42:59 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

root@ncnu-ftp:/ubuntu/mirror# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 14.04.3 LTS
Release: 14.04
Codename: trusty

Any suggestion?

All VMs are VMware guest.

root@ncnu-ftp:/ubuntu/mirror# LANG=C free -m
             total       used       free     shared    buffers     cached
Mem:          7984       1453       6531          0         66        352
-/+ buffers/cache:       1033       6950
Swap:         4091          0       4091

root@ncnu-ftp:/ubuntu/mirror# LANG=C touch /lala
touch: cannot touch '/lala': Read-only file system

root@ncnu-ftp:/ubuntu/mirror# LANG=C mount -o remount,rw /
mount: cannot remount block device /dev/mapper/ub--ftp-root read-write, is write-protected

root@ncnu-ftp:/ubuntu/mirror# dumpe2fs /dev/mapper/ub--ftp-root /dev/|grep -i Filesystem\ state
dumpe2fs 1.42.9 (4-Feb-2014)
Filesystem state:         clean with errors

root@ncnu-ftp:/ubuntu/mirror# LC_ALL=C dmesg -T |grep error
[Thu Nov  5 00:35:46 2015] EXT4-fs (dm-0): re-mounted. Opts: errors=remount-ro
[Thu Nov  5 00:35:55 2015] EXT4-fs error (device dm-0): ext4_mb_generate_buddy:756: group 371, block bitmap and bg descriptor inconsistent: 25478 vs 25477 free clusters
[Thu Nov  5 00:35:55 2015] IP: [<ffffffff8125e511>] __ext4_error_inode+0x31/0x160
[Thu Nov  5 00:35:55 2015] RIP: 0010:[<ffffffff8125e511>]  [<ffffffff8125e511>] __ext4_error_inode+0x31/0x160
[Thu Nov  5 00:35:55 2015] RIP  [<ffffffff8125e511>] __ext4_error_inode+0x31/0x160
[Thu Nov  5 00:35:55 2015] EXT4-fs error (device dm-0): ext4_remount:4816: Abort forced by user
[Thu Nov  5 00:40:39 2015] EXT4-fs (dm-0): error count since last fsck: 15
[Thu Nov  5 00:40:39 2015] EXT4-fs (dm-0): initial error at time 1445904785: ext4_reserve_inode_write:4928
[Thu Nov  5 00:40:39 2015] EXT4-fs (dm-0): last error at time 1446654956: ext4_remount:4816
[Thu Nov  5 01:18:38 2015] EXT4-fs error (device dm-0): ext4_remount:4816: Abort forced by user
[Thu Nov  5 01:18:59 2015] EXT4-fs error (device dm-0): ext4_remount:4816: Abort forced by user
[Thu Nov  5 01:28:59 2015] EXT4-fs error (device dm-0): ext4_remount:4816: Abort forced by user
[Thu Nov  5 01:58:08 2015] EXT4-fs error (device dm-0): ext4_remount:4816: Abort forced by user
[Thu Nov  5 02:20:41 2015] EXT4-fs error (device dm-0): ext4_remount:4816: Abort forced by user

root@ncnu-ftp:/ubuntu/mirror# uname -a
Linux ncnu-ftp 3.13.0-63-generic #103-Ubuntu SMP Fri Aug 14 21:42:59 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

root@ncnu-ftp:/ubuntu/mirror# lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 14.04.3 LTS
Release:	14.04
Codename:	trusty

Revision history for this message

In Linux Kernel Bug Tracker #104571, linux-ext4 (linux-ext4-linux-kernel-bugs) wrote on 2015-11-04:

#74

Since then, running 4.2.0-16-generic (Ubuntu 'Wily', 15.10), problem still persists. I've re-created the FS, made it 10T bigger (20T total, 11T in use), copied all data.

Logs:

Nov 4 15:45:16 [447918.885531] EXT4-fs (vdb): pa ffff88002162ef08: logic 192512, phys. 2823200768, len 1929
Nov 4 15:45:16 [447918.885776] EXT4-fs error (device vdb): ext4_mb_release_inode_pa:3788: group 86157, free 1917, pa_free 1915
Nov 4 15:45:16 [447918.926300] EXT4-fs (vdb): Remounting filesystem read-only
Nov 4 15:45:16 [447918.926652] EXT4-fs error (device vdb) in ext4_writepages:2520: IO failure
Nov 4 15:45:16 [447918.968068] EXT4-fs (vdb): ext4_writepages: jbd2_start: 9223372036854775790 pages, ino 310215523; err -30

Nov 4 15:54:39 [448481.868756] EXT4-fs error (device vdb): ext4_mb_generate_buddy:758: group 86156, block bitmap and bg descriptor inconsistent: 28672 vs 21008 free clusters
Nov 4 15:54:39 [448481.869114] EXT4-fs (vdb): pa ffff880105bb9888: logic 301056, phys. 2823186432, len 2048
Nov 4 15:54:39 [448481.869343] EXT4-fs error (device vdb): ext4_mb_release_inode_pa:3788: group 86156, free 2048, pa_free 1926
Nov 4 15:54:39 [448482.020157] EXT4-fs (vdb): pa ffff88010b6298f0: logic 464896, phys. 2823178240, len 2048
Nov 4 15:54:39 [448482.020434] EXT4-fs error (device vdb): ext4_mb_release_inode_pa:3788: group 86156, free 2048, pa_free 1683
Nov 4 15:54:40 [448482.570440] EXT4-fs error (device vdb): ext4_put_super:803: Couldn't clean up the journal

Nov 4 18:21:31 [457294.270301] EXT4-fs error (device vdb): ext4_mb_generate_buddy:758: group 86089, block bitmap and bg descriptor inconsistent: 13143 vs 13141 free clusters
Nov 4 18:21:31 [457294.333046] EXT4-fs (vdb): Remounting filesystem read-only
Nov 4 18:21:31 [457294.333211] EXT4-fs error (device vdb) in ext4_writepages:2520: IO failure

Nov 4 21:22:53 [ 7546.252366] EXT4-fs (vdb): pa ffff8800062eabc8: logic 64708, phys. 2822632644, len 828
Nov 4 21:22:53 [ 7546.252475] EXT4-fs error (device vdb): ext4_mb_release_inode_pa:3788: group 86139, free 42, pa_free 40
Nov 4 21:22:53 [ 7546.317200] EXT4-fs (vdb): Remounting filesystem read-only
Nov 4 21:22:53 [ 7546.317529] EXT4-fs error (device vdb) in ext4_writepages:2520: IO failure
Nov 4 21:22:53 [ 7546.374769] EXT4-fs (vdb): ext4_writepages: jbd2_start: 9223372036854774720 pages, ino 310215559; err -30

After this last incident at 21:22:53 CET, i've reconfigured the VM to use 'default' values for 'Cache Mode' and 'IO Mode' in libvirt/qemu. It was set to 'Cache Mode: None', 'IO Mode: Native'. It still uses the VirtIO disk bus.

The problem must be triggered by large spikes of random IO (non-sequential reads & writes) to sparse files on the FS. I will try to re-create this in a test-VM.

Since then, running 4.2.0-16-generic (Ubuntu 'Wily', 15.10), problem still persists. I've re-created the FS, made it 10T bigger (20T total, 11T in use), copied all data.

Logs:

Nov  4 15:45:16 [447918.885531] EXT4-fs (vdb): pa ffff88002162ef08: logic 192512, phys. 2823200768, len 1929
Nov  4 15:45:16 [447918.885776] EXT4-fs error (device vdb): ext4_mb_release_inode_pa:3788: group 86157, free 1917, pa_free 1915
Nov  4 15:45:16 [447918.926300] EXT4-fs (vdb): Remounting filesystem read-only
Nov  4 15:45:16 [447918.926652] EXT4-fs error (device vdb) in ext4_writepages:2520: IO failure
Nov  4 15:45:16 [447918.968068] EXT4-fs (vdb): ext4_writepages: jbd2_start: 9223372036854775790 pages, ino 310215523; err -30

Nov  4 15:54:39 [448481.868756] EXT4-fs error (device vdb): ext4_mb_generate_buddy:758: group 86156, block bitmap and bg descriptor inconsistent: 28672 vs 21008 free clusters
Nov  4 15:54:39 [448481.869114] EXT4-fs (vdb): pa ffff880105bb9888: logic 301056, phys. 2823186432, len 2048
Nov  4 15:54:39 [448481.869343] EXT4-fs error (device vdb): ext4_mb_release_inode_pa:3788: group 86156, free 2048, pa_free 1926
Nov  4 15:54:39 [448482.020157] EXT4-fs (vdb): pa ffff88010b6298f0: logic 464896, phys. 2823178240, len 2048
Nov  4 15:54:39 [448482.020434] EXT4-fs error (device vdb): ext4_mb_release_inode_pa:3788: group 86156, free 2048, pa_free 1683
Nov  4 15:54:40 [448482.570440] EXT4-fs error (device vdb): ext4_put_super:803: Couldn't clean up the journal

Nov  4 18:21:31 [457294.270301] EXT4-fs error (device vdb): ext4_mb_generate_buddy:758: group 86089, block bitmap and bg descriptor inconsistent: 13143 vs 13141 free clusters
Nov  4 18:21:31 [457294.333046] EXT4-fs (vdb): Remounting filesystem read-only
Nov  4 18:21:31 [457294.333211] EXT4-fs error (device vdb) in ext4_writepages:2520: IO failure

Nov  4 21:22:53 [ 7546.252366] EXT4-fs (vdb): pa ffff8800062eabc8: logic 64708, phys. 2822632644, len 828
Nov  4 21:22:53 [ 7546.252475] EXT4-fs error (device vdb): ext4_mb_release_inode_pa:3788: group 86139, free 42, pa_free 40 
Nov  4 21:22:53 [ 7546.317200] EXT4-fs (vdb): Remounting filesystem read-only
Nov  4 21:22:53 [ 7546.317529] EXT4-fs error (device vdb) in ext4_writepages:2520: IO failure
Nov  4 21:22:53 [ 7546.374769] EXT4-fs (vdb): ext4_writepages: jbd2_start: 9223372036854774720 pages, ino 310215559; err -30

The problem must be triggered by large spikes of random IO (non-sequential reads & writes) to sparse files on the FS. I will try to re-create this in a test-VM.

Revision history for this message

In Linux Kernel Bug Tracker #104571, linux-ext4 (linux-ext4-linux-kernel-bugs) wrote on 2015-11-05:

#75

And again it went RO. libvirt/kvm Cache Mode / IO Mode is not related, it seems, was worth a shot...

Jan Wagner (waja) on 2015-11-12

Changed in linux (Debian):
importance:	Undecided → Unknown
status:	New → Unknown
affects:	linux-kernel-headers → linux

Revision history for this message

Jan Wagner (waja) wrote on 2015-11-12:

#46

Looks like there are some more LKBT issues for this one:

https://bugzilla.kernel.org/show_bug.cgi?id=102731
https://bugzilla.kernel.org/show_bug.cgi?id=89621

Bug Watch Updater (bug-watch-updater) on 2015-11-12

Changed in linux (Debian):
status:	Unknown → Confirmed

Revision history for this message

LouieGosselin (0-ubunbu-d) wrote on 2015-11-13:

#47

/boot/config-3.16.0-4-amd64 for the VM in question Edit (154.0 KiB, text/plain)

chris,

Here is what you asked for, sorry for not getting it earlier.

I don't use virsh. This is how I started KVM to trigger the problem interactively (curses interface):

kvm -drive file=/dev/raid/shared,media=disk,if=none,cache=none,aio=native,format=raw,id=hd0 -device virtio-blk-pci,drive=hd0 -smp 2 -m 1000 -netdev tap,ifname=vm_shared,script=no,downscript=no,id=eth0 -device virtio-net-pci,netdev=eth0,mac=52:54:00:12:34:58 -name shared -runas shared -curses

fdisk -l
Disk /dev/vda: 100 GiB, 107374182400 bytes, 209715200 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: AE4BDF3E-0B83-4C17-B104-A5139722F263

Device Start End Sectors Size Type
/dev/vda1 2048 3905535 3903488 1.9G Linux swap
/dev/vda2 3905536 209713151 205807616 98.1G Linux filesystem

It hasn't happened in this particular VM since upping the RAM so the VM doesn't swap.

My intention was to reproduce on non-production hardware, and then try different kernels, rule out LVM, virtio, etc. But I'm in the middle of a new assignment, I probably won't have time to do this myself before December.

Revision history for this message

LouieGosselin (0-ubunbu-d) wrote on 2015-11-13:

#48

Oops...the above kvm command line is correct but it did not crash with -m 1000, that's what production is using now.
It was crashing consistently with -m 512 about a minute into the synthetic FS load.

Revision history for this message

Udo Giacomozzi (udo-launchpad) wrote on 2015-12-01:

#49

FYI, the same errors happened to me on real hardware (a Raspberry Pi 2 B), based on a custom Debian Jessie image.

The image uses a ext2 filesystem created on a x86 Boot2docker host (Kernel "Linux fbd0c1340061 4.1.12-boot2docker #1 SMP Tue Nov 3 06:03:36 UTC 2015 x86_64 GNU/Linux").

While remounting the filesystem as r-w on the Raspberry and writing to it I got:

[53406.370524] EXT4-fs (mmcblk0p3): mounting ext2 file system using the ext4 subsystem
[53406.575883] EXT4-fs (mmcblk0p3): mounted filesystem without journal. Opts: (null)
[53416.394833] EXT4-fs error (device mmcblk0p3): ext4_mb_generate_buddy:757: group 1, block bitmap and bg descriptor inconsistent: 8953 vs 8990 free clusters
[53435.245967] EXT4-fs error (device mmcblk0p3): ext4_lookup:1417: inode #2: comm rsync: deleted inode referenced: 46849
[53481.805006] EXT4-fs error (device mmcblk0p3): ext4_lookup:1417: inode #2: comm ls: deleted inode referenced: 46849

The Raspi isn't running the most current Kernel yet (it's "Linux intermodul 3.19.3-v7 #1 SMP PREEMPT Mon Nov 30 08:37:00 UTC 2015 armv7l GNU/Linux"), but perhaps this helps analyzing this bug as it's not a VM...

Revision history for this message

In Linux Kernel Bug Tracker #104571, mario (mario-linux-kernel-bugs) wrote on 2016-01-17:

#76

Running Fedora with 4.2.8, that crashed my X completely:

[ 2797.949100] EXT4-fs error (device dm-4): ext4_mb_generate_buddy:758: group 1, block bitmap and bg descriptor inconsistent: 2431 vs 2396 free clusters
[ 2797.951110] EXT4-fs error (device dm-4): ext4_mb_generate_buddy:758: group 5, block bitmap and bg descriptor inconsistent: 2449 vs 2421 free clusters
[ 2797.952697] EXT4-fs error (device dm-4): ext4_mb_generate_buddy:758: group 12, block bitmap and bg descriptor inconsistent: 6323 vs 6319 free clusters
[ 2797.952878] EXT4-fs error (device dm-4): ext4_mb_generate_buddy:758: group 13, block bitmap and bg descriptor inconsistent: 12510 vs 12506 free clusters
[ 2797.953017] EXT4-fs error (device dm-4): ext4_mb_generate_buddy:758: group 14, block bitmap and bg descriptor inconsistent: 7196 vs 7190 free clusters
[ 2797.961216] EXT4-fs error (device dm-4): ext4_mb_generate_buddy:758: group 23, block bitmap and bg descriptor inconsistent: 916 vs 897 free clusters
[ 2797.961765] EXT4-fs error (device dm-4): ext4_mb_generate_buddy:758: group 25, block bitmap and bg descriptor inconsistent: 722 vs 721 free clusters
[ 2797.963208] EXT4-fs error (device dm-4): ext4_mb_generate_buddy:758: group 26, block bitmap and bg descriptor inconsistent: 904 vs 894 free clusters
[ 2797.976310] EXT4-fs error (device dm-4): ext4_mb_generate_buddy:758: group 35, block bitmap and bg descriptor inconsistent: 21 vs 65507 free clusters
[ 2797.977395] EXT4-fs error (device dm-4): ext4_mb_generate_buddy:758: group 44, block bitmap and bg descriptor inconsistent: 809 vs 64297 free clusters
[ 2798.063978] JBD2: Spotted dirty metadata buffer (dev = dm-4, blocknr = 0). There's a risk of filesystem corruption in case of system crash.
[ 2798.064804] JBD2: Spotted dirty metadata buffer (dev = dm-4, blocknr = 0). There's a risk of filesystem corruption in case of system crash.

That's on my standard install, not in a VM (hich is what was reported by many other users).

Revision history for this message

nunyaz info (project1750) wrote on 2016-02-29:

#50

I am able to reproduce this bug every single time I suspend and then resume one of my laptops. Which is running xubun tu 15.10 with 4.2.0-16-generic kernel on a lenovo R400. I suspect possible a hardware problem as this only happens on 1 out of 3 R400 that I have. I'll report back if I remember.

Revision history for this message

Václav Ovsík (vaclav-ovsik-gmail) wrote on 2016-03-21:

#51

This bug is probably what I describe in
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=818502

Bug in Debian 3.16 kernel (kernel from current stable Jessie) used as KVM hypervisor on older Intel CPU.

Simon Déziel (sdeziel) on 2016-03-21

Changed in linux (Ubuntu):
status:	Incomplete → Confirmed

Revision history for this message

Simon Déziel (sdeziel) wrote on 2016-03-21:

#52

Thanks Václav, your conclusion about older Intel CPU seems to match my setup since this only happens on a Xeon E3110 (which is in fact a re-branded Core2 Duo E8400).

Thanks for bisecting this and figure the fix was:

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?h=7dec5603b6b8dc4c3e1c65d318bd2a5a8c62a424

Revision history for this message

Emmanuel Lacour (elacour) wrote on 2016-03-21:

#53

Here we have this problem on two servers runnning "Intel(R) Xeon(R) CPU X5460 @ 3.16GHz" (and debian 7.0 backport kernel 3.16 on host, 4.x on VM). We doesn't seems to have this problem on others nodes runnning:

Intel(R) Xeon(R) CPU E5345 @ 2.33GHz
Intel(R) Xeon(R) CPU 5160 @ 3.00GHz
Intel(R) Xeon(R) CPU E5320 @ 1.86GHz
Intel(R) Xeon(R) CPU X3470 @ 2.93GHz

We are going to try an upgrade to see if it solves the problem. If yes, it's a pretty good news!

Revision history for this message

LouieGosselin (0-ubunbu-d) wrote on 2016-03-21:

#54

Good work.

We also have PE2950III systems running "Intel(R) Xeon(R) CPU X5460 @ 3.16GHz".

If this is indeed the fix, I'm confused why it would only affect certain cpus?
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?h=7dec5603b6b8dc4c3e1c65d318bd2a5a8c62a424

I'll have to come up with a plan to replace debian's stable/jessie kernel with an unmanaged one on the host. I'm not keen on doing that as the DRAC units on these are not very reliable...

Revision history for this message

Chris J Arges (arges) wrote on 2016-04-20:

#55

Here's a test build for trusty 3.13 with https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?h=7dec5603b6b8dc4c3e1c65d318bd2a5a8c62a424 applied:

http://people.canonical.com/~arges/lp1423672/

Can someone verify this does fix the issue so this can be SRU'ed into 3.13?

Revision history for this message

Leonardo Borda (lborda) wrote on 2016-04-20:

#56

Hi Chris,

This is also seen on kernel 3.16.0-51-generic. Could you get us a kernel build test for 3.16 as well ?

Thank you
Leo

Revision history for this message

Chris J Arges (arges) wrote on 2016-04-20:

#57

Leo,
Also uploaded a lts-utopic build here:
http://people.canonical.com/~arges/lp1423672/

It will all debs with '3.16' in the name.

Changed in linux-lts-utopic (Ubuntu):
status:	New → Invalid

Revision history for this message

Simon Déziel (sdeziel) wrote on 2016-04-20:

#58

Thanks Chris. So far, no regression with the Trusty kernel:

$ uname -a
Linux xeon 3.13.0-86-generic #130~lp1423672v201604200743 SMP Wed Apr 20 12:44:20 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Since the issue only happens rarely, more people testing it would be welcome.

Chris J Arges (arges) on 2016-04-20

Changed in linux (Ubuntu Trusty):
assignee:	nobody → Chris J Arges (arges)
Changed in linux-lts-utopic (Ubuntu Trusty):
assignee:	nobody → Chris J Arges (arges)

Chris J Arges (arges) on 2016-04-21

description:

updated

Revision history for this message

Kamal Mostafa (kamalmostafa) wrote on 2016-04-21:

#59

Additional positive test result notes:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=818502#27
https://bugzilla.kernel.org/show_bug.cgi?id=102731#c43

Tim Gardner (timg-tpi) on 2016-04-21

Changed in linux (Ubuntu Trusty):
status:	Confirmed → Fix Committed
Changed in linux-lts-utopic (Ubuntu Trusty):
status:	New → Fix Committed

Revision history for this message

Simon Déziel (sdeziel) wrote on 2016-04-28:

#60

I recently reinstalled the affected host to run Xenial so I can no longer test the proposed fix for the 3.13 kernel.

Revision history for this message

Kamal Mostafa (kamalmostafa) wrote on 2016-05-19:

#61

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-trusty' to 'verification-done-trusty'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags:

added: verification-needed-trusty

Revision history for this message

Kamal Mostafa (kamalmostafa) wrote on 2016-05-24:

#62

Anyone affected by this issue: We're looking for a positive verification that the Trusty kernel version currently available in -proposed (3.13.0-87.132) fixes this problem. If you can provide that confirmation, please do!

Revision history for this message

LouieGosselin (0-ubunbu-d) wrote on 2016-05-25:

#63

I have a "PE 2950III Intel(R) Xeon(R) CPU X5460 @ 3.16GHz" server here and I've been trying to test this out. I'm using an "rsync" copy of an original server exhibiting the problem. So far though I've been unable to reproduce the original error at all.

It would seem that using the exact same OS/kernel/binaries, the error doesn't happen on a fresh filesystem, I guess there must have been something about the filesystem image itself that triggered the fault. So my dilemma is that I don't know how to reproduce this fault on a fresh install. So while I can test this update, I'm not sure how valid the test will be on an installation that isn't faulting.

Does anyone have a suggestion or have an idea about how to reproduce the conditions?

Revision history for this message

Simon Déziel (sdeziel) wrote on 2016-05-25:

#64

Louie, https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=818502#22 provides a simple way to reproduce. If you could give it a try that would be appreciated.

FYI, when my hypervisor was running Trusty (3.13), the problem was reproducible on fresh VMs with brand new ext4 FSes, so hopefully that will be easy to reproduce for you.

Revision history for this message

Launchpad Janitor (janitor) wrote on 2016-05-31:

#65

This bug was fixed in the package linux - 3.13.0-87.133

---------------
linux (3.13.0-87.133) trusty; urgency=low

[ Kamal Mostafa ]

* Release Tracking Bug
- LP: #1585315

[ Upstream Kernel Changes ]

* Revert "usb: hub: do not clear BOS field during reset device"
- LP: #1582864

linux (3.13.0-87.132) trusty; urgency=low

[ Kamal Mostafa ]

* Release Tracking Bug
- LP: #1582398

[ Kamal Mostafa ]

* [Config] Drop ozwpan from the ABI

[ Luis Henriques ]

  * [Config] CONFIG_USB_WPAN_HCD=n
    - LP: #1463740
    - CVE-2015-4004

[ Prarit Bhargava ]

  * SAUCE: (no-up) ACPICA: Dispatcher: Update thread ID for recursive
    method calls
    - LP: #1577898

[ Upstream Kernel Changes ]

  * usbnet: cleanup after bind() in probe()
    - LP: #1567191
    - CVE-2016-3951
  * KVM: x86: bit-ops emulation ignores offset on 64-bit
    - LP: #1423672
  * USB: usbip: fix potential out-of-bounds write
    - LP: #1572666
    - CVE-2016-3955
  * x86/mm/32: Enable full randomization on i386 and X86_32
    - LP: #1568523
    - CVE-2016-3672
  * Input: gtco - fix crash on detecting device without endpoints
    - LP: #1575706
    - CVE-2016-2187
  * atl2: Disable unimplemented scatter/gather feature
    - LP: #1561403
    - CVE-2016-2117
  * ALSA: usb-audio: Skip volume controls triggers hangup on Dell USB Dock
    - LP: #1577905
  * fs/pnode.c: treat zero mnt_group_id-s as unequal
    - LP: #1572316
  * propogate_mnt: Handle the first propogated copy being a slave
    - LP: #1572316
  * drm: Balance error path for GEM handle allocation
    - LP: #1579610
  * x86/mm: Add barriers and document switch_mm()-vs-flush synchronization
    - LP: #1538429
    - CVE-2016-2069
  * x86/mm: Improve switch_mm() barrier comments
    - LP: #1538429
    - CVE-2016-2069
  * net: fix infoleak in llc
    - LP: #1578496
    - CVE-2016-4485
  * net: fix infoleak in rtnetlink
    - LP: #1578497
    - CVE-2016-4486

-- Kamal Mostafa <email address hidden> Tue, 24 May 2016 11:04:30 -0700

This bug was fixed in the package linux - 3.13.0-87.133

---------------
linux (3.13.0-87.133) trusty; urgency=low

[ Kamal Mostafa ]

* Release Tracking Bug
    - LP: #1585315

[ Upstream Kernel Changes ]

* Revert "usb: hub: do not clear BOS field during reset device"
    - LP: #1582864

linux (3.13.0-87.132) trusty; urgency=low

[ Kamal Mostafa ]

* Release Tracking Bug
    - LP: #1582398

[ Kamal Mostafa ]

* [Config] Drop ozwpan from the ABI

[ Luis Henriques ]

* [Config] CONFIG_USB_WPAN_HCD=n
    - LP: #1463740
    - CVE-2015-4004

[ Prarit Bhargava ]

* SAUCE: (no-up) ACPICA: Dispatcher: Update thread ID for recursive
    method calls
    - LP: #1577898

[ Upstream Kernel Changes ]

-- Kamal Mostafa <kamal@canonical.com>  Tue, 24 May 2016 11:04:30 -0700

Changed in linux (Ubuntu Trusty):
status:	Fix Committed → Fix Released
status:	Fix Committed → Fix Released

Revision history for this message

Launchpad Janitor (janitor) wrote on 2016-06-09:

#67

This bug was fixed in the package linux-lts-utopic - 3.16.0-73.95~14.04.1

---------------
linux-lts-utopic (3.16.0-73.95~14.04.1) trusty; urgency=low

[ Kamal Mostafa ]

  * CVE-2016-1583 (LP: #1588871)
    - ecryptfs: fix handling of directory opening
    - SAUCE: proc: prevent stacking filesystems on top
    - SAUCE: ecryptfs: forbid opening files without mmap handler

-- Andy Whitcroft <email address hidden> Thu, 09 Jun 2016 08:46:24 +0100

Changed in linux-lts-utopic (Ubuntu Trusty):
status:	Fix Committed → Fix Released

Bug Watch Updater (bug-watch-updater) on 2016-07-08

Changed in linux (Debian):
status:	Confirmed → Fix Released

Joseph Salisbury (jsalisbury) on 2016-09-30

tags:

added: verification-done-trusty
removed: verification-needed-trusty

Revision history for this message

chenyuchai (chenyuchai) wrote on 2016-10-15:

#68

Hi，all:
I need your confirmation.Is the Fix: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?h=7dec5603b6b8dc4c3e1c65d318bd2a5a8c62a424 to solve this issue?If not,pls show the patch,thank you very much!

Revision history for this message

Simon Déziel (sdeziel) wrote on 2016-10-17:

#69

Chenyuchai, yes, https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?h=7dec5603b6b8dc4c3e1c65d318bd2a5a8c62a424 is the fix. It was integrated in the Trusty kernel 3.13.0-87.133

Revision history for this message

In Linux Kernel Bug Tracker #104571, dpsenner (dpsenner-linux-kernel-bugs) wrote on 2019-04-09:

#77

We were recently hit by this issue, too on a physical x64 machine that hosts one kvm virtual machine guest with a windows server 2016 x64 operating system. The filesystem is a mdadm software raid1 spanning over two disks with ext4 on-top.

$ uname -r
4.15.0-43-generic

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 18.04.1 LTS
Release: 18.04
Codename: bionic

The filesystem of the physical machine went read only and effectively crashed all processes, including the guest operating system. A hard reset and disk check appears to have solved the symptom, but we have mixed feelings since we don't know about any collateral damage this may have caused. dmesg displays worrying information that is hopefully not a symptom of irreversible root filesystem corruption:

systemd-journald[457]: File /var/log/journal/7346ea28f12b763f29b8995058d63291/user-1000.journal corrupted or uncleanly shut down, renaming and replacing.

The symptom manifested while booting the guest machine and thus may well be related to a burst of random reads and writes of the kvm guest. We observed moderate disk activity in htop.

Revision history for this message

Dominik Psenner (dpsenner) wrote on 2019-04-09:

#70

Unfortunately the symptom still persists on bionic (Ubuntu 18.04.1 LTS) with kernel 4.15.0-43-generic. This issue may be a symptom that's caused by lice and fleas. I updated https://bugzilla.kernel.org/show_bug.cgi?id=104571 by adding a comment with additional information.

Bug Watch Updater (bug-watch-updater) on 2019-04-10

Changed in linux:
importance:	Unknown → Medium
status:	Unknown → Confirmed

Brad Figg (brad-figg) on 2019-07-24

tags:

added: cscc

Revision history for this message

LouieGosselin (0-ubunbu-d) wrote on 2019-07-30:

#78

I'd like to follow up because the issue seems to have cleared up for us after installing linux 5.0.1 about 40 days ago. It's hard to say whether everyone is experiencing the same bugs, but give 5.x a shot and let us know how it goes!

Just to recap. Every week or so we were seeing R/O file systems with the follow errors, which required reboot & fsck.

EXT4-fs error (device vda2): ext4_mb_generate_buddy:757: group 144, block bitmap and bg descriptor inconsistent: 23914 vs 23913 free clusters
Aborting journal on device vda2-8.
EXT4-fs (vda2): Remounting filesystem read-only

We never experienced any corruption on the host itself, only under KVM guests.

Host DELL Poweredge 2950III
Several KVM Guests: linux OS, distro&kernel doesn't make any difference, all randomly vulnerable during periods of high disk activity.

Not sure it matters, but in our case we were using LVM2 volumes on the host and kvm media was configured as follows "media=disk,if=virtio,cache=none,aio=native,format=raw".

We initially thought just one guest was affected, but over time we saw it happen with many distros and kernels. It wasn't until we had an extended period of downtime that we decided to reinstall the host with a 5.x kernel. None of the guests experienced any issues since, fingers crossed.

At this point, it's hard to recommend Ubuntu 19.04 given that it's only a few months away from EOL, however the 5.x kernel seems promising whereas the Ubuntu 18.04LTS runs an older kernel that is still known to exhibit the corruption. For LTS I'd look into running it under a custom setup with a newer kernel.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

debbugs #772848
[done critical patch fixed-upstream upstream] Edit
debbugs #818502
[done critical upstream fixed-upstream patch] Edit
linux-kernel-bugs #102731 Edit
linux-kernel-bugs #104571
[NEW] Edit
linux-kernel-bugs #89621
[NEW] Edit

Bug watches keep track of this bug in other bug trackers.

Ubuntulinux package

ext4_mb_generate_buddy:756: group N, block bitmap and bg descriptor inconsistent: X vs Y

Bug Description

CVE References

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
linux package