Bug #1780137 “[Regression] EXT4-fs error (device sda1): ext4_val...” : Bugs : linux package : Ubuntu

Revision history for this message

Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote on 2018-07-04: Missing required logs.

#1

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1780137

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status:	New → Incomplete

Joseph Salisbury (jsalisbury) on 2018-07-05

Changed in linux (Ubuntu):
importance:	Undecided → Medium
tags:	added: kernel-da-key
Changed in linux (Ubuntu):
status:	Incomplete → Triaged
Changed in linux (Ubuntu Bionic):
status:	New → Triaged
importance:	Undecided → Medium
tags:	added: bionic

Revision history for this message

Colin Ian King (colin-king) wrote on 2018-07-05:

#2

What is the stress-ng command that is being run by /usr/lib/plainbox-provider-checkbox/bin/disk_stress_ng - without knowing that it's hard to figure out the initial stressor conditions

Revision history for this message

dann frazier (dannf) wrote on 2018-07-05: Re: [Bug 1780137] Re: [Regression] EXT4-fs error (device sda1): ext4_validate_inode_bitmap:99: comm stress-ng: Corrupt inode bitmap

#3

On Thu, Jul 5, 2018 at 9:35 AM Colin Ian King
<email address hidden> wrote:
>
> What is the stress-ng command that is being run by /usr/lib/plainbox-
> provider-checkbox/bin/disk_stress_ng - without knowing that it's hard to
> figure out the initial stressor conditions

It runs several stress-ng tests against the target. In my
reproduction, I've disabled all but one, and that is 'chdir'. Here's
the invocation:

stress-ng --aggressive --verify --timeout 240 --aio 0 --hdd-opts dsync
--readahead-bytes 16M -k

This is the only stress-ng command that runs after boot/fresh-mkfs.
However, as I mentioned, this stress-ng command alone did not appear
to trigger the issue. I suspect that this is due to sysctls that cert
sets before the running the test, but I have not attempted to prove
that yet.

-dann

Revision history for this message

Colin Ian King (colin-king) wrote on 2018-07-05:

#4

It may be worth grabbing a copy of the /proc/sys on a clean boot and then a copy after the sysctl changes so we can get and idea of any specific tweaks that may have occurred.

Just to note, I've been running the stress-ng command as noted in comment #3 on a 24 CPU ARM64 Synquacer box with the alleged faulty kernel and cannot reproduce the issue.

dann frazier (dannf) on 2018-07-05

description:

updated

Revision history for this message

dann frazier (dannf) wrote on 2018-07-05:

#5

The significant difference in the cert test wrapper actually appears to be that it creates a directory in which to run the test, vs. running it at the root of the mount point. I'm able to reproduce the failure outside of cert after a fresh reboot with the following script. The fs.aio-max-nr sysctl was a red herring.

mkdir -p /tmp/mnt
# My script does a mkfs.ext4 -F, but omitting that here in case someone blindly cut & pastes
mkfs.ext4 /dev/sda1
mount /dev/sda1 /tmp/mnt
# Running directly under /tmp/mnt doesn't trigger the issue, we need
# this subdirectory for some reason
mkdir -p /tmp/mnt/tmp/disk_stress_ng_f70f0f26-b332-4c48-9e07-67c529770e3d
stress-ng --aggressive --verify --timeout 240 --temp-path /tmp/mnt/tmp/disk_stress_ng_f70f0f26-b332-4c48-9e07-67c529770e3d --chdir 0 --hdd-opts dsync --readahead-bytes 16M -k

@Colin: I updated the description to note that we have seen this on 2 different HiSilicon servers that use different HiSilicon SoCs. We've also tested on a Cavium Sabre (ThunderX2-based) board, and were unable to reproduce it there.

Revision history for this message

dann frazier (dannf) wrote on 2018-07-06:

#6

Upstream thread: https://lkml.org/lkml/2018/7/6/763

Revision history for this message

Colin Ian King (colin-king) wrote on 2018-07-06:

#7

I would be useful to see if a non-SMP boot will cause the same issue; if it only occurs in SMP boots then we know it's a lower level locking/race style of problem.

Revision history for this message

dann frazier (dannf) wrote on 2018-07-10:

#8

Copying over comments from Ike from a different forum:

Ike Panhc (ikepanhc) wrote on 2018-07-09: #42
I build kernel debs based on Ted's response on ext4 mailing list at

http://kernel.ubuntu.com/~ikepanhc/ext4msg61578/

Please also see

https://www.spinics.net/lists/linux-ext4/msg61578.html

Now running test on d05-4

Ike Panhc (ikepanhc) wrote 14 hours ago: #43
kernel debs in #42 generates other ext4 error message. I am going to tell if there is patch missing from upstream ext4 or we can response these to Ted.

# grep EXT4 d05-4-ipmi.log
[ 26.215587] EXT4-fs (sdb2): mounted filesystem with ordered data mode. Opts: (null)
[ 29.844105] EXT4-fs (sdb2): re-mounted. Opts: errors=remount-ro
[ 3586.211348] EXT4-fs error (device sda2): ext4_validate_block_bitmap:383: comm stress-ng: bg 4705: bad block bitmap checksum
[ 8254.776992] EXT4-fs error (device sda2): ext4_validate_block_bitmap:383: comm stress-ng: bg 4193: bad block bitmap checksum
[12921.245105] EXT4-fs error (device sda2): ext4_validate_block_bitmap:383: comm stress-ng: bg 21254: bad block bitmap checksum
[17595.654170] EXT4-fs error (device sda2): ext4_validate_block_bitmap:383: comm stress-ng: bg 13197: bad block bitmap checksum
[17627.071833] EXT4-fs error (device sda2): ext4_validate_block_bitmap:383: comm stress-ng: bg 13214: bad block bitmap checksum
[17668.508879] EXT4-fs error (device sda2): ext4_validate_block_bitmap:383: comm stress-ng: bg 13230: bad block bitmap checksum
[17764.903199] EXT4-fs error (device sda2): ext4_validate_block_bitmap:383: comm stress-ng: bg 13199: bad block bitmap checksum
[19799.801233] EXT4-fs error (device sda2): ext4_validate_block_bitmap:383: comm stress-ng: bg 19978: bad block bitmap checksum
[36257.005491] EXT4-fs error (device sda2): ext4_validate_block_bitmap:383: comm stress-ng: bg 14158: bad block bitmap checksum
[40924.017749] EXT4-fs error (device sda2): ext4_validate_block_bitmap:383: comm stress-ng: bg 18668: bad block bitmap checksum
[40927.479721] EXT4-fs error (device sda2): ext4_validate_block_bitmap:383: comm stress-ng: bg 18679: bad block bitmap checksum
[50223.976763] EXT4-fs error (device sda2): ext4_validate_block_bitmap:383: comm stress-ng: bg 2865: bad block bitmap checksum
[54896.662047] EXT4-fs error (device sda2): ext4_validate_block_bitmap:383: comm stress-ng: bg 5029: bad block bitmap checksum
[54896.751793] EXT4-fs error (device sda2): ext4_validate_block_bitmap:383: comm stress-ng: bg 5041: bad block bitmap checksum

Copying over comments from Ike from a different forum:

Ike Panhc (ikepanhc) wrote on 2018-07-09:	#42
I build kernel debs based on Ted's response on ext4 mailing list at

http://kernel.ubuntu.com/~ikepanhc/ext4msg61578/

Please also see

https://www.spinics.net/lists/linux-ext4/msg61578.html

Now running test on d05-4

Ike Panhc (ikepanhc) wrote 14 hours ago:	#43
kernel debs in #42 generates other ext4 error message. I am going to tell if there is patch missing from upstream ext4 or we can response these to Ted.

# grep EXT4 d05-4-ipmi.log
[ 26.215587] EXT4-fs (sdb2): mounted filesystem with ordered data mode. Opts: (null)
[ 29.844105] EXT4-fs (sdb2): re-mounted. Opts: errors=remount-ro
[ 3586.211348] EXT4-fs error (device sda2): ext4_validate_block_bitmap:383: comm stress-ng: bg 4705: bad block bitmap checksum
[ 8254.776992] EXT4-fs error (device sda2): ext4_validate_block_bitmap:383: comm stress-ng: bg 4193: bad block bitmap checksum
[12921.245105] EXT4-fs error (device sda2): ext4_validate_block_bitmap:383: comm stress-ng: bg 21254: bad block bitmap checksum
[17595.654170] EXT4-fs error (device sda2): ext4_validate_block_bitmap:383: comm stress-ng: bg 13197: bad block bitmap checksum
[17627.071833] EXT4-fs error (device sda2): ext4_validate_block_bitmap:383: comm stress-ng: bg 13214: bad block bitmap checksum
[17668.508879] EXT4-fs error (device sda2): ext4_validate_block_bitmap:383: comm stress-ng: bg 13230: bad block bitmap checksum
[17764.903199] EXT4-fs error (device sda2): ext4_validate_block_bitmap:383: comm stress-ng: bg 13199: bad block bitmap checksum
[19799.801233] EXT4-fs error (device sda2): ext4_validate_block_bitmap:383: comm stress-ng: bg 19978: bad block bitmap checksum
[36257.005491] EXT4-fs error (device sda2): ext4_validate_block_bitmap:383: comm stress-ng: bg 14158: bad block bitmap checksum
[40924.017749] EXT4-fs error (device sda2): ext4_validate_block_bitmap:383: comm stress-ng: bg 18668: bad block bitmap checksum
[40927.479721] EXT4-fs error (device sda2): ext4_validate_block_bitmap:383: comm stress-ng: bg 18679: bad block bitmap checksum
[50223.976763] EXT4-fs error (device sda2): ext4_validate_block_bitmap:383: comm stress-ng: bg 2865: bad block bitmap checksum
[54896.662047] EXT4-fs error (device sda2): ext4_validate_block_bitmap:383: comm stress-ng: bg 5029: bad block bitmap checksum
[54896.751793] EXT4-fs error (device sda2): ext4_validate_block_bitmap:383: comm stress-ng: bg 5041: bad block bitmap checksum

Revision history for this message

dann frazier (dannf) wrote on 2018-07-10:

#9

I am no longer able to reproduce after applying Ted's patch. I was able to run my unit test 20 times w/o failing on upstream + patch on d05-6. I then switched over to the Ubuntu kernel + patch, and it has now passed 64 times (and counting).

I then looked to see why Ike is still observing a failure. My theory is that the filesystem Ike was testing was already corrupted by a previous *unpatched* run, so the kernel is finding pre-existing corruption. Evidence follows.

The last record of "sudo mkfs.ext4" running in /var/log/auth.log:

Jul 5 06:43:39 d05-4 sudo: ubuntu : TTY=ttyAMA0 ; PWD=/home/ubuntu ; USER=root ; COMMAND=/sbin/mkfs.ext4 /dev/sda2

While the kernel w/ the fix wasn't built until Jul 9:
[ 0.000000] Linux version 4.15.0-25-generic (root@recht) (gcc version
7.3.0 (Ubuntu/Linaro 7.3.0-16ubuntu3)) #27+ext4msg61578.1 SMP Mon Jul 9 08:28:49 UTC 2018 (Ubuntu 4.1
5.0-25.27+ext4msg61578.1-generic 4.15.18)

The first time /dev/sda2 was mounted after booting this kernel, it reported known errors:
Jul 9 05:43:33 d05-4 kernel: [ 138.522140] EXT4-fs (sda2): warning: mounting fs with errors, running e2fsck is recommended

Looking at the conserver log (logs all console activity on this system), it looks like the test used did not reformat the disk between iterations:
root@d05-4:~# ^G^G^G^G^G^G^G^G^G^G^G^Gwhile true; do sudo /usr/lib/plainbox-provider-checkbox/bin/disk
_st^Mtress_ng sda --base-time 240 --really-run; done

Finally, I manually ran mkfs.ext4 on /dev/sda2. Afterwards, both my unit test (https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1780137/comments/5) and the full disk_stress_ng cert test Ike was running passed without error.

Revision history for this message

dann frazier (dannf) wrote on 2018-07-10:

#10

@Ike: Let me know if the above sounds plausible. I should also make the following notes, which may provide evidence counter to my theory of pre-existing corruption:

It is possible that mkfs.ext4 was ran outside of sudo (e.g. directly in a root shell), and would therefore not be logged in /var/log/auth.log. It's also possible that testing was running in an ssh shell, so the commands I gleaned from the console may not be the full picture.

While it is true that the first mount of /dev/sda2 after applying Ted's patch reported "mounting fs with errors", subsequent mounts do not have that message, and yet there are "bad block bitmap checksum" messages that follow:

Jul 9 08:19:42 d05-4 kernel: [ 139.572607] EXT4-fs (sda2): mounted filesystem with ordered data mode. Opts: (null)
Jul 9 09:17:09 d05-4 kernel: [ 3586.211348] EXT4-fs error (device sda2): ext4_validate_block_bitmap:383: comm stress-ng: bg 4705: bad block bitmap checksum
Jul 9 09:37:34 d05-4 kernel: [ 4810.952360] EXT4-fs (sda2): mounted filesystem with ordered data mode. Opts: (null)
Jul 9 10:34:58 d05-4 kernel: [ 8254.776992] EXT4-fs error (device sda2): ext4_validate_block_bitmap:383: comm stress-ng: bg 4193: bad block bitmap checksum

Does that mean the filesystem was somehow no longer corrupted at that time (e.g. fsck'd, or re-mkfs'd), or is it possible that the fs-has-errors flag was just cleared while corruption persisted?

dann frazier (dannf) on 2018-07-10

description:

updated

Kleber Sacilotto de Souza (kleber-souza) on 2018-07-11

Changed in linux (Ubuntu Bionic):
status:	Triaged → Fix Committed

Revision history for this message

Ike Panhc (ikepanhc) wrote on 2018-07-11:

#11

consolelog.txt Edit (7.1 KiB, text/plain)

@dannf, my command is

`while true; do sudo /usr/lib/plainbox-provider-checkbox/bin/disk_stress_ng sda --base-time 240 --really-run; mkfs.ext4 -F /dev/sda2; done`

But I also find out that the e2fsprogs I am using is upstream one.

mke2fs 1.44.3-rc2 (3-July-2018)

I am going to reset my environment and re-test.

Seth Forshee (sforshee) on 2018-07-11

Changed in linux (Ubuntu):
status:	Triaged → Fix Committed

Revision history for this message

Brad Figg (brad-figg) wrote on 2018-07-12:

#12

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-bionic' to 'verification-done-bionic'. If the problem still exists, change the tag 'verification-needed-bionic' to 'verification-failed-bionic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags:

added: verification-needed-bionic

Revision history for this message

dann frazier (dannf) wrote on 2018-07-12:

#13

Verification results:

ubuntu@d05-6:~$ sudo ./stress
+ umount /tmp/mnt
umount: /tmp/mnt: not mounted.
+ /bin/true
+ mkdir -p /tmp/mnt
+ mkfs.ext4 -F /dev/sda1
mke2fs 1.44.1 (24-Mar-2018)
/dev/sda1 contains a ext4 file system
last mounted on Wed Jul 11 23:26:23 2018
/dev/sda1 alignment is offset by 244736 bytes.
This may result in very poor performance, (re)-partitioning suggested.
Creating filesystem with 1953124855 4k blocks and 244142080 inodes
Filesystem UUID: 7aa77e9b-8518-4b64-9507-b65d09aa6382
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
102400000, 214990848, 512000000, 550731776, 644972544, 1934917632

Allocating group tables: done
Writing inode tables: done
Creating journal (262144 blocks): done
Writing superblocks and filesystem accounting information: done

+ mount /dev/sda1 /tmp/mnt
+ dir=/tmp/mnt/tmp/disk_stress_ng_f70f0f26-b332-4c48-9e07-67c529770e3d
+ mkdir -p /tmp/mnt/tmp/disk_stress_ng_f70f0f26-b332-4c48-9e07-67c529770e3d
+ stress-ng --aggressive --verify --timeout 240 --temp-path /tmp/mnt/tmp/disk_stress_ng_f70f0f26-b332-4c48-9e07-67c529770e3d --chdir 0 --hdd-opts dsync --readahead-bytes 16M -k
stress-ng: info: [2670] dispatching hogs: 64 chdir
stress-ng: info: [2670] cache allocate: using built-in defaults as unable to determine cache details
stress-ng: info: [2670] successful run completed in 244.71s (4 mins, 4.71 secs)
ubuntu@d05-6:~$ cat /proc/version
Linux version 4.15.0-27-generic (buildd@bos02-arm64-027) (gcc version 7.3.0 (Ubuntu/Linaro 7.3.0-16ubuntu3)) #29-Ubuntu SMP Wed Jul 11 08:20:51 UTC 2018

tags:

added: verification-done-bionic
removed: verification-needed-bionic

Revision history for this message

Launchpad Janitor (janitor) wrote on 2018-07-20:

#14

Download full text (4.1 KiB)

This bug was fixed in the package linux - 4.15.0-29.31

---------------
linux (4.15.0-29.31) bionic; urgency=medium

* linux: 4.15.0-29.31 -proposed tracker (LP: #1782173)

  * [SRU Bionic][Cosmic] kernel panic in ipmi_ssif at msg_done_handler
    (LP: #1777716)
    - ipmi_ssif: Fix kernel panic at msg_done_handler

  * Update to ocxl driver for 18.04.1 (LP: #1775786)
    - misc: ocxl: use put_device() instead of device_unregister()
    - powerpc: Add TIDR CPU feature for POWER9
    - powerpc: Use TIDR CPU feature to control TIDR allocation
    - powerpc: use task_pid_nr() for TID allocation
    - ocxl: Rename pnv_ocxl_spa_remove_pe to clarify it's action
    - ocxl: Expose the thread_id needed for wait on POWER9
    - ocxl: Add an IOCTL so userspace knows what OCXL features are available
    - ocxl: Document new OCXL IOCTLs
    - ocxl: Fix missing unlock on error in afu_ioctl_enable_p9_wait()

  * Critical upstream bugfix missing in Ubuntu 18.04 - frequent Xorg crash after
    suspend (LP: #1776887)
    - ocxl: Document the OCXL_IOCTL_GET_METADATA IOCTL

  * Hard LOCKUP observed on stressing Ubuntu 18 04 (LP: #1777194)
    - powerpc: use NMI IPI for smp_send_stop
    - powerpc: Fix smp_send_stop NMI IPI handling

  * IPL: ppc64_cpu --frequency hang with INFO: rcu_sched detected stalls on
    CPUs/tasks on w34 and wsbmc016 with 920.1714.20170330n (LP: #1773964)
    - rtc: opal: Fix OPAL RTC driver OPAL_BUSY loops

  * [Regression] EXT4-fs error (device sda2): ext4_validate_block_bitmap:383:
    comm stress-ng: bg 4705: bad block bitmap checksum (LP: #1781709)
    - SAUCE: Revert "UBUNTU: SAUCE: ext4: fix ext4_validate_inode_bitmap: comm
      stress-ng: Corrupt inode bitmap"
    - SAUCE: ext4: check for allocation block validity with block group locked

linux (4.15.0-28.30) bionic; urgency=medium

* linux: 4.15.0-28.30 -proposed tracker (LP: #1781433)

  * Cannot set MTU higher than 1500 in Xen instance (LP: #1781413)
    - xen-netfront: Fix mismatched rtnl_unlock
    - xen-netfront: Update features after registering netdev

linux (4.15.0-27.29) bionic; urgency=medium

* linux: 4.15.0-27.29 -proposed tracker (LP: #1781062)

  * [Regression] EXT4-fs error (device sda1): ext4_validate_inode_bitmap:99:
    comm stress-ng: Corrupt inode bitmap (LP: #1780137)
    - SAUCE: ext4: fix ext4_validate_inode_bitmap: comm stress-ng: Corrupt inode
      bitmap

linux (4.15.0-26.28) bionic; urgency=medium

* linux: 4.15.0-26.28 -proposed tracker (LP: #1780112)

  * failure to boot with linux-image-4.15.0-24-generic (LP: #1779827) // Cloud-
    init causes potentially huge boot delays with 4.15 kernels (LP: #1780062)
    - random: Make getrandom() ready earlier

linux (4.15.0-25.27) bionic; urgency=medium

* linux: 4.15.0-25.27 -proposed tracker (LP: #1779354)

* hisi_sas_v3_hw: internal task abort: timeout and not done. (LP: #1777736)
- scsi: hisi_sas: Update a couple of register settings for v3 hw

* hisi_sas: Add missing PHY spinlock init (LP: #1777734)
- scsi: hisi_sas: Add missing PHY spinlock init

  * hisi_sas: improve read performance by pre-allocating slot DMA buffers
    (LP: #1777727)
    - scsi: hisi_sas: use dma_zalloc_cohe...

Affects		Status	Importance	Assigned to	Milestone
	linux (Ubuntu)	Fix Released	Medium	Unassigned
	Bionic	Fix Released	Medium	Unassigned

Ubuntu
linux package

[Regression] EXT4-fs error (device sda1): ext4_validate_inode_bitmap:99: comm stress-ng: Corrupt inode bitmap

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Changed in linux (Ubuntu Bionic):
status:	Fix Committed → Fix Released

Ubuntulinux package

[Regression] EXT4-fs error (device sda1): ext4_validate_inode_bitmap:99: comm stress-ng: Corrupt inode bitmap

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
linux package