intermittent resize2fs failures: kernel BUG at fs/ext4/resize.c:409!

Bug #1280709 reported by Dan Prince
20
This bug affects 4 people
Affects Status Importance Assigned to Milestone
diskimage-builder
Fix Released
Critical
Dan Prince

Bug Description

I've been seeing intermittent resize2fs issues w/ DIB images (deployed with Nova bare metal) for months now. The issue occurs at first boot time once a Nova bare metal instance has booted. Cloud init makes a call to resize the file system which fails with the following kernel BUG message:

[ 112.136896] EXT4-fs (sda1): resizing filesystem from 1080688 to 5243214 blocs
[ 112.164072] ------------[ cut here ]------------
[ 112.164179] kernel BUG at fs/ext4/resize.c:409!
[ 112.164285] invalid opcode: 0000 [#1] SMP
[ 112.164488] Modules linked in: openvswitch vxlan ip_tunnel gre libcrc32c noui
[ 112.165042] CPU: 0 PID: 968 Comm: resize2fs Tainted: G I 3.12.9-301
[ 112.165042] Hardware name: Dell Inc. OptiPlex 760 /0M858N, B9
[ 112.165042] task: ffff8800b7969080 ti: ffff8800b74f4000 task.ti: ffff8800b740
[ 112.165042] RIP: 0010:[<ffffffff81254fa1>] [<ffffffff81254fa1>] set_flexbg_0
[ 112.165042] RSP: 0018:ffff8800b74f5c28 EFLAGS: 00010216
[ 112.165042] RAX: ffff8800b743bf00 RBX: ffff88007fae9000 RCX: 0000000000001000
[ 112.165042] RDX: ffff88007f4e2c00 RSI: 0000000000000001 RDI: 0000000000000010
[ 112.165042] RBP: ffff8800b74f5c70 R08: ffff8800afd27750 R09: ffff8800b7daee00
[ 112.165042] R10: 0000000000000000 R11: ffff8800afd27750 R12: 0000000000188000
[ 112.165042] R13: 0000000000000010 R14: 0000000000188000 R15: ffff8800b743b800
[ 112.165042] FS: 00007f858a7f4780(0000) GS:ffff8800be800000(0000) knlGS:00000
[ 112.165042] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 112.165042] CR2: 00007f8589c1eea6 CR3: 00000000b7979000 CR4: 00000000000407f0
[ 112.165042] Stack:
[ 112.165042] ffff8800afd27750 ffff8800b96cb060 ffff880000188000 000000108125b
[ 112.165042] 0000000000000010 ffff88007f4e2ed0 00000000000007ff ffff8800b7430
[ 112.165042] ffff8800b743b800 ffff8800b74f5d68 ffffffff81256768 0000000000180
[ 112.165042] Call Trace:
[ 112.165042] [<ffffffff81256768>] ext4_flex_group_add+0x1448/0x1830
[ 112.165042] [<ffffffff81257de2>] ext4_resize_fs+0x7b2/0xe80
[ 112.165042] [<ffffffff8123ac50>] ext4_ioctl+0xbf0/0xf00
[ 112.165042] [<ffffffff811c111d>] do_vfs_ioctl+0x2dd/0x4b0
[ 112.165042] [<ffffffff811b9df2>] ? final_putname+0x22/0x50
[ 112.165042] [<ffffffff811c1371>] SyS_ioctl+0x81/0xa0
[ 112.165042] [<ffffffff81676aa9>] system_call_fastpath+0x16/0x1b
[ 112.165042] Code: c8 4c 89 df e8 41 96 f8 ff 44 89 e8 49 01 c4 44 29 6d d4 0
[ 112.165042] RIP [<ffffffff81254fa1>] set_flexbg_block_bitmap+0x171/0x180
[ 112.165042] RSP <ffff8800b74f5c28>
[ 112.175633] ---[ end trace f179f994a575df06 ]---

Dan Prince (dan-prince)
Changed in diskimage-builder:
assignee: nobody → Dan Prince (dan-prince)
status: New → In Progress
Revision history for this message
Dan Prince (dan-prince) wrote :

As I mentioned above this issue seemed to occur intermittently if you build and deploy new images all the time. It is however reproducible once you hit it if you use the same DIB image again and again. Once you have an image that fails it seems that it will continue to fail... Aha!

Jon recently reported the issue on the ext4 list here and Ted's reply shed some light on the root cause here:

  http://marc.info/?l=linux-ext4&m=139232631720458&w=2

We were able to trace it back to this DIB commit:

https://github.com/openstack/diskimage-builder/commit/fb246a02eb2ed330d3cc37f5795b

The short story is there *is* an ext4 resize bug that needs fixing. However we can almost certainly avoid the issue entirely by using ext4 defaults here which should be reasonable for most cases. The default allows for a root partition up to 4TB (I think).

Anyway, The root cause of all this is really a design problem in DIB/TripleO at the moment in that we shouldn't have to worry about the max size of the root file system when creating our images. Ideally we'd just mkfs on the root file system itself. Much more efficient, and avoids this problem altogether...

I think the best thing to do today to avoid this is make setting max-online-resize an option in DIB. This will allow us to stick to the (well tested) ext4 defaults for most cases, and if someone has need for a large root filesystem they can easily bump the setting. This may be temporary until we either fix the design... or the ext4 fix is released.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to diskimage-builder (master)

Fix proposed to branch: master
Review: https://review.openstack.org/73854

Dan Prince (dan-prince)
Changed in diskimage-builder:
importance: Undecided → Critical
Dan Prince (dan-prince)
Changed in nova:
assignee: nobody → Dan Prince (dan-prince)
importance: Undecided → High
status: New → In Progress
Changed in diskimage-builder:
assignee: Dan Prince (dan-prince) → Robert Collins (lifeless)
Changed in diskimage-builder:
assignee: Robert Collins (lifeless) → Dan Prince (dan-prince)
Dan Prince (dan-prince)
no longer affects: nova
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to diskimage-builder (master)

Reviewed: https://review.openstack.org/73854
Committed: https://git.openstack.org/cgit/openstack/diskimage-builder/commit/?id=ea555476ccc41b9d11fda321dd322fbb7b4b3cce
Submitter: Jenkins
Branch: master

commit ea555476ccc41b9d11fda321dd322fbb7b4b3cce
Author: Dan Prince <email address hidden>
Date: Sat Feb 15 23:06:11 2014 -0500

    Make max-online-resize an option.

    In fb246a02eb2ed330d3cc37f5795b3ed026aabe07 we introduced an
    ext4 option to allow root filesystems to be resized up to 1PB.
    This appears to cause an ext4 resize2fs bug in some images.

    When the issue occurs an image will hit a kernel bug when
    cloud-init runs the resize2fs command during first boot:

      kernel BUG at fs/ext4/resize.c:409!

    In this commit we add a new option for max-online-resize
    which can be used if a really large root partition is
    desirable.

    The root cause of all this is really a design problem in
    DIB/TripleO at the moment in that we shouldn't have to worry about the
    max size of the root file system when creating our images. Ideally we'd
    just mkfs on the root file system itself. Much more efficient,
    avoids this problem altogether...

    I think the best thing to do today to avoid this is make setting
    max-online-resize an option in DIB. This will allow us to stick
    to the (well tested) ext4 defaults for most cases, and if someone has
    need for a large root filesystem they can easily bump the setting. This
    may be temporary until we either fix the design... or the ext4 fix is
    released.

    Change-Id: I371f62555d2753cec48790c8fd811c4342af925c
    Closes-bug: #1280709

Changed in diskimage-builder:
status: In Progress → Fix Committed
Changed in diskimage-builder:
status: Fix Committed → Fix Released
Revision history for this message
Daniele Venzano (venza) wrote :

I'm still observing this issue with images created with the diskimage-builder + the Savanna elements. I don't think the Savanna elements are directly responsible since all they do is install some more software in the image, no filesystem related operations.

It's very random and I cannot find a pattern. For example, if I create an image installing version X of some software, it works, but if I change the corresponding element and install version X+1, the resize2fs bug comes up and the VM does not boot.
I tried adding a 100MB 0-filled file in the image (doing a dd inside a post-install element script), just to see if it would help, but its presence seems irrelevant, resize2fs still fails with some images.

I'm also at a loss to explain why it started happening only now (since last week) and we never saw this problem before.
I made sure I'm using the diskimage-builder version with the fix presented above in it.

The images I am working with are under 1GB.

Any idea?

Revision history for this message
Daniele Venzano (venza) wrote :
Download full text (3.8 KiB)

This is the crash I am observing, consistently, with this image:
file format: qcow2
virtual size: 2.3G (2442395648 bytes)
disk size: 642M
cluster_size: 65536

With this image it never happens:
file format: qcow2
virtual size: 2.3G (2476474368 bytes)
disk size: 505M
cluster_size: 65536

The only difference is a software installed in it. In the first image it is spark 0.9.0 and in the second one is Spark 0.8.1. Unluckily I need 0.9 and I cannot manage to create a working image. Both images were created with the latest image builder.
I understand that the bug lies in the kernel, but I'm looking for a workaround to create images that boot.

[ 14.518794] ------------[ cut here ]------------
[ 14.519886] kernel BUG at /build/buildd/linux-3.11.0/fs/ext4/resize.c:409!
[ 14.521275] invalid opcode: 0000 [#1] SMP
[ 14.522404] Modules linked in: dm_crypt kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd microcode joydev i2c_piix4 psmouse serio_raw virtio_balloon pvpanic mac_hid hid_generic usbhid hid syscopyarea sysfillrect sysimgblt ttm drm_kms_helper drm floppy
[ 14.522758] CPU: 0 PID: 795 Comm: resize2fs Not tainted 3.11.0-17-generic #31-Ubuntu
[ 14.522758] Hardware name: OpenStack Foundation OpenStack Nova, BIOS Bochs 01/01/2011
[ 14.522758] task: ffff88007869ddc0 ti: ffff880036850000 task.ti: ffff880036850000
[ 14.522758] RIP: 0010:[<ffffffff812643a1>] [<ffffffff812643a1>] set_flexbg_block_bitmap+0x171/0x180
[ 14.522758] RSP: 0018:ffff880036851c30 EFLAGS: 00010216
[ 14.522758] RAX: ffff880079c748a0 RBX: ffff880036790800 RCX: 0000000000001000
[ 14.522758] RDX: ffff88007c2fd800 RSI: 0000000000000001 RDI: 0000000000000020
[ 14.522758] RBP: ffff880036851c78 R08: ffff88007ba1a618 R09: ffff88007bc05200
[ 14.522758] R10: 0000000000000000 R11: ffff88007ba1a618 R12: 0000000000188000
[ 14.522758] R13: 0000000000000020 R14: 0000000000188000 R15: ffff880079c74380
[ 14.522758] FS: 00007f5634d1a740(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
[ 14.522758] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 14.522758] CR2: 00007f563413a050 CR3: 00000000366a7000 CR4: 00000000000406f0
[ 14.522758] Stack:
[ 14.522758] ffff88007ba1a618 ffff88007bea5000 ffff880000188000 000000208126e7fb
[ 14.522758] 0000000000000010 ffff88007c2fdad0 0000000000000800 ffff880079c74380
[ 14.522758] ffff880079c74380 ffff880036851d70 ffffffff81265df2 0000000000188000
[ 14.522758] Call Trace:
[ 14.522758] [<ffffffff81265df2>] ext4_flex_group_add+0x1262/0x1400
[ 14.522758] [<ffffffff812671ba>] ext4_resize_fs+0x74a/0xdf0
[ 14.522758] [<ffffffff8124a6a6>] ext4_ioctl+0xc16/0xf10
[ 14.522758] [<ffffffff811b7eca>] ? do_filp_open+0x3a/0x90
[ 14.522758] [<ffffffff811b9d75>] do_vfs_ioctl+0x2e5/0x4d0
[ 14.522758] [<ffffffff811b70e2>] ? final_putname+0x22/0x50
[ 14.522758] [<ffffffff811b72e9>] ? putname+0x29/0x40
[ 14.522758] [<ffffffff811a70cb>] ? do_sys_open+0x1bb/0x270
[ 14.522758] [<ffffffff811b9fe1>] SyS_ioctl+0x81/0xa0
[ 14.522758] [<ffffffff816f74dd>] system_call_fastpath+0x1a/0x1f
[ 14.522758] Code: c8 4c 89 df e8 41 40 f7 f...

Read more...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.