I've been seeing intermittent resize2fs issues w/ DIB images (deployed with Nova bare metal) for months now. The issue occurs at first boot time once a Nova bare metal instance has booted. Cloud init makes a call to resize the file system which fails with the following kernel BUG message:
[ 112.136896] EXT4-fs (sda1): resizing filesystem from 1080688 to 5243214 blocs
[ 112.164072] ------------[ cut here ]------------
[ 112.164179] kernel BUG at fs/ext4/resize.c:409!
[ 112.164285] invalid opcode: 0000 [#1] SMP
[ 112.164488] Modules linked in: openvswitch vxlan ip_tunnel gre libcrc32c noui
[ 112.165042] CPU: 0 PID: 968 Comm: resize2fs Tainted: G I 3.12.9-301
[ 112.165042] Hardware name: Dell Inc. OptiPlex 760 /0M858N, B9
[ 112.165042] task: ffff8800b7969080 ti: ffff8800b74f4000 task.ti: ffff8800b740
[ 112.165042] RIP: 0010:[<ffffffff81254fa1>] [<ffffffff81254fa1>] set_flexbg_0
[ 112.165042] RSP: 0018:ffff8800b74f5c28 EFLAGS: 00010216
[ 112.165042] RAX: ffff8800b743bf00 RBX: ffff88007fae9000 RCX: 0000000000001000
[ 112.165042] RDX: ffff88007f4e2c00 RSI: 0000000000000001 RDI: 0000000000000010
[ 112.165042] RBP: ffff8800b74f5c70 R08: ffff8800afd27750 R09: ffff8800b7daee00
[ 112.165042] R10: 0000000000000000 R11: ffff8800afd27750 R12: 0000000000188000
[ 112.165042] R13: 0000000000000010 R14: 0000000000188000 R15: ffff8800b743b800
[ 112.165042] FS: 00007f858a7f4780(0000) GS:ffff8800be800000(0000) knlGS:00000
[ 112.165042] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 112.165042] CR2: 00007f8589c1eea6 CR3: 00000000b7979000 CR4: 00000000000407f0
[ 112.165042] Stack:
[ 112.165042] ffff8800afd27750 ffff8800b96cb060 ffff880000188000 000000108125b
[ 112.165042] 0000000000000010 ffff88007f4e2ed0 00000000000007ff ffff8800b7430
[ 112.165042] ffff8800b743b800 ffff8800b74f5d68 ffffffff81256768 0000000000180
[ 112.165042] Call Trace:
[ 112.165042] [<ffffffff81256768>] ext4_flex_group_add+0x1448/0x1830
[ 112.165042] [<ffffffff81257de2>] ext4_resize_fs+0x7b2/0xe80
[ 112.165042] [<ffffffff8123ac50>] ext4_ioctl+0xbf0/0xf00
[ 112.165042] [<ffffffff811c111d>] do_vfs_ioctl+0x2dd/0x4b0
[ 112.165042] [<ffffffff811b9df2>] ? final_putname+0x22/0x50
[ 112.165042] [<ffffffff811c1371>] SyS_ioctl+0x81/0xa0
[ 112.165042] [<ffffffff81676aa9>] system_call_fastpath+0x16/0x1b
[ 112.165042] Code: c8 4c 89 df e8 41 96 f8 ff 44 89 e8 49 01 c4 44 29 6d d4 0
[ 112.165042] RIP [<ffffffff81254fa1>] set_flexbg_block_bitmap+0x171/0x180
[ 112.165042] RSP <ffff8800b74f5c28>
[ 112.175633] ---[ end trace f179f994a575df06 ]---
As I mentioned above this issue seemed to occur intermittently if you build and deploy new images all the time. It is however reproducible once you hit it if you use the same DIB image again and again. Once you have an image that fails it seems that it will continue to fail... Aha!
Jon recently reported the issue on the ext4 list here and Ted's reply shed some light on the root cause here:
http:// marc.info/ ?l=linux- ext4&m= 139232631720458 &w=2
We were able to trace it back to this DIB commit:
https:/ /github. com/openstack/ diskimage- builder/ commit/ fb246a02eb2ed33 0d3cc37f5795b
The short story is there *is* an ext4 resize bug that needs fixing. However we can almost certainly avoid the issue entirely by using ext4 defaults here which should be reasonable for most cases. The default allows for a root partition up to 4TB (I think).
Anyway, The root cause of all this is really a design problem in DIB/TripleO at the moment in that we shouldn't have to worry about the max size of the root file system when creating our images. Ideally we'd just mkfs on the root file system itself. Much more efficient, and avoids this problem altogether...
I think the best thing to do today to avoid this is make setting max-online-resize an option in DIB. This will allow us to stick to the (well tested) ext4 defaults for most cases, and if someone has need for a large root filesystem they can easily bump the setting. This may be temporary until we either fix the design... or the ext4 fix is released.