Activity log for bug #1933172

Date Who What changed Old value New value Message
2021-06-22 03:35:11 Matthew Ruffell bug added bug
2021-06-22 03:35:25 Matthew Ruffell linux (Ubuntu): status New Fix Released
2021-06-22 03:35:38 Matthew Ruffell nominated for series Ubuntu Bionic
2021-06-22 03:35:38 Matthew Ruffell bug task added linux (Ubuntu Bionic)
2021-06-22 03:35:46 Matthew Ruffell linux (Ubuntu Bionic): status New In Progress
2021-06-22 03:35:50 Matthew Ruffell linux (Ubuntu Bionic): importance Undecided Medium
2021-06-22 03:35:53 Matthew Ruffell linux (Ubuntu Bionic): assignee Matthew Ruffell (mruffell)
2021-06-22 03:38:04 Matthew Ruffell description [Impact] If you attempt to balance a btrfs filesystem that is nearly full, and this filesystem has had a lot of small, medium and large files created and deleted, such that the b-tree needs to be rotated, when the balance fails due to not having enough free space, the kernel oops, and the btrfs filesystem hangs. It doesn't appear to cause any filesystem corruption, and is reproducible every time on affected filesystems. The following oops is generated: general protection fault: 0000 [#1] SMP PTI CPU: 0 PID: 18440 Comm: btrfs Not tainted 4.15.0-136-generic #140-Ubuntu Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014 RIP: 0010:btrfs_set_root_node+0x5/0x60 [btrfs] RSP: 0018:ffffb3db890a79e0 EFLAGS: 00010282 RAX: ffff8d7f73861ad0 RBX: ffff8d7f78455708 RCX: ffff8d7f6d9a5390 RDX: ffff8d7f73861ad0 RSI: a023775cfc0348a3 RDI: ffff8d7f6d9a5028 RBP: ffffb3db890a7a78 R08: 0000000000000044 R09: 0000000000000228 R10: ffff8d7f6d9a5000 R11: 0000000000000010 R12: ffffb3db890a7a08 R13: ffff8d7f6d9a5000 R14: ffff8d7f6d9a5028 R15: ffff8d7f74560000 FS: 00007f48d84498c0(0000) GS:ffff8d7f7fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fe4fbc1f000 CR3: 00000001799fc001 CR4: 0000000000160ef0 Call Trace: ? commit_fs_roots+0x130/0x1b0 [btrfs] ? btrfs_run_delayed_refs.part.70+0x80/0x190 [btrfs] btrfs_commit_transaction+0x42c/0x910 [btrfs] ? start_transaction+0x191/0x430 [btrfs] relocate_block_group+0x1e7/0x640 [btrfs] btrfs_relocate_block_group+0x18f/0x280 [btrfs] btrfs_relocate_chunk+0x38/0xd0 [btrfs] __btrfs_balance+0x972/0xcd0 [btrfs] ? insert_balance_item.isra.35+0x391/0x3c0 [btrfs] btrfs_balance+0x32c/0x5a0 [btrfs] btrfs_ioctl_balance+0x320/0x390 [btrfs] btrfs_ioctl+0x5a6/0x2490 [btrfs] ? lru_cache_add_active_or_unevictable+0x36/0xb0 ? __handle_mm_fault+0x9fd/0x1290 do_vfs_ioctl+0xa8/0x630 ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs] ? do_vfs_ioctl+0xa8/0x630 ? __do_page_fault+0x2a1/0x4b0 SyS_ioctl+0x79/0x90 do_syscall_64+0x73/0x130 entry_SYSCALL_64_after_hwframe+0x41/0xa6 RIP: 0033:0x7f48d7228317 RSP: 002b:00007ffd76d03e38 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f48d7228317 RDX: 00007ffd76d03ec8 RSI: 00000000c4009420 RDI: 0000000000000003 RBP: 00007ffd76d03ec8 R08: 0000000000000078 R09: 0000000000000000 R10: 0000562086e7f010 R11: 0000000000000246 R12: 0000000000000003 R13: 00007ffd76d057cb R14: 0000000000000002 R15: 0000000000000000 Code: 4d 85 e4 0f 84 56 fe ff ff 4d 89 04 24 41 c6 44 24 08 84 4d 89 4c 24 09 e9 42 fe ff ff 0f 0b e8 02 24 5e e0 66 90 0f 1f 44 00 00 <48> 8b 06 48 8b 0d c9 d4 99 e1 48 8b 15 d2 d4 99 e1 55 48 89 87 RIP: btrfs_set_root_node+0x5/0x60 [btrfs] RSP: ffffb3db890a79e0 I don't see this behaviour on any upstream kernel, and the first kernel to show this behaviour is 4.15.0-109-generic. The current 4.15.0-145-generic is still affected. I believe that this is a regression introduced in the fixing of CVE-2019-19036. [Testcase] I haven't reliably been able to create a script which places a btrfs filesystem into the state necessary to reproduce this issue, so I have just provided my qcow2 image with my btrfs filesystem which reproduces the issue 100% of the time. Download the image from here (warning size is 8.0gb): https://people.canonical.com/~mruffell/sf311164/ubuntu18.04-server-2.qcow2 Make a Ubuntu 18.04 VM. Attach the ubuntu18.04-server-2.qcow2 image to a new virtio disk. Note, ubuntu18.04-server-2.qcow2 does not have an operating system, it is just a data only volume. Mount the volume: $ sudo mount /dev/vdb /mnt Attempt to balance: $ sudo btrfs filesystem balance start --full-balance /mnt Segmentation fault (core dumped) Check dmesg for kernel oops: https://paste.ubuntu.com/p/wjJNqKBCfh/ If you install the test kernel from the following ppa: You should see this instead: $ sudo btrfs filesystem balance start --full-balance /mnt ERROR: error during balancing '/mnt': No space left on device There may be more info in syslog - try dmesg | tail Checking dmesg shows no kernel oops, and just info about the volume being too full to balance: https://paste.ubuntu.com/p/4J8Gq2dtz4/ [Fix] I found the problem to be introduced in 4.15.0-109-generic, and 4.15.0-108-generic and earlier worked fine, which means we introduced a regression somewhere. I bisected the problem down to the following commit: ubuntu-bionic 6f536ce7a978531d38a21d092394616cefb54436 Author: Qu Wenruo <wqu@suse.com> Date: Tue May 19 10:13:20 2020 +0800 Subject btrfs: reloc: fix reloc root leak and NULL pointer dereference Link: https://paste.ubuntu.com/p/4qfWCM8ykh/ Unfortunately, I believe this is a bad backport. If you examine the original upstream commit: commit 51415b6c1b117e223bc083e30af675cb5c5498f3 Author: Qu Wenruo <wqu@suse.com> Date: Tue May 19 10:13:20 2020 +0800 Subject: btrfs: reloc: fix reloc root leak and NULL pointer dereference Link: https://github.com/torvalds/linux/commit/51415b6c1b117e223bc083e30af675cb5c5498f3 You will see the 4.15 backport has calls to free_extent_buffer() and btrfs_put_fs_root(). Now, btrfs_put_fs_root() was renamed to btrfs_put_root() in the newer patches, and contains logic to free relocated roots, so I think we might not need the calls to free_extent_buffer() to free the extents first, since it might be handled later. The core issue is that we hit a general protection fault when attempting to access a root node, which means we have freed a root node we shouldn't have. If we look at the backport in 5.4.y, aka, the one in Focal: ubuntu-focal ecaee3a76ea998bc2fe20f056eb27f9bc837d116 Author: Qu Wenruo <wqu@suse.com> Date: Tue May 19 10:13:20 2020 +0800 Subject: btrfs: reloc: fix reloc root leak and NULL pointer dereference Link: https://paste.ubuntu.com/p/PZrMqVt8Yk/ It seems upstream -stable omitted the calls to btrfs_put_root() entirely, and we don't need the calls to free_extent_buffer() because of it. If I revert 6f536ce7a978531d38a21d092394616cefb54436 from ubuntu-bionic, and cherry-pick ecaee3a76ea998bc2fe20f056eb27f9bc837d116 from ubuntu-focal, and build, the problem no longer reproduces. [Where problems could occur] If a regression were to occur, it would affect users of btrfs filesystems, and would likely show during a routine balance operation. Since the issue is triggered during the cancellation of a balance operation, problems might occur for users with nearly full filesystems or filesystems that have existing corruption. We are replacing a patch that was backported during the fixing of CVE-2019-19036, and replacing it with a backport provided by upstream developers, which cherry picks from 5.4.y to Bionic. The patch in 5.4.y is well tested by the community and is currently in the Focal kernel. With all modifications to btrfs, there is a risk of data corruption and filesystem corruption for all btrfs users, since balances happen automatically and on a regular basis. If a regression does happen, users should remount their filesystems with the "nobalance" flag, backup their data, and attempt a repair if necessary. [Other info] A community member has hit this issue before I did, and has reported it upstream to linux-btrfs here, although no one knew what was happening: https://www.spinics.net/lists/linux-btrfs/msg103367.html BugLink: https://bugs.launchpad.net/bugs/1933172 [Impact] If you attempt to balance a btrfs filesystem that is nearly full, and this filesystem has had a lot of small, medium and large files created and deleted, such that the b-tree needs to be rotated, when the balance fails due to not having enough free space, the kernel oops, and the btrfs filesystem hangs. It doesn't appear to cause any filesystem corruption, and is reproducible every time on affected filesystems. The following oops is generated: general protection fault: 0000 [#1] SMP PTI CPU: 0 PID: 18440 Comm: btrfs Not tainted 4.15.0-136-generic #140-Ubuntu Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014 RIP: 0010:btrfs_set_root_node+0x5/0x60 [btrfs] RSP: 0018:ffffb3db890a79e0 EFLAGS: 00010282 RAX: ffff8d7f73861ad0 RBX: ffff8d7f78455708 RCX: ffff8d7f6d9a5390 RDX: ffff8d7f73861ad0 RSI: a023775cfc0348a3 RDI: ffff8d7f6d9a5028 RBP: ffffb3db890a7a78 R08: 0000000000000044 R09: 0000000000000228 R10: ffff8d7f6d9a5000 R11: 0000000000000010 R12: ffffb3db890a7a08 R13: ffff8d7f6d9a5000 R14: ffff8d7f6d9a5028 R15: ffff8d7f74560000 FS: 00007f48d84498c0(0000) GS:ffff8d7f7fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fe4fbc1f000 CR3: 00000001799fc001 CR4: 0000000000160ef0 Call Trace:  ? commit_fs_roots+0x130/0x1b0 [btrfs]  ? btrfs_run_delayed_refs.part.70+0x80/0x190 [btrfs]  btrfs_commit_transaction+0x42c/0x910 [btrfs]  ? start_transaction+0x191/0x430 [btrfs]  relocate_block_group+0x1e7/0x640 [btrfs]  btrfs_relocate_block_group+0x18f/0x280 [btrfs]  btrfs_relocate_chunk+0x38/0xd0 [btrfs]  __btrfs_balance+0x972/0xcd0 [btrfs]  ? insert_balance_item.isra.35+0x391/0x3c0 [btrfs]  btrfs_balance+0x32c/0x5a0 [btrfs]  btrfs_ioctl_balance+0x320/0x390 [btrfs]  btrfs_ioctl+0x5a6/0x2490 [btrfs]  ? lru_cache_add_active_or_unevictable+0x36/0xb0  ? __handle_mm_fault+0x9fd/0x1290  do_vfs_ioctl+0xa8/0x630  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]  ? do_vfs_ioctl+0xa8/0x630  ? __do_page_fault+0x2a1/0x4b0  SyS_ioctl+0x79/0x90  do_syscall_64+0x73/0x130  entry_SYSCALL_64_after_hwframe+0x41/0xa6 RIP: 0033:0x7f48d7228317 RSP: 002b:00007ffd76d03e38 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f48d7228317 RDX: 00007ffd76d03ec8 RSI: 00000000c4009420 RDI: 0000000000000003 RBP: 00007ffd76d03ec8 R08: 0000000000000078 R09: 0000000000000000 R10: 0000562086e7f010 R11: 0000000000000246 R12: 0000000000000003 R13: 00007ffd76d057cb R14: 0000000000000002 R15: 0000000000000000 Code: 4d 85 e4 0f 84 56 fe ff ff 4d 89 04 24 41 c6 44 24 08 84 4d 89 4c 24 09 e9 42 fe ff ff 0f 0b e8 02 24 5e e0 66 90 0f 1f 44 00 00 <48> 8b 06 48 8b 0d c9 d4 99 e1 48 8b 15 d2 d4 99 e1 55 48 89 87 RIP: btrfs_set_root_node+0x5/0x60 [btrfs] RSP: ffffb3db890a79e0 I don't see this behaviour on any upstream kernel, and the first kernel to show this behaviour is 4.15.0-109-generic. The current 4.15.0-145-generic is still affected. I believe that this is a regression introduced in the fixing of CVE-2019-19036. [Testcase] I haven't reliably been able to create a script which places a btrfs filesystem into the state necessary to reproduce this issue, so I have just provided my qcow2 image with my btrfs filesystem which reproduces the issue 100% of the time. Download the image from here (warning size is 8.0gb): https://people.canonical.com/~mruffell/sf311164/ubuntu18.04-server-2.qcow2 Make a Ubuntu 18.04 VM. Attach the ubuntu18.04-server-2.qcow2 image to a new virtio disk. Note, ubuntu18.04-server-2.qcow2 does not have an operating system, it is just a data only volume. Mount the volume: $ sudo mount /dev/vdb /mnt Attempt to balance: $ sudo btrfs filesystem balance start --full-balance /mnt Segmentation fault (core dumped) Check dmesg for kernel oops: https://paste.ubuntu.com/p/wjJNqKBCfh/ If you install the test kernel from the following ppa: https://launchpad.net/~mruffell/+archive/ubuntu/sf311164-test You should see this instead: $ sudo btrfs filesystem balance start --full-balance /mnt ERROR: error during balancing '/mnt': No space left on device There may be more info in syslog - try dmesg | tail Checking dmesg shows no kernel oops, and just info about the volume being too full to balance: https://paste.ubuntu.com/p/4J8Gq2dtz4/ [Fix] I found the problem to be introduced in 4.15.0-109-generic, and 4.15.0-108-generic and earlier worked fine, which means we introduced a regression somewhere. I bisected the problem down to the following commit: ubuntu-bionic 6f536ce7a978531d38a21d092394616cefb54436 Author: Qu Wenruo <wqu@suse.com> Date: Tue May 19 10:13:20 2020 +0800 Subject btrfs: reloc: fix reloc root leak and NULL pointer dereference Link: https://paste.ubuntu.com/p/4qfWCM8ykh/ Unfortunately, I believe this is a bad backport. If you examine the original upstream commit: commit 51415b6c1b117e223bc083e30af675cb5c5498f3 Author: Qu Wenruo <wqu@suse.com> Date: Tue May 19 10:13:20 2020 +0800 Subject: btrfs: reloc: fix reloc root leak and NULL pointer dereference Link: https://github.com/torvalds/linux/commit/51415b6c1b117e223bc083e30af675cb5c5498f3 You will see the 4.15 backport has calls to free_extent_buffer() and btrfs_put_fs_root(). Now, btrfs_put_fs_root() was renamed to btrfs_put_root() in the newer patches, and contains logic to free relocated roots, so I think we might not need the calls to free_extent_buffer() to free the extents first, since it might be handled later. The core issue is that we hit a general protection fault when attempting to access a root node, which means we have freed a root node we shouldn't have. If we look at the backport in 5.4.y, aka, the one in Focal: ubuntu-focal ecaee3a76ea998bc2fe20f056eb27f9bc837d116 Author: Qu Wenruo <wqu@suse.com> Date: Tue May 19 10:13:20 2020 +0800 Subject: btrfs: reloc: fix reloc root leak and NULL pointer dereference Link: https://paste.ubuntu.com/p/PZrMqVt8Yk/ It seems upstream -stable omitted the calls to btrfs_put_root() entirely, and we don't need the calls to free_extent_buffer() because of it. If I revert 6f536ce7a978531d38a21d092394616cefb54436 from ubuntu-bionic, and cherry-pick ecaee3a76ea998bc2fe20f056eb27f9bc837d116 from ubuntu-focal, and build, the problem no longer reproduces. [Where problems could occur] If a regression were to occur, it would affect users of btrfs filesystems, and would likely show during a routine balance operation. Since the issue is triggered during the cancellation of a balance operation, problems might occur for users with nearly full filesystems or filesystems that have existing corruption. We are replacing a patch that was backported during the fixing of CVE-2019-19036, and replacing it with a backport provided by upstream developers, which cherry picks from 5.4.y to Bionic. The patch in 5.4.y is well tested by the community and is currently in the Focal kernel. With all modifications to btrfs, there is a risk of data corruption and filesystem corruption for all btrfs users, since balances happen automatically and on a regular basis. If a regression does happen, users should remount their filesystems with the "nobalance" flag, backup their data, and attempt a repair if necessary. [Other info] A community member has hit this issue before I did, and has reported it upstream to linux-btrfs here, although no one knew what was happening: https://www.spinics.net/lists/linux-btrfs/msg103367.html
2021-06-22 03:38:19 Matthew Ruffell tags bionic sts
2021-06-22 09:30:52 Dominique Poulain bug added subscriber Dominique Poulain
2021-06-29 00:20:28 Kelsey Steele linux (Ubuntu Bionic): status In Progress Fix Committed
2021-07-21 15:03:32 Ubuntu Kernel Bot tags bionic sts bionic sts verification-needed-bionic
2021-07-26 03:24:18 Matthew Ruffell tags bionic sts verification-needed-bionic bionic sts verification-done-bionic
2021-08-16 19:46:29 Launchpad Janitor linux (Ubuntu Bionic): status Fix Committed Fix Released
2021-08-16 19:46:29 Launchpad Janitor cve linked 2019-19036