Activity log for bug #1934709

Date Who What changed Old value New value Message
2021-07-06 03:21:39 Matthew Ruffell bug added bug
2021-07-06 03:21:48 Matthew Ruffell linux (Ubuntu): status New Fix Released
2021-07-06 03:21:54 Matthew Ruffell nominated for series Ubuntu Bionic
2021-07-06 03:21:54 Matthew Ruffell bug task added linux (Ubuntu Bionic)
2021-07-06 03:21:59 Matthew Ruffell linux (Ubuntu Bionic): status New In Progress
2021-07-06 03:22:02 Matthew Ruffell linux (Ubuntu Bionic): importance Undecided Medium
2021-07-06 03:22:06 Matthew Ruffell linux (Ubuntu Bionic): assignee Matthew Ruffell (mruffell)
2021-07-06 03:24:32 Matthew Ruffell description [Impact] During an automatic balance, users may encounter an error when writing the transaction log to disk, when the log tree is being parsed, which forces the filesystem to be remounted read-only and outputs the following kernel oops: BTRFS: error (device dm-14) in balance_level:1962: errno=-117 unknown BTRFS info (device dm-14): forced readonly BTRFS: Transaction aborted (error -117) WARNING: CPU: 7 PID: 10818 at /build/linux-99Rib2/linux-4.15.0/fs/btrfs/tree-log.c:2908 btrfs_sync_log+0xa28/0xbc0 [btrfs] CPU: 7 PID: 10818 Comm: qemu-system-s39 Tainted: G OE 4.15.0-136-generic #140-Ubuntu Hardware name: IBM 3907 LR1 A00 (LPAR) Krnl PSW : 0000000076bc1d64 000000009cc65255 (btrfs_sync_log+0xa28/0xbc0 [btrfs]) R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3 Krnl GPRS: 007899a600000000 0000000000000006 0000000000000027 0000000000000007 000003ff801fdd8c 000000000002394c 00000053650a7000 ffffffffffffff8b 000000536b7f6000 00000053ffffff8b 00000053650a3800 0000005385935000 000000532054de01 0000005385935290 000003ff801fdd8c 0000004eebb1fae0 Krnl Code: 000003ff801fdd80: c0200002a032 larl %r2,000003ff80251de4 000003ff801fdd86: c0e5fffb2181 brasl %r14,000003ff80162088 #000003ff801fdd8c: a7f40001 brc 15,000003ff801fdd8e >000003ff801fdd90: e3a0f0a80004 lg %r10,168(%r15) 000003ff801fdd96: b9040057 lgr %r5,%r7 000003ff801fdd9a: a7490b5c lghi %r4,2908 000003ff801fdd9e: b904002a lgr %r2,%r10 000003ff801fdda2: c0300002604f larl %r3,000003ff80249e40 Call Trace: ([<000003ff801fdd8c>] btrfs_sync_log+0xa24/0xbc0 [btrfs]) [<000003ff801c74e2>] btrfs_sync_file+0x3e2/0x550 [btrfs] [<00000000003ce6ce>] do_fsync+0x5e/0x90 [<00000000003ce9ca>] SyS_fdatasync+0x32/0x48 [<00000000008ffe5c>] system_call+0xd8/0x2c8 Last Breaking-Event-Address: [<000003ff801fdd8c>] btrfs_sync_log+0xa24/0xbc0 [btrfs] BTRFS: error (device dm-14) in btrfs_sync_log:2908: errno=-117 unknown BTRFS error (device dm-14): pending csums is 269639680 This bug appears to be linked to bug 1933172, but is different and has a different root cause. Again, I believe that this is a regression introduced in the fixing of CVE-2019-19036, from 4.15.0-109-generic. [Fix] Analysis of the kernel oops is as follows: The first thing we see is that BTRFS entered ERROR state with the reason: in balance_level:1962: errno=-117 unknown Now errno -117 is: 100 #define EUCLEAN 117 /* Structure needs cleaning */ btrfs treats -EUCLEAN as if corruption has happened. Let's see where this is returned from. If we start at fs/btrfs/ctree.c in balance_level(), line 1962: 1917 static noinline int balance_level(struct btrfs_trans_handle *trans, 1918 struct btrfs_root *root, 1919 struct btrfs_path *path, int level) 1920 { ... 1958 /* promote the child to a root */ 1959 child = read_node_slot(fs_info, mid, 0); 1960 if (IS_ERR(child)) { 1961 ret = PTR_ERR(child); 1962 btrfs_handle_fs_error(fs_info, ret, NULL); 1963 goto enospc; 1964 } ... 2136 } We are in the middle of a balancing operation, and if you happen to be familiar with how b-tree data structures work, we are promoting a child node to a topmost root node. The error most likely happens in read_node_slot(), with the lines after it printing that an error has happened. 1887 static noinline struct extent_buffer * 1888 read_node_slot(struct btrfs_fs_info *fs_info, struct extent_buffer *parent, 1889 int slot) 1890 { ... 1900 btrfs_node_key_to_cpu(parent, &first_key, slot); 1901 eb = read_tree_block(fs_info, btrfs_node_blockptr(parent, slot), 1902 btrfs_node_ptr_generation(parent, slot), 1903 level - 1, &first_key); ... 1910 } There are two calls here which are relevant. btrfs_node_key_to_cpu() and read_tree_block(). Let's look at read_tree_block() in fs/btrfs/disk-io.c: 1147 struct extent_buffer *read_tree_block(struct btrfs_fs_info *fs_info, u64 bytenr, 1148 u64 parent_transid, int level, 1149 struct btrfs_key *first_key) 1150 { 1151 struct extent_buffer *buf = NULL; 1152 int ret; 1153 1154 buf = btrfs_find_create_tree_block(fs_info, bytenr); 1155 if (IS_ERR(buf)) 1156 return buf; 1157 1158 ret = btree_read_extent_buffer_pages(fs_info, buf, parent_transid, 1159 level, first_key); 1160 if (ret) { 1161 free_extent_buffer_stale(buf); 1162 return ERR_PTR(ret); 1163 } 1164 return buf; 1165 1166 } The interesting one here is btree_read_extent_buffer_pages(): 498 static int btree_read_extent_buffer_pages(struct btrfs_fs_info *fs_info, 499 struct extent_buffer *eb, 500 u64 parent_transid, int level, 501 struct btrfs_key *first_key) 502 { ... 511 while (1) { 512 clear_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags); 513 ret = read_extent_buffer_pages(io_tree, eb, WAIT_COMPLETE, 514 btree_get_extent, mirror_num); 515 if (!ret) { 516 if (verify_parent_transid(io_tree, eb, 517 parent_transid, 0)) 518 ret = -EIO; 519 else if (verify_level_key(fs_info, eb, level, 520 first_key)) 521 ret = -EUCLEAN; 522 else 523 break; 524 } 525 526 /* 527 * This buffer's crc is fine, but its contents are corrupted, so 528 * there is no reason to read the other copies, they won't be 529 * any less wrong. 530 */ 531 if (test_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags) || 532 ret == -EUCLEAN) 533 break; ... If read_extent_buffer_pages() returns zero, we call verify_level_key(), and if this returns non-zero, we return -EUCLEAN. verify_level_key() can fail on three conditions. 440 static int verify_level_key(struct btrfs_fs_info *fs_info, 441 struct extent_buffer *eb, int level, 442 struct btrfs_key *first_key) 443 { 448 found_level = btrfs_header_level(eb); 449 if (found_level != level) { ... 456 return -EIO; 457 } 458 459 if (!first_key) 460 return 0; 461 462 /* We have @first_key, so this @eb must have at least one item */ 463 if (btrfs_header_nritems(eb) == 0) { 464 btrfs_err(fs_info, 465 "invalid tree nritems, bytenr=%llu nritems=0 expect >0", 466 eb->start); 467 WARN_ON(IS_ENABLED(CONFIG_BTRFS_DEBUG)); 468 return -EUCLEAN; 469 } 470 471 if (found_level) 472 btrfs_node_key_to_cpu(eb, &found_key, 0); 473 else 474 btrfs_item_key_to_cpu(eb, &found_key, 0); 475 ret = btrfs_comp_cpu_keys(first_key, &found_key); ... 487 return ret; 488 } 1) If the eb level doesn't match the provided level. 2) If the eb has 0 items. 3) If the found_key doesn't match the first_key. With the information we currently have, we don't know what one caused the problem. I looked to see when verify_level_key() was first introduced. It seems it arrived in 4.15.0-109-generic through the SRU of CVE-2019-19036, with the commit: ubuntu-bionic 50ab1ff51db0c5eb77ffc6f15ef32f07764f86ff Author: Qu Wenruo <wqu@suse.com> Date: Thu Mar 29 09:08:11 2018 +0800 Subject: btrfs: Validate child tree block's level and first key Link: https://paste.ubuntu.com/p/DQjTkfNRDt/ If you look at this particular hunk: https://paste.ubuntu.com/p/z4qXw2Jp9K/ We see lines 18 - 26 of the above pastebin are introduced here. Looking at the original upstream commit: commit 581c1760415c48cca9349b198bba52dd38750765 Author: Qu Wenruo <wqu@suse.com> Date: Thu Mar 29 09:08:11 2018 +0800 Subject: btrfs: Validate child tree block's level and first key Link: https://github.com/torvalds/linux/commit/581c1760415c48cca9349b198bba52dd38750765 Particularly the same hunk: https://paste.ubuntu.com/p/vW3Y9BvK4C/ There is a subtle difference, the second if statement is extended with the ret == -EUCLEAN check, and not implemented entirely. Why is this? I looked up when the if statement was first introduced, and it was a very old commit from v2.6.39-rc1: https://paste.ubuntu.com/p/vPT36xXq66/ Particularly this hunk: https://paste.ubuntu.com/p/2jmQRSmVfc/ Interesting. Why is a commit from 4.15-109-generic re-implement something that should have been there since 2.6.39? I checked upstream, and found the if statement to be removed entirely. That is when I came across: commit f8397d69daef06d358430d3054662fb597e37c00 Author: Nikolay Borisov <nborisov@suse.com> Date: Tue Nov 6 16:40:20 2018 +0200 Subject: btrfs: Always try all copies when reading extent buffers Link: https://github.com/torvalds/linux/commit/f8397d69daef06d358430d3054662fb597e37c00 Which talks about balance operations failing out of the blue after a raid 1 disk was added back to the array. The commit removes the if statement, and moves the location of the clear EXTENT_BUFFER_CORRUPT inside the while loop. I checked the Bionic 4.15 kernel, and I found that this commit was applied in 4.15.0-56-generic: ubuntu-bionic 03e1b5c9a1c1704e109466b375d09a4721b65ec5 Author: Nikolay Borisov <nborisov@suse.com> Date: Tue Nov 6 16:40:20 2018 +0200 Subject: btrfs: Always try all copies when reading extent buffers Link: https://paste.ubuntu.com/p/TS2c5Mptr2/ It appears that the if statement was removed in 4.15.0-56-generic intentionally, and was brought back mistakenly in a backport of "btrfs: Validate child tree block's level and first key" 4.15.0-109-generic. The root cause is likely some interaction between bug 1933172 and this bug, which leads to EXTENT_BUFFER_CORRUPT being set, or the incorrect first_key being set for the root node, which means we end up returning -EUCLEAN and aborting the transaction. Unfortunately, this is the second bad backport of CVE-2019-19036. The fix is to revert "btrfs: Validate child tree block's level and first key" and its dependency commit "btrfs: Detect unbalanced tree with empty leaf before crashing btree operations", and re-apply correct backports of these commits with that if statement removed, to keep the spirit of the already applied "btrfs: Always try all copies when reading extent buffers". We will also add the below commit as it is a fixup commit for "btrfs: Validate child tree block's level and first key" to correct usage of the wrong "key" variable, which also leads to similar symptoms of -EUCLEAN being returned from verify_level_key(). commit 17515f1b764df36271f3166c714f5a78301fbaa7 Author: Qu Wenruo <wqu@suse.com> Date: Mon Apr 23 17:32:04 2018 +0800 Subject: btrfs: Fix wrong first_key parameter in replace_path Link: https://github.com/torvalds/linux/commit/17515f1b764df36271f3166c714f5a78301fbaa7 [Testcase] Unfortunately, the customer did not image the affected filesystem and has since restored a backup ontop of it. I have been attempting to reproduce this issue for some time, but I have not experienced the same call trace. I ran into bug 1933172 while trying to reproduce this bug. I have been trying to balance nearly full btrfs filesystems, and I have looped xfstests btrfs/124 for hours attempting to trigger "btrfs: Always try all copies when reading extent buffers" but I haven't experienced a crash yet. I have a test kernel in the following ppa: If you install it, balances still complete as expected. I will keep attempting to reproduce the issue, and will update this section if I manage to create a testcase. [Where problems could occur] If a regression were to occur, it would affect users of btrfs filesystems, and would likely show during a routine balance operation. I believe affected users would have nearly full filesystems, and would also experience filesystem corruption from bug 1933172, which would then cause the issues from this bug when the transaction log is written to disk. With all modifications to btrfs, there is a risk of data corruption and filesystem corruption for all btrfs users, since balances happen automatically and on a regular basis. If a regression does happen, users should remount their filesystems with the "nobalance" flag, backup their data, and attempt a repair if necessary. [Other info] A community member has hit this issue before, and has reported it upstream to linux-btrfs here, although they never received a reply. https://www.spinics.net/lists/linux-btrfs/msg112261.html I have written to Richard and asked if he has any additional information that might help reproduce, but I have yet to receive a reply. If you read Richard's mailing list link, it mentions filesystem corruption with missing extents. This suggests this crash might be linked to bug 1933172, which I came across while trying to reproduce the issue in this bug. BugLink: https://bugs.launchpad.net/bugs/1934709 [Impact] During an automatic balance, users may encounter an error when writing the transaction log to disk, when the log tree is being parsed, which forces the filesystem to be remounted read-only and outputs the following kernel oops: BTRFS: error (device dm-14) in balance_level:1962: errno=-117 unknown BTRFS info (device dm-14): forced readonly BTRFS: Transaction aborted (error -117) WARNING: CPU: 7 PID: 10818 at /build/linux-99Rib2/linux-4.15.0/fs/btrfs/tree-log.c:2908 btrfs_sync_log+0xa28/0xbc0 [btrfs] CPU: 7 PID: 10818 Comm: qemu-system-s39 Tainted: G OE 4.15.0-136-generic #140-Ubuntu Hardware name: IBM 3907 LR1 A00 (LPAR) Krnl PSW : 0000000076bc1d64 000000009cc65255 (btrfs_sync_log+0xa28/0xbc0 [btrfs])            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3 Krnl GPRS: 007899a600000000 0000000000000006 0000000000000027 0000000000000007            000003ff801fdd8c 000000000002394c 00000053650a7000 ffffffffffffff8b            000000536b7f6000 00000053ffffff8b 00000053650a3800 0000005385935000            000000532054de01 0000005385935290 000003ff801fdd8c 0000004eebb1fae0 Krnl Code: 000003ff801fdd80: c0200002a032 larl %r2,000003ff80251de4            000003ff801fdd86: c0e5fffb2181 brasl %r14,000003ff80162088           #000003ff801fdd8c: a7f40001 brc 15,000003ff801fdd8e           >000003ff801fdd90: e3a0f0a80004 lg %r10,168(%r15)            000003ff801fdd96: b9040057 lgr %r5,%r7            000003ff801fdd9a: a7490b5c lghi %r4,2908            000003ff801fdd9e: b904002a lgr %r2,%r10            000003ff801fdda2: c0300002604f larl %r3,000003ff80249e40 Call Trace: ([<000003ff801fdd8c>] btrfs_sync_log+0xa24/0xbc0 [btrfs])  [<000003ff801c74e2>] btrfs_sync_file+0x3e2/0x550 [btrfs]  [<00000000003ce6ce>] do_fsync+0x5e/0x90  [<00000000003ce9ca>] SyS_fdatasync+0x32/0x48  [<00000000008ffe5c>] system_call+0xd8/0x2c8 Last Breaking-Event-Address:  [<000003ff801fdd8c>] btrfs_sync_log+0xa24/0xbc0 [btrfs] BTRFS: error (device dm-14) in btrfs_sync_log:2908: errno=-117 unknown BTRFS error (device dm-14): pending csums is 269639680 This bug appears to be linked to bug 1933172, but is different and has a different root cause. Again, I believe that this is a regression introduced in the fixing of CVE-2019-19036, from 4.15.0-109-generic. [Fix] Analysis of the kernel oops is as follows: The first thing we see is that BTRFS entered ERROR state with the reason: in balance_level:1962: errno=-117 unknown Now errno -117 is: 100 #define EUCLEAN 117 /* Structure needs cleaning */ btrfs treats -EUCLEAN as if corruption has happened. Let's see where this is returned from. If we start at fs/btrfs/ctree.c in balance_level(), line 1962: 1917 static noinline int balance_level(struct btrfs_trans_handle *trans, 1918 struct btrfs_root *root, 1919 struct btrfs_path *path, int level) 1920 { ... 1958 /* promote the child to a root */ 1959 child = read_node_slot(fs_info, mid, 0); 1960 if (IS_ERR(child)) { 1961 ret = PTR_ERR(child); 1962 btrfs_handle_fs_error(fs_info, ret, NULL); 1963 goto enospc; 1964 } ... 2136 } We are in the middle of a balancing operation, and if you happen to be familiar with how b-tree data structures work, we are promoting a child node to a topmost root node. The error most likely happens in read_node_slot(), with the lines after it printing that an error has happened. 1887 static noinline struct extent_buffer * 1888 read_node_slot(struct btrfs_fs_info *fs_info, struct extent_buffer *parent, 1889 int slot) 1890 { ... 1900 btrfs_node_key_to_cpu(parent, &first_key, slot); 1901 eb = read_tree_block(fs_info, btrfs_node_blockptr(parent, slot), 1902 btrfs_node_ptr_generation(parent, slot), 1903 level - 1, &first_key); ... 1910 } There are two calls here which are relevant. btrfs_node_key_to_cpu() and read_tree_block(). Let's look at read_tree_block() in fs/btrfs/disk-io.c: 1147 struct extent_buffer *read_tree_block(struct btrfs_fs_info *fs_info, u64 bytenr, 1148 u64 parent_transid, int level, 1149 struct btrfs_key *first_key) 1150 { 1151 struct extent_buffer *buf = NULL; 1152 int ret; 1153 1154 buf = btrfs_find_create_tree_block(fs_info, bytenr); 1155 if (IS_ERR(buf)) 1156 return buf; 1157 1158 ret = btree_read_extent_buffer_pages(fs_info, buf, parent_transid, 1159 level, first_key); 1160 if (ret) { 1161 free_extent_buffer_stale(buf); 1162 return ERR_PTR(ret); 1163 } 1164 return buf; 1165 1166 } The interesting one here is btree_read_extent_buffer_pages(): 498 static int btree_read_extent_buffer_pages(struct btrfs_fs_info *fs_info, 499 struct extent_buffer *eb, 500 u64 parent_transid, int level, 501 struct btrfs_key *first_key) 502 { ... 511 while (1) { 512 clear_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags); 513 ret = read_extent_buffer_pages(io_tree, eb, WAIT_COMPLETE, 514 btree_get_extent, mirror_num); 515 if (!ret) { 516 if (verify_parent_transid(io_tree, eb, 517 parent_transid, 0)) 518 ret = -EIO; 519 else if (verify_level_key(fs_info, eb, level, 520 first_key)) 521 ret = -EUCLEAN; 522 else 523 break; 524 } 525 526 /* 527 * This buffer's crc is fine, but its contents are corrupted, so 528 * there is no reason to read the other copies, they won't be 529 * any less wrong. 530 */ 531 if (test_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags) || 532 ret == -EUCLEAN) 533 break; ... If read_extent_buffer_pages() returns zero, we call verify_level_key(), and if this returns non-zero, we return -EUCLEAN. verify_level_key() can fail on three conditions.  440 static int verify_level_key(struct btrfs_fs_info *fs_info,  441 struct extent_buffer *eb, int level,  442 struct btrfs_key *first_key)  443 {  448 found_level = btrfs_header_level(eb);  449 if (found_level != level) { ...  456 return -EIO;  457 }  458  459 if (!first_key)  460 return 0;  461  462 /* We have @first_key, so this @eb must have at least one item */  463 if (btrfs_header_nritems(eb) == 0) {  464 btrfs_err(fs_info,  465 "invalid tree nritems, bytenr=%llu nritems=0 expect >0",  466 eb->start);  467 WARN_ON(IS_ENABLED(CONFIG_BTRFS_DEBUG));  468 return -EUCLEAN;  469 }  470  471 if (found_level)  472 btrfs_node_key_to_cpu(eb, &found_key, 0);  473 else  474 btrfs_item_key_to_cpu(eb, &found_key, 0);  475 ret = btrfs_comp_cpu_keys(first_key, &found_key); ...  487 return ret;  488 } 1) If the eb level doesn't match the provided level. 2) If the eb has 0 items. 3) If the found_key doesn't match the first_key. With the information we currently have, we don't know what one caused the problem. I looked to see when verify_level_key() was first introduced. It seems it arrived in 4.15.0-109-generic through the SRU of CVE-2019-19036, with the commit: ubuntu-bionic 50ab1ff51db0c5eb77ffc6f15ef32f07764f86ff Author: Qu Wenruo <wqu@suse.com> Date: Thu Mar 29 09:08:11 2018 +0800 Subject: btrfs: Validate child tree block's level and first key Link: https://paste.ubuntu.com/p/DQjTkfNRDt/ If you look at this particular hunk: https://paste.ubuntu.com/p/z4qXw2Jp9K/ We see lines 18 - 26 of the above pastebin are introduced here. Looking at the original upstream commit: commit 581c1760415c48cca9349b198bba52dd38750765 Author: Qu Wenruo <wqu@suse.com> Date: Thu Mar 29 09:08:11 2018 +0800 Subject: btrfs: Validate child tree block's level and first key Link: https://github.com/torvalds/linux/commit/581c1760415c48cca9349b198bba52dd38750765 Particularly the same hunk: https://paste.ubuntu.com/p/vW3Y9BvK4C/ There is a subtle difference, the second if statement is extended with the ret == -EUCLEAN check, and not implemented entirely. Why is this? I looked up when the if statement was first introduced, and it was a very old commit from v2.6.39-rc1: https://paste.ubuntu.com/p/vPT36xXq66/ Particularly this hunk: https://paste.ubuntu.com/p/2jmQRSmVfc/ Interesting. Why is a commit from 4.15-109-generic re-implement something that should have been there since 2.6.39? I checked upstream, and found the if statement to be removed entirely. That is when I came across: commit f8397d69daef06d358430d3054662fb597e37c00 Author: Nikolay Borisov <nborisov@suse.com> Date: Tue Nov 6 16:40:20 2018 +0200 Subject: btrfs: Always try all copies when reading extent buffers Link: https://github.com/torvalds/linux/commit/f8397d69daef06d358430d3054662fb597e37c00 Which talks about balance operations failing out of the blue after a raid 1 disk was added back to the array. The commit removes the if statement, and moves the location of the clear EXTENT_BUFFER_CORRUPT inside the while loop. I checked the Bionic 4.15 kernel, and I found that this commit was applied in 4.15.0-56-generic: ubuntu-bionic 03e1b5c9a1c1704e109466b375d09a4721b65ec5 Author: Nikolay Borisov <nborisov@suse.com> Date: Tue Nov 6 16:40:20 2018 +0200 Subject: btrfs: Always try all copies when reading extent buffers Link: https://paste.ubuntu.com/p/TS2c5Mptr2/ It appears that the if statement was removed in 4.15.0-56-generic intentionally, and was brought back mistakenly in a backport of "btrfs: Validate child tree block's level and first key" 4.15.0-109-generic. The root cause is likely some interaction between bug 1933172 and this bug, which leads to EXTENT_BUFFER_CORRUPT being set, or the incorrect first_key being set for the root node, which means we end up returning -EUCLEAN and aborting the transaction. Unfortunately, this is the second bad backport of CVE-2019-19036. The fix is to revert "btrfs: Validate child tree block's level and first key" and its dependency commit "btrfs: Detect unbalanced tree with empty leaf before crashing btree operations", and re-apply correct backports of these commits with that if statement removed, to keep the spirit of the already applied "btrfs: Always try all copies when reading extent buffers". We will also add the below commit as it is a fixup commit for "btrfs: Validate child tree block's level and first key" to correct usage of the wrong "key" variable, which also leads to similar symptoms of -EUCLEAN being returned from verify_level_key(). commit 17515f1b764df36271f3166c714f5a78301fbaa7 Author: Qu Wenruo <wqu@suse.com> Date: Mon Apr 23 17:32:04 2018 +0800 Subject: btrfs: Fix wrong first_key parameter in replace_path Link: https://github.com/torvalds/linux/commit/17515f1b764df36271f3166c714f5a78301fbaa7 [Testcase] Unfortunately, the customer did not image the affected filesystem and has since restored a backup ontop of it. I have been attempting to reproduce this issue for some time, but I have not experienced the same call trace. I ran into bug 1933172 while trying to reproduce this bug. I have been trying to balance nearly full btrfs filesystems, and I have looped xfstests btrfs/124 for hours attempting to trigger "btrfs: Always try all copies when reading extent buffers" but I haven't experienced a crash yet. I have a test kernel in the following ppa: https://launchpad.net/~mruffell/+archive/ubuntu/sf311164-test-2 If you install it, balances still complete as expected. I will keep attempting to reproduce the issue, and will update this section if I manage to create a testcase. [Where problems could occur] If a regression were to occur, it would affect users of btrfs filesystems, and would likely show during a routine balance operation. I believe affected users would have nearly full filesystems, and would also experience filesystem corruption from bug 1933172, which would then cause the issues from this bug when the transaction log is written to disk. With all modifications to btrfs, there is a risk of data corruption and filesystem corruption for all btrfs users, since balances happen automatically and on a regular basis. If a regression does happen, users should remount their filesystems with the "nobalance" flag, backup their data, and attempt a repair if necessary. [Other info] A community member has hit this issue before, and has reported it upstream to linux-btrfs here, although they never received a reply. https://www.spinics.net/lists/linux-btrfs/msg112261.html I have written to Richard and asked if he has any additional information that might help reproduce, but I have yet to receive a reply. If you read Richard's mailing list link, it mentions filesystem corruption with missing extents. This suggests this crash might be linked to bug 1933172, which I came across while trying to reproduce the issue in this bug.
2021-07-06 04:35:50 Matthew Ruffell description BugLink: https://bugs.launchpad.net/bugs/1934709 [Impact] During an automatic balance, users may encounter an error when writing the transaction log to disk, when the log tree is being parsed, which forces the filesystem to be remounted read-only and outputs the following kernel oops: BTRFS: error (device dm-14) in balance_level:1962: errno=-117 unknown BTRFS info (device dm-14): forced readonly BTRFS: Transaction aborted (error -117) WARNING: CPU: 7 PID: 10818 at /build/linux-99Rib2/linux-4.15.0/fs/btrfs/tree-log.c:2908 btrfs_sync_log+0xa28/0xbc0 [btrfs] CPU: 7 PID: 10818 Comm: qemu-system-s39 Tainted: G OE 4.15.0-136-generic #140-Ubuntu Hardware name: IBM 3907 LR1 A00 (LPAR) Krnl PSW : 0000000076bc1d64 000000009cc65255 (btrfs_sync_log+0xa28/0xbc0 [btrfs])            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3 Krnl GPRS: 007899a600000000 0000000000000006 0000000000000027 0000000000000007            000003ff801fdd8c 000000000002394c 00000053650a7000 ffffffffffffff8b            000000536b7f6000 00000053ffffff8b 00000053650a3800 0000005385935000            000000532054de01 0000005385935290 000003ff801fdd8c 0000004eebb1fae0 Krnl Code: 000003ff801fdd80: c0200002a032 larl %r2,000003ff80251de4            000003ff801fdd86: c0e5fffb2181 brasl %r14,000003ff80162088           #000003ff801fdd8c: a7f40001 brc 15,000003ff801fdd8e           >000003ff801fdd90: e3a0f0a80004 lg %r10,168(%r15)            000003ff801fdd96: b9040057 lgr %r5,%r7            000003ff801fdd9a: a7490b5c lghi %r4,2908            000003ff801fdd9e: b904002a lgr %r2,%r10            000003ff801fdda2: c0300002604f larl %r3,000003ff80249e40 Call Trace: ([<000003ff801fdd8c>] btrfs_sync_log+0xa24/0xbc0 [btrfs])  [<000003ff801c74e2>] btrfs_sync_file+0x3e2/0x550 [btrfs]  [<00000000003ce6ce>] do_fsync+0x5e/0x90  [<00000000003ce9ca>] SyS_fdatasync+0x32/0x48  [<00000000008ffe5c>] system_call+0xd8/0x2c8 Last Breaking-Event-Address:  [<000003ff801fdd8c>] btrfs_sync_log+0xa24/0xbc0 [btrfs] BTRFS: error (device dm-14) in btrfs_sync_log:2908: errno=-117 unknown BTRFS error (device dm-14): pending csums is 269639680 This bug appears to be linked to bug 1933172, but is different and has a different root cause. Again, I believe that this is a regression introduced in the fixing of CVE-2019-19036, from 4.15.0-109-generic. [Fix] Analysis of the kernel oops is as follows: The first thing we see is that BTRFS entered ERROR state with the reason: in balance_level:1962: errno=-117 unknown Now errno -117 is: 100 #define EUCLEAN 117 /* Structure needs cleaning */ btrfs treats -EUCLEAN as if corruption has happened. Let's see where this is returned from. If we start at fs/btrfs/ctree.c in balance_level(), line 1962: 1917 static noinline int balance_level(struct btrfs_trans_handle *trans, 1918 struct btrfs_root *root, 1919 struct btrfs_path *path, int level) 1920 { ... 1958 /* promote the child to a root */ 1959 child = read_node_slot(fs_info, mid, 0); 1960 if (IS_ERR(child)) { 1961 ret = PTR_ERR(child); 1962 btrfs_handle_fs_error(fs_info, ret, NULL); 1963 goto enospc; 1964 } ... 2136 } We are in the middle of a balancing operation, and if you happen to be familiar with how b-tree data structures work, we are promoting a child node to a topmost root node. The error most likely happens in read_node_slot(), with the lines after it printing that an error has happened. 1887 static noinline struct extent_buffer * 1888 read_node_slot(struct btrfs_fs_info *fs_info, struct extent_buffer *parent, 1889 int slot) 1890 { ... 1900 btrfs_node_key_to_cpu(parent, &first_key, slot); 1901 eb = read_tree_block(fs_info, btrfs_node_blockptr(parent, slot), 1902 btrfs_node_ptr_generation(parent, slot), 1903 level - 1, &first_key); ... 1910 } There are two calls here which are relevant. btrfs_node_key_to_cpu() and read_tree_block(). Let's look at read_tree_block() in fs/btrfs/disk-io.c: 1147 struct extent_buffer *read_tree_block(struct btrfs_fs_info *fs_info, u64 bytenr, 1148 u64 parent_transid, int level, 1149 struct btrfs_key *first_key) 1150 { 1151 struct extent_buffer *buf = NULL; 1152 int ret; 1153 1154 buf = btrfs_find_create_tree_block(fs_info, bytenr); 1155 if (IS_ERR(buf)) 1156 return buf; 1157 1158 ret = btree_read_extent_buffer_pages(fs_info, buf, parent_transid, 1159 level, first_key); 1160 if (ret) { 1161 free_extent_buffer_stale(buf); 1162 return ERR_PTR(ret); 1163 } 1164 return buf; 1165 1166 } The interesting one here is btree_read_extent_buffer_pages(): 498 static int btree_read_extent_buffer_pages(struct btrfs_fs_info *fs_info, 499 struct extent_buffer *eb, 500 u64 parent_transid, int level, 501 struct btrfs_key *first_key) 502 { ... 511 while (1) { 512 clear_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags); 513 ret = read_extent_buffer_pages(io_tree, eb, WAIT_COMPLETE, 514 btree_get_extent, mirror_num); 515 if (!ret) { 516 if (verify_parent_transid(io_tree, eb, 517 parent_transid, 0)) 518 ret = -EIO; 519 else if (verify_level_key(fs_info, eb, level, 520 first_key)) 521 ret = -EUCLEAN; 522 else 523 break; 524 } 525 526 /* 527 * This buffer's crc is fine, but its contents are corrupted, so 528 * there is no reason to read the other copies, they won't be 529 * any less wrong. 530 */ 531 if (test_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags) || 532 ret == -EUCLEAN) 533 break; ... If read_extent_buffer_pages() returns zero, we call verify_level_key(), and if this returns non-zero, we return -EUCLEAN. verify_level_key() can fail on three conditions.  440 static int verify_level_key(struct btrfs_fs_info *fs_info,  441 struct extent_buffer *eb, int level,  442 struct btrfs_key *first_key)  443 {  448 found_level = btrfs_header_level(eb);  449 if (found_level != level) { ...  456 return -EIO;  457 }  458  459 if (!first_key)  460 return 0;  461  462 /* We have @first_key, so this @eb must have at least one item */  463 if (btrfs_header_nritems(eb) == 0) {  464 btrfs_err(fs_info,  465 "invalid tree nritems, bytenr=%llu nritems=0 expect >0",  466 eb->start);  467 WARN_ON(IS_ENABLED(CONFIG_BTRFS_DEBUG));  468 return -EUCLEAN;  469 }  470  471 if (found_level)  472 btrfs_node_key_to_cpu(eb, &found_key, 0);  473 else  474 btrfs_item_key_to_cpu(eb, &found_key, 0);  475 ret = btrfs_comp_cpu_keys(first_key, &found_key); ...  487 return ret;  488 } 1) If the eb level doesn't match the provided level. 2) If the eb has 0 items. 3) If the found_key doesn't match the first_key. With the information we currently have, we don't know what one caused the problem. I looked to see when verify_level_key() was first introduced. It seems it arrived in 4.15.0-109-generic through the SRU of CVE-2019-19036, with the commit: ubuntu-bionic 50ab1ff51db0c5eb77ffc6f15ef32f07764f86ff Author: Qu Wenruo <wqu@suse.com> Date: Thu Mar 29 09:08:11 2018 +0800 Subject: btrfs: Validate child tree block's level and first key Link: https://paste.ubuntu.com/p/DQjTkfNRDt/ If you look at this particular hunk: https://paste.ubuntu.com/p/z4qXw2Jp9K/ We see lines 18 - 26 of the above pastebin are introduced here. Looking at the original upstream commit: commit 581c1760415c48cca9349b198bba52dd38750765 Author: Qu Wenruo <wqu@suse.com> Date: Thu Mar 29 09:08:11 2018 +0800 Subject: btrfs: Validate child tree block's level and first key Link: https://github.com/torvalds/linux/commit/581c1760415c48cca9349b198bba52dd38750765 Particularly the same hunk: https://paste.ubuntu.com/p/vW3Y9BvK4C/ There is a subtle difference, the second if statement is extended with the ret == -EUCLEAN check, and not implemented entirely. Why is this? I looked up when the if statement was first introduced, and it was a very old commit from v2.6.39-rc1: https://paste.ubuntu.com/p/vPT36xXq66/ Particularly this hunk: https://paste.ubuntu.com/p/2jmQRSmVfc/ Interesting. Why is a commit from 4.15-109-generic re-implement something that should have been there since 2.6.39? I checked upstream, and found the if statement to be removed entirely. That is when I came across: commit f8397d69daef06d358430d3054662fb597e37c00 Author: Nikolay Borisov <nborisov@suse.com> Date: Tue Nov 6 16:40:20 2018 +0200 Subject: btrfs: Always try all copies when reading extent buffers Link: https://github.com/torvalds/linux/commit/f8397d69daef06d358430d3054662fb597e37c00 Which talks about balance operations failing out of the blue after a raid 1 disk was added back to the array. The commit removes the if statement, and moves the location of the clear EXTENT_BUFFER_CORRUPT inside the while loop. I checked the Bionic 4.15 kernel, and I found that this commit was applied in 4.15.0-56-generic: ubuntu-bionic 03e1b5c9a1c1704e109466b375d09a4721b65ec5 Author: Nikolay Borisov <nborisov@suse.com> Date: Tue Nov 6 16:40:20 2018 +0200 Subject: btrfs: Always try all copies when reading extent buffers Link: https://paste.ubuntu.com/p/TS2c5Mptr2/ It appears that the if statement was removed in 4.15.0-56-generic intentionally, and was brought back mistakenly in a backport of "btrfs: Validate child tree block's level and first key" 4.15.0-109-generic. The root cause is likely some interaction between bug 1933172 and this bug, which leads to EXTENT_BUFFER_CORRUPT being set, or the incorrect first_key being set for the root node, which means we end up returning -EUCLEAN and aborting the transaction. Unfortunately, this is the second bad backport of CVE-2019-19036. The fix is to revert "btrfs: Validate child tree block's level and first key" and its dependency commit "btrfs: Detect unbalanced tree with empty leaf before crashing btree operations", and re-apply correct backports of these commits with that if statement removed, to keep the spirit of the already applied "btrfs: Always try all copies when reading extent buffers". We will also add the below commit as it is a fixup commit for "btrfs: Validate child tree block's level and first key" to correct usage of the wrong "key" variable, which also leads to similar symptoms of -EUCLEAN being returned from verify_level_key(). commit 17515f1b764df36271f3166c714f5a78301fbaa7 Author: Qu Wenruo <wqu@suse.com> Date: Mon Apr 23 17:32:04 2018 +0800 Subject: btrfs: Fix wrong first_key parameter in replace_path Link: https://github.com/torvalds/linux/commit/17515f1b764df36271f3166c714f5a78301fbaa7 [Testcase] Unfortunately, the customer did not image the affected filesystem and has since restored a backup ontop of it. I have been attempting to reproduce this issue for some time, but I have not experienced the same call trace. I ran into bug 1933172 while trying to reproduce this bug. I have been trying to balance nearly full btrfs filesystems, and I have looped xfstests btrfs/124 for hours attempting to trigger "btrfs: Always try all copies when reading extent buffers" but I haven't experienced a crash yet. I have a test kernel in the following ppa: https://launchpad.net/~mruffell/+archive/ubuntu/sf311164-test-2 If you install it, balances still complete as expected. I will keep attempting to reproduce the issue, and will update this section if I manage to create a testcase. [Where problems could occur] If a regression were to occur, it would affect users of btrfs filesystems, and would likely show during a routine balance operation. I believe affected users would have nearly full filesystems, and would also experience filesystem corruption from bug 1933172, which would then cause the issues from this bug when the transaction log is written to disk. With all modifications to btrfs, there is a risk of data corruption and filesystem corruption for all btrfs users, since balances happen automatically and on a regular basis. If a regression does happen, users should remount their filesystems with the "nobalance" flag, backup their data, and attempt a repair if necessary. [Other info] A community member has hit this issue before, and has reported it upstream to linux-btrfs here, although they never received a reply. https://www.spinics.net/lists/linux-btrfs/msg112261.html I have written to Richard and asked if he has any additional information that might help reproduce, but I have yet to receive a reply. If you read Richard's mailing list link, it mentions filesystem corruption with missing extents. This suggests this crash might be linked to bug 1933172, which I came across while trying to reproduce the issue in this bug. BugLink: https://bugs.launchpad.net/bugs/1934709 [Impact] During an automatic balance, users may encounter an error when writing the transaction log to disk, when the log tree is being parsed, which forces the filesystem to be remounted read-only and outputs the following kernel oops: BTRFS: error (device dm-14) in balance_level:1962: errno=-117 unknown BTRFS info (device dm-14): forced readonly BTRFS: Transaction aborted (error -117) WARNING: CPU: 7 PID: 10818 at /build/linux-99Rib2/linux-4.15.0/fs/btrfs/tree-log.c:2908 btrfs_sync_log+0xa28/0xbc0 [btrfs] CPU: 7 PID: 10818 Comm: qemu-system-s39 Tainted: G OE 4.15.0-136-generic #140-Ubuntu Hardware name: IBM 3907 LR1 A00 (LPAR) Krnl PSW : 0000000076bc1d64 000000009cc65255 (btrfs_sync_log+0xa28/0xbc0 [btrfs])            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3 Krnl GPRS: 007899a600000000 0000000000000006 0000000000000027 0000000000000007            000003ff801fdd8c 000000000002394c 00000053650a7000 ffffffffffffff8b            000000536b7f6000 00000053ffffff8b 00000053650a3800 0000005385935000            000000532054de01 0000005385935290 000003ff801fdd8c 0000004eebb1fae0 Krnl Code: 000003ff801fdd80: c0200002a032 larl %r2,000003ff80251de4            000003ff801fdd86: c0e5fffb2181 brasl %r14,000003ff80162088           #000003ff801fdd8c: a7f40001 brc 15,000003ff801fdd8e           >000003ff801fdd90: e3a0f0a80004 lg %r10,168(%r15)            000003ff801fdd96: b9040057 lgr %r5,%r7            000003ff801fdd9a: a7490b5c lghi %r4,2908            000003ff801fdd9e: b904002a lgr %r2,%r10            000003ff801fdda2: c0300002604f larl %r3,000003ff80249e40 Call Trace: ([<000003ff801fdd8c>] btrfs_sync_log+0xa24/0xbc0 [btrfs])  [<000003ff801c74e2>] btrfs_sync_file+0x3e2/0x550 [btrfs]  [<00000000003ce6ce>] do_fsync+0x5e/0x90  [<00000000003ce9ca>] SyS_fdatasync+0x32/0x48  [<00000000008ffe5c>] system_call+0xd8/0x2c8 Last Breaking-Event-Address:  [<000003ff801fdd8c>] btrfs_sync_log+0xa24/0xbc0 [btrfs] BTRFS: error (device dm-14) in btrfs_sync_log:2908: errno=-117 unknown BTRFS error (device dm-14): pending csums is 269639680 This bug appears to be linked to bug 1933172, but is different and has a different root cause. Again, I believe that this is a regression introduced in the fixing of CVE-2019-19036, from 4.15.0-109-generic. [Fix] Analysis of the kernel oops is as follows: The first thing we see is that BTRFS entered ERROR state with the reason: in balance_level:1962: errno=-117 unknown Now errno -117 is: 100 #define EUCLEAN 117 /* Structure needs cleaning */ btrfs treats -EUCLEAN as if corruption has happened. Let's see where this is returned from. If we start at fs/btrfs/ctree.c in balance_level(), line 1962: 1917 static noinline int balance_level(struct btrfs_trans_handle *trans, 1918 struct btrfs_root *root, 1919 struct btrfs_path *path, int level) 1920 { ... 1958 /* promote the child to a root */ 1959 child = read_node_slot(fs_info, mid, 0); 1960 if (IS_ERR(child)) { 1961 ret = PTR_ERR(child); 1962 btrfs_handle_fs_error(fs_info, ret, NULL); 1963 goto enospc; 1964 } ... 2136 } We are in the middle of a balancing operation, and if you happen to be familiar with how b-tree data structures work, we are promoting a child node to a topmost root node. The error most likely happens in read_node_slot(), with the lines after it printing that an error has happened. 1887 static noinline struct extent_buffer * 1888 read_node_slot(struct btrfs_fs_info *fs_info, struct extent_buffer *parent, 1889 int slot) 1890 { ... 1900 btrfs_node_key_to_cpu(parent, &first_key, slot); 1901 eb = read_tree_block(fs_info, btrfs_node_blockptr(parent, slot), 1902 btrfs_node_ptr_generation(parent, slot), 1903 level - 1, &first_key); ... 1910 } There are two calls here which are relevant. btrfs_node_key_to_cpu() and read_tree_block(). Let's look at read_tree_block() in fs/btrfs/disk-io.c: 1147 struct extent_buffer *read_tree_block(struct btrfs_fs_info *fs_info, u64 bytenr, 1148 u64 parent_transid, int level, 1149 struct btrfs_key *first_key) 1150 { 1151 struct extent_buffer *buf = NULL; 1152 int ret; 1153 1154 buf = btrfs_find_create_tree_block(fs_info, bytenr); 1155 if (IS_ERR(buf)) 1156 return buf; 1157 1158 ret = btree_read_extent_buffer_pages(fs_info, buf, parent_transid, 1159 level, first_key); 1160 if (ret) { 1161 free_extent_buffer_stale(buf); 1162 return ERR_PTR(ret); 1163 } 1164 return buf; 1165 1166 } The interesting one here is btree_read_extent_buffer_pages(): 498 static int btree_read_extent_buffer_pages(struct btrfs_fs_info *fs_info, 499 struct extent_buffer *eb, 500 u64 parent_transid, int level, 501 struct btrfs_key *first_key) 502 { ... 511 while (1) { 512 clear_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags); 513 ret = read_extent_buffer_pages(io_tree, eb, WAIT_COMPLETE, 514 btree_get_extent, mirror_num); 515 if (!ret) { 516 if (verify_parent_transid(io_tree, eb, 517 parent_transid, 0)) 518 ret = -EIO; 519 else if (verify_level_key(fs_info, eb, level, 520 first_key)) 521 ret = -EUCLEAN; 522 else 523 break; 524 } 525 526 /* 527 * This buffer's crc is fine, but its contents are corrupted, so 528 * there is no reason to read the other copies, they won't be 529 * any less wrong. 530 */ 531 if (test_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags) || 532 ret == -EUCLEAN) 533 break; ... If read_extent_buffer_pages() returns zero, we call verify_level_key(), and if this returns non-zero, we return -EUCLEAN. verify_level_key() can fail on three conditions.  440 static int verify_level_key(struct btrfs_fs_info *fs_info,  441 struct extent_buffer *eb, int level,  442 struct btrfs_key *first_key)  443 {  448 found_level = btrfs_header_level(eb);  449 if (found_level != level) { ...  456 return -EIO;  457 }  458  459 if (!first_key)  460 return 0;  461  462 /* We have @first_key, so this @eb must have at least one item */  463 if (btrfs_header_nritems(eb) == 0) {  464 btrfs_err(fs_info,  465 "invalid tree nritems, bytenr=%llu nritems=0 expect >0",  466 eb->start);  467 WARN_ON(IS_ENABLED(CONFIG_BTRFS_DEBUG));  468 return -EUCLEAN;  469 }  470  471 if (found_level)  472 btrfs_node_key_to_cpu(eb, &found_key, 0);  473 else  474 btrfs_item_key_to_cpu(eb, &found_key, 0);  475 ret = btrfs_comp_cpu_keys(first_key, &found_key); ...  487 return ret;  488 } 1) If the eb level doesn't match the provided level. 2) If the eb has 0 items. 3) If the found_key doesn't match the first_key. With the information we currently have, we don't know what one caused the problem. I looked to see when verify_level_key() was first introduced. It seems it arrived in 4.15.0-109-generic through the SRU of CVE-2019-19036, with the commit: ubuntu-bionic 50ab1ff51db0c5eb77ffc6f15ef32f07764f86ff Author: Qu Wenruo <wqu@suse.com> Date: Thu Mar 29 09:08:11 2018 +0800 Subject: btrfs: Validate child tree block's level and first key Link: https://paste.ubuntu.com/p/DQjTkfNRDt/ If you look at this particular hunk: https://paste.ubuntu.com/p/z4qXw2Jp9K/ We see lines 18 - 26 of the above pastebin are introduced here. Looking at the original upstream commit: commit 581c1760415c48cca9349b198bba52dd38750765 Author: Qu Wenruo <wqu@suse.com> Date: Thu Mar 29 09:08:11 2018 +0800 Subject: btrfs: Validate child tree block's level and first key Link: https://github.com/torvalds/linux/commit/581c1760415c48cca9349b198bba52dd38750765 Particularly the same hunk: https://paste.ubuntu.com/p/vW3Y9BvK4C/ There is a subtle difference, the second if statement is extended with the ret == -EUCLEAN check, and not implemented entirely. Why is this? I looked up when the if statement was first introduced, and it was a very old commit from v2.6.39-rc1: https://paste.ubuntu.com/p/vPT36xXq66/ Particularly this hunk: https://paste.ubuntu.com/p/2jmQRSmVfc/ Interesting. Why is a commit from 4.15-109-generic re-implement something that should have been there since 2.6.39? I checked upstream, and found the if statement to be removed entirely. That is when I came across: commit f8397d69daef06d358430d3054662fb597e37c00 Author: Nikolay Borisov <nborisov@suse.com> Date: Tue Nov 6 16:40:20 2018 +0200 Subject: btrfs: Always try all copies when reading extent buffers Link: https://github.com/torvalds/linux/commit/f8397d69daef06d358430d3054662fb597e37c00 Which talks about balance operations failing out of the blue after a raid 1 disk was added back to the array. The commit removes the if statement, and moves the location of the clear EXTENT_BUFFER_CORRUPT inside the while loop. I checked the Bionic 4.15 kernel, and I found that this commit was applied in 4.15.0-56-generic: ubuntu-bionic 03e1b5c9a1c1704e109466b375d09a4721b65ec5 Author: Nikolay Borisov <nborisov@suse.com> Date: Tue Nov 6 16:40:20 2018 +0200 Subject: btrfs: Always try all copies when reading extent buffers Link: https://paste.ubuntu.com/p/TS2c5Mptr2/ It appears that the if statement was removed in 4.15.0-56-generic intentionally, and was brought back mistakenly in a backport of "btrfs: Validate child tree block's level and first key" 4.15.0-109-generic. The root cause is likely some interaction between bug 1933172 and this bug, which leads to EXTENT_BUFFER_CORRUPT being set, or the incorrect first_key being set for the root node, which means we end up returning -EUCLEAN and aborting the transaction. Unfortunately, this is the second bad backport of CVE-2019-19036. The fix is to revert "btrfs: Validate child tree block's level and first key" and its dependency commit "btrfs: Detect unbalanced tree with empty leaf before crashing btree operations", and re-apply correct backports of these commits with that if statement removed, to keep the spirit of the already applied "btrfs: Always try all copies when reading extent buffers". We will also add the below commits as they are fixup commits for "btrfs: Validate child tree block's level and first key". commit 5d41be6f702f19f72db816c17175caf9dbdcdfa6 Author: Qu Wenruo <wqu@suse.com> Date: Fri Apr 13 06:32:47 2018 +0800 Subject: btrfs: Only check first key for committed tree blocks https://github.com/torvalds/linux/commit/5d41be6f702f19f72db816c17175caf9dbdcdfa6 commit 17515f1b764df36271f3166c714f5a78301fbaa7 Author: Qu Wenruo <wqu@suse.com> Date: Mon Apr 23 17:32:04 2018 +0800 Subject: btrfs: Fix wrong first_key parameter in replace_path Link: https://github.com/torvalds/linux/commit/17515f1b764df36271f3166c714f5a78301fbaa7 [Testcase] Unfortunately, the customer did not image the affected filesystem and has since restored a backup ontop of it. I have been attempting to reproduce this issue for some time, but I have not experienced the same call trace. I ran into bug 1933172 while trying to reproduce this bug. I have been trying to balance nearly full btrfs filesystems, and I have looped xfstests btrfs/124 for hours attempting to trigger "btrfs: Always try all copies when reading extent buffers" but I haven't experienced a crash yet. I have a test kernel in the following ppa: https://launchpad.net/~mruffell/+archive/ubuntu/sf311164-test-2 If you install it, balances still complete as expected. I will keep attempting to reproduce the issue, and will update this section if I manage to create a testcase. [Where problems could occur] If a regression were to occur, it would affect users of btrfs filesystems, and would likely show during a routine balance operation. I believe affected users would have nearly full filesystems, and would also experience filesystem corruption from bug 1933172, which would then cause the issues from this bug when the transaction log is written to disk. With all modifications to btrfs, there is a risk of data corruption and filesystem corruption for all btrfs users, since balances happen automatically and on a regular basis. If a regression does happen, users should remount their filesystems with the "nobalance" flag, backup their data, and attempt a repair if necessary. [Other info] A community member has hit this issue before, and has reported it upstream to linux-btrfs here, although they never received a reply. https://www.spinics.net/lists/linux-btrfs/msg112261.html I have written to Richard and asked if he has any additional information that might help reproduce, but I have yet to receive a reply. If you read Richard's mailing list link, it mentions filesystem corruption with missing extents. This suggests this crash might be linked to bug 1933172, which I came across while trying to reproduce the issue in this bug.
2021-07-06 04:36:22 Matthew Ruffell description BugLink: https://bugs.launchpad.net/bugs/1934709 [Impact] During an automatic balance, users may encounter an error when writing the transaction log to disk, when the log tree is being parsed, which forces the filesystem to be remounted read-only and outputs the following kernel oops: BTRFS: error (device dm-14) in balance_level:1962: errno=-117 unknown BTRFS info (device dm-14): forced readonly BTRFS: Transaction aborted (error -117) WARNING: CPU: 7 PID: 10818 at /build/linux-99Rib2/linux-4.15.0/fs/btrfs/tree-log.c:2908 btrfs_sync_log+0xa28/0xbc0 [btrfs] CPU: 7 PID: 10818 Comm: qemu-system-s39 Tainted: G OE 4.15.0-136-generic #140-Ubuntu Hardware name: IBM 3907 LR1 A00 (LPAR) Krnl PSW : 0000000076bc1d64 000000009cc65255 (btrfs_sync_log+0xa28/0xbc0 [btrfs])            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3 Krnl GPRS: 007899a600000000 0000000000000006 0000000000000027 0000000000000007            000003ff801fdd8c 000000000002394c 00000053650a7000 ffffffffffffff8b            000000536b7f6000 00000053ffffff8b 00000053650a3800 0000005385935000            000000532054de01 0000005385935290 000003ff801fdd8c 0000004eebb1fae0 Krnl Code: 000003ff801fdd80: c0200002a032 larl %r2,000003ff80251de4            000003ff801fdd86: c0e5fffb2181 brasl %r14,000003ff80162088           #000003ff801fdd8c: a7f40001 brc 15,000003ff801fdd8e           >000003ff801fdd90: e3a0f0a80004 lg %r10,168(%r15)            000003ff801fdd96: b9040057 lgr %r5,%r7            000003ff801fdd9a: a7490b5c lghi %r4,2908            000003ff801fdd9e: b904002a lgr %r2,%r10            000003ff801fdda2: c0300002604f larl %r3,000003ff80249e40 Call Trace: ([<000003ff801fdd8c>] btrfs_sync_log+0xa24/0xbc0 [btrfs])  [<000003ff801c74e2>] btrfs_sync_file+0x3e2/0x550 [btrfs]  [<00000000003ce6ce>] do_fsync+0x5e/0x90  [<00000000003ce9ca>] SyS_fdatasync+0x32/0x48  [<00000000008ffe5c>] system_call+0xd8/0x2c8 Last Breaking-Event-Address:  [<000003ff801fdd8c>] btrfs_sync_log+0xa24/0xbc0 [btrfs] BTRFS: error (device dm-14) in btrfs_sync_log:2908: errno=-117 unknown BTRFS error (device dm-14): pending csums is 269639680 This bug appears to be linked to bug 1933172, but is different and has a different root cause. Again, I believe that this is a regression introduced in the fixing of CVE-2019-19036, from 4.15.0-109-generic. [Fix] Analysis of the kernel oops is as follows: The first thing we see is that BTRFS entered ERROR state with the reason: in balance_level:1962: errno=-117 unknown Now errno -117 is: 100 #define EUCLEAN 117 /* Structure needs cleaning */ btrfs treats -EUCLEAN as if corruption has happened. Let's see where this is returned from. If we start at fs/btrfs/ctree.c in balance_level(), line 1962: 1917 static noinline int balance_level(struct btrfs_trans_handle *trans, 1918 struct btrfs_root *root, 1919 struct btrfs_path *path, int level) 1920 { ... 1958 /* promote the child to a root */ 1959 child = read_node_slot(fs_info, mid, 0); 1960 if (IS_ERR(child)) { 1961 ret = PTR_ERR(child); 1962 btrfs_handle_fs_error(fs_info, ret, NULL); 1963 goto enospc; 1964 } ... 2136 } We are in the middle of a balancing operation, and if you happen to be familiar with how b-tree data structures work, we are promoting a child node to a topmost root node. The error most likely happens in read_node_slot(), with the lines after it printing that an error has happened. 1887 static noinline struct extent_buffer * 1888 read_node_slot(struct btrfs_fs_info *fs_info, struct extent_buffer *parent, 1889 int slot) 1890 { ... 1900 btrfs_node_key_to_cpu(parent, &first_key, slot); 1901 eb = read_tree_block(fs_info, btrfs_node_blockptr(parent, slot), 1902 btrfs_node_ptr_generation(parent, slot), 1903 level - 1, &first_key); ... 1910 } There are two calls here which are relevant. btrfs_node_key_to_cpu() and read_tree_block(). Let's look at read_tree_block() in fs/btrfs/disk-io.c: 1147 struct extent_buffer *read_tree_block(struct btrfs_fs_info *fs_info, u64 bytenr, 1148 u64 parent_transid, int level, 1149 struct btrfs_key *first_key) 1150 { 1151 struct extent_buffer *buf = NULL; 1152 int ret; 1153 1154 buf = btrfs_find_create_tree_block(fs_info, bytenr); 1155 if (IS_ERR(buf)) 1156 return buf; 1157 1158 ret = btree_read_extent_buffer_pages(fs_info, buf, parent_transid, 1159 level, first_key); 1160 if (ret) { 1161 free_extent_buffer_stale(buf); 1162 return ERR_PTR(ret); 1163 } 1164 return buf; 1165 1166 } The interesting one here is btree_read_extent_buffer_pages(): 498 static int btree_read_extent_buffer_pages(struct btrfs_fs_info *fs_info, 499 struct extent_buffer *eb, 500 u64 parent_transid, int level, 501 struct btrfs_key *first_key) 502 { ... 511 while (1) { 512 clear_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags); 513 ret = read_extent_buffer_pages(io_tree, eb, WAIT_COMPLETE, 514 btree_get_extent, mirror_num); 515 if (!ret) { 516 if (verify_parent_transid(io_tree, eb, 517 parent_transid, 0)) 518 ret = -EIO; 519 else if (verify_level_key(fs_info, eb, level, 520 first_key)) 521 ret = -EUCLEAN; 522 else 523 break; 524 } 525 526 /* 527 * This buffer's crc is fine, but its contents are corrupted, so 528 * there is no reason to read the other copies, they won't be 529 * any less wrong. 530 */ 531 if (test_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags) || 532 ret == -EUCLEAN) 533 break; ... If read_extent_buffer_pages() returns zero, we call verify_level_key(), and if this returns non-zero, we return -EUCLEAN. verify_level_key() can fail on three conditions.  440 static int verify_level_key(struct btrfs_fs_info *fs_info,  441 struct extent_buffer *eb, int level,  442 struct btrfs_key *first_key)  443 {  448 found_level = btrfs_header_level(eb);  449 if (found_level != level) { ...  456 return -EIO;  457 }  458  459 if (!first_key)  460 return 0;  461  462 /* We have @first_key, so this @eb must have at least one item */  463 if (btrfs_header_nritems(eb) == 0) {  464 btrfs_err(fs_info,  465 "invalid tree nritems, bytenr=%llu nritems=0 expect >0",  466 eb->start);  467 WARN_ON(IS_ENABLED(CONFIG_BTRFS_DEBUG));  468 return -EUCLEAN;  469 }  470  471 if (found_level)  472 btrfs_node_key_to_cpu(eb, &found_key, 0);  473 else  474 btrfs_item_key_to_cpu(eb, &found_key, 0);  475 ret = btrfs_comp_cpu_keys(first_key, &found_key); ...  487 return ret;  488 } 1) If the eb level doesn't match the provided level. 2) If the eb has 0 items. 3) If the found_key doesn't match the first_key. With the information we currently have, we don't know what one caused the problem. I looked to see when verify_level_key() was first introduced. It seems it arrived in 4.15.0-109-generic through the SRU of CVE-2019-19036, with the commit: ubuntu-bionic 50ab1ff51db0c5eb77ffc6f15ef32f07764f86ff Author: Qu Wenruo <wqu@suse.com> Date: Thu Mar 29 09:08:11 2018 +0800 Subject: btrfs: Validate child tree block's level and first key Link: https://paste.ubuntu.com/p/DQjTkfNRDt/ If you look at this particular hunk: https://paste.ubuntu.com/p/z4qXw2Jp9K/ We see lines 18 - 26 of the above pastebin are introduced here. Looking at the original upstream commit: commit 581c1760415c48cca9349b198bba52dd38750765 Author: Qu Wenruo <wqu@suse.com> Date: Thu Mar 29 09:08:11 2018 +0800 Subject: btrfs: Validate child tree block's level and first key Link: https://github.com/torvalds/linux/commit/581c1760415c48cca9349b198bba52dd38750765 Particularly the same hunk: https://paste.ubuntu.com/p/vW3Y9BvK4C/ There is a subtle difference, the second if statement is extended with the ret == -EUCLEAN check, and not implemented entirely. Why is this? I looked up when the if statement was first introduced, and it was a very old commit from v2.6.39-rc1: https://paste.ubuntu.com/p/vPT36xXq66/ Particularly this hunk: https://paste.ubuntu.com/p/2jmQRSmVfc/ Interesting. Why is a commit from 4.15-109-generic re-implement something that should have been there since 2.6.39? I checked upstream, and found the if statement to be removed entirely. That is when I came across: commit f8397d69daef06d358430d3054662fb597e37c00 Author: Nikolay Borisov <nborisov@suse.com> Date: Tue Nov 6 16:40:20 2018 +0200 Subject: btrfs: Always try all copies when reading extent buffers Link: https://github.com/torvalds/linux/commit/f8397d69daef06d358430d3054662fb597e37c00 Which talks about balance operations failing out of the blue after a raid 1 disk was added back to the array. The commit removes the if statement, and moves the location of the clear EXTENT_BUFFER_CORRUPT inside the while loop. I checked the Bionic 4.15 kernel, and I found that this commit was applied in 4.15.0-56-generic: ubuntu-bionic 03e1b5c9a1c1704e109466b375d09a4721b65ec5 Author: Nikolay Borisov <nborisov@suse.com> Date: Tue Nov 6 16:40:20 2018 +0200 Subject: btrfs: Always try all copies when reading extent buffers Link: https://paste.ubuntu.com/p/TS2c5Mptr2/ It appears that the if statement was removed in 4.15.0-56-generic intentionally, and was brought back mistakenly in a backport of "btrfs: Validate child tree block's level and first key" 4.15.0-109-generic. The root cause is likely some interaction between bug 1933172 and this bug, which leads to EXTENT_BUFFER_CORRUPT being set, or the incorrect first_key being set for the root node, which means we end up returning -EUCLEAN and aborting the transaction. Unfortunately, this is the second bad backport of CVE-2019-19036. The fix is to revert "btrfs: Validate child tree block's level and first key" and its dependency commit "btrfs: Detect unbalanced tree with empty leaf before crashing btree operations", and re-apply correct backports of these commits with that if statement removed, to keep the spirit of the already applied "btrfs: Always try all copies when reading extent buffers". We will also add the below commits as they are fixup commits for "btrfs: Validate child tree block's level and first key". commit 5d41be6f702f19f72db816c17175caf9dbdcdfa6 Author: Qu Wenruo <wqu@suse.com> Date: Fri Apr 13 06:32:47 2018 +0800 Subject: btrfs: Only check first key for committed tree blocks https://github.com/torvalds/linux/commit/5d41be6f702f19f72db816c17175caf9dbdcdfa6 commit 17515f1b764df36271f3166c714f5a78301fbaa7 Author: Qu Wenruo <wqu@suse.com> Date: Mon Apr 23 17:32:04 2018 +0800 Subject: btrfs: Fix wrong first_key parameter in replace_path Link: https://github.com/torvalds/linux/commit/17515f1b764df36271f3166c714f5a78301fbaa7 [Testcase] Unfortunately, the customer did not image the affected filesystem and has since restored a backup ontop of it. I have been attempting to reproduce this issue for some time, but I have not experienced the same call trace. I ran into bug 1933172 while trying to reproduce this bug. I have been trying to balance nearly full btrfs filesystems, and I have looped xfstests btrfs/124 for hours attempting to trigger "btrfs: Always try all copies when reading extent buffers" but I haven't experienced a crash yet. I have a test kernel in the following ppa: https://launchpad.net/~mruffell/+archive/ubuntu/sf311164-test-2 If you install it, balances still complete as expected. I will keep attempting to reproduce the issue, and will update this section if I manage to create a testcase. [Where problems could occur] If a regression were to occur, it would affect users of btrfs filesystems, and would likely show during a routine balance operation. I believe affected users would have nearly full filesystems, and would also experience filesystem corruption from bug 1933172, which would then cause the issues from this bug when the transaction log is written to disk. With all modifications to btrfs, there is a risk of data corruption and filesystem corruption for all btrfs users, since balances happen automatically and on a regular basis. If a regression does happen, users should remount their filesystems with the "nobalance" flag, backup their data, and attempt a repair if necessary. [Other info] A community member has hit this issue before, and has reported it upstream to linux-btrfs here, although they never received a reply. https://www.spinics.net/lists/linux-btrfs/msg112261.html I have written to Richard and asked if he has any additional information that might help reproduce, but I have yet to receive a reply. If you read Richard's mailing list link, it mentions filesystem corruption with missing extents. This suggests this crash might be linked to bug 1933172, which I came across while trying to reproduce the issue in this bug. BugLink: https://bugs.launchpad.net/bugs/1934709 [Impact] During an automatic balance, users may encounter an error when writing the transaction log to disk, when the log tree is being parsed, which forces the filesystem to be remounted read-only and outputs the following kernel oops: BTRFS: error (device dm-14) in balance_level:1962: errno=-117 unknown BTRFS info (device dm-14): forced readonly BTRFS: Transaction aborted (error -117) WARNING: CPU: 7 PID: 10818 at /build/linux-99Rib2/linux-4.15.0/fs/btrfs/tree-log.c:2908 btrfs_sync_log+0xa28/0xbc0 [btrfs] CPU: 7 PID: 10818 Comm: qemu-system-s39 Tainted: G OE 4.15.0-136-generic #140-Ubuntu Hardware name: IBM 3907 LR1 A00 (LPAR) Krnl PSW : 0000000076bc1d64 000000009cc65255 (btrfs_sync_log+0xa28/0xbc0 [btrfs])            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3 Krnl GPRS: 007899a600000000 0000000000000006 0000000000000027 0000000000000007            000003ff801fdd8c 000000000002394c 00000053650a7000 ffffffffffffff8b            000000536b7f6000 00000053ffffff8b 00000053650a3800 0000005385935000            000000532054de01 0000005385935290 000003ff801fdd8c 0000004eebb1fae0 Krnl Code: 000003ff801fdd80: c0200002a032 larl %r2,000003ff80251de4            000003ff801fdd86: c0e5fffb2181 brasl %r14,000003ff80162088           #000003ff801fdd8c: a7f40001 brc 15,000003ff801fdd8e           >000003ff801fdd90: e3a0f0a80004 lg %r10,168(%r15)            000003ff801fdd96: b9040057 lgr %r5,%r7            000003ff801fdd9a: a7490b5c lghi %r4,2908            000003ff801fdd9e: b904002a lgr %r2,%r10            000003ff801fdda2: c0300002604f larl %r3,000003ff80249e40 Call Trace: ([<000003ff801fdd8c>] btrfs_sync_log+0xa24/0xbc0 [btrfs])  [<000003ff801c74e2>] btrfs_sync_file+0x3e2/0x550 [btrfs]  [<00000000003ce6ce>] do_fsync+0x5e/0x90  [<00000000003ce9ca>] SyS_fdatasync+0x32/0x48  [<00000000008ffe5c>] system_call+0xd8/0x2c8 Last Breaking-Event-Address:  [<000003ff801fdd8c>] btrfs_sync_log+0xa24/0xbc0 [btrfs] BTRFS: error (device dm-14) in btrfs_sync_log:2908: errno=-117 unknown BTRFS error (device dm-14): pending csums is 269639680 This bug appears to be linked to bug 1933172, but is different and has a different root cause. Again, I believe that this is a regression introduced in the fixing of CVE-2019-19036, from 4.15.0-109-generic. [Fix] Analysis of the kernel oops is as follows: The first thing we see is that BTRFS entered ERROR state with the reason: in balance_level:1962: errno=-117 unknown Now errno -117 is: 100 #define EUCLEAN 117 /* Structure needs cleaning */ btrfs treats -EUCLEAN as if corruption has happened. Let's see where this is returned from. If we start at fs/btrfs/ctree.c in balance_level(), line 1962: 1917 static noinline int balance_level(struct btrfs_trans_handle *trans, 1918 struct btrfs_root *root, 1919 struct btrfs_path *path, int level) 1920 { ... 1958 /* promote the child to a root */ 1959 child = read_node_slot(fs_info, mid, 0); 1960 if (IS_ERR(child)) { 1961 ret = PTR_ERR(child); 1962 btrfs_handle_fs_error(fs_info, ret, NULL); 1963 goto enospc; 1964 } ... 2136 } We are in the middle of a balancing operation, and if you happen to be familiar with how b-tree data structures work, we are promoting a child node to a topmost root node. The error most likely happens in read_node_slot(), with the lines after it printing that an error has happened. 1887 static noinline struct extent_buffer * 1888 read_node_slot(struct btrfs_fs_info *fs_info, struct extent_buffer *parent, 1889 int slot) 1890 { ... 1900 btrfs_node_key_to_cpu(parent, &first_key, slot); 1901 eb = read_tree_block(fs_info, btrfs_node_blockptr(parent, slot), 1902 btrfs_node_ptr_generation(parent, slot), 1903 level - 1, &first_key); ... 1910 } There are two calls here which are relevant. btrfs_node_key_to_cpu() and read_tree_block(). Let's look at read_tree_block() in fs/btrfs/disk-io.c: 1147 struct extent_buffer *read_tree_block(struct btrfs_fs_info *fs_info, u64 bytenr, 1148 u64 parent_transid, int level, 1149 struct btrfs_key *first_key) 1150 { 1151 struct extent_buffer *buf = NULL; 1152 int ret; 1153 1154 buf = btrfs_find_create_tree_block(fs_info, bytenr); 1155 if (IS_ERR(buf)) 1156 return buf; 1157 1158 ret = btree_read_extent_buffer_pages(fs_info, buf, parent_transid, 1159 level, first_key); 1160 if (ret) { 1161 free_extent_buffer_stale(buf); 1162 return ERR_PTR(ret); 1163 } 1164 return buf; 1165 1166 } The interesting one here is btree_read_extent_buffer_pages(): 498 static int btree_read_extent_buffer_pages(struct btrfs_fs_info *fs_info, 499 struct extent_buffer *eb, 500 u64 parent_transid, int level, 501 struct btrfs_key *first_key) 502 { ... 511 while (1) { 512 clear_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags); 513 ret = read_extent_buffer_pages(io_tree, eb, WAIT_COMPLETE, 514 btree_get_extent, mirror_num); 515 if (!ret) { 516 if (verify_parent_transid(io_tree, eb, 517 parent_transid, 0)) 518 ret = -EIO; 519 else if (verify_level_key(fs_info, eb, level, 520 first_key)) 521 ret = -EUCLEAN; 522 else 523 break; 524 } 525 526 /* 527 * This buffer's crc is fine, but its contents are corrupted, so 528 * there is no reason to read the other copies, they won't be 529 * any less wrong. 530 */ 531 if (test_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags) || 532 ret == -EUCLEAN) 533 break; ... If read_extent_buffer_pages() returns zero, we call verify_level_key(), and if this returns non-zero, we return -EUCLEAN. verify_level_key() can fail on three conditions.  440 static int verify_level_key(struct btrfs_fs_info *fs_info,  441 struct extent_buffer *eb, int level,  442 struct btrfs_key *first_key)  443 {  448 found_level = btrfs_header_level(eb);  449 if (found_level != level) { ...  456 return -EIO;  457 }  458  459 if (!first_key)  460 return 0;  461  462 /* We have @first_key, so this @eb must have at least one item */  463 if (btrfs_header_nritems(eb) == 0) {  464 btrfs_err(fs_info,  465 "invalid tree nritems, bytenr=%llu nritems=0 expect >0",  466 eb->start);  467 WARN_ON(IS_ENABLED(CONFIG_BTRFS_DEBUG));  468 return -EUCLEAN;  469 }  470  471 if (found_level)  472 btrfs_node_key_to_cpu(eb, &found_key, 0);  473 else  474 btrfs_item_key_to_cpu(eb, &found_key, 0);  475 ret = btrfs_comp_cpu_keys(first_key, &found_key); ...  487 return ret;  488 } 1) If the eb level doesn't match the provided level. 2) If the eb has 0 items. 3) If the found_key doesn't match the first_key. With the information we currently have, we don't know what one caused the problem. I looked to see when verify_level_key() was first introduced. It seems it arrived in 4.15.0-109-generic through the SRU of CVE-2019-19036, with the commit: ubuntu-bionic 50ab1ff51db0c5eb77ffc6f15ef32f07764f86ff Author: Qu Wenruo <wqu@suse.com> Date: Thu Mar 29 09:08:11 2018 +0800 Subject: btrfs: Validate child tree block's level and first key Link: https://paste.ubuntu.com/p/DQjTkfNRDt/ If you look at this particular hunk: https://paste.ubuntu.com/p/z4qXw2Jp9K/ We see lines 18 - 26 of the above pastebin are introduced here. Looking at the original upstream commit: commit 581c1760415c48cca9349b198bba52dd38750765 Author: Qu Wenruo <wqu@suse.com> Date: Thu Mar 29 09:08:11 2018 +0800 Subject: btrfs: Validate child tree block's level and first key Link: https://github.com/torvalds/linux/commit/581c1760415c48cca9349b198bba52dd38750765 Particularly the same hunk: https://paste.ubuntu.com/p/vW3Y9BvK4C/ There is a subtle difference, the second if statement is extended with the ret == -EUCLEAN check, and not implemented entirely. Why is this? I looked up when the if statement was first introduced, and it was a very old commit from v2.6.39-rc1: https://paste.ubuntu.com/p/vPT36xXq66/ Particularly this hunk: https://paste.ubuntu.com/p/2jmQRSmVfc/ Interesting. Why is a commit from 4.15-109-generic re-implement something that should have been there since 2.6.39? I checked upstream, and found the if statement to be removed entirely. That is when I came across: commit f8397d69daef06d358430d3054662fb597e37c00 Author: Nikolay Borisov <nborisov@suse.com> Date: Tue Nov 6 16:40:20 2018 +0200 Subject: btrfs: Always try all copies when reading extent buffers Link: https://github.com/torvalds/linux/commit/f8397d69daef06d358430d3054662fb597e37c00 Which talks about balance operations failing out of the blue after a raid 1 disk was added back to the array. The commit removes the if statement, and moves the location of the clear EXTENT_BUFFER_CORRUPT inside the while loop. I checked the Bionic 4.15 kernel, and I found that this commit was applied in 4.15.0-56-generic: ubuntu-bionic 03e1b5c9a1c1704e109466b375d09a4721b65ec5 Author: Nikolay Borisov <nborisov@suse.com> Date: Tue Nov 6 16:40:20 2018 +0200 Subject: btrfs: Always try all copies when reading extent buffers Link: https://paste.ubuntu.com/p/TS2c5Mptr2/ It appears that the if statement was removed in 4.15.0-56-generic intentionally, and was brought back mistakenly in a backport of "btrfs: Validate child tree block's level and first key" 4.15.0-109-generic. The root cause is likely some interaction between bug 1933172 and this bug, which leads to EXTENT_BUFFER_CORRUPT being set, or the incorrect first_key being set for the root node, which means we end up returning -EUCLEAN and aborting the transaction. Unfortunately, this is the second bad backport of CVE-2019-19036. The fix is to revert "btrfs: Validate child tree block's level and first key" and its dependency commit "btrfs: Detect unbalanced tree with empty leaf before crashing btree operations", and re-apply correct backports of these commits with that if statement removed, to keep the spirit of the already applied "btrfs: Always try all copies when reading extent buffers". We will also add the below commits as they are fixup commits for "btrfs: Validate child tree block's level and first key". commit 5d41be6f702f19f72db816c17175caf9dbdcdfa6 Author: Qu Wenruo <wqu@suse.com> Date: Fri Apr 13 06:32:47 2018 +0800 Subject: btrfs: Only check first key for committed tree blocks https://github.com/torvalds/linux/commit/5d41be6f702f19f72db816c17175caf9dbdcdfa6 commit 17515f1b764df36271f3166c714f5a78301fbaa7 Author: Qu Wenruo <wqu@suse.com> Date: Mon Apr 23 17:32:04 2018 +0800 Subject: btrfs: Fix wrong first_key parameter in replace_path Link: https://github.com/torvalds/linux/commit/17515f1b764df36271f3166c714f5a78301fbaa7 [Testcase] Unfortunately, the customer did not image the affected filesystem and has since restored a backup ontop of it. I have been attempting to reproduce this issue for some time, but I have not experienced the same call trace. I ran into bug 1933172 while trying to reproduce this bug. I have been trying to balance nearly full btrfs filesystems, and I have looped xfstests btrfs/124 for hours attempting to trigger "btrfs: Always try all copies when reading extent buffers" but I haven't experienced a crash yet. I have a test kernel in the following ppa: https://launchpad.net/~mruffell/+archive/ubuntu/sf311164-test-2 If you install it, balances still complete as expected. I will keep attempting to reproduce the issue, and will update this section if I manage to create a testcase. [Where problems could occur] If a regression were to occur, it would affect users of btrfs filesystems, and would likely show during a routine balance operation. I believe affected users would have nearly full filesystems, and would also experience filesystem corruption from bug 1933172, which would then cause the issues from this bug when the transaction log is written to disk. With all modifications to btrfs, there is a risk of data corruption and filesystem corruption for all btrfs users, since balances happen automatically and on a regular basis. If a regression does happen, users should remount their filesystems with the "nobalance" flag, backup their data, and attempt a repair if necessary. [Other info] A community member has hit this issue before, and has reported it upstream to linux-btrfs here, although they never received a reply. https://www.spinics.net/lists/linux-btrfs/msg112261.html I have written to Richard and asked if he has any additional information that might help reproduce, but I have yet to receive a reply. If you read Richard's mailing list link, it mentions filesystem corruption with missing extents. This suggests this crash might be linked to bug 1933172, which I came across while trying to reproduce the issue in this bug.
2021-07-07 04:25:18 Matthew Ruffell description BugLink: https://bugs.launchpad.net/bugs/1934709 [Impact] During an automatic balance, users may encounter an error when writing the transaction log to disk, when the log tree is being parsed, which forces the filesystem to be remounted read-only and outputs the following kernel oops: BTRFS: error (device dm-14) in balance_level:1962: errno=-117 unknown BTRFS info (device dm-14): forced readonly BTRFS: Transaction aborted (error -117) WARNING: CPU: 7 PID: 10818 at /build/linux-99Rib2/linux-4.15.0/fs/btrfs/tree-log.c:2908 btrfs_sync_log+0xa28/0xbc0 [btrfs] CPU: 7 PID: 10818 Comm: qemu-system-s39 Tainted: G OE 4.15.0-136-generic #140-Ubuntu Hardware name: IBM 3907 LR1 A00 (LPAR) Krnl PSW : 0000000076bc1d64 000000009cc65255 (btrfs_sync_log+0xa28/0xbc0 [btrfs])            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3 Krnl GPRS: 007899a600000000 0000000000000006 0000000000000027 0000000000000007            000003ff801fdd8c 000000000002394c 00000053650a7000 ffffffffffffff8b            000000536b7f6000 00000053ffffff8b 00000053650a3800 0000005385935000            000000532054de01 0000005385935290 000003ff801fdd8c 0000004eebb1fae0 Krnl Code: 000003ff801fdd80: c0200002a032 larl %r2,000003ff80251de4            000003ff801fdd86: c0e5fffb2181 brasl %r14,000003ff80162088           #000003ff801fdd8c: a7f40001 brc 15,000003ff801fdd8e           >000003ff801fdd90: e3a0f0a80004 lg %r10,168(%r15)            000003ff801fdd96: b9040057 lgr %r5,%r7            000003ff801fdd9a: a7490b5c lghi %r4,2908            000003ff801fdd9e: b904002a lgr %r2,%r10            000003ff801fdda2: c0300002604f larl %r3,000003ff80249e40 Call Trace: ([<000003ff801fdd8c>] btrfs_sync_log+0xa24/0xbc0 [btrfs])  [<000003ff801c74e2>] btrfs_sync_file+0x3e2/0x550 [btrfs]  [<00000000003ce6ce>] do_fsync+0x5e/0x90  [<00000000003ce9ca>] SyS_fdatasync+0x32/0x48  [<00000000008ffe5c>] system_call+0xd8/0x2c8 Last Breaking-Event-Address:  [<000003ff801fdd8c>] btrfs_sync_log+0xa24/0xbc0 [btrfs] BTRFS: error (device dm-14) in btrfs_sync_log:2908: errno=-117 unknown BTRFS error (device dm-14): pending csums is 269639680 This bug appears to be linked to bug 1933172, but is different and has a different root cause. Again, I believe that this is a regression introduced in the fixing of CVE-2019-19036, from 4.15.0-109-generic. [Fix] Analysis of the kernel oops is as follows: The first thing we see is that BTRFS entered ERROR state with the reason: in balance_level:1962: errno=-117 unknown Now errno -117 is: 100 #define EUCLEAN 117 /* Structure needs cleaning */ btrfs treats -EUCLEAN as if corruption has happened. Let's see where this is returned from. If we start at fs/btrfs/ctree.c in balance_level(), line 1962: 1917 static noinline int balance_level(struct btrfs_trans_handle *trans, 1918 struct btrfs_root *root, 1919 struct btrfs_path *path, int level) 1920 { ... 1958 /* promote the child to a root */ 1959 child = read_node_slot(fs_info, mid, 0); 1960 if (IS_ERR(child)) { 1961 ret = PTR_ERR(child); 1962 btrfs_handle_fs_error(fs_info, ret, NULL); 1963 goto enospc; 1964 } ... 2136 } We are in the middle of a balancing operation, and if you happen to be familiar with how b-tree data structures work, we are promoting a child node to a topmost root node. The error most likely happens in read_node_slot(), with the lines after it printing that an error has happened. 1887 static noinline struct extent_buffer * 1888 read_node_slot(struct btrfs_fs_info *fs_info, struct extent_buffer *parent, 1889 int slot) 1890 { ... 1900 btrfs_node_key_to_cpu(parent, &first_key, slot); 1901 eb = read_tree_block(fs_info, btrfs_node_blockptr(parent, slot), 1902 btrfs_node_ptr_generation(parent, slot), 1903 level - 1, &first_key); ... 1910 } There are two calls here which are relevant. btrfs_node_key_to_cpu() and read_tree_block(). Let's look at read_tree_block() in fs/btrfs/disk-io.c: 1147 struct extent_buffer *read_tree_block(struct btrfs_fs_info *fs_info, u64 bytenr, 1148 u64 parent_transid, int level, 1149 struct btrfs_key *first_key) 1150 { 1151 struct extent_buffer *buf = NULL; 1152 int ret; 1153 1154 buf = btrfs_find_create_tree_block(fs_info, bytenr); 1155 if (IS_ERR(buf)) 1156 return buf; 1157 1158 ret = btree_read_extent_buffer_pages(fs_info, buf, parent_transid, 1159 level, first_key); 1160 if (ret) { 1161 free_extent_buffer_stale(buf); 1162 return ERR_PTR(ret); 1163 } 1164 return buf; 1165 1166 } The interesting one here is btree_read_extent_buffer_pages(): 498 static int btree_read_extent_buffer_pages(struct btrfs_fs_info *fs_info, 499 struct extent_buffer *eb, 500 u64 parent_transid, int level, 501 struct btrfs_key *first_key) 502 { ... 511 while (1) { 512 clear_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags); 513 ret = read_extent_buffer_pages(io_tree, eb, WAIT_COMPLETE, 514 btree_get_extent, mirror_num); 515 if (!ret) { 516 if (verify_parent_transid(io_tree, eb, 517 parent_transid, 0)) 518 ret = -EIO; 519 else if (verify_level_key(fs_info, eb, level, 520 first_key)) 521 ret = -EUCLEAN; 522 else 523 break; 524 } 525 526 /* 527 * This buffer's crc is fine, but its contents are corrupted, so 528 * there is no reason to read the other copies, they won't be 529 * any less wrong. 530 */ 531 if (test_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags) || 532 ret == -EUCLEAN) 533 break; ... If read_extent_buffer_pages() returns zero, we call verify_level_key(), and if this returns non-zero, we return -EUCLEAN. verify_level_key() can fail on three conditions.  440 static int verify_level_key(struct btrfs_fs_info *fs_info,  441 struct extent_buffer *eb, int level,  442 struct btrfs_key *first_key)  443 {  448 found_level = btrfs_header_level(eb);  449 if (found_level != level) { ...  456 return -EIO;  457 }  458  459 if (!first_key)  460 return 0;  461  462 /* We have @first_key, so this @eb must have at least one item */  463 if (btrfs_header_nritems(eb) == 0) {  464 btrfs_err(fs_info,  465 "invalid tree nritems, bytenr=%llu nritems=0 expect >0",  466 eb->start);  467 WARN_ON(IS_ENABLED(CONFIG_BTRFS_DEBUG));  468 return -EUCLEAN;  469 }  470  471 if (found_level)  472 btrfs_node_key_to_cpu(eb, &found_key, 0);  473 else  474 btrfs_item_key_to_cpu(eb, &found_key, 0);  475 ret = btrfs_comp_cpu_keys(first_key, &found_key); ...  487 return ret;  488 } 1) If the eb level doesn't match the provided level. 2) If the eb has 0 items. 3) If the found_key doesn't match the first_key. With the information we currently have, we don't know what one caused the problem. I looked to see when verify_level_key() was first introduced. It seems it arrived in 4.15.0-109-generic through the SRU of CVE-2019-19036, with the commit: ubuntu-bionic 50ab1ff51db0c5eb77ffc6f15ef32f07764f86ff Author: Qu Wenruo <wqu@suse.com> Date: Thu Mar 29 09:08:11 2018 +0800 Subject: btrfs: Validate child tree block's level and first key Link: https://paste.ubuntu.com/p/DQjTkfNRDt/ If you look at this particular hunk: https://paste.ubuntu.com/p/z4qXw2Jp9K/ We see lines 18 - 26 of the above pastebin are introduced here. Looking at the original upstream commit: commit 581c1760415c48cca9349b198bba52dd38750765 Author: Qu Wenruo <wqu@suse.com> Date: Thu Mar 29 09:08:11 2018 +0800 Subject: btrfs: Validate child tree block's level and first key Link: https://github.com/torvalds/linux/commit/581c1760415c48cca9349b198bba52dd38750765 Particularly the same hunk: https://paste.ubuntu.com/p/vW3Y9BvK4C/ There is a subtle difference, the second if statement is extended with the ret == -EUCLEAN check, and not implemented entirely. Why is this? I looked up when the if statement was first introduced, and it was a very old commit from v2.6.39-rc1: https://paste.ubuntu.com/p/vPT36xXq66/ Particularly this hunk: https://paste.ubuntu.com/p/2jmQRSmVfc/ Interesting. Why is a commit from 4.15-109-generic re-implement something that should have been there since 2.6.39? I checked upstream, and found the if statement to be removed entirely. That is when I came across: commit f8397d69daef06d358430d3054662fb597e37c00 Author: Nikolay Borisov <nborisov@suse.com> Date: Tue Nov 6 16:40:20 2018 +0200 Subject: btrfs: Always try all copies when reading extent buffers Link: https://github.com/torvalds/linux/commit/f8397d69daef06d358430d3054662fb597e37c00 Which talks about balance operations failing out of the blue after a raid 1 disk was added back to the array. The commit removes the if statement, and moves the location of the clear EXTENT_BUFFER_CORRUPT inside the while loop. I checked the Bionic 4.15 kernel, and I found that this commit was applied in 4.15.0-56-generic: ubuntu-bionic 03e1b5c9a1c1704e109466b375d09a4721b65ec5 Author: Nikolay Borisov <nborisov@suse.com> Date: Tue Nov 6 16:40:20 2018 +0200 Subject: btrfs: Always try all copies when reading extent buffers Link: https://paste.ubuntu.com/p/TS2c5Mptr2/ It appears that the if statement was removed in 4.15.0-56-generic intentionally, and was brought back mistakenly in a backport of "btrfs: Validate child tree block's level and first key" 4.15.0-109-generic. The root cause is likely some interaction between bug 1933172 and this bug, which leads to EXTENT_BUFFER_CORRUPT being set, or the incorrect first_key being set for the root node, which means we end up returning -EUCLEAN and aborting the transaction. Unfortunately, this is the second bad backport of CVE-2019-19036. The fix is to revert "btrfs: Validate child tree block's level and first key" and its dependency commit "btrfs: Detect unbalanced tree with empty leaf before crashing btree operations", and re-apply correct backports of these commits with that if statement removed, to keep the spirit of the already applied "btrfs: Always try all copies when reading extent buffers". We will also add the below commits as they are fixup commits for "btrfs: Validate child tree block's level and first key". commit 5d41be6f702f19f72db816c17175caf9dbdcdfa6 Author: Qu Wenruo <wqu@suse.com> Date: Fri Apr 13 06:32:47 2018 +0800 Subject: btrfs: Only check first key for committed tree blocks https://github.com/torvalds/linux/commit/5d41be6f702f19f72db816c17175caf9dbdcdfa6 commit 17515f1b764df36271f3166c714f5a78301fbaa7 Author: Qu Wenruo <wqu@suse.com> Date: Mon Apr 23 17:32:04 2018 +0800 Subject: btrfs: Fix wrong first_key parameter in replace_path Link: https://github.com/torvalds/linux/commit/17515f1b764df36271f3166c714f5a78301fbaa7 [Testcase] Unfortunately, the customer did not image the affected filesystem and has since restored a backup ontop of it. I have been attempting to reproduce this issue for some time, but I have not experienced the same call trace. I ran into bug 1933172 while trying to reproduce this bug. I have been trying to balance nearly full btrfs filesystems, and I have looped xfstests btrfs/124 for hours attempting to trigger "btrfs: Always try all copies when reading extent buffers" but I haven't experienced a crash yet. I have a test kernel in the following ppa: https://launchpad.net/~mruffell/+archive/ubuntu/sf311164-test-2 If you install it, balances still complete as expected. I will keep attempting to reproduce the issue, and will update this section if I manage to create a testcase. [Where problems could occur] If a regression were to occur, it would affect users of btrfs filesystems, and would likely show during a routine balance operation. I believe affected users would have nearly full filesystems, and would also experience filesystem corruption from bug 1933172, which would then cause the issues from this bug when the transaction log is written to disk. With all modifications to btrfs, there is a risk of data corruption and filesystem corruption for all btrfs users, since balances happen automatically and on a regular basis. If a regression does happen, users should remount their filesystems with the "nobalance" flag, backup their data, and attempt a repair if necessary. [Other info] A community member has hit this issue before, and has reported it upstream to linux-btrfs here, although they never received a reply. https://www.spinics.net/lists/linux-btrfs/msg112261.html I have written to Richard and asked if he has any additional information that might help reproduce, but I have yet to receive a reply. If you read Richard's mailing list link, it mentions filesystem corruption with missing extents. This suggests this crash might be linked to bug 1933172, which I came across while trying to reproduce the issue in this bug. BugLink: https://bugs.launchpad.net/bugs/1934709 [Impact] During an automatic balance, users may encounter an error when writing the transaction log to disk, when the log tree is being parsed, which forces the filesystem to be remounted read-only and outputs the following kernel oops: BTRFS: error (device dm-14) in balance_level:1962: errno=-117 unknown BTRFS info (device dm-14): forced readonly BTRFS: Transaction aborted (error -117) WARNING: CPU: 7 PID: 10818 at /build/linux-99Rib2/linux-4.15.0/fs/btrfs/tree-log.c:2908 btrfs_sync_log+0xa28/0xbc0 [btrfs] CPU: 7 PID: 10818 Comm: qemu-system-s39 Tainted: G OE 4.15.0-136-generic #140-Ubuntu Hardware name: IBM 3907 LR1 A00 (LPAR) Krnl PSW : 0000000076bc1d64 000000009cc65255 (btrfs_sync_log+0xa28/0xbc0 [btrfs])            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3 Krnl GPRS: 007899a600000000 0000000000000006 0000000000000027 0000000000000007            000003ff801fdd8c 000000000002394c 00000053650a7000 ffffffffffffff8b            000000536b7f6000 00000053ffffff8b 00000053650a3800 0000005385935000            000000532054de01 0000005385935290 000003ff801fdd8c 0000004eebb1fae0 Krnl Code: 000003ff801fdd80: c0200002a032 larl %r2,000003ff80251de4            000003ff801fdd86: c0e5fffb2181 brasl %r14,000003ff80162088           #000003ff801fdd8c: a7f40001 brc 15,000003ff801fdd8e           >000003ff801fdd90: e3a0f0a80004 lg %r10,168(%r15)            000003ff801fdd96: b9040057 lgr %r5,%r7            000003ff801fdd9a: a7490b5c lghi %r4,2908            000003ff801fdd9e: b904002a lgr %r2,%r10            000003ff801fdda2: c0300002604f larl %r3,000003ff80249e40 Call Trace: ([<000003ff801fdd8c>] btrfs_sync_log+0xa24/0xbc0 [btrfs])  [<000003ff801c74e2>] btrfs_sync_file+0x3e2/0x550 [btrfs]  [<00000000003ce6ce>] do_fsync+0x5e/0x90  [<00000000003ce9ca>] SyS_fdatasync+0x32/0x48  [<00000000008ffe5c>] system_call+0xd8/0x2c8 Last Breaking-Event-Address:  [<000003ff801fdd8c>] btrfs_sync_log+0xa24/0xbc0 [btrfs] BTRFS: error (device dm-14) in btrfs_sync_log:2908: errno=-117 unknown BTRFS error (device dm-14): pending csums is 269639680 This bug appears to be linked to bug 1933172, but is different and has a different root cause. Again, I believe that this is a regression introduced in the fixing of CVE-2019-19036, from 4.15.0-109-generic. [Fix] Analysis of the kernel oops is as follows: The first thing we see is that BTRFS entered ERROR state with the reason: in balance_level:1962: errno=-117 unknown Now errno -117 is: 100 #define EUCLEAN 117 /* Structure needs cleaning */ btrfs treats -EUCLEAN as if corruption has happened. Let's see where this is returned from. If we start at fs/btrfs/ctree.c in balance_level(), line 1962: 1917 static noinline int balance_level(struct btrfs_trans_handle *trans, 1918 struct btrfs_root *root, 1919 struct btrfs_path *path, int level) 1920 { ... 1958 /* promote the child to a root */ 1959 child = read_node_slot(fs_info, mid, 0); 1960 if (IS_ERR(child)) { 1961 ret = PTR_ERR(child); 1962 btrfs_handle_fs_error(fs_info, ret, NULL); 1963 goto enospc; 1964 } ... 2136 } We are in the middle of a balancing operation, and if you happen to be familiar with how b-tree data structures work, we are promoting a child node to a topmost root node. The error most likely happens in read_node_slot(), with the lines after it printing that an error has happened. 1887 static noinline struct extent_buffer * 1888 read_node_slot(struct btrfs_fs_info *fs_info, struct extent_buffer *parent, 1889 int slot) 1890 { ... 1900 btrfs_node_key_to_cpu(parent, &first_key, slot); 1901 eb = read_tree_block(fs_info, btrfs_node_blockptr(parent, slot), 1902 btrfs_node_ptr_generation(parent, slot), 1903 level - 1, &first_key); ... 1910 } There are two calls here which are relevant. btrfs_node_key_to_cpu() and read_tree_block(). Let's look at read_tree_block() in fs/btrfs/disk-io.c: 1147 struct extent_buffer *read_tree_block(struct btrfs_fs_info *fs_info, u64 bytenr, 1148 u64 parent_transid, int level, 1149 struct btrfs_key *first_key) 1150 { 1151 struct extent_buffer *buf = NULL; 1152 int ret; 1153 1154 buf = btrfs_find_create_tree_block(fs_info, bytenr); 1155 if (IS_ERR(buf)) 1156 return buf; 1157 1158 ret = btree_read_extent_buffer_pages(fs_info, buf, parent_transid, 1159 level, first_key); 1160 if (ret) { 1161 free_extent_buffer_stale(buf); 1162 return ERR_PTR(ret); 1163 } 1164 return buf; 1165 1166 } The interesting one here is btree_read_extent_buffer_pages(): 498 static int btree_read_extent_buffer_pages(struct btrfs_fs_info *fs_info, 499 struct extent_buffer *eb, 500 u64 parent_transid, int level, 501 struct btrfs_key *first_key) 502 { ... 511 while (1) { 512 clear_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags); 513 ret = read_extent_buffer_pages(io_tree, eb, WAIT_COMPLETE, 514 btree_get_extent, mirror_num); 515 if (!ret) { 516 if (verify_parent_transid(io_tree, eb, 517 parent_transid, 0)) 518 ret = -EIO; 519 else if (verify_level_key(fs_info, eb, level, 520 first_key)) 521 ret = -EUCLEAN; 522 else 523 break; 524 } 525 526 /* 527 * This buffer's crc is fine, but its contents are corrupted, so 528 * there is no reason to read the other copies, they won't be 529 * any less wrong. 530 */ 531 if (test_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags) || 532 ret == -EUCLEAN) 533 break; ... If read_extent_buffer_pages() returns zero, we call verify_level_key(), and if this returns non-zero, we return -EUCLEAN. verify_level_key() can fail on three conditions.  440 static int verify_level_key(struct btrfs_fs_info *fs_info,  441 struct extent_buffer *eb, int level,  442 struct btrfs_key *first_key)  443 {  448 found_level = btrfs_header_level(eb);  449 if (found_level != level) { ...  456 return -EIO;  457 }  458  459 if (!first_key)  460 return 0;  461  462 /* We have @first_key, so this @eb must have at least one item */  463 if (btrfs_header_nritems(eb) == 0) {  464 btrfs_err(fs_info,  465 "invalid tree nritems, bytenr=%llu nritems=0 expect >0",  466 eb->start);  467 WARN_ON(IS_ENABLED(CONFIG_BTRFS_DEBUG));  468 return -EUCLEAN;  469 }  470  471 if (found_level)  472 btrfs_node_key_to_cpu(eb, &found_key, 0);  473 else  474 btrfs_item_key_to_cpu(eb, &found_key, 0);  475 ret = btrfs_comp_cpu_keys(first_key, &found_key); ...  487 return ret;  488 } 1) If the eb level doesn't match the provided level. 2) If the eb has 0 items. 3) If the found_key doesn't match the first_key. With the information we currently have, we don't know what one caused the problem. I looked to see when verify_level_key() was first introduced. It seems it arrived in 4.15.0-109-generic through the SRU of CVE-2019-19036, with the commit: ubuntu-bionic 50ab1ff51db0c5eb77ffc6f15ef32f07764f86ff Author: Qu Wenruo <wqu@suse.com> Date: Thu Mar 29 09:08:11 2018 +0800 Subject: btrfs: Validate child tree block's level and first key Link: https://paste.ubuntu.com/p/DQjTkfNRDt/ If you look at this particular hunk: https://paste.ubuntu.com/p/z4qXw2Jp9K/ We see lines 18 - 26 of the above pastebin are introduced here. Looking at the original upstream commit: commit 581c1760415c48cca9349b198bba52dd38750765 Author: Qu Wenruo <wqu@suse.com> Date: Thu Mar 29 09:08:11 2018 +0800 Subject: btrfs: Validate child tree block's level and first key Link: https://github.com/torvalds/linux/commit/581c1760415c48cca9349b198bba52dd38750765 Particularly the same hunk: https://paste.ubuntu.com/p/vW3Y9BvK4C/ There is a subtle difference, the second if statement is extended with the ret == -EUCLEAN check, and not implemented entirely. Why is this? I looked up when the if statement was first introduced, and it was a very old commit from v2.6.39-rc1: https://paste.ubuntu.com/p/vPT36xXq66/ Particularly this hunk: https://paste.ubuntu.com/p/2jmQRSmVfc/ Interesting. Why is a commit from 4.15-109-generic re-implement something that should have been there since 2.6.39? I checked upstream, and found the if statement to be removed entirely. That is when I came across: commit f8397d69daef06d358430d3054662fb597e37c00 Author: Nikolay Borisov <nborisov@suse.com> Date: Tue Nov 6 16:40:20 2018 +0200 Subject: btrfs: Always try all copies when reading extent buffers Link: https://github.com/torvalds/linux/commit/f8397d69daef06d358430d3054662fb597e37c00 Which talks about balance operations failing out of the blue after a raid 1 disk was added back to the array. The commit removes the if statement, and moves the location of the clear EXTENT_BUFFER_CORRUPT inside the while loop. I checked the Bionic 4.15 kernel, and I found that this commit was applied in 4.15.0-56-generic: ubuntu-bionic 03e1b5c9a1c1704e109466b375d09a4721b65ec5 Author: Nikolay Borisov <nborisov@suse.com> Date: Tue Nov 6 16:40:20 2018 +0200 Subject: btrfs: Always try all copies when reading extent buffers Link: https://paste.ubuntu.com/p/TS2c5Mptr2/ It appears that the if statement was removed in 4.15.0-56-generic intentionally, and was brought back mistakenly in a backport of "btrfs: Validate child tree block's level and first key" 4.15.0-109-generic. The root cause is likely some interaction between bug 1933172 and this bug, which leads to EXTENT_BUFFER_CORRUPT being set, or the incorrect first_key being set for the root node, which means we end up returning -EUCLEAN and aborting the transaction. Unfortunately, this is the second bad backport of CVE-2019-19036. The fix is to revert "btrfs: Validate child tree block's level and first key" and its dependency commit "btrfs: Detect unbalanced tree with empty leaf before crashing btree operations", and re-apply correct backports of these commits with that if statement removed, to keep the spirit of the already applied "btrfs: Always try all copies when reading extent buffers". We will also add the below commits as they are fixup commits for "btrfs: Validate child tree block's level and first key". commit 5d41be6f702f19f72db816c17175caf9dbdcdfa6 Author: Qu Wenruo <wqu@suse.com> Date: Fri Apr 13 06:32:47 2018 +0800 Subject: btrfs: Only check first key for committed tree blocks https://github.com/torvalds/linux/commit/5d41be6f702f19f72db816c17175caf9dbdcdfa6 commit 17515f1b764df36271f3166c714f5a78301fbaa7 Author: Qu Wenruo <wqu@suse.com> Date: Mon Apr 23 17:32:04 2018 +0800 Subject: btrfs: Fix wrong first_key parameter in replace_path Link: https://github.com/torvalds/linux/commit/17515f1b764df36271f3166c714f5a78301fbaa7 [Testcase] Unfortunately, the customer did not image the affected filesystem and has since restored a backup ontop of it. I have been attempting to reproduce this issue for some time, but I have not experienced the same call trace. I ran into bug 1933172 while trying to reproduce this bug. I have been trying to balance nearly full btrfs filesystems, and I have looped xfstests btrfs/124 for hours attempting to trigger "btrfs: Always try all copies when reading extent buffers" but I haven't experienced a crash yet. I have a test kernel in the following ppa: https://launchpad.net/~mruffell/+archive/ubuntu/sf311164-test-2 If you install it, balances still complete as expected. I will keep attempting to reproduce the issue, and will update this section if I manage to create a testcase. For regression testing, I have run xfstests btrfs/* on both 4.15.0-136-generic and the test kernel, and they both share the same results: 4.15.0-136-generic from -updates: https://paste.ubuntu.com/p/4MpZ5YVMnv/ 4.15.0-136-generic test kernel from above ppa: https://paste.ubuntu.com/p/CsxzbCkJmJ/ [Where problems could occur] If a regression were to occur, it would affect users of btrfs filesystems, and would likely show during a routine balance operation. I believe affected users would have nearly full filesystems, and would also experience filesystem corruption from bug 1933172, which would then cause the issues from this bug when the transaction log is written to disk. With all modifications to btrfs, there is a risk of data corruption and filesystem corruption for all btrfs users, since balances happen automatically and on a regular basis. If a regression does happen, users should remount their filesystems with the "nobalance" flag, backup their data, and attempt a repair if necessary. [Other info] A community member has hit this issue before, and has reported it upstream to linux-btrfs here, although they never received a reply. https://www.spinics.net/lists/linux-btrfs/msg112261.html I have written to Richard and asked if he has any additional information that might help reproduce, but I have yet to receive a reply. If you read Richard's mailing list link, it mentions filesystem corruption with missing extents. This suggests this crash might be linked to bug 1933172, which I came across while trying to reproduce the issue in this bug.
2021-07-16 15:36:46 Kleber Sacilotto de Souza linux (Ubuntu Bionic): status In Progress Fix Committed
2021-07-21 15:03:19 Ubuntu Kernel Bot tags verification-needed-bionic
2021-07-26 05:53:35 Matthew Ruffell tags verification-needed-bionic bionic sts verification-done-bionic
2021-08-16 19:46:29 Launchpad Janitor linux (Ubuntu Bionic): status Fix Committed Fix Released
2021-08-16 19:46:29 Launchpad Janitor cve linked 2019-19036