#
2e2bc42c |
| 12-Mar-2024 |
Ingo Molnar <mingo@kernel.org> |
Merge branch 'linus' into x86/boot, to resolve conflict
There's a new conflict with Linus's upstream tree, because in the following merge conflict resolution in <asm/coco.h>:
38b334fc767e Merge t
Merge branch 'linus' into x86/boot, to resolve conflict
There's a new conflict with Linus's upstream tree, because in the following merge conflict resolution in <asm/coco.h>:
38b334fc767e Merge tag 'x86_sev_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Linus has resolved the conflicting placement of 'cc_mask' better than the original commit:
1c811d403afd x86/sev: Fix position dependent variable references in startup code
... which was also done by an internal merge resolution:
2e5fc4786b7a Merge branch 'x86/sev' into x86/boot, to resolve conflicts and to pick up dependent tree
But Linus is right in 38b334fc767e, the 'cc_mask' declaration is sufficient within the #ifdef CONFIG_ARCH_HAS_CC_PLATFORM block.
So instead of forcing Linus to do the same resolution again, merge in Linus's tree and follow his conflict resolution.
Conflicts: arch/x86/include/asm/coco.h
Signed-off-by: Ingo Molnar <mingo@kernel.org>
show more ...
|
#
e3afe5dd |
| 07-Mar-2024 |
Jakub Kicinski <kuba@kernel.org> |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.
No conflicts.
Adjacent changes:
net/core/page_pool_user.c 0b11b1c5c320 ("netdev:
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.
No conflicts.
Adjacent changes:
net/core/page_pool_user.c 0b11b1c5c320 ("netdev: let netlink core handle -EMSGSIZE errors") 429679dcf7d9 ("page_pool: fix netlink dump stop/resume")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
#
1cab1375 |
| 28-Feb-2024 |
Filipe Manana <fdmanana@suse.com> |
btrfs: reuse cloned extent buffer during fiemap to avoid re-allocations
During fiemap we may have to visit multiple leaves of the subvolume's inode tree, and each time we are freeing and allocating
btrfs: reuse cloned extent buffer during fiemap to avoid re-allocations
During fiemap we may have to visit multiple leaves of the subvolume's inode tree, and each time we are freeing and allocating an extent buffer to use as a clone of each visited leaf. Optimize this by reusing cloned extent buffers, to avoid the freeing and re-allocation both of the extent buffer structure itself and more importantly of the pages attached to the extent buffer.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
978b63f7 |
| 28-Feb-2024 |
Filipe Manana <fdmanana@suse.com> |
btrfs: fix race when detecting delalloc ranges during fiemap
For fiemap we recently stopped locking the target extent range for the whole duration of the fiemap call, in order to avoid a deadlock in
btrfs: fix race when detecting delalloc ranges during fiemap
For fiemap we recently stopped locking the target extent range for the whole duration of the fiemap call, in order to avoid a deadlock in a scenario where the fiemap buffer happens to be a memory mapped range of the same file. This use case is very unlikely to be useful in practice but it may be triggered by fuzz testing (syzbot, etc).
This however introduced a race that makes us miss delalloc ranges for file regions that are currently holes, so the caller of fiemap will not be aware that there's data for some file regions. This can be quite serious for some use cases - for example in coreutils versions before 9.0, the cp program used fiemap to detect holes and data in the source file, copying only regions with data (extents or delalloc) from the source file to the destination file in order to preserve holes (see the documentation for its --sparse command line option). This means that if cp was used with a source file that had delalloc in a hole, the destination file could end up without that data, which is effectively a data loss issue, if it happened to hit the race described below.
The race happens like this:
1) Fiemap is called, without the FIEMAP_FLAG_SYNC flag, for a file that has delalloc in the file range [64M, 65M[, which is currently a hole;
2) Fiemap locks the inode in shared mode, then starts iterating the inode's subvolume tree searching for file extent items, without having the whole fiemap target range locked in the inode's io tree - the change introduced recently by commit b0ad381fa769 ("btrfs: fix deadlock with fiemap and extent locking"). It only locks ranges in the io tree when it finds a hole or prealloc extent since that commit;
3) Note that fiemap clones each leaf before using it, and this is to avoid deadlocks when locking a file range in the inode's io tree and the fiemap buffer is memory mapped to some file, because writing to the page with btrfs_page_mkwrite() will wait on any ordered extent for the page's range and the ordered extent needs to lock the range and may need to modify the same leaf, therefore leading to a deadlock on the leaf;
4) While iterating the file extent items in the cloned leaf before finding the hole in the range [64M, 65M[, the delalloc in that range is flushed and its ordered extent completes - meaning the corresponding file extent item is in the inode's subvolume tree, but not present in the cloned leaf that fiemap is iterating over;
5) When fiemap finds the hole in the [64M, 65M[ range by seeing the gap in the cloned leaf (or a file extent item with disk_bytenr == 0 in case the NO_HOLES feature is not enabled), it will lock that file range in the inode's io tree and then search for delalloc by checking for the EXTENT_DELALLOC bit in the io tree for that range and ordered extents (with btrfs_find_delalloc_in_range()). But it finds nothing since the delalloc in that range was already flushed and the ordered extent completed and is gone - as a result fiemap will not report that there's delalloc or an extent for the range [64M, 65M[, so user space will be mislead into thinking that there's a hole in that range.
This could actually be sporadically triggered with test case generic/094 from fstests, which reports a missing extent/delalloc range like this:
generic/094 2s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//generic/094.out.bad) --- tests/generic/094.out 2020-06-10 19:29:03.830519425 +0100 +++ /home/fdmanana/git/hub/xfstests/results//generic/094.out.bad 2024-02-28 11:00:00.381071525 +0000 @@ -1,3 +1,9 @@ QA output created by 094 fiemap run with sync fiemap run without sync +ERROR: couldn't find extent at 7 +map is 'HHDDHPPDPHPH' +logical: [ 5.. 6] phys: 301517.. 301518 flags: 0x800 tot: 2 +logical: [ 8.. 8] phys: 301520.. 301520 flags: 0x800 tot: 1 ... (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/generic/094.out /home/fdmanana/git/hub/xfstests/results//generic/094.out.bad' to see the entire diff)
So in order to fix this, while still avoiding deadlocks in the case where the fiemap buffer is memory mapped to the same file, change fiemap to work like the following:
1) Always lock the whole range in the inode's io tree before starting to iterate the inode's subvolume tree searching for file extent items, just like we did before commit b0ad381fa769 ("btrfs: fix deadlock with fiemap and extent locking");
2) Now instead of writing to the fiemap buffer every time we have an extent to report, write instead to a temporary buffer (1 page), and when that buffer becomes full, stop iterating the file extent items, unlock the range in the io tree, release the search path, submit all the entries kept in that buffer to the fiemap buffer, and then resume the search for file extent items after locking again the remainder of the range in the io tree.
The buffer having a size of a page, allows for 146 entries in a system with 4K pages. This is a large enough value to have a good performance by avoiding too many restarts of the search for file extent items. In other words this preserves the huge performance gains made in the last two years to fiemap, while avoiding the deadlocks in case the fiemap buffer is memory mapped to the same file (useless in practice, but possible and exercised by fuzz testing and syzbot).
Fixes: b0ad381fa769 ("btrfs: fix deadlock with fiemap and extent locking") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
ef5a05c5 |
| 24-Feb-2024 |
Chengming Zhou <zhouchengming@bytedance.com> |
btrfs: remove SLAB_MEM_SPREAD flag use
The SLAB_MEM_SPREAD flag used to be implemented in SLAB, which was removed as of v6.8-rc1, so it became a dead flag since the commit 16a1d968358a ("mm/slab: re
btrfs: remove SLAB_MEM_SPREAD flag use
The SLAB_MEM_SPREAD flag used to be implemented in SLAB, which was removed as of v6.8-rc1, so it became a dead flag since the commit 16a1d968358a ("mm/slab: remove mm/slab.c and slab_def.h"). And the series[1] went on to mark it obsolete to avoid confusion for users. Here we can just remove all its users, which has no functional change.
[1] https://lore.kernel.org/all/20240223-slab-cleanup-flags-v2-1-02f1753e8303@suse.cz/
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
970ea374 |
| 06-Feb-2024 |
David Sterba <dsterba@suse.com> |
btrfs: pass a valid extent map cache pointer to __get_extent_map()
We can pass a valid em cache pointer down to __get_extent_map() and drop the validity check. This avoids the special case, the call
btrfs: pass a valid extent map cache pointer to __get_extent_map()
We can pass a valid em cache pointer down to __get_extent_map() and drop the validity check. This avoids the special case, the call stacks are simple:
btrfs_read_folio btrfs_do_readpage __get_extent_map
extent_readahead contiguous_readpages btrfs_do_readpage __get_extent_map
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
e84bfffc |
| 24-Jan-2024 |
David Sterba <dsterba@suse.com> |
btrfs: hoist fs_info out of loops in end_bbio_data_write and end_bbio_data_read
The fs_info and sectorsize remain the same during the loops, no need to set them on each iteration.
Reviewed-by: Joha
btrfs: hoist fs_info out of loops in end_bbio_data_write and end_bbio_data_read
The fs_info and sectorsize remain the same during the loops, no need to set them on each iteration.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
41044b41 |
| 14-Sep-2023 |
David Sterba <dsterba@suse.com> |
btrfs: add helper to get fs_info from struct inode pointer
Add a convenience helper to get a fs_info from a VFS inode pointer instead of open coding the chain or using btrfs_sb() that in some cases
btrfs: add helper to get fs_info from struct inode pointer
Add a convenience helper to get a fs_info from a VFS inode pointer instead of open coding the chain or using btrfs_sb() that in some cases does one more pointer hop. This is implemented as a macro (still with type checking) so we don't need full definitions of struct btrfs_inode, btrfs_root or btrfs_fs_info.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
b33d2e53 |
| 14-Sep-2023 |
David Sterba <dsterba@suse.com> |
btrfs: add helpers to get fs_info from page/folio pointers
Add convenience helpers to get a fs_info from a page or folio pointer instead of open coding the chain or using btrfs_sb() that in some cas
btrfs: add helpers to get fs_info from page/folio pointers
Add convenience helpers to get a fs_info from a page or folio pointer instead of open coding the chain or using btrfs_sb() that in some cases does one more pointer hop. This is implemented as a macro (still with type checking) so we don't need full definitions of struct page, folio, btrfs_root and btrfs_fs_info. The latter can't be static inlines as this would create loop between ctree.h <-> fs.h, or the headers would have to be restructured.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
c8293894 |
| 13-Sep-2023 |
David Sterba <dsterba@suse.com> |
btrfs: add helpers to get inode from page/folio pointers
Add convenience helpers to get a struct btrfs_inode from a page or folio pointer instead of open coding the chain or intermediate BTRFS_I. Th
btrfs: add helpers to get inode from page/folio pointers
Add convenience helpers to get a struct btrfs_inode from a page or folio pointer instead of open coding the chain or intermediate BTRFS_I. This is implemented as a macro (still with type checking) so we don't need full definitions of struct page or address_space.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
2b712e3b |
| 25-Jan-2024 |
David Sterba <dsterba@suse.com> |
btrfs: remove unused included headers
With help of neovim, LSP and clangd we can identify header files that are not actually needed to be included in the .c files. This is focused only on removal (w
btrfs: remove unused included headers
With help of neovim, LSP and clangd we can identify header files that are not actually needed to be included in the .c files. This is focused only on removal (with minor fixups), further cleanups are possible but will require doing the header files properly with forward declarations, minimized includes and include-what-you-use care.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
4e00422e |
| 16-Jan-2024 |
David Sterba <dsterba@suse.com> |
btrfs: replace sb::s_blocksize by fs_info::sectorsize
The block size stored in the super block is used by subsystems outside of btrfs and it's a copy of fs_info::sectorsize. Unify that to always use
btrfs: replace sb::s_blocksize by fs_info::sectorsize
The block size stored in the super block is used by subsystems outside of btrfs and it's a copy of fs_info::sectorsize. Unify that to always use our sectorsize, with the exception of mount where we first need to use fixed values (4K) until we read the super block and can set the sectorsize.
Replace all uses, in most cases it's fewer pointer indirections.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
dfba9f47 |
| 14-Dec-2023 |
Matthew Wilcox (Oracle) <willy@infradead.org> |
btrfs: add set_folio_extent_mapped() helper
Turn set_page_extent_mapped() into a wrapper around this version. Saves a call to compound_head() for callers who already have a folio and removes a coupl
btrfs: add set_folio_extent_mapped() helper
Turn set_page_extent_mapped() into a wrapper around this version. Saves a call to compound_head() for callers who already have a folio and removes a couple of users of page->mapping.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
8fd2b12e |
| 02-Jan-2024 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: WARN_ON_ONCE() in our leak detection code
fstests looks for WARN_ON's in dmesg. Add WARN_ON_ONCE() to our leak detection code (enabled only in debug builds) so that fstests will fail if thes
btrfs: WARN_ON_ONCE() in our leak detection code
fstests looks for WARN_ON's in dmesg. Add WARN_ON_ONCE() to our leak detection code (enabled only in debug builds) so that fstests will fail if these things trip at all. This will allow us to easily catch problems with our reference counting that may otherwise go unnoticed.
Reviewed-by: Neal Gompa <neal@gompa.dev> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
84cda1a6 |
| 05-Jan-2024 |
Qu Wenruo <wqu@suse.com> |
btrfs: cache folio size and shift in extent_buffer
After the conversion to folio interfaces (but without the patch to enable larger folio allocation), there is an LTP report about observable perform
btrfs: cache folio size and shift in extent_buffer
After the conversion to folio interfaces (but without the patch to enable larger folio allocation), there is an LTP report about observable performance drop on metadata heavy operations.
https://lore.kernel.org/linux-btrfs/202312221750.571925bd-oliver.sang@intel.com/
This drop is caused by the extra code of calculating the folio_size()/folio_shift(), instead of the old hard coded PAGE_SIZE/PAGE_SHIFT.
To slightly reduce the overhead, just cache both folio_size and folio_shift in extent_buffer.
The two new members (u32 folio_size and u8 folio_shift) are stored inside the holes of extent_buffer. folio_size is shared with len, which is reduced to u32. The size of eb does not change.
Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
4d02b543 |
| 08-Jan-2024 |
Qu Wenruo <wqu@suse.com> |
btrfs: remove unused variable bio_offset from end_bbio_data_read()
The variable @bio_offset was introduced in commit 7ffd27e378d2 ("btrfs: pass bio_offset to check_data_csum() directly"), when we ar
btrfs: remove unused variable bio_offset from end_bbio_data_read()
The variable @bio_offset was introduced in commit 7ffd27e378d2 ("btrfs: pass bio_offset to check_data_csum() directly"), when we are still using the same endio function for both data and metadata.
Later we had several changes to data and metadata endio functions:
- Data verification is handled by btrfs bio layer
- Split data and metadata endio paths
Now for data path we no longer do any verification in end_bbio_data_read(), as the verification is handled by btrfs bio layer already.
Thus there is no need for such bio_offset variable.
Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
8bab0a30 |
| 08-Jan-2024 |
Qu Wenruo <wqu@suse.com> |
btrfs: remove the pg_offset parameter from btrfs_get_extent()
The parameter @pg_offset of btrfs_get_extent() is only utilized for inlined extent, and we already have an ASSERT() and tree-checker, to
btrfs: remove the pg_offset parameter from btrfs_get_extent()
The parameter @pg_offset of btrfs_get_extent() is only utilized for inlined extent, and we already have an ASSERT() and tree-checker, to make sure we can only get inline extent at file offset 0.
Any invalid inline extent with non-zero file offset would be rejected by tree-checker in the first place.
Thus the @pg_offset parameter is not really necessary, just remove it.
Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
3c94ba52 |
| 04-Mar-2024 |
Ingo Molnar <mingo@kernel.org> |
Merge tag 'v6.8-rc7' into x86/cleanups, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
#
29cd8555 |
| 26-Feb-2024 |
Ingo Molnar <mingo@kernel.org> |
Merge tag 'v6.8-rc6' into x86/boot, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
#
fecc5155 |
| 23-Feb-2024 |
Jakub Kicinski <kuba@kernel.org> |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.
Conflicts:
net/ipv4/udp.c f796feabb9f5 ("udp: add local "peek offset enabled" fla
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.
Conflicts:
net/ipv4/udp.c f796feabb9f5 ("udp: add local "peek offset enabled" flag") 56667da7399e ("net: implement lockless setsockopt(SO_PEEK_OFF)")
Adjacent changes:
net/unix/garbage.c aa82ac51d633 ("af_unix: Drop oob_skb ref before purging queue in GC.") 11498715f266 ("af_unix: Remove io_uring code for GC.")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
#
03c11eb3 |
| 14-Feb-2024 |
Ingo Molnar <mingo@kernel.org> |
Merge tag 'v6.8-rc4' into x86/percpu, to resolve conflicts and refresh the branch
Conflicts: arch/x86/include/asm/percpu.h arch/x86/include/asm/text-patching.h
Signed-off-by: Ingo Molnar <mingo@k
Merge tag 'v6.8-rc4' into x86/percpu, to resolve conflicts and refresh the branch
Conflicts: arch/x86/include/asm/percpu.h arch/x86/include/asm/text-patching.h
Signed-off-by: Ingo Molnar <mingo@kernel.org>
show more ...
|
#
42ac0be1 |
| 26-Jan-2024 |
Ingo Molnar <mingo@kernel.org> |
Merge branch 'linus' into x86/mm, to refresh the branch and pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
#
7505aa14 |
| 01-Mar-2024 |
Linus Torvalds <torvalds@linux-foundation.org> |
Merge tag 'for-6.8-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix freeing allocated id for anon dev when snapshot creation fails
Merge tag 'for-6.8-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix freeing allocated id for anon dev when snapshot creation fails
- fiemap fixes: - followup for a recent deadlock fix, ranges that fiemap can access can still race with ordered extent completion - make sure fiemap with SYNC flag does not race with writes
* tag 'for-6.8-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix double free of anonymous device after snapshot creation failure btrfs: ensure fiemap doesn't race with writes when FIEMAP_FLAG_SYNC is given btrfs: fix race between ordered extent completion and fiemap
show more ...
|
#
418b0902 |
| 22-Feb-2024 |
Filipe Manana <fdmanana@suse.com> |
btrfs: ensure fiemap doesn't race with writes when FIEMAP_FLAG_SYNC is given
When FIEMAP_FLAG_SYNC is given to fiemap the expectation is that that are no concurrent writes and we get a stable view o
btrfs: ensure fiemap doesn't race with writes when FIEMAP_FLAG_SYNC is given
When FIEMAP_FLAG_SYNC is given to fiemap the expectation is that that are no concurrent writes and we get a stable view of the inode's extent layout.
When the flag is given we flush all IO (and wait for ordered extents to complete) and then lock the inode in shared mode, however that leaves open the possibility that a write might happen right after the flushing and before locking the inode. So fix this by flushing again after locking the inode - we leave the initial flushing before locking the inode to avoid holding the lock and blocking other RO operations while waiting for IO and ordered extents to complete. The second flushing while holding the inode's lock will most of the time do nothing or very little since the time window for new writes to have happened is small.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
a1a4a9ca |
| 22-Feb-2024 |
Filipe Manana <fdmanana@suse.com> |
btrfs: fix race between ordered extent completion and fiemap
For fiemap we recently stopped locking the target extent range for the whole duration of the fiemap call, in order to avoid a deadlock in
btrfs: fix race between ordered extent completion and fiemap
For fiemap we recently stopped locking the target extent range for the whole duration of the fiemap call, in order to avoid a deadlock in a scenario where the fiemap buffer happens to be a memory mapped range of the same file. This use case is very unlikely to be useful in practice but it may be triggered by fuzz testing (syzbot, etc).
However by not locking the target extent range for the whole duration of the fiemap call we can race with an ordered extent. This happens like this:
1) The fiemap task finishes processing a file extent item that covers the file range [512K, 1M[, and that file extent item is the last item in the leaf currently being processed;
2) And ordered extent for the file range [768K, 2M[, in COW mode, completes (btrfs_finish_one_ordered()) and the file extent item covering the range [512K, 1M[ is trimmed to cover the range [512K, 768K[ and then a new file extent item for the range [768K, 2M[ is inserted in the inode's subvolume tree;
3) The fiemap task calls fiemap_next_leaf_item(), which then calls btrfs_next_leaf() to find the next leaf / item. This finds that the the next key following the one we previously processed (its type is BTRFS_EXTENT_DATA_KEY and its offset is 512K), is the key corresponding to the new file extent item inserted by the ordered extent, which has a type of BTRFS_EXTENT_DATA_KEY and an offset of 768K;
4) Later the fiemap code ends up at emit_fiemap_extent() and triggers the warning:
if (cache->offset + cache->len > offset) { WARN_ON(1); return -EINVAL; }
Since we get 1M > 768K, because the previously emitted entry for the old extent covering the file range [512K, 1M[ ends at an offset that is greater than the new extent's start offset (768K). This makes fiemap fail with -EINVAL besides triggering the warning that produces a stack trace like the following:
[1621.677651] ------------[ cut here ]------------ [1621.677656] WARNING: CPU: 1 PID: 204366 at fs/btrfs/extent_io.c:2492 emit_fiemap_extent+0x84/0x90 [btrfs] [1621.677899] Modules linked in: btrfs blake2b_generic (...) [1621.677951] CPU: 1 PID: 204366 Comm: pool Not tainted 6.8.0-rc5-btrfs-next-151+ #1 [1621.677954] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014 [1621.677956] RIP: 0010:emit_fiemap_extent+0x84/0x90 [btrfs] [1621.678033] Code: 2b 4c 89 63 (...) [1621.678035] RSP: 0018:ffffab16089ffd20 EFLAGS: 00010206 [1621.678037] RAX: 00000000004fa000 RBX: ffffab16089ffe08 RCX: 0000000000009000 [1621.678039] RDX: 00000000004f9000 RSI: 00000000004f1000 RDI: ffffab16089ffe90 [1621.678040] RBP: 00000000004f9000 R08: 0000000000001000 R09: 0000000000000000 [1621.678041] R10: 0000000000000000 R11: 0000000000001000 R12: 0000000041d78000 [1621.678043] R13: 0000000000001000 R14: 0000000000000000 R15: ffff9434f0b17850 [1621.678044] FS: 00007fa6e20006c0(0000) GS:ffff943bdfa40000(0000) knlGS:0000000000000000 [1621.678046] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [1621.678048] CR2: 00007fa6b0801000 CR3: 000000012d404002 CR4: 0000000000370ef0 [1621.678053] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [1621.678055] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [1621.678056] Call Trace: [1621.678074] <TASK> [1621.678076] ? __warn+0x80/0x130 [1621.678082] ? emit_fiemap_extent+0x84/0x90 [btrfs] [1621.678159] ? report_bug+0x1f4/0x200 [1621.678164] ? handle_bug+0x42/0x70 [1621.678167] ? exc_invalid_op+0x14/0x70 [1621.678170] ? asm_exc_invalid_op+0x16/0x20 [1621.678178] ? emit_fiemap_extent+0x84/0x90 [btrfs] [1621.678253] extent_fiemap+0x766/0xa30 [btrfs] [1621.678339] btrfs_fiemap+0x45/0x80 [btrfs] [1621.678420] do_vfs_ioctl+0x1e4/0x870 [1621.678431] __x64_sys_ioctl+0x6a/0xc0 [1621.678434] do_syscall_64+0x52/0x120 [1621.678445] entry_SYSCALL_64_after_hwframe+0x6e/0x76
There's also another case where before calling btrfs_next_leaf() we are processing a hole or a prealloc extent and we had several delalloc ranges within that hole or prealloc extent. In that case if the ordered extents complete before we find the next key, we may end up finding an extent item with an offset smaller than (or equals to) the offset in cache->offset.
So fix this by changing emit_fiemap_extent() to address these three scenarios like this:
1) For the first case, steps listed above, adjust the length of the previously cached extent so that it does not overlap with the current extent, emit the previous one and cache the current file extent item;
2) For the second case where he had a hole or prealloc extent with multiple delalloc ranges inside the hole or prealloc extent's range, and the current file extent item has an offset that matches the offset in the fiemap cache, just discard what we have in the fiemap cache and assign the current file extent item to the cache, since it's more up to date;
3) For the third case where he had a hole or prealloc extent with multiple delalloc ranges inside the hole or prealloc extent's range and the offset of the file extent item we just found is smaller than what we have in the cache, just skip the current file extent item if its range end at or behind the cached extent's end, because we may have emitted (to the fiemap user space buffer) delalloc ranges that overlap with the current file extent item's range. If the file extent item's range goes beyond the end offset of the cached extent, just emit the cached extent and cache a subrange of the file extent item, that goes from the end offset of the cached extent to the end offset of the file extent item.
Dealing with those cases in those ways makes everything consistent by reflecting the current state of file extent items in the btree and without emitting extents that have overlapping ranges (which would be confusing and violating expectations).
This issue could be triggered often with test case generic/561, and was also hit and reported by Wang Yugui.
Reported-by: Wang Yugui <wangyugui@e16-tech.com> Link: https://lore.kernel.org/linux-btrfs/20240223104619.701F.409509F4@e16-tech.com/ Fixes: b0ad381fa769 ("btrfs: fix deadlock with fiemap and extent locking") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|