History log of /linux/fs/btrfs/extent_io.c (Results 226 – 250 of 4091)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
# 2e2bc42c 12-Mar-2024 Ingo Molnar <mingo@kernel.org>

Merge branch 'linus' into x86/boot, to resolve conflict

There's a new conflict with Linus's upstream tree, because
in the following merge conflict resolution in <asm/coco.h>:

38b334fc767e Merge t

Merge branch 'linus' into x86/boot, to resolve conflict

There's a new conflict with Linus's upstream tree, because
in the following merge conflict resolution in <asm/coco.h>:

38b334fc767e Merge tag 'x86_sev_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Linus has resolved the conflicting placement of 'cc_mask' better
than the original commit:

1c811d403afd x86/sev: Fix position dependent variable references in startup code

... which was also done by an internal merge resolution:

2e5fc4786b7a Merge branch 'x86/sev' into x86/boot, to resolve conflicts and to pick up dependent tree

But Linus is right in 38b334fc767e, the 'cc_mask' declaration is sufficient
within the #ifdef CONFIG_ARCH_HAS_CC_PLATFORM block.

So instead of forcing Linus to do the same resolution again, merge in Linus's
tree and follow his conflict resolution.

Conflicts:
arch/x86/include/asm/coco.h

Signed-off-by: Ingo Molnar <mingo@kernel.org>

show more ...


# e3afe5dd 07-Mar-2024 Jakub Kicinski <kuba@kernel.org>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Cross-merge networking fixes after downstream PR.

No conflicts.

Adjacent changes:

net/core/page_pool_user.c
0b11b1c5c320 ("netdev:

Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Cross-merge networking fixes after downstream PR.

No conflicts.

Adjacent changes:

net/core/page_pool_user.c
0b11b1c5c320 ("netdev: let netlink core handle -EMSGSIZE errors")
429679dcf7d9 ("page_pool: fix netlink dump stop/resume")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>

show more ...


# 1cab1375 28-Feb-2024 Filipe Manana <fdmanana@suse.com>

btrfs: reuse cloned extent buffer during fiemap to avoid re-allocations

During fiemap we may have to visit multiple leaves of the subvolume's
inode tree, and each time we are freeing and allocating

btrfs: reuse cloned extent buffer during fiemap to avoid re-allocations

During fiemap we may have to visit multiple leaves of the subvolume's
inode tree, and each time we are freeing and allocating an extent buffer
to use as a clone of each visited leaf. Optimize this by reusing cloned
extent buffers, to avoid the freeing and re-allocation both of the extent
buffer structure itself and more importantly of the pages attached to the
extent buffer.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# 978b63f7 28-Feb-2024 Filipe Manana <fdmanana@suse.com>

btrfs: fix race when detecting delalloc ranges during fiemap

For fiemap we recently stopped locking the target extent range for the
whole duration of the fiemap call, in order to avoid a deadlock in

btrfs: fix race when detecting delalloc ranges during fiemap

For fiemap we recently stopped locking the target extent range for the
whole duration of the fiemap call, in order to avoid a deadlock in a
scenario where the fiemap buffer happens to be a memory mapped range of
the same file. This use case is very unlikely to be useful in practice but
it may be triggered by fuzz testing (syzbot, etc).

This however introduced a race that makes us miss delalloc ranges for
file regions that are currently holes, so the caller of fiemap will not
be aware that there's data for some file regions. This can be quite
serious for some use cases - for example in coreutils versions before 9.0,
the cp program used fiemap to detect holes and data in the source file,
copying only regions with data (extents or delalloc) from the source file
to the destination file in order to preserve holes (see the documentation
for its --sparse command line option). This means that if cp was used
with a source file that had delalloc in a hole, the destination file could
end up without that data, which is effectively a data loss issue, if it
happened to hit the race described below.

The race happens like this:

1) Fiemap is called, without the FIEMAP_FLAG_SYNC flag, for a file that
has delalloc in the file range [64M, 65M[, which is currently a hole;

2) Fiemap locks the inode in shared mode, then starts iterating the
inode's subvolume tree searching for file extent items, without having
the whole fiemap target range locked in the inode's io tree - the
change introduced recently by commit b0ad381fa769 ("btrfs: fix
deadlock with fiemap and extent locking"). It only locks ranges in
the io tree when it finds a hole or prealloc extent since that
commit;

3) Note that fiemap clones each leaf before using it, and this is to
avoid deadlocks when locking a file range in the inode's io tree and
the fiemap buffer is memory mapped to some file, because writing
to the page with btrfs_page_mkwrite() will wait on any ordered extent
for the page's range and the ordered extent needs to lock the range
and may need to modify the same leaf, therefore leading to a deadlock
on the leaf;

4) While iterating the file extent items in the cloned leaf before
finding the hole in the range [64M, 65M[, the delalloc in that range
is flushed and its ordered extent completes - meaning the corresponding
file extent item is in the inode's subvolume tree, but not present in
the cloned leaf that fiemap is iterating over;

5) When fiemap finds the hole in the [64M, 65M[ range by seeing the gap in
the cloned leaf (or a file extent item with disk_bytenr == 0 in case
the NO_HOLES feature is not enabled), it will lock that file range in
the inode's io tree and then search for delalloc by checking for the
EXTENT_DELALLOC bit in the io tree for that range and ordered extents
(with btrfs_find_delalloc_in_range()). But it finds nothing since the
delalloc in that range was already flushed and the ordered extent
completed and is gone - as a result fiemap will not report that there's
delalloc or an extent for the range [64M, 65M[, so user space will be
mislead into thinking that there's a hole in that range.

This could actually be sporadically triggered with test case generic/094
from fstests, which reports a missing extent/delalloc range like this:

generic/094 2s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//generic/094.out.bad)
--- tests/generic/094.out 2020-06-10 19:29:03.830519425 +0100
+++ /home/fdmanana/git/hub/xfstests/results//generic/094.out.bad 2024-02-28 11:00:00.381071525 +0000
@@ -1,3 +1,9 @@
QA output created by 094
fiemap run with sync
fiemap run without sync
+ERROR: couldn't find extent at 7
+map is 'HHDDHPPDPHPH'
+logical: [ 5.. 6] phys: 301517.. 301518 flags: 0x800 tot: 2
+logical: [ 8.. 8] phys: 301520.. 301520 flags: 0x800 tot: 1
...
(Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/generic/094.out /home/fdmanana/git/hub/xfstests/results//generic/094.out.bad' to see the entire diff)

So in order to fix this, while still avoiding deadlocks in the case where
the fiemap buffer is memory mapped to the same file, change fiemap to work
like the following:

1) Always lock the whole range in the inode's io tree before starting to
iterate the inode's subvolume tree searching for file extent items,
just like we did before commit b0ad381fa769 ("btrfs: fix deadlock with
fiemap and extent locking");

2) Now instead of writing to the fiemap buffer every time we have an extent
to report, write instead to a temporary buffer (1 page), and when that
buffer becomes full, stop iterating the file extent items, unlock the
range in the io tree, release the search path, submit all the entries
kept in that buffer to the fiemap buffer, and then resume the search
for file extent items after locking again the remainder of the range in
the io tree.

The buffer having a size of a page, allows for 146 entries in a system
with 4K pages. This is a large enough value to have a good performance
by avoiding too many restarts of the search for file extent items.
In other words this preserves the huge performance gains made in the
last two years to fiemap, while avoiding the deadlocks in case the
fiemap buffer is memory mapped to the same file (useless in practice,
but possible and exercised by fuzz testing and syzbot).

Fixes: b0ad381fa769 ("btrfs: fix deadlock with fiemap and extent locking")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# ef5a05c5 24-Feb-2024 Chengming Zhou <zhouchengming@bytedance.com>

btrfs: remove SLAB_MEM_SPREAD flag use

The SLAB_MEM_SPREAD flag used to be implemented in SLAB, which was
removed as of v6.8-rc1, so it became a dead flag since the commit
16a1d968358a ("mm/slab: re

btrfs: remove SLAB_MEM_SPREAD flag use

The SLAB_MEM_SPREAD flag used to be implemented in SLAB, which was
removed as of v6.8-rc1, so it became a dead flag since the commit
16a1d968358a ("mm/slab: remove mm/slab.c and slab_def.h"). And the
series[1] went on to mark it obsolete to avoid confusion for users.
Here we can just remove all its users, which has no functional change.

[1] https://lore.kernel.org/all/20240223-slab-cleanup-flags-v2-1-02f1753e8303@suse.cz/

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# 970ea374 06-Feb-2024 David Sterba <dsterba@suse.com>

btrfs: pass a valid extent map cache pointer to __get_extent_map()

We can pass a valid em cache pointer down to __get_extent_map() and
drop the validity check. This avoids the special case, the call

btrfs: pass a valid extent map cache pointer to __get_extent_map()

We can pass a valid em cache pointer down to __get_extent_map() and
drop the validity check. This avoids the special case, the call stacks
are simple:

btrfs_read_folio
btrfs_do_readpage
__get_extent_map

extent_readahead
contiguous_readpages
btrfs_do_readpage
__get_extent_map

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# e84bfffc 24-Jan-2024 David Sterba <dsterba@suse.com>

btrfs: hoist fs_info out of loops in end_bbio_data_write and end_bbio_data_read

The fs_info and sectorsize remain the same during the loops, no need to
set them on each iteration.

Reviewed-by: Joha

btrfs: hoist fs_info out of loops in end_bbio_data_write and end_bbio_data_read

The fs_info and sectorsize remain the same during the loops, no need to
set them on each iteration.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# 41044b41 14-Sep-2023 David Sterba <dsterba@suse.com>

btrfs: add helper to get fs_info from struct inode pointer

Add a convenience helper to get a fs_info from a VFS inode pointer
instead of open coding the chain or using btrfs_sb() that in some cases

btrfs: add helper to get fs_info from struct inode pointer

Add a convenience helper to get a fs_info from a VFS inode pointer
instead of open coding the chain or using btrfs_sb() that in some cases
does one more pointer hop. This is implemented as a macro (still with
type checking) so we don't need full definitions of struct btrfs_inode,
btrfs_root or btrfs_fs_info.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# b33d2e53 14-Sep-2023 David Sterba <dsterba@suse.com>

btrfs: add helpers to get fs_info from page/folio pointers

Add convenience helpers to get a fs_info from a page or folio pointer
instead of open coding the chain or using btrfs_sb() that in some cas

btrfs: add helpers to get fs_info from page/folio pointers

Add convenience helpers to get a fs_info from a page or folio pointer
instead of open coding the chain or using btrfs_sb() that in some cases
does one more pointer hop. This is implemented as a macro (still with
type checking) so we don't need full definitions of struct page, folio,
btrfs_root and btrfs_fs_info. The latter can't be static inlines as this
would create loop between ctree.h <-> fs.h, or the headers would have to
be restructured.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# c8293894 13-Sep-2023 David Sterba <dsterba@suse.com>

btrfs: add helpers to get inode from page/folio pointers

Add convenience helpers to get a struct btrfs_inode from a page or folio
pointer instead of open coding the chain or intermediate BTRFS_I. Th

btrfs: add helpers to get inode from page/folio pointers

Add convenience helpers to get a struct btrfs_inode from a page or folio
pointer instead of open coding the chain or intermediate BTRFS_I. This
is implemented as a macro (still with type checking) so we don't need
full definitions of struct page or address_space.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# 2b712e3b 25-Jan-2024 David Sterba <dsterba@suse.com>

btrfs: remove unused included headers

With help of neovim, LSP and clangd we can identify header files that
are not actually needed to be included in the .c files. This is focused
only on removal (w

btrfs: remove unused included headers

With help of neovim, LSP and clangd we can identify header files that
are not actually needed to be included in the .c files. This is focused
only on removal (with minor fixups), further cleanups are possible but
will require doing the header files properly with forward declarations,
minimized includes and include-what-you-use care.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# 4e00422e 16-Jan-2024 David Sterba <dsterba@suse.com>

btrfs: replace sb::s_blocksize by fs_info::sectorsize

The block size stored in the super block is used by subsystems outside
of btrfs and it's a copy of fs_info::sectorsize. Unify that to always
use

btrfs: replace sb::s_blocksize by fs_info::sectorsize

The block size stored in the super block is used by subsystems outside
of btrfs and it's a copy of fs_info::sectorsize. Unify that to always
use our sectorsize, with the exception of mount where we first need to
use fixed values (4K) until we read the super block and can set the
sectorsize.

Replace all uses, in most cases it's fewer pointer indirections.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# dfba9f47 14-Dec-2023 Matthew Wilcox (Oracle) <willy@infradead.org>

btrfs: add set_folio_extent_mapped() helper

Turn set_page_extent_mapped() into a wrapper around this version.
Saves a call to compound_head() for callers who already have a folio
and removes a coupl

btrfs: add set_folio_extent_mapped() helper

Turn set_page_extent_mapped() into a wrapper around this version.
Saves a call to compound_head() for callers who already have a folio
and removes a couple of users of page->mapping.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# 8fd2b12e 02-Jan-2024 Josef Bacik <josef@toxicpanda.com>

btrfs: WARN_ON_ONCE() in our leak detection code

fstests looks for WARN_ON's in dmesg. Add WARN_ON_ONCE() to our leak
detection code (enabled only in debug builds) so that fstests will fail
if thes

btrfs: WARN_ON_ONCE() in our leak detection code

fstests looks for WARN_ON's in dmesg. Add WARN_ON_ONCE() to our leak
detection code (enabled only in debug builds) so that fstests will fail
if these things trip at all. This will allow us to easily catch
problems with our reference counting that may otherwise go unnoticed.

Reviewed-by: Neal Gompa <neal@gompa.dev>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# 84cda1a6 05-Jan-2024 Qu Wenruo <wqu@suse.com>

btrfs: cache folio size and shift in extent_buffer

After the conversion to folio interfaces (but without the patch to
enable larger folio allocation), there is an LTP report about observable
perform

btrfs: cache folio size and shift in extent_buffer

After the conversion to folio interfaces (but without the patch to
enable larger folio allocation), there is an LTP report about observable
performance drop on metadata heavy operations.

https://lore.kernel.org/linux-btrfs/202312221750.571925bd-oliver.sang@intel.com/

This drop is caused by the extra code of calculating the
folio_size()/folio_shift(), instead of the old hard coded
PAGE_SIZE/PAGE_SHIFT.

To slightly reduce the overhead, just cache both folio_size and
folio_shift in extent_buffer.

The two new members (u32 folio_size and u8 folio_shift) are stored
inside the holes of extent_buffer. folio_size is shared with len, which
is reduced to u32. The size of eb does not change.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# 4d02b543 08-Jan-2024 Qu Wenruo <wqu@suse.com>

btrfs: remove unused variable bio_offset from end_bbio_data_read()

The variable @bio_offset was introduced in commit 7ffd27e378d2 ("btrfs:
pass bio_offset to check_data_csum() directly"), when we ar

btrfs: remove unused variable bio_offset from end_bbio_data_read()

The variable @bio_offset was introduced in commit 7ffd27e378d2 ("btrfs:
pass bio_offset to check_data_csum() directly"), when we are still using
the same endio function for both data and metadata.

Later we had several changes to data and metadata endio functions:

- Data verification is handled by btrfs bio layer

- Split data and metadata endio paths

Now for data path we no longer do any verification in
end_bbio_data_read(), as the verification is handled by btrfs bio layer
already.

Thus there is no need for such bio_offset variable.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# 8bab0a30 08-Jan-2024 Qu Wenruo <wqu@suse.com>

btrfs: remove the pg_offset parameter from btrfs_get_extent()

The parameter @pg_offset of btrfs_get_extent() is only utilized for
inlined extent, and we already have an ASSERT() and tree-checker, to

btrfs: remove the pg_offset parameter from btrfs_get_extent()

The parameter @pg_offset of btrfs_get_extent() is only utilized for
inlined extent, and we already have an ASSERT() and tree-checker, to
make sure we can only get inline extent at file offset 0.

Any invalid inline extent with non-zero file offset would be rejected by
tree-checker in the first place.

Thus the @pg_offset parameter is not really necessary, just remove it.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# 3c94ba52 04-Mar-2024 Ingo Molnar <mingo@kernel.org>

Merge tag 'v6.8-rc7' into x86/cleanups, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 29cd8555 26-Feb-2024 Ingo Molnar <mingo@kernel.org>

Merge tag 'v6.8-rc6' into x86/boot, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>


# fecc5155 23-Feb-2024 Jakub Kicinski <kuba@kernel.org>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Cross-merge networking fixes after downstream PR.

Conflicts:

net/ipv4/udp.c
f796feabb9f5 ("udp: add local "peek offset enabled" fla

Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Cross-merge networking fixes after downstream PR.

Conflicts:

net/ipv4/udp.c
f796feabb9f5 ("udp: add local "peek offset enabled" flag")
56667da7399e ("net: implement lockless setsockopt(SO_PEEK_OFF)")

Adjacent changes:

net/unix/garbage.c
aa82ac51d633 ("af_unix: Drop oob_skb ref before purging queue in GC.")
11498715f266 ("af_unix: Remove io_uring code for GC.")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>

show more ...


# 03c11eb3 14-Feb-2024 Ingo Molnar <mingo@kernel.org>

Merge tag 'v6.8-rc4' into x86/percpu, to resolve conflicts and refresh the branch

Conflicts:
arch/x86/include/asm/percpu.h
arch/x86/include/asm/text-patching.h

Signed-off-by: Ingo Molnar <mingo@k

Merge tag 'v6.8-rc4' into x86/percpu, to resolve conflicts and refresh the branch

Conflicts:
arch/x86/include/asm/percpu.h
arch/x86/include/asm/text-patching.h

Signed-off-by: Ingo Molnar <mingo@kernel.org>

show more ...


# 42ac0be1 26-Jan-2024 Ingo Molnar <mingo@kernel.org>

Merge branch 'linus' into x86/mm, to refresh the branch and pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>


# 7505aa14 01-Mar-2024 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'for-6.8-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

- fix freeing allocated id for anon dev when snapshot creation fails

Merge tag 'for-6.8-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

- fix freeing allocated id for anon dev when snapshot creation fails

- fiemap fixes:
- followup for a recent deadlock fix, ranges that fiemap can access
can still race with ordered extent completion
- make sure fiemap with SYNC flag does not race with writes

* tag 'for-6.8-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix double free of anonymous device after snapshot creation failure
btrfs: ensure fiemap doesn't race with writes when FIEMAP_FLAG_SYNC is given
btrfs: fix race between ordered extent completion and fiemap

show more ...


# 418b0902 22-Feb-2024 Filipe Manana <fdmanana@suse.com>

btrfs: ensure fiemap doesn't race with writes when FIEMAP_FLAG_SYNC is given

When FIEMAP_FLAG_SYNC is given to fiemap the expectation is that that
are no concurrent writes and we get a stable view o

btrfs: ensure fiemap doesn't race with writes when FIEMAP_FLAG_SYNC is given

When FIEMAP_FLAG_SYNC is given to fiemap the expectation is that that
are no concurrent writes and we get a stable view of the inode's extent
layout.

When the flag is given we flush all IO (and wait for ordered extents to
complete) and then lock the inode in shared mode, however that leaves open
the possibility that a write might happen right after the flushing and
before locking the inode. So fix this by flushing again after locking the
inode - we leave the initial flushing before locking the inode to avoid
holding the lock and blocking other RO operations while waiting for IO
and ordered extents to complete. The second flushing while holding the
inode's lock will most of the time do nothing or very little since the
time window for new writes to have happened is small.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# a1a4a9ca 22-Feb-2024 Filipe Manana <fdmanana@suse.com>

btrfs: fix race between ordered extent completion and fiemap

For fiemap we recently stopped locking the target extent range for the
whole duration of the fiemap call, in order to avoid a deadlock in

btrfs: fix race between ordered extent completion and fiemap

For fiemap we recently stopped locking the target extent range for the
whole duration of the fiemap call, in order to avoid a deadlock in a
scenario where the fiemap buffer happens to be a memory mapped range of
the same file. This use case is very unlikely to be useful in practice but
it may be triggered by fuzz testing (syzbot, etc).

However by not locking the target extent range for the whole duration of
the fiemap call we can race with an ordered extent. This happens like
this:

1) The fiemap task finishes processing a file extent item that covers
the file range [512K, 1M[, and that file extent item is the last item
in the leaf currently being processed;

2) And ordered extent for the file range [768K, 2M[, in COW mode,
completes (btrfs_finish_one_ordered()) and the file extent item
covering the range [512K, 1M[ is trimmed to cover the range
[512K, 768K[ and then a new file extent item for the range [768K, 2M[
is inserted in the inode's subvolume tree;

3) The fiemap task calls fiemap_next_leaf_item(), which then calls
btrfs_next_leaf() to find the next leaf / item. This finds that the
the next key following the one we previously processed (its type is
BTRFS_EXTENT_DATA_KEY and its offset is 512K), is the key corresponding
to the new file extent item inserted by the ordered extent, which has
a type of BTRFS_EXTENT_DATA_KEY and an offset of 768K;

4) Later the fiemap code ends up at emit_fiemap_extent() and triggers
the warning:

if (cache->offset + cache->len > offset) {
WARN_ON(1);
return -EINVAL;
}

Since we get 1M > 768K, because the previously emitted entry for the
old extent covering the file range [512K, 1M[ ends at an offset that
is greater than the new extent's start offset (768K). This makes fiemap
fail with -EINVAL besides triggering the warning that produces a stack
trace like the following:

[1621.677651] ------------[ cut here ]------------
[1621.677656] WARNING: CPU: 1 PID: 204366 at fs/btrfs/extent_io.c:2492 emit_fiemap_extent+0x84/0x90 [btrfs]
[1621.677899] Modules linked in: btrfs blake2b_generic (...)
[1621.677951] CPU: 1 PID: 204366 Comm: pool Not tainted 6.8.0-rc5-btrfs-next-151+ #1
[1621.677954] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
[1621.677956] RIP: 0010:emit_fiemap_extent+0x84/0x90 [btrfs]
[1621.678033] Code: 2b 4c 89 63 (...)
[1621.678035] RSP: 0018:ffffab16089ffd20 EFLAGS: 00010206
[1621.678037] RAX: 00000000004fa000 RBX: ffffab16089ffe08 RCX: 0000000000009000
[1621.678039] RDX: 00000000004f9000 RSI: 00000000004f1000 RDI: ffffab16089ffe90
[1621.678040] RBP: 00000000004f9000 R08: 0000000000001000 R09: 0000000000000000
[1621.678041] R10: 0000000000000000 R11: 0000000000001000 R12: 0000000041d78000
[1621.678043] R13: 0000000000001000 R14: 0000000000000000 R15: ffff9434f0b17850
[1621.678044] FS: 00007fa6e20006c0(0000) GS:ffff943bdfa40000(0000) knlGS:0000000000000000
[1621.678046] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1621.678048] CR2: 00007fa6b0801000 CR3: 000000012d404002 CR4: 0000000000370ef0
[1621.678053] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[1621.678055] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[1621.678056] Call Trace:
[1621.678074] <TASK>
[1621.678076] ? __warn+0x80/0x130
[1621.678082] ? emit_fiemap_extent+0x84/0x90 [btrfs]
[1621.678159] ? report_bug+0x1f4/0x200
[1621.678164] ? handle_bug+0x42/0x70
[1621.678167] ? exc_invalid_op+0x14/0x70
[1621.678170] ? asm_exc_invalid_op+0x16/0x20
[1621.678178] ? emit_fiemap_extent+0x84/0x90 [btrfs]
[1621.678253] extent_fiemap+0x766/0xa30 [btrfs]
[1621.678339] btrfs_fiemap+0x45/0x80 [btrfs]
[1621.678420] do_vfs_ioctl+0x1e4/0x870
[1621.678431] __x64_sys_ioctl+0x6a/0xc0
[1621.678434] do_syscall_64+0x52/0x120
[1621.678445] entry_SYSCALL_64_after_hwframe+0x6e/0x76

There's also another case where before calling btrfs_next_leaf() we are
processing a hole or a prealloc extent and we had several delalloc ranges
within that hole or prealloc extent. In that case if the ordered extents
complete before we find the next key, we may end up finding an extent item
with an offset smaller than (or equals to) the offset in cache->offset.

So fix this by changing emit_fiemap_extent() to address these three
scenarios like this:

1) For the first case, steps listed above, adjust the length of the
previously cached extent so that it does not overlap with the current
extent, emit the previous one and cache the current file extent item;

2) For the second case where he had a hole or prealloc extent with
multiple delalloc ranges inside the hole or prealloc extent's range,
and the current file extent item has an offset that matches the offset
in the fiemap cache, just discard what we have in the fiemap cache and
assign the current file extent item to the cache, since it's more up
to date;

3) For the third case where he had a hole or prealloc extent with
multiple delalloc ranges inside the hole or prealloc extent's range
and the offset of the file extent item we just found is smaller than
what we have in the cache, just skip the current file extent item
if its range end at or behind the cached extent's end, because we may
have emitted (to the fiemap user space buffer) delalloc ranges that
overlap with the current file extent item's range. If the file extent
item's range goes beyond the end offset of the cached extent, just
emit the cached extent and cache a subrange of the file extent item,
that goes from the end offset of the cached extent to the end offset
of the file extent item.

Dealing with those cases in those ways makes everything consistent by
reflecting the current state of file extent items in the btree and
without emitting extents that have overlapping ranges (which would be
confusing and violating expectations).

This issue could be triggered often with test case generic/561, and was
also hit and reported by Wang Yugui.

Reported-by: Wang Yugui <wangyugui@e16-tech.com>
Link: https://lore.kernel.org/linux-btrfs/20240223104619.701F.409509F4@e16-tech.com/
Fixes: b0ad381fa769 ("btrfs: fix deadlock with fiemap and extent locking")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


12345678910>>...164