| #
7e0e7bd6 |
| 15-Jun-2026 |
Linus Torvalds <torvalds@linux-foundation.org> |
Merge tag 'vfs-7.2-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner: "Features:
- Reduce pipe->mutex contention by pre-allocating
Merge tag 'vfs-7.2-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner: "Features:
- Reduce pipe->mutex contention by pre-allocating pages outside the lock in anon_pipe_write().
anon_pipe_write() called alloc_page() once per page while holding pipe->mutex. The allocation can sleep doing direct reclaim and runs memcg charging, which extends the critical section and stalls any concurrent reader on the same mutex. Now up to 8 pages are pre-allocated before the mutex is taken, leftovers are recycled into the per-pipe tmp_page[] cache before unlock, and any remainder is released after unlock, keeping the allocator out of the critical section on both sides. On a writers x readers sweep with 64KB writes against a 1 MB pipe throughput improves 6-28% and average write latency drops 5-22%; under memory pressure - when the cost of holding the mutex across reclaim is highest - throughput improves 21-48% and latency drops 17-33%. The microbenchmark is added to selftests.
- uaccess/sockptr: fix the ignored_trailing logic in copy_struct_to_user() to behave as documented and the usize check in copy_struct_from_sockptr() for user pointers, and add copy_struct_{from,to}_bounce_buffer() and copy_struct_to_sockptr() helpers for upcoming users (IPPROTO_SMBDIRECT, IPPROTO_QUIC).
- bpf: add a sleepable bpf_real_inode() kfunc that resolves the real inode backing a dentry via d_real_inode(). On overlayfs the inode attached to the dentry doesn't carry the underlying device information; this is used by the filesystem restriction BPF program that was merged into systemd.
- docs: add guidelines for submitting new filesystems, motivated by the maintenance burden abandoned and untestable filesystems impose on VFS developers, blocking infrastructure work like folio conversions and iomap migration.
Fixes:
- libfs: set SB_I_NOEXEC and SB_I_NODEV by default in init_pseudo() and drop the now-redundant assignments in callers. This began as a one-line dma-buf fix for a path_noexec() warning; a pseudo filesystem has no reason not to set SB_I_NOEXEC. All init_pseudo() callers were audited: the only visible effect is on dma-buf where SB_I_NOEXEC silences the warning.
- Handle set_blocksize() failures in legacy filesystems (bfs, hpfs, qnx4, jfs, befs, affs, isofs, minix, ntfs3, omfs). Mounting a device with a sector size > PAGE_SIZE crashed roughly half of them; the rest had the same missing error handling pattern. Plus a follow-up releasing the superblock buffer_head when setting the minix v3 block size fails.
- mount: honour SB_NOUSER in the new mount API.
- fs/fcntl: fix a SOFTIRQ-unsafe lock order in fasync signaling by switching the process-group paths of send_sigio() and send_sigurg() from read_lock(&tasklist_lock) to RCU, matching the single-PID path.
- vfs: add an FS_USERNS_DELEGATABLE flag and set it for NFS, fixing delegated NFS mounts (fsopen() in a container with the mount performed by a privileged daemon) that broke when non-init s_user_ns was tied to FS_USERNS_MOUNT.
- selftests/namespaces: fix a hang in nsid_test where an unreaped grandchild kept the TAP pipe write-end open, a waitpid(-1) race in listns_efault_test, and a false FAIL on kernels without listns() where the tests should SKIP.
- filelock: fix the break_lease() stub signature for CONFIG_FILE_LOCKING=n.
- init/initramfs_test: wait for the async initramfs unpacking before running; the test and do_populate_rootfs() share the parser state.
- fs/coredump: reduce redundant log noise in validate_coredump_safety().
- iomap: pass the correct length to fserror_report_io() in __iomap_write_begin().
- backing-file: fix the backing_file_open() kerneldoc.
Cleanups:
- initramfs: refactor the cpio hex header parsing to use hex2bin() instead of the hand-rolled simple_strntoul() which is reverted, and extend the initramfs KUnit tests to cover header fields with 0x prefixes.
- Replace __get_free_pages() and friends with kmalloc()/kzalloc() across quota, proc, ocfs2/dlm, nilfs2, nfs, nfsd, libfs, jfs, jbd2, isofs, fuse, select, namespace, configfs, binfmt_misc, bfs, and the do_mounts init code - part of the larger work of replacing page allocator calls with kmalloc().
- Use clear_and_wake_up_bit() in unlock_buffer() and journal_end_buffer_io_sync() instead of open-coding the sequence.
- Drop unused VFS exports: unexport drop_super_exclusive(), remove start_removing_user_path_at(), and fold __start_removing_path() into start_removing_path().
- fs/read_write: narrow the __kernel_write() export with EXPORT_SYMBOL_FOR_MODULES().
- vfs: uapi: retire octal and hex constants in favor of (1 << n) for the O_ flags. Finding a free bit for a new flag across the architectures was needlessly hard with the mixed bases.
- dcache: add extra sanity checks of dead dentries in dentry_free() via a new DENTRY_WARN_ONCE() that also prints d_flags.
- iov_iter: use kmemdup_array() in dup_iter() to harden the allocation against multiplication overflow.
- fs/pipe: write to ->poll_usage only once.
- vfs: remove an always-taken if-branch in find_next_fd().
- dcache: use kmalloc_flex() for struct external_name in __d_alloc().
- namei: use QSTR() instead of QSTR_INIT() in path_pts().
- sync_file_range: delete dead S_ISLNK code.
- Comment fixes: retire a stale comment in fget_task_next() and fix assorted spelling mistakes"
* tag 'vfs-7.2-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (73 commits) backing-file: fix backing_file_open() kerneldoc parameter iomap: pass the correct len to fserror_report_io in __iomap_write_begin vfs: add FS_USERNS_DELEGATABLE flag and set it for NFS filelock: fix break_lease() stub signature for CONFIG_FILE_LOCKING=n vfs: uapi: retire octal and hex numbers in favor of (1 << n) for O_ flags bpf: add bpf_real_inode() kfunc fs/read_write: Do not export __kernel_write() to the entire world libfs: drop redundant SB_I_NOEXEC/SB_I_NODEV in init_pseudo() callers libfs: set SB_I_NOEXEC and SB_I_NODEV by default in init_pseudo() mount: honour SB_NOUSER in the new mount API fs/fcntl: fix SOFTIRQ-unsafe lock order in fasync signaling selftests/pipe: add pipe_bench microbenchmark fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write fs: retire stale comment in fget_task_next() fs: fix spelling mistakes in comment bfs: replace get_zeroed_page() with kzalloc() binfmt_misc: replace __get_free_page() with kmalloc() configfs: replace __get_free_pages() with kzalloc() fs/namespace: use __getname() to allocate mntpath buffer fs/select: replace __get_free_page() with kmalloc() ...
show more ...
|
| #
99f41427 |
| 28-May-2026 |
Christian Brauner <brauner@kernel.org> |
Merge patch series "fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock"
Breno Leitao <leitao@debian.org> says:
While profiling Meta's caching code[1], I found pipe->mutex con
Merge patch series "fs/pipe: reduce pipe->mutex contention by pre-allocating outside the lock"
Breno Leitao <leitao@debian.org> says:
While profiling Meta's caching code[1], I found pipe->mutex contention on the hot path. anon_pipe_write() currently calls alloc_page() once per page while holding pipe->mutex. The allocation can sleep doing direct reclaim and runs memcg charging, which extends the critical section and stalls any concurrent reader on the same mutex.
This series pre-allocates pages outside pipe->mutex in anon_pipe_write(): for writes that span more than one full page, up to PIPE_PREALLOC_MAX (8) pages are allocated via a per-page alloc_page() loop before the mutex is taken. anon_pipe_get_page() then drains the prealloc array first, falls back to the per-pipe tmp_page[] cache, and only enters the allocator under the mutex for the leftover pages (writes larger than PIPE_PREALLOC_MAX, single-page writes that skip prealloc, or shortfalls when the prealloc loop fails). Leftover prealloc pages are recycled into tmp_page[] before unlock and any remainder is put_page()'d after unlock, keeping the allocator out of the critical section on both sides.
alloc_pages_bulk_mempolicy() looked tempting but the bulk allocator refuses __GFP_ACCOUNT under memcg -- it returns at most one page when memcg_kmem_online() && (gfp & __GFP_ACCOUNT), see commit 8dcb3060d81d ("memcg: page_alloc: skip bulk allocator for __GFP_ACCOUNT"). A per-page loop keeps memcg accounting and the task NUMA mempolicy honoured uniformly without open-coding the charge.
I also vibe-coded a microbenchmark to validate the change. It sweeps writers x readers over {1,2,5} x {1,5,10} with 64KB writes against a 1 MB pipe and prints throughput + latency percentiles per config.
Measured on arm64 and also on x86 using virtme-ng (16 vCPUs, 64KB writes, 1 MB pipe). The numbers below were collected on v1 (alloc_pages_bulk()); v2's per-page loop preserves the dominant "allocation outside the mutex" win and is expected to land in the same range.
== No memory pressure (10s per config) ==
Throughput in MB/s (baseline -> patched, delta): writers readers=1 readers=5 readers=10 1 1119 -> 1354 (+21%) 1132 -> 1195 (+6%) 1060 -> 1240 (+17%) 2 1162 -> 1487 (+28%) 1034 -> 1285 (+24%) 1069 -> 1213 (+14%) 5 1152 -> 1357 (+18%) 1021 -> 1164 (+14%) 997 -> 1239 (+24%)
Avg write latency in ns (baseline -> patched, delta): writers readers=1 readers=5 readers=10 1 55786 -> 46103 (-17%) 55164 -> 52260 (-5%) 58906 -> 50370 (-14%) 2 107546 -> 84011 (-22%) 120837 -> 97206 (-20%) 116860 -> 103036 (-12%) 5 271293 -> 230170 (-15%) 306089 -> 268429 (-12%) 313300 -> 252232 (-19%)
Throughput improves +6% to +28% and average write latency drops 5% to 22% across every configuration.
== Under memory pressure (--memory-pressure, 6s per config) ==
stress-ng --vm 2 --vm-bytes 50% --vm-keep is forked alongside the sweep so the alloc_page() calls inside anon_pipe_write() routinely hit direct reclaim -- exactly the regime the patch targets.
Throughput in MB/s (baseline -> patched, delta): writers readers=1 readers=5 readers=10 1 1088 -> 1438 (+32%) 996 -> 1477 (+48%) 989 -> 1194 (+21%) 2 1076 -> 1378 (+28%) 1007 -> 1269 (+26%) 1018 -> 1234 (+21%) 5 1052 -> 1311 (+25%) 986 -> 1225 (+24%) 972 -> 1249 (+29%)
Avg write latency in ns (baseline -> patched, delta): writers readers=1 readers=5 readers=10 1 57397 -> 43406 (-24%) 62690 -> 42272 (-33%) 63136 -> 52272 (-17%) 2 116121 -> 90700 (-22%) 124098 -> 98481 (-21%) 122754 -> 101217 (-18%) 5 297122 -> 238322 (-20%) 316836 -> 255095 (-19%) 321496 -> 250189 (-22%)
Throughput improves +21% to +48% and average write latency drops 17% to 33% -- a noticeably bigger win than the no-pressure run.
That tracks: when alloc_page() has to dip into reclaim, the cost of holding pipe->mutex across it is highest, and pulling the allocation out of the critical section pays the most.
* patches from https://patch.msgid.link/20260524-fix_pipe-v3-0-bb4a75d23a90@debian.org: selftests/pipe: add pipe_bench microbenchmark fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write
Link: https://www.usenix.org/system/files/conference/atc13/atc13-bronson.pdf [1] Link: https://patch.msgid.link/20260524-fix_pipe-v3-0-bb4a75d23a90@debian.org Signed-off-by: Christian Brauner <brauner@kernel.org>
show more ...
|
| #
d29bd8ef |
| 24-May-2026 |
Breno Leitao <leitao@debian.org> |
selftests/pipe: add pipe_bench microbenchmark
Add a small selftest that stresses pipe->mutex contention by spawning N writer threads that hammer a single pipe with multi-page writes, plus M reader t
selftests/pipe: add pipe_bench microbenchmark
Add a small selftest that stresses pipe->mutex contention by spawning N writer threads that hammer a single pipe with multi-page writes, plus M reader threads that drain. Each writer records its own write() latency samples into a log2-bucketed histogram; main aggregates and prints total writes, throughput, average and percentile (p50/p99) latencies, and the maximum observed latency.
Pass --memory-pressure to fork stress-ng (--vm 4 --vm-bytes 80% --vm-method all) for the duration of the run, so alloc_page() in anon_pipe_write() routinely hits direct reclaim. The flag fails fast if stress-ng is not on $PATH.
Program print something like the following, for different writes, readers, msgsizes and memory pressure:
config: writers=X readers=Y msgsize=Z duration=3 pipe_size=1048576 memory_pressure=[no|yes] writes: total=54451 rate=18150/s throughput_MBps: 1134.40 lat_avg_ns: 275355 lat_p50_ns_upper: 262143 lat_p99_ns_upper: 1048575 lat_max_ns: 2145633
Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Breno Leitao <leitao@debian.org> Link: https://patch.msgid.link/20260524-fix_pipe-v3-2-bb4a75d23a90@debian.org Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
show more ...
|