1*6be0ddb2SBagas Sanjaya.. SPDX-License-Identifier: GPL-2.0 2*6be0ddb2SBagas Sanjaya 3*6be0ddb2SBagas Sanjaya================ 4*6be0ddb2SBagas SanjayaFUSE Passthrough 5*6be0ddb2SBagas Sanjaya================ 6*6be0ddb2SBagas Sanjaya 7*6be0ddb2SBagas SanjayaIntroduction 8*6be0ddb2SBagas Sanjaya============ 9*6be0ddb2SBagas Sanjaya 10*6be0ddb2SBagas SanjayaFUSE (Filesystem in Userspace) passthrough is a feature designed to improve the 11*6be0ddb2SBagas Sanjayaperformance of FUSE filesystems for I/O operations. Typically, FUSE operations 12*6be0ddb2SBagas Sanjayainvolve communication between the kernel and a userspace FUSE daemon, which can 13*6be0ddb2SBagas Sanjayaincur overhead. Passthrough allows certain operations on a FUSE file to bypass 14*6be0ddb2SBagas Sanjayathe userspace daemon and be executed directly by the kernel on an underlying 15*6be0ddb2SBagas Sanjaya"backing file". 16*6be0ddb2SBagas Sanjaya 17*6be0ddb2SBagas SanjayaThis is achieved by the FUSE daemon registering a file descriptor (pointing to 18*6be0ddb2SBagas Sanjayathe backing file on a lower filesystem) with the FUSE kernel module. The kernel 19*6be0ddb2SBagas Sanjayathen receives an identifier (``backing_id``) for this registered backing file. 20*6be0ddb2SBagas SanjayaWhen a FUSE file is subsequently opened, the FUSE daemon can, in its response to 21*6be0ddb2SBagas Sanjayathe ``OPEN`` request, include this ``backing_id`` and set the 22*6be0ddb2SBagas Sanjaya``FOPEN_PASSTHROUGH`` flag. This establishes a direct link for specific 23*6be0ddb2SBagas Sanjayaoperations. 24*6be0ddb2SBagas Sanjaya 25*6be0ddb2SBagas SanjayaCurrently, passthrough is supported for operations like ``read(2)``/``write(2)`` 26*6be0ddb2SBagas Sanjaya(via ``read_iter``/``write_iter``), ``splice(2)``, and ``mmap(2)``. 27*6be0ddb2SBagas Sanjaya 28*6be0ddb2SBagas SanjayaEnabling Passthrough 29*6be0ddb2SBagas Sanjaya==================== 30*6be0ddb2SBagas Sanjaya 31*6be0ddb2SBagas SanjayaTo use FUSE passthrough: 32*6be0ddb2SBagas Sanjaya 33*6be0ddb2SBagas Sanjaya 1. The FUSE filesystem must be compiled with ``CONFIG_FUSE_PASSTHROUGH`` 34*6be0ddb2SBagas Sanjaya enabled. 35*6be0ddb2SBagas Sanjaya 2. The FUSE daemon, during the ``FUSE_INIT`` handshake, must negotiate the 36*6be0ddb2SBagas Sanjaya ``FUSE_PASSTHROUGH`` capability and specify its desired 37*6be0ddb2SBagas Sanjaya ``max_stack_depth``. 38*6be0ddb2SBagas Sanjaya 3. The (privileged) FUSE daemon uses the ``FUSE_DEV_IOC_BACKING_OPEN`` ioctl 39*6be0ddb2SBagas Sanjaya on its connection file descriptor (e.g., ``/dev/fuse``) to register a 40*6be0ddb2SBagas Sanjaya backing file descriptor and obtain a ``backing_id``. 41*6be0ddb2SBagas Sanjaya 4. When handling an ``OPEN`` or ``CREATE`` request for a FUSE file, the daemon 42*6be0ddb2SBagas Sanjaya replies with the ``FOPEN_PASSTHROUGH`` flag set in 43*6be0ddb2SBagas Sanjaya ``fuse_open_out::open_flags`` and provides the corresponding ``backing_id`` 44*6be0ddb2SBagas Sanjaya in ``fuse_open_out::backing_id``. 45*6be0ddb2SBagas Sanjaya 5. The FUSE daemon should eventually call ``FUSE_DEV_IOC_BACKING_CLOSE`` with 46*6be0ddb2SBagas Sanjaya the ``backing_id`` to release the kernel's reference to the backing file 47*6be0ddb2SBagas Sanjaya when it's no longer needed for passthrough setups. 48*6be0ddb2SBagas Sanjaya 49*6be0ddb2SBagas SanjayaPrivilege Requirements 50*6be0ddb2SBagas Sanjaya====================== 51*6be0ddb2SBagas Sanjaya 52*6be0ddb2SBagas SanjayaSetting up passthrough functionality currently requires the FUSE daemon to 53*6be0ddb2SBagas Sanjayapossess the ``CAP_SYS_ADMIN`` capability. This requirement stems from several 54*6be0ddb2SBagas Sanjayasecurity and resource management considerations that are actively being 55*6be0ddb2SBagas Sanjayadiscussed and worked on. The primary reasons for this restriction are detailed 56*6be0ddb2SBagas Sanjayabelow. 57*6be0ddb2SBagas Sanjaya 58*6be0ddb2SBagas SanjayaResource Accounting and Visibility 59*6be0ddb2SBagas Sanjaya---------------------------------- 60*6be0ddb2SBagas Sanjaya 61*6be0ddb2SBagas SanjayaThe core mechanism for passthrough involves the FUSE daemon opening a file 62*6be0ddb2SBagas Sanjayadescriptor to a backing file and registering it with the FUSE kernel module via 63*6be0ddb2SBagas Sanjayathe ``FUSE_DEV_IOC_BACKING_OPEN`` ioctl. This ioctl returns a ``backing_id`` 64*6be0ddb2SBagas Sanjayaassociated with a kernel-internal ``struct fuse_backing`` object, which holds a 65*6be0ddb2SBagas Sanjayareference to the backing ``struct file``. 66*6be0ddb2SBagas Sanjaya 67*6be0ddb2SBagas SanjayaA significant concern arises because the FUSE daemon can close its own file 68*6be0ddb2SBagas Sanjayadescriptor to the backing file after registration. The kernel, however, will 69*6be0ddb2SBagas Sanjayastill hold a reference to the ``struct file`` via the ``struct fuse_backing`` 70*6be0ddb2SBagas Sanjayaobject as long as it's associated with a ``backing_id`` (or subsequently, with 71*6be0ddb2SBagas Sanjayaan open FUSE file in passthrough mode). 72*6be0ddb2SBagas Sanjaya 73*6be0ddb2SBagas SanjayaThis behavior leads to two main issues for unprivileged FUSE daemons: 74*6be0ddb2SBagas Sanjaya 75*6be0ddb2SBagas Sanjaya 1. **Invisibility to lsof and other inspection tools**: Once the FUSE 76*6be0ddb2SBagas Sanjaya daemon closes its file descriptor, the open backing file held by the kernel 77*6be0ddb2SBagas Sanjaya becomes "hidden." Standard tools like ``lsof``, which typically inspect 78*6be0ddb2SBagas Sanjaya process file descriptor tables, would not be able to identify that this 79*6be0ddb2SBagas Sanjaya file is still open by the system on behalf of the FUSE filesystem. This 80*6be0ddb2SBagas Sanjaya makes it difficult for system administrators to track resource usage or 81*6be0ddb2SBagas Sanjaya debug issues related to open files (e.g., preventing unmounts). 82*6be0ddb2SBagas Sanjaya 83*6be0ddb2SBagas Sanjaya 2. **Bypassing RLIMIT_NOFILE**: The FUSE daemon process is subject to 84*6be0ddb2SBagas Sanjaya resource limits, including the maximum number of open file descriptors 85*6be0ddb2SBagas Sanjaya (``RLIMIT_NOFILE``). If an unprivileged daemon could register backing files 86*6be0ddb2SBagas Sanjaya and then close its own FDs, it could potentially cause the kernel to hold 87*6be0ddb2SBagas Sanjaya an unlimited number of open ``struct file`` references without these being 88*6be0ddb2SBagas Sanjaya accounted against the daemon's ``RLIMIT_NOFILE``. This could lead to a 89*6be0ddb2SBagas Sanjaya denial-of-service (DoS) by exhausting system-wide file resources. 90*6be0ddb2SBagas Sanjaya 91*6be0ddb2SBagas SanjayaThe ``CAP_SYS_ADMIN`` requirement acts as a safeguard against these issues, 92*6be0ddb2SBagas Sanjayarestricting this powerful capability to trusted processes. 93*6be0ddb2SBagas Sanjaya 94*6be0ddb2SBagas Sanjaya**NOTE**: ``io_uring`` solves this similar issue by exposing its "fixed files", 95*6be0ddb2SBagas Sanjayawhich are visible via ``fdinfo`` and accounted under the registering user's 96*6be0ddb2SBagas Sanjaya``RLIMIT_NOFILE``. 97*6be0ddb2SBagas Sanjaya 98*6be0ddb2SBagas SanjayaFilesystem Stacking and Shutdown Loops 99*6be0ddb2SBagas Sanjaya-------------------------------------- 100*6be0ddb2SBagas Sanjaya 101*6be0ddb2SBagas SanjayaAnother concern relates to the potential for creating complex and problematic 102*6be0ddb2SBagas Sanjayafilesystem stacking scenarios if unprivileged users could set up passthrough. 103*6be0ddb2SBagas SanjayaA FUSE passthrough filesystem might use a backing file that resides: 104*6be0ddb2SBagas Sanjaya 105*6be0ddb2SBagas Sanjaya * On the *same* FUSE filesystem. 106*6be0ddb2SBagas Sanjaya * On another filesystem (like OverlayFS) which itself might have an upper or 107*6be0ddb2SBagas Sanjaya lower layer that is a FUSE filesystem. 108*6be0ddb2SBagas Sanjaya 109*6be0ddb2SBagas SanjayaThese configurations could create dependency loops, particularly during 110*6be0ddb2SBagas Sanjayafilesystem shutdown or unmount sequences, leading to deadlocks or system 111*6be0ddb2SBagas Sanjayainstability. This is conceptually similar to the risks associated with the 112*6be0ddb2SBagas Sanjaya``LOOP_SET_FD`` ioctl, which also requires ``CAP_SYS_ADMIN``. 113*6be0ddb2SBagas Sanjaya 114*6be0ddb2SBagas SanjayaTo mitigate this, FUSE passthrough already incorporates checks based on 115*6be0ddb2SBagas Sanjayafilesystem stacking depth (``sb->s_stack_depth`` and ``fc->max_stack_depth``). 116*6be0ddb2SBagas SanjayaFor example, during the ``FUSE_INIT`` handshake, the FUSE daemon can negotiate 117*6be0ddb2SBagas Sanjayathe ``max_stack_depth`` it supports. When a backing file is registered via 118*6be0ddb2SBagas Sanjaya``FUSE_DEV_IOC_BACKING_OPEN``, the kernel checks if the backing file's 119*6be0ddb2SBagas Sanjayafilesystem stack depth is within the allowed limit. 120*6be0ddb2SBagas Sanjaya 121*6be0ddb2SBagas SanjayaThe ``CAP_SYS_ADMIN`` requirement provides an additional layer of security, 122*6be0ddb2SBagas Sanjayaensuring that only privileged users can create these potentially complex 123*6be0ddb2SBagas Sanjayastacking arrangements. 124*6be0ddb2SBagas Sanjaya 125*6be0ddb2SBagas SanjayaGeneral Security Posture 126*6be0ddb2SBagas Sanjaya------------------------ 127*6be0ddb2SBagas Sanjaya 128*6be0ddb2SBagas SanjayaAs a general principle for new kernel features that allow userspace to instruct 129*6be0ddb2SBagas Sanjayathe kernel to perform direct operations on its behalf based on user-provided 130*6be0ddb2SBagas Sanjayafile descriptors, starting with a higher privilege requirement (like 131*6be0ddb2SBagas Sanjaya``CAP_SYS_ADMIN``) is a conservative and common security practice. This allows 132*6be0ddb2SBagas Sanjayathe feature to be used and tested while further security implications are 133*6be0ddb2SBagas Sanjayaevaluated and addressed. 134