1*6be0ddb2SBagas Sanjaya.. SPDX-License-Identifier: GPL-2.0 2*6be0ddb2SBagas Sanjaya 3*6be0ddb2SBagas Sanjaya============= 4*6be0ddb2SBagas SanjayaFUSE Overview 5*6be0ddb2SBagas Sanjaya============= 6*6be0ddb2SBagas Sanjaya 7*6be0ddb2SBagas SanjayaDefinitions 8*6be0ddb2SBagas Sanjaya=========== 9*6be0ddb2SBagas Sanjaya 10*6be0ddb2SBagas SanjayaUserspace filesystem: 11*6be0ddb2SBagas Sanjaya A filesystem in which data and metadata are provided by an ordinary 12*6be0ddb2SBagas Sanjaya userspace process. The filesystem can be accessed normally through 13*6be0ddb2SBagas Sanjaya the kernel interface. 14*6be0ddb2SBagas Sanjaya 15*6be0ddb2SBagas SanjayaFilesystem daemon: 16*6be0ddb2SBagas Sanjaya The process(es) providing the data and metadata of the filesystem. 17*6be0ddb2SBagas Sanjaya 18*6be0ddb2SBagas SanjayaNon-privileged mount (or user mount): 19*6be0ddb2SBagas Sanjaya A userspace filesystem mounted by a non-privileged (non-root) user. 20*6be0ddb2SBagas Sanjaya The filesystem daemon is running with the privileges of the mounting 21*6be0ddb2SBagas Sanjaya user. NOTE: this is not the same as mounts allowed with the "user" 22*6be0ddb2SBagas Sanjaya option in /etc/fstab, which is not discussed here. 23*6be0ddb2SBagas Sanjaya 24*6be0ddb2SBagas SanjayaFilesystem connection: 25*6be0ddb2SBagas Sanjaya A connection between the filesystem daemon and the kernel. The 26*6be0ddb2SBagas Sanjaya connection exists until either the daemon dies, or the filesystem is 27*6be0ddb2SBagas Sanjaya umounted. Note that detaching (or lazy umounting) the filesystem 28*6be0ddb2SBagas Sanjaya does *not* break the connection, in this case it will exist until 29*6be0ddb2SBagas Sanjaya the last reference to the filesystem is released. 30*6be0ddb2SBagas Sanjaya 31*6be0ddb2SBagas SanjayaMount owner: 32*6be0ddb2SBagas Sanjaya The user who does the mounting. 33*6be0ddb2SBagas Sanjaya 34*6be0ddb2SBagas SanjayaUser: 35*6be0ddb2SBagas Sanjaya The user who is performing filesystem operations. 36*6be0ddb2SBagas Sanjaya 37*6be0ddb2SBagas SanjayaWhat is FUSE? 38*6be0ddb2SBagas Sanjaya============= 39*6be0ddb2SBagas Sanjaya 40*6be0ddb2SBagas SanjayaFUSE is a userspace filesystem framework. It consists of a kernel 41*6be0ddb2SBagas Sanjayamodule (fuse.ko), a userspace library (libfuse.*) and a mount utility 42*6be0ddb2SBagas Sanjaya(fusermount). 43*6be0ddb2SBagas Sanjaya 44*6be0ddb2SBagas SanjayaOne of the most important features of FUSE is allowing secure, 45*6be0ddb2SBagas Sanjayanon-privileged mounts. This opens up new possibilities for the use of 46*6be0ddb2SBagas Sanjayafilesystems. A good example is sshfs: a secure network filesystem 47*6be0ddb2SBagas Sanjayausing the sftp protocol. 48*6be0ddb2SBagas Sanjaya 49*6be0ddb2SBagas SanjayaThe userspace library and utilities are available from the 50*6be0ddb2SBagas Sanjaya`FUSE homepage: <https://github.com/libfuse/>`_ 51*6be0ddb2SBagas Sanjaya 52*6be0ddb2SBagas SanjayaFilesystem type 53*6be0ddb2SBagas Sanjaya=============== 54*6be0ddb2SBagas Sanjaya 55*6be0ddb2SBagas SanjayaThe filesystem type given to mount(2) can be one of the following: 56*6be0ddb2SBagas Sanjaya 57*6be0ddb2SBagas Sanjaya fuse 58*6be0ddb2SBagas Sanjaya This is the usual way to mount a FUSE filesystem. The first 59*6be0ddb2SBagas Sanjaya argument of the mount system call may contain an arbitrary string, 60*6be0ddb2SBagas Sanjaya which is not interpreted by the kernel. 61*6be0ddb2SBagas Sanjaya 62*6be0ddb2SBagas Sanjaya fuseblk 63*6be0ddb2SBagas Sanjaya The filesystem is block device based. The first argument of the 64*6be0ddb2SBagas Sanjaya mount system call is interpreted as the name of the device. 65*6be0ddb2SBagas Sanjaya 66*6be0ddb2SBagas SanjayaMount options 67*6be0ddb2SBagas Sanjaya============= 68*6be0ddb2SBagas Sanjaya 69*6be0ddb2SBagas Sanjayafd=N 70*6be0ddb2SBagas Sanjaya The file descriptor to use for communication between the userspace 71*6be0ddb2SBagas Sanjaya filesystem and the kernel. The file descriptor must have been 72*6be0ddb2SBagas Sanjaya obtained by opening the FUSE device ('/dev/fuse'). 73*6be0ddb2SBagas Sanjaya 74*6be0ddb2SBagas Sanjayarootmode=M 75*6be0ddb2SBagas Sanjaya The file mode of the filesystem's root in octal representation. 76*6be0ddb2SBagas Sanjaya 77*6be0ddb2SBagas Sanjayauser_id=N 78*6be0ddb2SBagas Sanjaya The numeric user id of the mount owner. 79*6be0ddb2SBagas Sanjaya 80*6be0ddb2SBagas Sanjayagroup_id=N 81*6be0ddb2SBagas Sanjaya The numeric group id of the mount owner. 82*6be0ddb2SBagas Sanjaya 83*6be0ddb2SBagas Sanjayadefault_permissions 84*6be0ddb2SBagas Sanjaya By default FUSE doesn't check file access permissions, the 85*6be0ddb2SBagas Sanjaya filesystem is free to implement its access policy or leave it to 86*6be0ddb2SBagas Sanjaya the underlying file access mechanism (e.g. in case of network 87*6be0ddb2SBagas Sanjaya filesystems). This option enables permission checking, restricting 88*6be0ddb2SBagas Sanjaya access based on file mode. It is usually useful together with the 89*6be0ddb2SBagas Sanjaya 'allow_other' mount option. 90*6be0ddb2SBagas Sanjaya 91*6be0ddb2SBagas Sanjayaallow_other 92*6be0ddb2SBagas Sanjaya This option overrides the security measure restricting file access 93*6be0ddb2SBagas Sanjaya to the user mounting the filesystem. This option is by default only 94*6be0ddb2SBagas Sanjaya allowed to root, but this restriction can be removed with a 95*6be0ddb2SBagas Sanjaya (userspace) configuration option. 96*6be0ddb2SBagas Sanjaya 97*6be0ddb2SBagas Sanjayamax_read=N 98*6be0ddb2SBagas Sanjaya With this option the maximum size of read operations can be set. 99*6be0ddb2SBagas Sanjaya The default is infinite. Note that the size of read requests is 100*6be0ddb2SBagas Sanjaya limited anyway to 32 pages (which is 128kbyte on i386). 101*6be0ddb2SBagas Sanjaya 102*6be0ddb2SBagas Sanjayablksize=N 103*6be0ddb2SBagas Sanjaya Set the block size for the filesystem. The default is 512. This 104*6be0ddb2SBagas Sanjaya option is only valid for 'fuseblk' type mounts. 105*6be0ddb2SBagas Sanjaya 106*6be0ddb2SBagas SanjayaControl filesystem 107*6be0ddb2SBagas Sanjaya================== 108*6be0ddb2SBagas Sanjaya 109*6be0ddb2SBagas SanjayaThere's a control filesystem for FUSE, which can be mounted by:: 110*6be0ddb2SBagas Sanjaya 111*6be0ddb2SBagas Sanjaya mount -t fusectl none /sys/fs/fuse/connections 112*6be0ddb2SBagas Sanjaya 113*6be0ddb2SBagas SanjayaMounting it under the '/sys/fs/fuse/connections' directory makes it 114*6be0ddb2SBagas Sanjayabackwards compatible with earlier versions. 115*6be0ddb2SBagas Sanjaya 116*6be0ddb2SBagas SanjayaUnder the fuse control filesystem each connection has a directory 117*6be0ddb2SBagas Sanjayanamed by a unique number. 118*6be0ddb2SBagas Sanjaya 119*6be0ddb2SBagas SanjayaFor each connection the following files exist within this directory: 120*6be0ddb2SBagas Sanjaya 121*6be0ddb2SBagas Sanjaya waiting 122*6be0ddb2SBagas Sanjaya The number of requests which are waiting to be transferred to 123*6be0ddb2SBagas Sanjaya userspace or being processed by the filesystem daemon. If there is 124*6be0ddb2SBagas Sanjaya no filesystem activity and 'waiting' is non-zero, then the 125*6be0ddb2SBagas Sanjaya filesystem is hung or deadlocked. 126*6be0ddb2SBagas Sanjaya 127*6be0ddb2SBagas Sanjaya abort 128*6be0ddb2SBagas Sanjaya Writing anything into this file will abort the filesystem 129*6be0ddb2SBagas Sanjaya connection. This means that all waiting requests will be aborted an 130*6be0ddb2SBagas Sanjaya error returned for all aborted and new requests. 131*6be0ddb2SBagas Sanjaya 132*6be0ddb2SBagas Sanjaya max_background 133*6be0ddb2SBagas Sanjaya The maximum number of background requests that can be outstanding 134*6be0ddb2SBagas Sanjaya at a time. When the number of background requests reaches this limit, 135*6be0ddb2SBagas Sanjaya further requests will be blocked until some are completed, potentially 136*6be0ddb2SBagas Sanjaya causing I/O operations to stall. 137*6be0ddb2SBagas Sanjaya 138*6be0ddb2SBagas Sanjaya congestion_threshold 139*6be0ddb2SBagas Sanjaya The threshold of background requests at which the kernel considers 140*6be0ddb2SBagas Sanjaya the filesystem to be congested. When the number of background requests 141*6be0ddb2SBagas Sanjaya exceeds this value, the kernel will skip asynchronous readahead 142*6be0ddb2SBagas Sanjaya operations, reducing read-ahead optimizations but preserving essential 143*6be0ddb2SBagas Sanjaya I/O, as well as suspending non-synchronous writeback operations 144*6be0ddb2SBagas Sanjaya (WB_SYNC_NONE), delaying page cache flushing to the filesystem. 145*6be0ddb2SBagas Sanjaya 146*6be0ddb2SBagas SanjayaOnly the owner of the mount may read or write these files. 147*6be0ddb2SBagas Sanjaya 148*6be0ddb2SBagas SanjayaInterrupting filesystem operations 149*6be0ddb2SBagas Sanjaya################################## 150*6be0ddb2SBagas Sanjaya 151*6be0ddb2SBagas SanjayaIf a process issuing a FUSE filesystem request is interrupted, the 152*6be0ddb2SBagas Sanjayafollowing will happen: 153*6be0ddb2SBagas Sanjaya 154*6be0ddb2SBagas Sanjaya - If the request is not yet sent to userspace AND the signal is 155*6be0ddb2SBagas Sanjaya fatal (SIGKILL or unhandled fatal signal), then the request is 156*6be0ddb2SBagas Sanjaya dequeued and returns immediately. 157*6be0ddb2SBagas Sanjaya 158*6be0ddb2SBagas Sanjaya - If the request is not yet sent to userspace AND the signal is not 159*6be0ddb2SBagas Sanjaya fatal, then an interrupted flag is set for the request. When 160*6be0ddb2SBagas Sanjaya the request has been successfully transferred to userspace and 161*6be0ddb2SBagas Sanjaya this flag is set, an INTERRUPT request is queued. 162*6be0ddb2SBagas Sanjaya 163*6be0ddb2SBagas Sanjaya - If the request is already sent to userspace, then an INTERRUPT 164*6be0ddb2SBagas Sanjaya request is queued. 165*6be0ddb2SBagas Sanjaya 166*6be0ddb2SBagas SanjayaINTERRUPT requests take precedence over other requests, so the 167*6be0ddb2SBagas Sanjayauserspace filesystem will receive queued INTERRUPTs before any others. 168*6be0ddb2SBagas Sanjaya 169*6be0ddb2SBagas SanjayaThe userspace filesystem may ignore the INTERRUPT requests entirely, 170*6be0ddb2SBagas Sanjayaor may honor them by sending a reply to the *original* request, with 171*6be0ddb2SBagas Sanjayathe error set to EINTR. 172*6be0ddb2SBagas Sanjaya 173*6be0ddb2SBagas SanjayaIt is also possible that there's a race between processing the 174*6be0ddb2SBagas Sanjayaoriginal request and its INTERRUPT request. There are two possibilities: 175*6be0ddb2SBagas Sanjaya 176*6be0ddb2SBagas Sanjaya 1. The INTERRUPT request is processed before the original request is 177*6be0ddb2SBagas Sanjaya processed 178*6be0ddb2SBagas Sanjaya 179*6be0ddb2SBagas Sanjaya 2. The INTERRUPT request is processed after the original request has 180*6be0ddb2SBagas Sanjaya been answered 181*6be0ddb2SBagas Sanjaya 182*6be0ddb2SBagas SanjayaIf the filesystem cannot find the original request, it should wait for 183*6be0ddb2SBagas Sanjayasome timeout and/or a number of new requests to arrive, after which it 184*6be0ddb2SBagas Sanjayashould reply to the INTERRUPT request with an EAGAIN error. In case 185*6be0ddb2SBagas Sanjaya1) the INTERRUPT request will be requeued. In case 2) the INTERRUPT 186*6be0ddb2SBagas Sanjayareply will be ignored. 187*6be0ddb2SBagas Sanjaya 188*6be0ddb2SBagas SanjayaAborting a filesystem connection 189*6be0ddb2SBagas Sanjaya================================ 190*6be0ddb2SBagas Sanjaya 191*6be0ddb2SBagas SanjayaIt is possible to get into certain situations where the filesystem is 192*6be0ddb2SBagas Sanjayanot responding. Reasons for this may be: 193*6be0ddb2SBagas Sanjaya 194*6be0ddb2SBagas Sanjaya a) Broken userspace filesystem implementation 195*6be0ddb2SBagas Sanjaya 196*6be0ddb2SBagas Sanjaya b) Network connection down 197*6be0ddb2SBagas Sanjaya 198*6be0ddb2SBagas Sanjaya c) Accidental deadlock 199*6be0ddb2SBagas Sanjaya 200*6be0ddb2SBagas Sanjaya d) Malicious deadlock 201*6be0ddb2SBagas Sanjaya 202*6be0ddb2SBagas Sanjaya(For more on c) and d) see later sections) 203*6be0ddb2SBagas Sanjaya 204*6be0ddb2SBagas SanjayaIn either of these cases it may be useful to abort the connection to 205*6be0ddb2SBagas Sanjayathe filesystem. There are several ways to do this: 206*6be0ddb2SBagas Sanjaya 207*6be0ddb2SBagas Sanjaya - Kill the filesystem daemon. Works in case of a) and b) 208*6be0ddb2SBagas Sanjaya 209*6be0ddb2SBagas Sanjaya - Kill the filesystem daemon and all users of the filesystem. Works 210*6be0ddb2SBagas Sanjaya in all cases except some malicious deadlocks 211*6be0ddb2SBagas Sanjaya 212*6be0ddb2SBagas Sanjaya - Use forced umount (umount -f). Works in all cases but only if 213*6be0ddb2SBagas Sanjaya filesystem is still attached (it hasn't been lazy unmounted) 214*6be0ddb2SBagas Sanjaya 215*6be0ddb2SBagas Sanjaya - Abort filesystem through the FUSE control filesystem. Most 216*6be0ddb2SBagas Sanjaya powerful method, always works. 217*6be0ddb2SBagas Sanjaya 218*6be0ddb2SBagas SanjayaHow do non-privileged mounts work? 219*6be0ddb2SBagas Sanjaya================================== 220*6be0ddb2SBagas Sanjaya 221*6be0ddb2SBagas SanjayaSince the mount() system call is a privileged operation, a helper 222*6be0ddb2SBagas Sanjayaprogram (fusermount) is needed, which is installed setuid root. 223*6be0ddb2SBagas Sanjaya 224*6be0ddb2SBagas SanjayaThe implication of providing non-privileged mounts is that the mount 225*6be0ddb2SBagas Sanjayaowner must not be able to use this capability to compromise the 226*6be0ddb2SBagas Sanjayasystem. Obvious requirements arising from this are: 227*6be0ddb2SBagas Sanjaya 228*6be0ddb2SBagas Sanjaya A) mount owner should not be able to get elevated privileges with the 229*6be0ddb2SBagas Sanjaya help of the mounted filesystem 230*6be0ddb2SBagas Sanjaya 231*6be0ddb2SBagas Sanjaya B) mount owner should not get illegitimate access to information from 232*6be0ddb2SBagas Sanjaya other users' and the super user's processes 233*6be0ddb2SBagas Sanjaya 234*6be0ddb2SBagas Sanjaya C) mount owner should not be able to induce undesired behavior in 235*6be0ddb2SBagas Sanjaya other users' or the super user's processes 236*6be0ddb2SBagas Sanjaya 237*6be0ddb2SBagas SanjayaHow are requirements fulfilled? 238*6be0ddb2SBagas Sanjaya=============================== 239*6be0ddb2SBagas Sanjaya 240*6be0ddb2SBagas Sanjaya A) The mount owner could gain elevated privileges by either: 241*6be0ddb2SBagas Sanjaya 242*6be0ddb2SBagas Sanjaya 1. creating a filesystem containing a device file, then opening this device 243*6be0ddb2SBagas Sanjaya 244*6be0ddb2SBagas Sanjaya 2. creating a filesystem containing a suid or sgid application, then executing this application 245*6be0ddb2SBagas Sanjaya 246*6be0ddb2SBagas Sanjaya The solution is not to allow opening device files and ignore 247*6be0ddb2SBagas Sanjaya setuid and setgid bits when executing programs. To ensure this 248*6be0ddb2SBagas Sanjaya fusermount always adds "nosuid" and "nodev" to the mount options 249*6be0ddb2SBagas Sanjaya for non-privileged mounts. 250*6be0ddb2SBagas Sanjaya 251*6be0ddb2SBagas Sanjaya B) If another user is accessing files or directories in the 252*6be0ddb2SBagas Sanjaya filesystem, the filesystem daemon serving requests can record the 253*6be0ddb2SBagas Sanjaya exact sequence and timing of operations performed. This 254*6be0ddb2SBagas Sanjaya information is otherwise inaccessible to the mount owner, so this 255*6be0ddb2SBagas Sanjaya counts as an information leak. 256*6be0ddb2SBagas Sanjaya 257*6be0ddb2SBagas Sanjaya The solution to this problem will be presented in point 2) of C). 258*6be0ddb2SBagas Sanjaya 259*6be0ddb2SBagas Sanjaya C) There are several ways in which the mount owner can induce 260*6be0ddb2SBagas Sanjaya undesired behavior in other users' processes, such as: 261*6be0ddb2SBagas Sanjaya 262*6be0ddb2SBagas Sanjaya 1) mounting a filesystem over a file or directory which the mount 263*6be0ddb2SBagas Sanjaya owner could otherwise not be able to modify (or could only 264*6be0ddb2SBagas Sanjaya make limited modifications). 265*6be0ddb2SBagas Sanjaya 266*6be0ddb2SBagas Sanjaya This is solved in fusermount, by checking the access 267*6be0ddb2SBagas Sanjaya permissions on the mountpoint and only allowing the mount if 268*6be0ddb2SBagas Sanjaya the mount owner can do unlimited modification (has write 269*6be0ddb2SBagas Sanjaya access to the mountpoint, and mountpoint is not a "sticky" 270*6be0ddb2SBagas Sanjaya directory) 271*6be0ddb2SBagas Sanjaya 272*6be0ddb2SBagas Sanjaya 2) Even if 1) is solved the mount owner can change the behavior 273*6be0ddb2SBagas Sanjaya of other users' processes. 274*6be0ddb2SBagas Sanjaya 275*6be0ddb2SBagas Sanjaya i) It can slow down or indefinitely delay the execution of a 276*6be0ddb2SBagas Sanjaya filesystem operation creating a DoS against the user or the 277*6be0ddb2SBagas Sanjaya whole system. For example a suid application locking a 278*6be0ddb2SBagas Sanjaya system file, and then accessing a file on the mount owner's 279*6be0ddb2SBagas Sanjaya filesystem could be stopped, and thus causing the system 280*6be0ddb2SBagas Sanjaya file to be locked forever. 281*6be0ddb2SBagas Sanjaya 282*6be0ddb2SBagas Sanjaya ii) It can present files or directories of unlimited length, or 283*6be0ddb2SBagas Sanjaya directory structures of unlimited depth, possibly causing a 284*6be0ddb2SBagas Sanjaya system process to eat up diskspace, memory or other 285*6be0ddb2SBagas Sanjaya resources, again causing *DoS*. 286*6be0ddb2SBagas Sanjaya 287*6be0ddb2SBagas Sanjaya The solution to this as well as B) is not to allow processes 288*6be0ddb2SBagas Sanjaya to access the filesystem, which could otherwise not be 289*6be0ddb2SBagas Sanjaya monitored or manipulated by the mount owner. Since if the 290*6be0ddb2SBagas Sanjaya mount owner can ptrace a process, it can do all of the above 291*6be0ddb2SBagas Sanjaya without using a FUSE mount, the same criteria as used in 292*6be0ddb2SBagas Sanjaya ptrace can be used to check if a process is allowed to access 293*6be0ddb2SBagas Sanjaya the filesystem or not. 294*6be0ddb2SBagas Sanjaya 295*6be0ddb2SBagas Sanjaya Note that the *ptrace* check is not strictly necessary to 296*6be0ddb2SBagas Sanjaya prevent C/2/i, it is enough to check if mount owner has enough 297*6be0ddb2SBagas Sanjaya privilege to send signal to the process accessing the 298*6be0ddb2SBagas Sanjaya filesystem, since *SIGSTOP* can be used to get a similar effect. 299*6be0ddb2SBagas Sanjaya 300*6be0ddb2SBagas SanjayaI think these limitations are unacceptable? 301*6be0ddb2SBagas Sanjaya=========================================== 302*6be0ddb2SBagas Sanjaya 303*6be0ddb2SBagas SanjayaIf a sysadmin trusts the users enough, or can ensure through other 304*6be0ddb2SBagas Sanjayameasures, that system processes will never enter non-privileged 305*6be0ddb2SBagas Sanjayamounts, it can relax the last limitation in several ways: 306*6be0ddb2SBagas Sanjaya 307*6be0ddb2SBagas Sanjaya - With the 'user_allow_other' config option. If this config option is 308*6be0ddb2SBagas Sanjaya set, the mounting user can add the 'allow_other' mount option which 309*6be0ddb2SBagas Sanjaya disables the check for other users' processes. 310*6be0ddb2SBagas Sanjaya 311*6be0ddb2SBagas Sanjaya User namespaces have an unintuitive interaction with 'allow_other': 312*6be0ddb2SBagas Sanjaya an unprivileged user - normally restricted from mounting with 313*6be0ddb2SBagas Sanjaya 'allow_other' - could do so in a user namespace where they're 314*6be0ddb2SBagas Sanjaya privileged. If any process could access such an 'allow_other' mount 315*6be0ddb2SBagas Sanjaya this would give the mounting user the ability to manipulate 316*6be0ddb2SBagas Sanjaya processes in user namespaces where they're unprivileged. For this 317*6be0ddb2SBagas Sanjaya reason 'allow_other' restricts access to users in the same userns 318*6be0ddb2SBagas Sanjaya or a descendant. 319*6be0ddb2SBagas Sanjaya 320*6be0ddb2SBagas Sanjaya - With the 'allow_sys_admin_access' module option. If this option is 321*6be0ddb2SBagas Sanjaya set, super user's processes have unrestricted access to mounts 322*6be0ddb2SBagas Sanjaya irrespective of allow_other setting or user namespace of the 323*6be0ddb2SBagas Sanjaya mounting user. 324*6be0ddb2SBagas Sanjaya 325*6be0ddb2SBagas SanjayaNote that both of these relaxations expose the system to potential 326*6be0ddb2SBagas Sanjayainformation leak or *DoS* as described in points B and C/2/i-ii in the 327*6be0ddb2SBagas Sanjayapreceding section. 328*6be0ddb2SBagas Sanjaya 329*6be0ddb2SBagas SanjayaKernel - userspace interface 330*6be0ddb2SBagas Sanjaya============================ 331*6be0ddb2SBagas Sanjaya 332*6be0ddb2SBagas SanjayaThe following diagram shows how a filesystem operation (in this 333*6be0ddb2SBagas Sanjayaexample unlink) is performed in FUSE. :: 334*6be0ddb2SBagas Sanjaya 335*6be0ddb2SBagas Sanjaya 336*6be0ddb2SBagas Sanjaya | "rm /mnt/fuse/file" | FUSE filesystem daemon 337*6be0ddb2SBagas Sanjaya | | 338*6be0ddb2SBagas Sanjaya | | >sys_read() 339*6be0ddb2SBagas Sanjaya | | >fuse_dev_read() 340*6be0ddb2SBagas Sanjaya | | >request_wait() 341*6be0ddb2SBagas Sanjaya | | [sleep on fc->waitq] 342*6be0ddb2SBagas Sanjaya | | 343*6be0ddb2SBagas Sanjaya | >sys_unlink() | 344*6be0ddb2SBagas Sanjaya | >fuse_unlink() | 345*6be0ddb2SBagas Sanjaya | [get request from | 346*6be0ddb2SBagas Sanjaya | fc->unused_list] | 347*6be0ddb2SBagas Sanjaya | >request_send() | 348*6be0ddb2SBagas Sanjaya | [queue req on fc->pending] | 349*6be0ddb2SBagas Sanjaya | [wake up fc->waitq] | [woken up] 350*6be0ddb2SBagas Sanjaya | >request_wait_answer() | 351*6be0ddb2SBagas Sanjaya | [sleep on req->waitq] | 352*6be0ddb2SBagas Sanjaya | | <request_wait() 353*6be0ddb2SBagas Sanjaya | | [remove req from fc->pending] 354*6be0ddb2SBagas Sanjaya | | [copy req to read buffer] 355*6be0ddb2SBagas Sanjaya | | [add req to fc->processing] 356*6be0ddb2SBagas Sanjaya | | <fuse_dev_read() 357*6be0ddb2SBagas Sanjaya | | <sys_read() 358*6be0ddb2SBagas Sanjaya | | 359*6be0ddb2SBagas Sanjaya | | [perform unlink] 360*6be0ddb2SBagas Sanjaya | | 361*6be0ddb2SBagas Sanjaya | | >sys_write() 362*6be0ddb2SBagas Sanjaya | | >fuse_dev_write() 363*6be0ddb2SBagas Sanjaya | | [look up req in fc->processing] 364*6be0ddb2SBagas Sanjaya | | [remove from fc->processing] 365*6be0ddb2SBagas Sanjaya | | [copy write buffer to req] 366*6be0ddb2SBagas Sanjaya | [woken up] | [wake up req->waitq] 367*6be0ddb2SBagas Sanjaya | | <fuse_dev_write() 368*6be0ddb2SBagas Sanjaya | | <sys_write() 369*6be0ddb2SBagas Sanjaya | <request_wait_answer() | 370*6be0ddb2SBagas Sanjaya | <request_send() | 371*6be0ddb2SBagas Sanjaya | [add request to | 372*6be0ddb2SBagas Sanjaya | fc->unused_list] | 373*6be0ddb2SBagas Sanjaya | <fuse_unlink() | 374*6be0ddb2SBagas Sanjaya | <sys_unlink() | 375*6be0ddb2SBagas Sanjaya 376*6be0ddb2SBagas Sanjaya.. note:: Everything in the description above is greatly simplified 377*6be0ddb2SBagas Sanjaya 378*6be0ddb2SBagas SanjayaThere are a couple of ways in which to deadlock a FUSE filesystem. 379*6be0ddb2SBagas SanjayaSince we are talking about unprivileged userspace programs, 380*6be0ddb2SBagas Sanjayasomething must be done about these. 381*6be0ddb2SBagas Sanjaya 382*6be0ddb2SBagas Sanjaya**Scenario 1 - Simple deadlock**:: 383*6be0ddb2SBagas Sanjaya 384*6be0ddb2SBagas Sanjaya | "rm /mnt/fuse/file" | FUSE filesystem daemon 385*6be0ddb2SBagas Sanjaya | | 386*6be0ddb2SBagas Sanjaya | >sys_unlink("/mnt/fuse/file") | 387*6be0ddb2SBagas Sanjaya | [acquire inode semaphore | 388*6be0ddb2SBagas Sanjaya | for "file"] | 389*6be0ddb2SBagas Sanjaya | >fuse_unlink() | 390*6be0ddb2SBagas Sanjaya | [sleep on req->waitq] | 391*6be0ddb2SBagas Sanjaya | | <sys_read() 392*6be0ddb2SBagas Sanjaya | | >sys_unlink("/mnt/fuse/file") 393*6be0ddb2SBagas Sanjaya | | [acquire inode semaphore 394*6be0ddb2SBagas Sanjaya | | for "file"] 395*6be0ddb2SBagas Sanjaya | | *DEADLOCK* 396*6be0ddb2SBagas Sanjaya 397*6be0ddb2SBagas SanjayaThe solution for this is to allow the filesystem to be aborted. 398*6be0ddb2SBagas Sanjaya 399*6be0ddb2SBagas Sanjaya**Scenario 2 - Tricky deadlock** 400*6be0ddb2SBagas Sanjaya 401*6be0ddb2SBagas Sanjaya 402*6be0ddb2SBagas SanjayaThis one needs a carefully crafted filesystem. It's a variation on 403*6be0ddb2SBagas Sanjayathe above, only the call back to the filesystem is not explicit, 404*6be0ddb2SBagas Sanjayabut is caused by a pagefault. :: 405*6be0ddb2SBagas Sanjaya 406*6be0ddb2SBagas Sanjaya | Kamikaze filesystem thread 1 | Kamikaze filesystem thread 2 407*6be0ddb2SBagas Sanjaya | | 408*6be0ddb2SBagas Sanjaya | [fd = open("/mnt/fuse/file")] | [request served normally] 409*6be0ddb2SBagas Sanjaya | [mmap fd to 'addr'] | 410*6be0ddb2SBagas Sanjaya | [close fd] | [FLUSH triggers 'magic' flag] 411*6be0ddb2SBagas Sanjaya | [read a byte from addr] | 412*6be0ddb2SBagas Sanjaya | >do_page_fault() | 413*6be0ddb2SBagas Sanjaya | [find or create page] | 414*6be0ddb2SBagas Sanjaya | [lock page] | 415*6be0ddb2SBagas Sanjaya | >fuse_readpage() | 416*6be0ddb2SBagas Sanjaya | [queue READ request] | 417*6be0ddb2SBagas Sanjaya | [sleep on req->waitq] | 418*6be0ddb2SBagas Sanjaya | | [read request to buffer] 419*6be0ddb2SBagas Sanjaya | | [create reply header before addr] 420*6be0ddb2SBagas Sanjaya | | >sys_write(addr - headerlength) 421*6be0ddb2SBagas Sanjaya | | >fuse_dev_write() 422*6be0ddb2SBagas Sanjaya | | [look up req in fc->processing] 423*6be0ddb2SBagas Sanjaya | | [remove from fc->processing] 424*6be0ddb2SBagas Sanjaya | | [copy write buffer to req] 425*6be0ddb2SBagas Sanjaya | | >do_page_fault() 426*6be0ddb2SBagas Sanjaya | | [find or create page] 427*6be0ddb2SBagas Sanjaya | | [lock page] 428*6be0ddb2SBagas Sanjaya | | * DEADLOCK * 429*6be0ddb2SBagas Sanjaya 430*6be0ddb2SBagas SanjayaThe solution is basically the same as above. 431*6be0ddb2SBagas Sanjaya 432*6be0ddb2SBagas SanjayaAn additional problem is that while the write buffer is being copied 433*6be0ddb2SBagas Sanjayato the request, the request must not be interrupted/aborted. This is 434*6be0ddb2SBagas Sanjayabecause the destination address of the copy may not be valid after the 435*6be0ddb2SBagas Sanjayarequest has returned. 436*6be0ddb2SBagas Sanjaya 437*6be0ddb2SBagas SanjayaThis is solved with doing the copy atomically, and allowing abort 438*6be0ddb2SBagas Sanjayawhile the page(s) belonging to the write buffer are faulted with 439*6be0ddb2SBagas Sanjayaget_user_pages(). The 'req->locked' flag indicates when the copy is 440*6be0ddb2SBagas Sanjayataking place, and abort is delayed until this flag is unset. 441