11da177e4SLinus Torvalds 21da177e4SLinus Torvalds JFFS2 LOCKING DOCUMENTATION 31da177e4SLinus Torvalds --------------------------- 41da177e4SLinus Torvalds 51da177e4SLinus TorvaldsThis document attempts to describe the existing locking rules for 61da177e4SLinus TorvaldsJFFS2. It is not expected to remain perfectly up to date, but ought to 71da177e4SLinus Torvaldsbe fairly close. 81da177e4SLinus Torvalds 91da177e4SLinus Torvalds 101da177e4SLinus Torvalds alloc_sem 111da177e4SLinus Torvalds --------- 121da177e4SLinus Torvalds 13ced22070SDavid WoodhouseThe alloc_sem is a per-filesystem mutex, used primarily to ensure 141da177e4SLinus Torvaldscontiguous allocation of space on the medium. It is automatically 151da177e4SLinus Torvaldsobtained during space allocations (jffs2_reserve_space()) and freed 161da177e4SLinus Torvaldsupon write completion (jffs2_complete_reservation()). Note that 171da177e4SLinus Torvaldsthe garbage collector will obtain this right at the beginning of 181da177e4SLinus Torvaldsjffs2_garbage_collect_pass() and release it at the end, thereby 191da177e4SLinus Torvaldspreventing any other write activity on the file system during a 201da177e4SLinus Torvaldsgarbage collect pass. 211da177e4SLinus Torvalds 221da177e4SLinus TorvaldsWhen writing new nodes, the alloc_sem must be held until the new nodes 231da177e4SLinus Torvaldshave been properly linked into the data structures for the inode to 241da177e4SLinus Torvaldswhich they belong. This is for the benefit of NAND flash - adding new 251da177e4SLinus Torvaldsnodes to an inode may obsolete old ones, and by holding the alloc_sem 261da177e4SLinus Torvaldsuntil this happens we ensure that any data in the write-buffer at the 271da177e4SLinus Torvaldstime this happens are part of the new node, not just something that 281da177e4SLinus Torvaldswas written afterwards. Hence, we can ensure the newly-obsoleted nodes 291da177e4SLinus Torvaldsdon't actually get erased until the write-buffer has been flushed to 301da177e4SLinus Torvaldsthe medium. 311da177e4SLinus Torvalds 321da177e4SLinus TorvaldsWith the introduction of NAND flash support and the write-buffer, 331da177e4SLinus Torvaldsthe alloc_sem is also used to protect the wbuf-related members of the 341da177e4SLinus Torvaldsjffs2_sb_info structure. Atomically reading the wbuf_len member to see 351da177e4SLinus Torvaldsif the wbuf is currently holding any data is permitted, though. 361da177e4SLinus Torvalds 371da177e4SLinus TorvaldsOrdering constraints: See f->sem. 381da177e4SLinus Torvalds 391da177e4SLinus Torvalds 40ced22070SDavid Woodhouse File Mutex f->sem 411da177e4SLinus Torvalds --------------------- 421da177e4SLinus Torvalds 43ced22070SDavid WoodhouseThis is the JFFS2-internal equivalent of the inode mutex i->i_sem. 441da177e4SLinus TorvaldsIt protects the contents of the jffs2_inode_info private inode data, 451da177e4SLinus Torvaldsincluding the linked list of node fragments (but see the notes below on 461da177e4SLinus Torvaldserase_completion_lock), etc. 471da177e4SLinus Torvalds 481da177e4SLinus TorvaldsThe reason that the i_sem itself isn't used for this purpose is to 491da177e4SLinus Torvaldsavoid deadlocks with garbage collection -- the VFS will lock the i_sem 501da177e4SLinus Torvaldsbefore calling a function which may need to allocate space. The 511da177e4SLinus Torvaldsallocation may trigger garbage-collection, which may need to move a 521da177e4SLinus Torvaldsnode belonging to the inode which was locked in the first place by the 531da177e4SLinus TorvaldsVFS. If the garbage collection code were to attempt to lock the i_sem 541da177e4SLinus Torvaldsof the inode from which it's garbage-collecting a physical node, this 551da177e4SLinus Torvaldslead to deadlock, unless we played games with unlocking the i_sem 561da177e4SLinus Torvaldsbefore calling the space allocation functions. 571da177e4SLinus Torvalds 581da177e4SLinus TorvaldsInstead of playing such games, we just have an extra internal 59ced22070SDavid Woodhousemutex, which is obtained by the garbage collection code and also 601da177e4SLinus Torvaldsby the normal file system code _after_ allocation of space. 611da177e4SLinus Torvalds 621da177e4SLinus TorvaldsOrdering constraints: 631da177e4SLinus Torvalds 641da177e4SLinus Torvalds 1. Never attempt to allocate space or lock alloc_sem with 651da177e4SLinus Torvalds any f->sem held. 66ced22070SDavid Woodhouse 2. Never attempt to lock two file mutexes in one thread. 671da177e4SLinus Torvalds No ordering rules have been made for doing so. 68*49e91e70SDavid Woodhouse 3. Never lock a page cache page with f->sem held. 691da177e4SLinus Torvalds 701da177e4SLinus Torvalds 711da177e4SLinus Torvalds erase_completion_lock spinlock 721da177e4SLinus Torvalds ------------------------------ 731da177e4SLinus Torvalds 741da177e4SLinus TorvaldsThis is used to serialise access to the eraseblock lists, to the 751da177e4SLinus Torvaldsper-eraseblock lists of physical jffs2_raw_node_ref structures, and 761da177e4SLinus Torvalds(NB) the per-inode list of physical nodes. The latter is a special 771da177e4SLinus Torvaldscase - see below. 781da177e4SLinus Torvalds 791da177e4SLinus TorvaldsAs the MTD API no longer permits erase-completion callback functions 801da177e4SLinus Torvaldsto be called from bottom-half (timer) context (on the basis that nobody 811da177e4SLinus Torvaldsever actually implemented such a thing), it's now sufficient to use 821da177e4SLinus Torvaldsa simple spin_lock() rather than spin_lock_bh(). 831da177e4SLinus Torvalds 841da177e4SLinus TorvaldsNote that the per-inode list of physical nodes (f->nodes) is a special 851da177e4SLinus Torvaldscase. Any changes to _valid_ nodes (i.e. ->flash_offset & 1 == 0) in 86ced22070SDavid Woodhousethe list are protected by the file mutex f->sem. But the erase code 87ced22070SDavid Woodhousemay remove _obsolete_ nodes from the list while holding only the 881da177e4SLinus Torvaldserase_completion_lock. So you can walk the list only while holding the 891da177e4SLinus Torvaldserase_completion_lock, and can drop the lock temporarily mid-walk as 901da177e4SLinus Torvaldslong as the pointer you're holding is to a _valid_ node, not an 911da177e4SLinus Torvaldsobsolete one. 921da177e4SLinus Torvalds 931da177e4SLinus TorvaldsThe erase_completion_lock is also used to protect the c->gc_task 941da177e4SLinus Torvaldspointer when the garbage collection thread exits. The code to kill the 951da177e4SLinus TorvaldsGC thread locks it, sends the signal, then unlocks it - while the GC 961da177e4SLinus Torvaldsthread itself locks it, zeroes c->gc_task, then unlocks on the exit path. 971da177e4SLinus Torvalds 981da177e4SLinus Torvalds 991da177e4SLinus Torvalds inocache_lock spinlock 1001da177e4SLinus Torvalds ---------------------- 1011da177e4SLinus Torvalds 1021da177e4SLinus TorvaldsThis spinlock protects the hashed list (c->inocache_list) of the 1031da177e4SLinus Torvaldsin-core jffs2_inode_cache objects (each inode in JFFS2 has the 1041da177e4SLinus Torvaldscorrespondent jffs2_inode_cache object). So, the inocache_lock 1051da177e4SLinus Torvaldshas to be locked while walking the c->inocache_list hash buckets. 1061da177e4SLinus Torvalds 1077d200960SDavid WoodhouseThis spinlock also covers allocation of new inode numbers, which is 1087d200960SDavid Woodhousecurrently just '++->highest_ino++', but might one day get more complicated 1097d200960SDavid Woodhouseif we need to deal with wrapping after 4 milliard inode numbers are used. 1107d200960SDavid Woodhouse 1111da177e4SLinus TorvaldsNote, the f->sem guarantees that the correspondent jffs2_inode_cache 1121da177e4SLinus Torvaldswill not be removed. So, it is allowed to access it without locking 1131da177e4SLinus Torvaldsthe inocache_lock spinlock. 1141da177e4SLinus Torvalds 1151da177e4SLinus TorvaldsOrdering constraints: 1161da177e4SLinus Torvalds 1171da177e4SLinus Torvalds If both erase_completion_lock and inocache_lock are needed, the 1181da177e4SLinus Torvalds c->erase_completion has to be acquired first. 1191da177e4SLinus Torvalds 1201da177e4SLinus Torvalds 1211da177e4SLinus Torvalds erase_free_sem 1221da177e4SLinus Torvalds -------------- 1231da177e4SLinus Torvalds 124ced22070SDavid WoodhouseThis mutex is only used by the erase code which frees obsolete node 125ced22070SDavid Woodhousereferences and the jffs2_garbage_collect_deletion_dirent() function. 126ced22070SDavid WoodhouseThe latter function on NAND flash must read _obsolete_ nodes to 127ced22070SDavid Woodhousedetermine whether the 'deletion dirent' under consideration can be 1281da177e4SLinus Torvaldsdiscarded or whether it is still required to show that an inode has 1291da177e4SLinus Torvaldsbeen unlinked. Because reading from the flash may sleep, the 1301da177e4SLinus Torvaldserase_completion_lock cannot be held, so an alternative, more 1311da177e4SLinus Torvaldsheavyweight lock was required to prevent the erase code from freeing 1321da177e4SLinus Torvaldsthe jffs2_raw_node_ref structures in question while the garbage 1331da177e4SLinus Torvaldscollection code is looking at them. 1341da177e4SLinus Torvalds 1351da177e4SLinus TorvaldsSuggestions for alternative solutions to this problem would be welcomed. 1361da177e4SLinus Torvalds 1371da177e4SLinus Torvalds 1381da177e4SLinus Torvalds wbuf_sem 1391da177e4SLinus Torvalds -------- 1401da177e4SLinus Torvalds 1411da177e4SLinus TorvaldsThis read/write semaphore protects against concurrent access to the 1421da177e4SLinus Torvaldswrite-behind buffer ('wbuf') used for flash chips where we must write 1431da177e4SLinus Torvaldsin blocks. It protects both the contents of the wbuf and the metadata 1441da177e4SLinus Torvaldswhich indicates which flash region (if any) is currently covered by 1451da177e4SLinus Torvaldsthe buffer. 1461da177e4SLinus Torvalds 1471da177e4SLinus TorvaldsOrdering constraints: 1481da177e4SLinus Torvalds Lock wbuf_sem last, after the alloc_sem or and f->sem. 1498b0b339dSKaiGai Kohei 1508b0b339dSKaiGai Kohei 1518b0b339dSKaiGai Kohei c->xattr_sem 1528b0b339dSKaiGai Kohei ------------ 1538b0b339dSKaiGai Kohei 1548b0b339dSKaiGai KoheiThis read/write semaphore protects against concurrent access to the 1558b0b339dSKaiGai Koheixattr related objects which include stuff in superblock and ic->xref. 1568b0b339dSKaiGai KoheiIn read-only path, write-semaphore is too much exclusion. It's enough 1578b0b339dSKaiGai Koheiby read-semaphore. But you must hold write-semaphore when updating, 1588b0b339dSKaiGai Koheicreating or deleting any xattr related object. 1598b0b339dSKaiGai Kohei 1608b0b339dSKaiGai KoheiOnce xattr_sem released, there would be no assurance for the existence 1618b0b339dSKaiGai Koheiof those objects. Thus, a series of processes is often required to retry, 1628b0b339dSKaiGai Koheiwhen updating such a object is necessary under holding read semaphore. 1638b0b339dSKaiGai KoheiFor example, do_jffs2_getxattr() holds read-semaphore to scan xref and 1648b0b339dSKaiGai Koheixdatum at first. But it retries this process with holding write-semaphore 1658b0b339dSKaiGai Koheiafter release read-semaphore, if it's necessary to load name/value pair 1668b0b339dSKaiGai Koheifrom medium. 1678b0b339dSKaiGai Kohei 1688b0b339dSKaiGai KoheiOrdering constraints: 1698b0b339dSKaiGai Kohei Lock xattr_sem last, after the alloc_sem. 170