1*1da177e4SLinus Torvalds $Id: README.Locking,v 1.9 2004/11/20 10:35:40 dwmw2 Exp $ 2*1da177e4SLinus Torvalds 3*1da177e4SLinus Torvalds JFFS2 LOCKING DOCUMENTATION 4*1da177e4SLinus Torvalds --------------------------- 5*1da177e4SLinus Torvalds 6*1da177e4SLinus TorvaldsAt least theoretically, JFFS2 does not require the Big Kernel Lock 7*1da177e4SLinus Torvalds(BKL), which was always helpfully obtained for it by Linux 2.4 VFS 8*1da177e4SLinus Torvaldscode. It has its own locking, as described below. 9*1da177e4SLinus Torvalds 10*1da177e4SLinus TorvaldsThis document attempts to describe the existing locking rules for 11*1da177e4SLinus TorvaldsJFFS2. It is not expected to remain perfectly up to date, but ought to 12*1da177e4SLinus Torvaldsbe fairly close. 13*1da177e4SLinus Torvalds 14*1da177e4SLinus Torvalds 15*1da177e4SLinus Torvalds alloc_sem 16*1da177e4SLinus Torvalds --------- 17*1da177e4SLinus Torvalds 18*1da177e4SLinus TorvaldsThe alloc_sem is a per-filesystem semaphore, used primarily to ensure 19*1da177e4SLinus Torvaldscontiguous allocation of space on the medium. It is automatically 20*1da177e4SLinus Torvaldsobtained during space allocations (jffs2_reserve_space()) and freed 21*1da177e4SLinus Torvaldsupon write completion (jffs2_complete_reservation()). Note that 22*1da177e4SLinus Torvaldsthe garbage collector will obtain this right at the beginning of 23*1da177e4SLinus Torvaldsjffs2_garbage_collect_pass() and release it at the end, thereby 24*1da177e4SLinus Torvaldspreventing any other write activity on the file system during a 25*1da177e4SLinus Torvaldsgarbage collect pass. 26*1da177e4SLinus Torvalds 27*1da177e4SLinus TorvaldsWhen writing new nodes, the alloc_sem must be held until the new nodes 28*1da177e4SLinus Torvaldshave been properly linked into the data structures for the inode to 29*1da177e4SLinus Torvaldswhich they belong. This is for the benefit of NAND flash - adding new 30*1da177e4SLinus Torvaldsnodes to an inode may obsolete old ones, and by holding the alloc_sem 31*1da177e4SLinus Torvaldsuntil this happens we ensure that any data in the write-buffer at the 32*1da177e4SLinus Torvaldstime this happens are part of the new node, not just something that 33*1da177e4SLinus Torvaldswas written afterwards. Hence, we can ensure the newly-obsoleted nodes 34*1da177e4SLinus Torvaldsdon't actually get erased until the write-buffer has been flushed to 35*1da177e4SLinus Torvaldsthe medium. 36*1da177e4SLinus Torvalds 37*1da177e4SLinus TorvaldsWith the introduction of NAND flash support and the write-buffer, 38*1da177e4SLinus Torvaldsthe alloc_sem is also used to protect the wbuf-related members of the 39*1da177e4SLinus Torvaldsjffs2_sb_info structure. Atomically reading the wbuf_len member to see 40*1da177e4SLinus Torvaldsif the wbuf is currently holding any data is permitted, though. 41*1da177e4SLinus Torvalds 42*1da177e4SLinus TorvaldsOrdering constraints: See f->sem. 43*1da177e4SLinus Torvalds 44*1da177e4SLinus Torvalds 45*1da177e4SLinus Torvalds File Semaphore f->sem 46*1da177e4SLinus Torvalds --------------------- 47*1da177e4SLinus Torvalds 48*1da177e4SLinus TorvaldsThis is the JFFS2-internal equivalent of the inode semaphore i->i_sem. 49*1da177e4SLinus TorvaldsIt protects the contents of the jffs2_inode_info private inode data, 50*1da177e4SLinus Torvaldsincluding the linked list of node fragments (but see the notes below on 51*1da177e4SLinus Torvaldserase_completion_lock), etc. 52*1da177e4SLinus Torvalds 53*1da177e4SLinus TorvaldsThe reason that the i_sem itself isn't used for this purpose is to 54*1da177e4SLinus Torvaldsavoid deadlocks with garbage collection -- the VFS will lock the i_sem 55*1da177e4SLinus Torvaldsbefore calling a function which may need to allocate space. The 56*1da177e4SLinus Torvaldsallocation may trigger garbage-collection, which may need to move a 57*1da177e4SLinus Torvaldsnode belonging to the inode which was locked in the first place by the 58*1da177e4SLinus TorvaldsVFS. If the garbage collection code were to attempt to lock the i_sem 59*1da177e4SLinus Torvaldsof the inode from which it's garbage-collecting a physical node, this 60*1da177e4SLinus Torvaldslead to deadlock, unless we played games with unlocking the i_sem 61*1da177e4SLinus Torvaldsbefore calling the space allocation functions. 62*1da177e4SLinus Torvalds 63*1da177e4SLinus TorvaldsInstead of playing such games, we just have an extra internal 64*1da177e4SLinus Torvaldssemaphore, which is obtained by the garbage collection code and also 65*1da177e4SLinus Torvaldsby the normal file system code _after_ allocation of space. 66*1da177e4SLinus Torvalds 67*1da177e4SLinus TorvaldsOrdering constraints: 68*1da177e4SLinus Torvalds 69*1da177e4SLinus Torvalds 1. Never attempt to allocate space or lock alloc_sem with 70*1da177e4SLinus Torvalds any f->sem held. 71*1da177e4SLinus Torvalds 2. Never attempt to lock two file semaphores in one thread. 72*1da177e4SLinus Torvalds No ordering rules have been made for doing so. 73*1da177e4SLinus Torvalds 74*1da177e4SLinus Torvalds 75*1da177e4SLinus Torvalds erase_completion_lock spinlock 76*1da177e4SLinus Torvalds ------------------------------ 77*1da177e4SLinus Torvalds 78*1da177e4SLinus TorvaldsThis is used to serialise access to the eraseblock lists, to the 79*1da177e4SLinus Torvaldsper-eraseblock lists of physical jffs2_raw_node_ref structures, and 80*1da177e4SLinus Torvalds(NB) the per-inode list of physical nodes. The latter is a special 81*1da177e4SLinus Torvaldscase - see below. 82*1da177e4SLinus Torvalds 83*1da177e4SLinus TorvaldsAs the MTD API no longer permits erase-completion callback functions 84*1da177e4SLinus Torvaldsto be called from bottom-half (timer) context (on the basis that nobody 85*1da177e4SLinus Torvaldsever actually implemented such a thing), it's now sufficient to use 86*1da177e4SLinus Torvaldsa simple spin_lock() rather than spin_lock_bh(). 87*1da177e4SLinus Torvalds 88*1da177e4SLinus TorvaldsNote that the per-inode list of physical nodes (f->nodes) is a special 89*1da177e4SLinus Torvaldscase. Any changes to _valid_ nodes (i.e. ->flash_offset & 1 == 0) in 90*1da177e4SLinus Torvaldsthe list are protected by the file semaphore f->sem. But the erase 91*1da177e4SLinus Torvaldscode may remove _obsolete_ nodes from the list while holding only the 92*1da177e4SLinus Torvaldserase_completion_lock. So you can walk the list only while holding the 93*1da177e4SLinus Torvaldserase_completion_lock, and can drop the lock temporarily mid-walk as 94*1da177e4SLinus Torvaldslong as the pointer you're holding is to a _valid_ node, not an 95*1da177e4SLinus Torvaldsobsolete one. 96*1da177e4SLinus Torvalds 97*1da177e4SLinus TorvaldsThe erase_completion_lock is also used to protect the c->gc_task 98*1da177e4SLinus Torvaldspointer when the garbage collection thread exits. The code to kill the 99*1da177e4SLinus TorvaldsGC thread locks it, sends the signal, then unlocks it - while the GC 100*1da177e4SLinus Torvaldsthread itself locks it, zeroes c->gc_task, then unlocks on the exit path. 101*1da177e4SLinus Torvalds 102*1da177e4SLinus Torvalds 103*1da177e4SLinus Torvalds inocache_lock spinlock 104*1da177e4SLinus Torvalds ---------------------- 105*1da177e4SLinus Torvalds 106*1da177e4SLinus TorvaldsThis spinlock protects the hashed list (c->inocache_list) of the 107*1da177e4SLinus Torvaldsin-core jffs2_inode_cache objects (each inode in JFFS2 has the 108*1da177e4SLinus Torvaldscorrespondent jffs2_inode_cache object). So, the inocache_lock 109*1da177e4SLinus Torvaldshas to be locked while walking the c->inocache_list hash buckets. 110*1da177e4SLinus Torvalds 111*1da177e4SLinus TorvaldsNote, the f->sem guarantees that the correspondent jffs2_inode_cache 112*1da177e4SLinus Torvaldswill not be removed. So, it is allowed to access it without locking 113*1da177e4SLinus Torvaldsthe inocache_lock spinlock. 114*1da177e4SLinus Torvalds 115*1da177e4SLinus TorvaldsOrdering constraints: 116*1da177e4SLinus Torvalds 117*1da177e4SLinus Torvalds If both erase_completion_lock and inocache_lock are needed, the 118*1da177e4SLinus Torvalds c->erase_completion has to be acquired first. 119*1da177e4SLinus Torvalds 120*1da177e4SLinus Torvalds 121*1da177e4SLinus Torvalds erase_free_sem 122*1da177e4SLinus Torvalds -------------- 123*1da177e4SLinus Torvalds 124*1da177e4SLinus TorvaldsThis semaphore is only used by the erase code which frees obsolete 125*1da177e4SLinus Torvaldsnode references and the jffs2_garbage_collect_deletion_dirent() 126*1da177e4SLinus Torvaldsfunction. The latter function on NAND flash must read _obsolete_ nodes 127*1da177e4SLinus Torvaldsto determine whether the 'deletion dirent' under consideration can be 128*1da177e4SLinus Torvaldsdiscarded or whether it is still required to show that an inode has 129*1da177e4SLinus Torvaldsbeen unlinked. Because reading from the flash may sleep, the 130*1da177e4SLinus Torvaldserase_completion_lock cannot be held, so an alternative, more 131*1da177e4SLinus Torvaldsheavyweight lock was required to prevent the erase code from freeing 132*1da177e4SLinus Torvaldsthe jffs2_raw_node_ref structures in question while the garbage 133*1da177e4SLinus Torvaldscollection code is looking at them. 134*1da177e4SLinus Torvalds 135*1da177e4SLinus TorvaldsSuggestions for alternative solutions to this problem would be welcomed. 136*1da177e4SLinus Torvalds 137*1da177e4SLinus Torvalds 138*1da177e4SLinus Torvalds wbuf_sem 139*1da177e4SLinus Torvalds -------- 140*1da177e4SLinus Torvalds 141*1da177e4SLinus TorvaldsThis read/write semaphore protects against concurrent access to the 142*1da177e4SLinus Torvaldswrite-behind buffer ('wbuf') used for flash chips where we must write 143*1da177e4SLinus Torvaldsin blocks. It protects both the contents of the wbuf and the metadata 144*1da177e4SLinus Torvaldswhich indicates which flash region (if any) is currently covered by 145*1da177e4SLinus Torvaldsthe buffer. 146*1da177e4SLinus Torvalds 147*1da177e4SLinus TorvaldsOrdering constraints: 148*1da177e4SLinus Torvalds Lock wbuf_sem last, after the alloc_sem or and f->sem. 149