1*620fc27eSBagas Sanjaya.. SPDX-License-Identifier: GPL-2.0 2*620fc27eSBagas Sanjaya 3*620fc27eSBagas Sanjaya============================ 4*620fc27eSBagas SanjayaGlock internal locking rules 5*620fc27eSBagas Sanjaya============================ 6*620fc27eSBagas Sanjaya 7*620fc27eSBagas SanjayaThis documents the basic principles of the glock state machine 8*620fc27eSBagas Sanjayainternals. Each glock (struct gfs2_glock in fs/gfs2/incore.h) 9*620fc27eSBagas Sanjayahas two main (internal) locks: 10*620fc27eSBagas Sanjaya 11*620fc27eSBagas Sanjaya 1. A spinlock (gl_lockref.lock) which protects the internal state such 12*620fc27eSBagas Sanjaya as gl_state, gl_target and the list of holders (gl_holders) 13*620fc27eSBagas Sanjaya 2. A non-blocking bit lock, GLF_LOCK, which is used to prevent other 14*620fc27eSBagas Sanjaya threads from making calls to the DLM, etc. at the same time. If a 15*620fc27eSBagas Sanjaya thread takes this lock, it must then call run_queue (usually via the 16*620fc27eSBagas Sanjaya workqueue) when it releases it in order to ensure any pending tasks 17*620fc27eSBagas Sanjaya are completed. 18*620fc27eSBagas Sanjaya 19*620fc27eSBagas SanjayaThe gl_holders list contains all the queued lock requests (not 20*620fc27eSBagas Sanjayajust the holders) associated with the glock. If there are any 21*620fc27eSBagas Sanjayaheld locks, then they will be contiguous entries at the head 22*620fc27eSBagas Sanjayaof the list. Locks are granted in strictly the order that they 23*620fc27eSBagas Sanjayaare queued. 24*620fc27eSBagas Sanjaya 25*620fc27eSBagas SanjayaThere are three lock states that users of the glock layer can request, 26*620fc27eSBagas Sanjayanamely shared (SH), deferred (DF) and exclusive (EX). Those translate 27*620fc27eSBagas Sanjayato the following DLM lock modes: 28*620fc27eSBagas Sanjaya 29*620fc27eSBagas Sanjaya========== ====== ===================================================== 30*620fc27eSBagas SanjayaGlock mode DLM lock mode 31*620fc27eSBagas Sanjaya========== ====== ===================================================== 32*620fc27eSBagas Sanjaya UN IV/NL Unlocked (no DLM lock associated with glock) or NL 33*620fc27eSBagas Sanjaya SH PR (Protected read) 34*620fc27eSBagas Sanjaya DF CW (Concurrent write) 35*620fc27eSBagas Sanjaya EX EX (Exclusive) 36*620fc27eSBagas Sanjaya========== ====== ===================================================== 37*620fc27eSBagas Sanjaya 38*620fc27eSBagas SanjayaThus DF is basically a shared mode which is incompatible with the "normal" 39*620fc27eSBagas Sanjayashared lock mode, SH. In GFS2 the DF mode is used exclusively for direct I/O 40*620fc27eSBagas Sanjayaoperations. The glocks are basically a lock plus some routines which deal 41*620fc27eSBagas Sanjayawith cache management. The following rules apply for the cache: 42*620fc27eSBagas Sanjaya 43*620fc27eSBagas Sanjaya========== ============== ========== ========== ============== 44*620fc27eSBagas SanjayaGlock mode Cache Metadata Cache data Dirty Data Dirty Metadata 45*620fc27eSBagas Sanjaya========== ============== ========== ========== ============== 46*620fc27eSBagas Sanjaya UN No No No No 47*620fc27eSBagas Sanjaya DF Yes No No No 48*620fc27eSBagas Sanjaya SH Yes Yes No No 49*620fc27eSBagas Sanjaya EX Yes Yes Yes Yes 50*620fc27eSBagas Sanjaya========== ============== ========== ========== ============== 51*620fc27eSBagas Sanjaya 52*620fc27eSBagas SanjayaThese rules are implemented using the various glock operations which 53*620fc27eSBagas Sanjayaare defined for each type of glock. Not all types of glocks use 54*620fc27eSBagas Sanjayaall the modes. Only inode glocks use the DF mode for example. 55*620fc27eSBagas Sanjaya 56*620fc27eSBagas SanjayaTable of glock operations and per type constants: 57*620fc27eSBagas Sanjaya 58*620fc27eSBagas Sanjaya============== ============================================================= 59*620fc27eSBagas SanjayaField Purpose 60*620fc27eSBagas Sanjaya============== ============================================================= 61*620fc27eSBagas Sanjayago_sync Called before remote state change (e.g. to sync dirty data) 62*620fc27eSBagas Sanjayago_xmote_bh Called after remote state change (e.g. to refill cache) 63*620fc27eSBagas Sanjayago_inval Called if remote state change requires invalidating the cache 64*620fc27eSBagas Sanjayago_instantiate Called when a glock has been acquired 65*620fc27eSBagas Sanjayago_held Called every time a glock holder is acquired 66*620fc27eSBagas Sanjayago_dump Called to print content of object for debugfs file, or on 67*620fc27eSBagas Sanjaya error to dump glock to the log. 68*620fc27eSBagas Sanjayago_callback Called if the DLM sends a callback to drop this lock 69*620fc27eSBagas Sanjayago_unlocked Called when a glock is unlocked (dlm_unlock()) 70*620fc27eSBagas Sanjayago_type The type of the glock, ``LM_TYPE_*`` 71*620fc27eSBagas Sanjayago_flags GLOF_ASPACE is set, if the glock has an address space 72*620fc27eSBagas Sanjaya associated with it 73*620fc27eSBagas Sanjaya============== ============================================================= 74*620fc27eSBagas Sanjaya 75*620fc27eSBagas SanjayaThe minimum hold time for each lock is the time after a remote lock 76*620fc27eSBagas Sanjayagrant for which we ignore remote demote requests. This is in order to 77*620fc27eSBagas Sanjayaprevent a situation where locks are being bounced around the cluster 78*620fc27eSBagas Sanjayafrom node to node with none of the nodes making any progress. This 79*620fc27eSBagas Sanjayatends to show up most with shared mmapped files which are being written 80*620fc27eSBagas Sanjayato by multiple nodes. By delaying the demotion in response to a 81*620fc27eSBagas Sanjayaremote callback, that gives the userspace program time to make 82*620fc27eSBagas Sanjayasome progress before the pages are unmapped. 83*620fc27eSBagas Sanjaya 84*620fc27eSBagas SanjayaEventually, we hope to make the glock "EX" mode locally shared such that any 85*620fc27eSBagas Sanjayalocal locking will be done with the i_mutex as required rather than via the 86*620fc27eSBagas Sanjayaglock. 87*620fc27eSBagas Sanjaya 88*620fc27eSBagas SanjayaLocking rules for glock operations: 89*620fc27eSBagas Sanjaya 90*620fc27eSBagas Sanjaya============== ====================== ============================= 91*620fc27eSBagas SanjayaOperation GLF_LOCK bit lock held gl_lockref.lock spinlock held 92*620fc27eSBagas Sanjaya============== ====================== ============================= 93*620fc27eSBagas Sanjayago_sync Yes No 94*620fc27eSBagas Sanjayago_xmote_bh Yes No 95*620fc27eSBagas Sanjayago_inval Yes No 96*620fc27eSBagas Sanjayago_instantiate No No 97*620fc27eSBagas Sanjayago_held No No 98*620fc27eSBagas Sanjayago_dump Sometimes Yes 99*620fc27eSBagas Sanjayago_callback Sometimes (N/A) Yes 100*620fc27eSBagas Sanjayago_unlocked Yes No 101*620fc27eSBagas Sanjaya============== ====================== ============================= 102*620fc27eSBagas Sanjaya 103*620fc27eSBagas Sanjaya.. Note:: 104*620fc27eSBagas Sanjaya 105*620fc27eSBagas Sanjaya Operations must not drop either the bit lock or the spinlock 106*620fc27eSBagas Sanjaya if its held on entry. go_dump and do_demote_ok must never block. 107*620fc27eSBagas Sanjaya Note that go_dump will only be called if the glock's state 108*620fc27eSBagas Sanjaya indicates that it is caching up-to-date data. 109*620fc27eSBagas Sanjaya 110*620fc27eSBagas SanjayaGlock locking order within GFS2: 111*620fc27eSBagas Sanjaya 112*620fc27eSBagas Sanjaya 1. i_rwsem (if required) 113*620fc27eSBagas Sanjaya 2. Rename glock (for rename only) 114*620fc27eSBagas Sanjaya 3. Inode glock(s) 115*620fc27eSBagas Sanjaya (Parents before children, inodes at "same level" with same parent in 116*620fc27eSBagas Sanjaya lock number order) 117*620fc27eSBagas Sanjaya 4. Rgrp glock(s) (for (de)allocation operations) 118*620fc27eSBagas Sanjaya 5. Transaction glock (via gfs2_trans_begin) for non-read operations 119*620fc27eSBagas Sanjaya 6. i_rw_mutex (if required) 120*620fc27eSBagas Sanjaya 7. Page lock (always last, very important!) 121*620fc27eSBagas Sanjaya 122*620fc27eSBagas SanjayaThere are two glocks per inode. One deals with access to the inode 123*620fc27eSBagas Sanjayaitself (locking order as above), and the other, known as the iopen 124*620fc27eSBagas Sanjayaglock is used in conjunction with the i_nlink field in the inode to 125*620fc27eSBagas Sanjayadetermine the lifetime of the inode in question. Locking of inodes 126*620fc27eSBagas Sanjayais on a per-inode basis. Locking of rgrps is on a per rgrp basis. 127*620fc27eSBagas SanjayaIn general we prefer to lock local locks prior to cluster locks. 128*620fc27eSBagas Sanjaya 129*620fc27eSBagas SanjayaGlock Statistics 130*620fc27eSBagas Sanjaya---------------- 131*620fc27eSBagas Sanjaya 132*620fc27eSBagas SanjayaThe stats are divided into two sets: those relating to the 133*620fc27eSBagas Sanjayasuper block and those relating to an individual glock. The 134*620fc27eSBagas Sanjayasuper block stats are done on a per cpu basis in order to 135*620fc27eSBagas Sanjayatry and reduce the overhead of gathering them. They are also 136*620fc27eSBagas Sanjayafurther divided by glock type. All timings are in nanoseconds. 137*620fc27eSBagas Sanjaya 138*620fc27eSBagas SanjayaIn the case of both the super block and glock statistics, 139*620fc27eSBagas Sanjayathe same information is gathered in each case. The super 140*620fc27eSBagas Sanjayablock timing statistics are used to provide default values for 141*620fc27eSBagas Sanjayathe glock timing statistics, so that newly created glocks 142*620fc27eSBagas Sanjayashould have, as far as possible, a sensible starting point. 143*620fc27eSBagas SanjayaThe per-glock counters are initialised to zero when the 144*620fc27eSBagas Sanjayaglock is created. The per-glock statistics are lost when 145*620fc27eSBagas Sanjayathe glock is ejected from memory. 146*620fc27eSBagas Sanjaya 147*620fc27eSBagas SanjayaThe statistics are divided into three pairs of mean and 148*620fc27eSBagas Sanjayavariance, plus two counters. The mean/variance pairs are 149*620fc27eSBagas Sanjayasmoothed exponential estimates and the algorithm used is 150*620fc27eSBagas Sanjayaone which will be very familiar to those used to calculation 151*620fc27eSBagas Sanjayaof round trip times in network code. See "TCP/IP Illustrated, 152*620fc27eSBagas SanjayaVolume 1", W. Richard Stevens, sect 21.3, "Round-Trip Time Measurement", 153*620fc27eSBagas Sanjayap. 299 and onwards. Also, Volume 2, Sect. 25.10, p. 838 and onwards. 154*620fc27eSBagas SanjayaUnlike the TCP/IP Illustrated case, the mean and variance are 155*620fc27eSBagas Sanjayanot scaled, but are in units of integer nanoseconds. 156*620fc27eSBagas Sanjaya 157*620fc27eSBagas SanjayaThe three pairs of mean/variance measure the following 158*620fc27eSBagas Sanjayathings: 159*620fc27eSBagas Sanjaya 160*620fc27eSBagas Sanjaya 1. DLM lock time (non-blocking requests) 161*620fc27eSBagas Sanjaya 2. DLM lock time (blocking requests) 162*620fc27eSBagas Sanjaya 3. Inter-request time (again to the DLM) 163*620fc27eSBagas Sanjaya 164*620fc27eSBagas SanjayaA non-blocking request is one which will complete right 165*620fc27eSBagas Sanjayaaway, whatever the state of the DLM lock in question. That 166*620fc27eSBagas Sanjayacurrently means any requests when (a) the current state of 167*620fc27eSBagas Sanjayathe lock is exclusive, i.e. a lock demotion (b) the requested 168*620fc27eSBagas Sanjayastate is either null or unlocked (again, a demotion) or (c) the 169*620fc27eSBagas Sanjaya"try lock" flag is set. A blocking request covers all the other 170*620fc27eSBagas Sanjayalock requests. 171*620fc27eSBagas Sanjaya 172*620fc27eSBagas SanjayaThere are two counters. The first is there primarily to show 173*620fc27eSBagas Sanjayahow many lock requests have been made, and thus how much data 174*620fc27eSBagas Sanjayahas gone into the mean/variance calculations. The other counter 175*620fc27eSBagas Sanjayais counting queuing of holders at the top layer of the glock 176*620fc27eSBagas Sanjayacode. Hopefully that number will be a lot larger than the number 177*620fc27eSBagas Sanjayaof dlm lock requests issued. 178*620fc27eSBagas Sanjaya 179*620fc27eSBagas SanjayaSo why gather these statistics? There are several reasons 180*620fc27eSBagas Sanjayawe'd like to get a better idea of these timings: 181*620fc27eSBagas Sanjaya 182*620fc27eSBagas Sanjaya1. To be able to better set the glock "min hold time" 183*620fc27eSBagas Sanjaya2. To spot performance issues more easily 184*620fc27eSBagas Sanjaya3. To improve the algorithm for selecting resource groups for 185*620fc27eSBagas Sanjaya allocation (to base it on lock wait time, rather than blindly 186*620fc27eSBagas Sanjaya using a "try lock") 187*620fc27eSBagas Sanjaya 188*620fc27eSBagas SanjayaDue to the smoothing action of the updates, a step change in 189*620fc27eSBagas Sanjayasome input quantity being sampled will only fully be taken 190*620fc27eSBagas Sanjayainto account after 8 samples (or 4 for the variance) and this 191*620fc27eSBagas Sanjayaneeds to be carefully considered when interpreting the 192*620fc27eSBagas Sanjayaresults. 193*620fc27eSBagas Sanjaya 194*620fc27eSBagas SanjayaKnowing both the time it takes a lock request to complete and 195*620fc27eSBagas Sanjayathe average time between lock requests for a glock means we 196*620fc27eSBagas Sanjayacan compute the total percentage of the time for which the 197*620fc27eSBagas Sanjayanode is able to use a glock vs. time that the rest of the 198*620fc27eSBagas Sanjayacluster has its share. That will be very useful when setting 199*620fc27eSBagas Sanjayathe lock min hold time. 200*620fc27eSBagas Sanjaya 201*620fc27eSBagas SanjayaGreat care has been taken to ensure that we 202*620fc27eSBagas Sanjayameasure exactly the quantities that we want, as accurately 203*620fc27eSBagas Sanjayaas possible. There are always inaccuracies in any 204*620fc27eSBagas Sanjayameasuring system, but I hope this is as accurate as we 205*620fc27eSBagas Sanjayacan reasonably make it. 206*620fc27eSBagas Sanjaya 207*620fc27eSBagas SanjayaPer sb stats can be found here:: 208*620fc27eSBagas Sanjaya 209*620fc27eSBagas Sanjaya /sys/kernel/debug/gfs2/<fsname>/sbstats 210*620fc27eSBagas Sanjaya 211*620fc27eSBagas SanjayaPer glock stats can be found here:: 212*620fc27eSBagas Sanjaya 213*620fc27eSBagas Sanjaya /sys/kernel/debug/gfs2/<fsname>/glstats 214*620fc27eSBagas Sanjaya 215*620fc27eSBagas SanjayaAssuming that debugfs is mounted on /sys/kernel/debug and also 216*620fc27eSBagas Sanjayathat <fsname> is replaced with the name of the gfs2 filesystem 217*620fc27eSBagas Sanjayain question. 218*620fc27eSBagas Sanjaya 219*620fc27eSBagas SanjayaThe abbreviations used in the output as are follows: 220*620fc27eSBagas Sanjaya 221*620fc27eSBagas Sanjaya========= ================================================================ 222*620fc27eSBagas Sanjayasrtt Smoothed round trip time for non blocking dlm requests 223*620fc27eSBagas Sanjayasrttvar Variance estimate for srtt 224*620fc27eSBagas Sanjayasrttb Smoothed round trip time for (potentially) blocking dlm requests 225*620fc27eSBagas Sanjayasrttvarb Variance estimate for srttb 226*620fc27eSBagas Sanjayasirt Smoothed inter request time (for dlm requests) 227*620fc27eSBagas Sanjayasirtvar Variance estimate for sirt 228*620fc27eSBagas Sanjayadlm Number of dlm requests made (dcnt in glstats file) 229*620fc27eSBagas Sanjayaqueue Number of glock requests queued (qcnt in glstats file) 230*620fc27eSBagas Sanjaya========= ================================================================ 231*620fc27eSBagas Sanjaya 232*620fc27eSBagas SanjayaThe sbstats file contains a set of these stats for each glock type (so 8 lines 233*620fc27eSBagas Sanjayafor each type) and for each cpu (one column per cpu). The glstats file contains 234*620fc27eSBagas Sanjayaa set of these stats for each glock in a similar format to the glocks file, but 235*620fc27eSBagas Sanjayausing the format mean/variance for each of the timing stats. 236*620fc27eSBagas Sanjaya 237*620fc27eSBagas SanjayaThe gfs2_glock_lock_time tracepoint prints out the current values of the stats 238*620fc27eSBagas Sanjayafor the glock in question, along with some addition information on each dlm 239*620fc27eSBagas Sanjayareply that is received: 240*620fc27eSBagas Sanjaya 241*620fc27eSBagas Sanjaya====== ======================================= 242*620fc27eSBagas Sanjayastatus The status of the dlm request 243*620fc27eSBagas Sanjayaflags The dlm request flags 244*620fc27eSBagas Sanjayatdiff The time taken by this specific request 245*620fc27eSBagas Sanjaya====== ======================================= 246*620fc27eSBagas Sanjaya 247*620fc27eSBagas Sanjaya(remaining fields as per above list) 248*620fc27eSBagas Sanjaya 249*620fc27eSBagas Sanjaya 250