xref: /linux/Documentation/filesystems/gfs2/glocks.rst (revision cc4adab164b772a34b3340d644b7c4728498581e)
1*620fc27eSBagas Sanjaya.. SPDX-License-Identifier: GPL-2.0
2*620fc27eSBagas Sanjaya
3*620fc27eSBagas Sanjaya============================
4*620fc27eSBagas SanjayaGlock internal locking rules
5*620fc27eSBagas Sanjaya============================
6*620fc27eSBagas Sanjaya
7*620fc27eSBagas SanjayaThis documents the basic principles of the glock state machine
8*620fc27eSBagas Sanjayainternals. Each glock (struct gfs2_glock in fs/gfs2/incore.h)
9*620fc27eSBagas Sanjayahas two main (internal) locks:
10*620fc27eSBagas Sanjaya
11*620fc27eSBagas Sanjaya 1. A spinlock (gl_lockref.lock) which protects the internal state such
12*620fc27eSBagas Sanjaya    as gl_state, gl_target and the list of holders (gl_holders)
13*620fc27eSBagas Sanjaya 2. A non-blocking bit lock, GLF_LOCK, which is used to prevent other
14*620fc27eSBagas Sanjaya    threads from making calls to the DLM, etc. at the same time. If a
15*620fc27eSBagas Sanjaya    thread takes this lock, it must then call run_queue (usually via the
16*620fc27eSBagas Sanjaya    workqueue) when it releases it in order to ensure any pending tasks
17*620fc27eSBagas Sanjaya    are completed.
18*620fc27eSBagas Sanjaya
19*620fc27eSBagas SanjayaThe gl_holders list contains all the queued lock requests (not
20*620fc27eSBagas Sanjayajust the holders) associated with the glock. If there are any
21*620fc27eSBagas Sanjayaheld locks, then they will be contiguous entries at the head
22*620fc27eSBagas Sanjayaof the list. Locks are granted in strictly the order that they
23*620fc27eSBagas Sanjayaare queued.
24*620fc27eSBagas Sanjaya
25*620fc27eSBagas SanjayaThere are three lock states that users of the glock layer can request,
26*620fc27eSBagas Sanjayanamely shared (SH), deferred (DF) and exclusive (EX). Those translate
27*620fc27eSBagas Sanjayato the following DLM lock modes:
28*620fc27eSBagas Sanjaya
29*620fc27eSBagas Sanjaya==========	====== =====================================================
30*620fc27eSBagas SanjayaGlock mode      DLM    lock mode
31*620fc27eSBagas Sanjaya==========	====== =====================================================
32*620fc27eSBagas Sanjaya    UN          IV/NL  Unlocked (no DLM lock associated with glock) or NL
33*620fc27eSBagas Sanjaya    SH          PR     (Protected read)
34*620fc27eSBagas Sanjaya    DF          CW     (Concurrent write)
35*620fc27eSBagas Sanjaya    EX          EX     (Exclusive)
36*620fc27eSBagas Sanjaya==========	====== =====================================================
37*620fc27eSBagas Sanjaya
38*620fc27eSBagas SanjayaThus DF is basically a shared mode which is incompatible with the "normal"
39*620fc27eSBagas Sanjayashared lock mode, SH. In GFS2 the DF mode is used exclusively for direct I/O
40*620fc27eSBagas Sanjayaoperations. The glocks are basically a lock plus some routines which deal
41*620fc27eSBagas Sanjayawith cache management. The following rules apply for the cache:
42*620fc27eSBagas Sanjaya
43*620fc27eSBagas Sanjaya==========      ==============   ==========   ==========   ==============
44*620fc27eSBagas SanjayaGlock mode      Cache Metadata   Cache data   Dirty Data   Dirty Metadata
45*620fc27eSBagas Sanjaya==========      ==============   ==========   ==========   ==============
46*620fc27eSBagas Sanjaya    UN                No            No            No            No
47*620fc27eSBagas Sanjaya    DF                Yes           No            No            No
48*620fc27eSBagas Sanjaya    SH                Yes           Yes           No            No
49*620fc27eSBagas Sanjaya    EX                Yes           Yes           Yes           Yes
50*620fc27eSBagas Sanjaya==========      ==============   ==========   ==========   ==============
51*620fc27eSBagas Sanjaya
52*620fc27eSBagas SanjayaThese rules are implemented using the various glock operations which
53*620fc27eSBagas Sanjayaare defined for each type of glock. Not all types of glocks use
54*620fc27eSBagas Sanjayaall the modes. Only inode glocks use the DF mode for example.
55*620fc27eSBagas Sanjaya
56*620fc27eSBagas SanjayaTable of glock operations and per type constants:
57*620fc27eSBagas Sanjaya
58*620fc27eSBagas Sanjaya==============     =============================================================
59*620fc27eSBagas SanjayaField              Purpose
60*620fc27eSBagas Sanjaya==============     =============================================================
61*620fc27eSBagas Sanjayago_sync            Called before remote state change (e.g. to sync dirty data)
62*620fc27eSBagas Sanjayago_xmote_bh        Called after remote state change (e.g. to refill cache)
63*620fc27eSBagas Sanjayago_inval           Called if remote state change requires invalidating the cache
64*620fc27eSBagas Sanjayago_instantiate     Called when a glock has been acquired
65*620fc27eSBagas Sanjayago_held            Called every time a glock holder is acquired
66*620fc27eSBagas Sanjayago_dump            Called to print content of object for debugfs file, or on
67*620fc27eSBagas Sanjaya                   error to dump glock to the log.
68*620fc27eSBagas Sanjayago_callback	   Called if the DLM sends a callback to drop this lock
69*620fc27eSBagas Sanjayago_unlocked        Called when a glock is unlocked (dlm_unlock())
70*620fc27eSBagas Sanjayago_type            The type of the glock, ``LM_TYPE_*``
71*620fc27eSBagas Sanjayago_flags	   GLOF_ASPACE is set, if the glock has an address space
72*620fc27eSBagas Sanjaya                   associated with it
73*620fc27eSBagas Sanjaya==============     =============================================================
74*620fc27eSBagas Sanjaya
75*620fc27eSBagas SanjayaThe minimum hold time for each lock is the time after a remote lock
76*620fc27eSBagas Sanjayagrant for which we ignore remote demote requests. This is in order to
77*620fc27eSBagas Sanjayaprevent a situation where locks are being bounced around the cluster
78*620fc27eSBagas Sanjayafrom node to node with none of the nodes making any progress. This
79*620fc27eSBagas Sanjayatends to show up most with shared mmapped files which are being written
80*620fc27eSBagas Sanjayato by multiple nodes. By delaying the demotion in response to a
81*620fc27eSBagas Sanjayaremote callback, that gives the userspace program time to make
82*620fc27eSBagas Sanjayasome progress before the pages are unmapped.
83*620fc27eSBagas Sanjaya
84*620fc27eSBagas SanjayaEventually, we hope to make the glock "EX" mode locally shared such that any
85*620fc27eSBagas Sanjayalocal locking will be done with the i_mutex as required rather than via the
86*620fc27eSBagas Sanjayaglock.
87*620fc27eSBagas Sanjaya
88*620fc27eSBagas SanjayaLocking rules for glock operations:
89*620fc27eSBagas Sanjaya
90*620fc27eSBagas Sanjaya==============   ======================    =============================
91*620fc27eSBagas SanjayaOperation        GLF_LOCK bit lock held    gl_lockref.lock spinlock held
92*620fc27eSBagas Sanjaya==============   ======================    =============================
93*620fc27eSBagas Sanjayago_sync               Yes                       No
94*620fc27eSBagas Sanjayago_xmote_bh           Yes                       No
95*620fc27eSBagas Sanjayago_inval              Yes                       No
96*620fc27eSBagas Sanjayago_instantiate        No                        No
97*620fc27eSBagas Sanjayago_held               No                        No
98*620fc27eSBagas Sanjayago_dump               Sometimes                 Yes
99*620fc27eSBagas Sanjayago_callback           Sometimes (N/A)           Yes
100*620fc27eSBagas Sanjayago_unlocked           Yes                       No
101*620fc27eSBagas Sanjaya==============   ======================    =============================
102*620fc27eSBagas Sanjaya
103*620fc27eSBagas Sanjaya.. Note::
104*620fc27eSBagas Sanjaya
105*620fc27eSBagas Sanjaya   Operations must not drop either the bit lock or the spinlock
106*620fc27eSBagas Sanjaya   if its held on entry. go_dump and do_demote_ok must never block.
107*620fc27eSBagas Sanjaya   Note that go_dump will only be called if the glock's state
108*620fc27eSBagas Sanjaya   indicates that it is caching up-to-date data.
109*620fc27eSBagas Sanjaya
110*620fc27eSBagas SanjayaGlock locking order within GFS2:
111*620fc27eSBagas Sanjaya
112*620fc27eSBagas Sanjaya 1. i_rwsem (if required)
113*620fc27eSBagas Sanjaya 2. Rename glock (for rename only)
114*620fc27eSBagas Sanjaya 3. Inode glock(s)
115*620fc27eSBagas Sanjaya    (Parents before children, inodes at "same level" with same parent in
116*620fc27eSBagas Sanjaya    lock number order)
117*620fc27eSBagas Sanjaya 4. Rgrp glock(s) (for (de)allocation operations)
118*620fc27eSBagas Sanjaya 5. Transaction glock (via gfs2_trans_begin) for non-read operations
119*620fc27eSBagas Sanjaya 6. i_rw_mutex (if required)
120*620fc27eSBagas Sanjaya 7. Page lock  (always last, very important!)
121*620fc27eSBagas Sanjaya
122*620fc27eSBagas SanjayaThere are two glocks per inode. One deals with access to the inode
123*620fc27eSBagas Sanjayaitself (locking order as above), and the other, known as the iopen
124*620fc27eSBagas Sanjayaglock is used in conjunction with the i_nlink field in the inode to
125*620fc27eSBagas Sanjayadetermine the lifetime of the inode in question. Locking of inodes
126*620fc27eSBagas Sanjayais on a per-inode basis. Locking of rgrps is on a per rgrp basis.
127*620fc27eSBagas SanjayaIn general we prefer to lock local locks prior to cluster locks.
128*620fc27eSBagas Sanjaya
129*620fc27eSBagas SanjayaGlock Statistics
130*620fc27eSBagas Sanjaya----------------
131*620fc27eSBagas Sanjaya
132*620fc27eSBagas SanjayaThe stats are divided into two sets: those relating to the
133*620fc27eSBagas Sanjayasuper block and those relating to an individual glock. The
134*620fc27eSBagas Sanjayasuper block stats are done on a per cpu basis in order to
135*620fc27eSBagas Sanjayatry and reduce the overhead of gathering them. They are also
136*620fc27eSBagas Sanjayafurther divided by glock type. All timings are in nanoseconds.
137*620fc27eSBagas Sanjaya
138*620fc27eSBagas SanjayaIn the case of both the super block and glock statistics,
139*620fc27eSBagas Sanjayathe same information is gathered in each case. The super
140*620fc27eSBagas Sanjayablock timing statistics are used to provide default values for
141*620fc27eSBagas Sanjayathe glock timing statistics, so that newly created glocks
142*620fc27eSBagas Sanjayashould have, as far as possible, a sensible starting point.
143*620fc27eSBagas SanjayaThe per-glock counters are initialised to zero when the
144*620fc27eSBagas Sanjayaglock is created. The per-glock statistics are lost when
145*620fc27eSBagas Sanjayathe glock is ejected from memory.
146*620fc27eSBagas Sanjaya
147*620fc27eSBagas SanjayaThe statistics are divided into three pairs of mean and
148*620fc27eSBagas Sanjayavariance, plus two counters. The mean/variance pairs are
149*620fc27eSBagas Sanjayasmoothed exponential estimates and the algorithm used is
150*620fc27eSBagas Sanjayaone which will be very familiar to those used to calculation
151*620fc27eSBagas Sanjayaof round trip times in network code. See "TCP/IP Illustrated,
152*620fc27eSBagas SanjayaVolume 1", W. Richard Stevens, sect 21.3, "Round-Trip Time Measurement",
153*620fc27eSBagas Sanjayap. 299 and onwards. Also, Volume 2, Sect. 25.10, p. 838 and onwards.
154*620fc27eSBagas SanjayaUnlike the TCP/IP Illustrated case, the mean and variance are
155*620fc27eSBagas Sanjayanot scaled, but are in units of integer nanoseconds.
156*620fc27eSBagas Sanjaya
157*620fc27eSBagas SanjayaThe three pairs of mean/variance measure the following
158*620fc27eSBagas Sanjayathings:
159*620fc27eSBagas Sanjaya
160*620fc27eSBagas Sanjaya 1. DLM lock time (non-blocking requests)
161*620fc27eSBagas Sanjaya 2. DLM lock time (blocking requests)
162*620fc27eSBagas Sanjaya 3. Inter-request time (again to the DLM)
163*620fc27eSBagas Sanjaya
164*620fc27eSBagas SanjayaA non-blocking request is one which will complete right
165*620fc27eSBagas Sanjayaaway, whatever the state of the DLM lock in question. That
166*620fc27eSBagas Sanjayacurrently means any requests when (a) the current state of
167*620fc27eSBagas Sanjayathe lock is exclusive, i.e. a lock demotion (b) the requested
168*620fc27eSBagas Sanjayastate is either null or unlocked (again, a demotion) or (c) the
169*620fc27eSBagas Sanjaya"try lock" flag is set. A blocking request covers all the other
170*620fc27eSBagas Sanjayalock requests.
171*620fc27eSBagas Sanjaya
172*620fc27eSBagas SanjayaThere are two counters. The first is there primarily to show
173*620fc27eSBagas Sanjayahow many lock requests have been made, and thus how much data
174*620fc27eSBagas Sanjayahas gone into the mean/variance calculations. The other counter
175*620fc27eSBagas Sanjayais counting queuing of holders at the top layer of the glock
176*620fc27eSBagas Sanjayacode. Hopefully that number will be a lot larger than the number
177*620fc27eSBagas Sanjayaof dlm lock requests issued.
178*620fc27eSBagas Sanjaya
179*620fc27eSBagas SanjayaSo why gather these statistics? There are several reasons
180*620fc27eSBagas Sanjayawe'd like to get a better idea of these timings:
181*620fc27eSBagas Sanjaya
182*620fc27eSBagas Sanjaya1. To be able to better set the glock "min hold time"
183*620fc27eSBagas Sanjaya2. To spot performance issues more easily
184*620fc27eSBagas Sanjaya3. To improve the algorithm for selecting resource groups for
185*620fc27eSBagas Sanjaya   allocation (to base it on lock wait time, rather than blindly
186*620fc27eSBagas Sanjaya   using a "try lock")
187*620fc27eSBagas Sanjaya
188*620fc27eSBagas SanjayaDue to the smoothing action of the updates, a step change in
189*620fc27eSBagas Sanjayasome input quantity being sampled will only fully be taken
190*620fc27eSBagas Sanjayainto account after 8 samples (or 4 for the variance) and this
191*620fc27eSBagas Sanjayaneeds to be carefully considered when interpreting the
192*620fc27eSBagas Sanjayaresults.
193*620fc27eSBagas Sanjaya
194*620fc27eSBagas SanjayaKnowing both the time it takes a lock request to complete and
195*620fc27eSBagas Sanjayathe average time between lock requests for a glock means we
196*620fc27eSBagas Sanjayacan compute the total percentage of the time for which the
197*620fc27eSBagas Sanjayanode is able to use a glock vs. time that the rest of the
198*620fc27eSBagas Sanjayacluster has its share. That will be very useful when setting
199*620fc27eSBagas Sanjayathe lock min hold time.
200*620fc27eSBagas Sanjaya
201*620fc27eSBagas SanjayaGreat care has been taken to ensure that we
202*620fc27eSBagas Sanjayameasure exactly the quantities that we want, as accurately
203*620fc27eSBagas Sanjayaas possible. There are always inaccuracies in any
204*620fc27eSBagas Sanjayameasuring system, but I hope this is as accurate as we
205*620fc27eSBagas Sanjayacan reasonably make it.
206*620fc27eSBagas Sanjaya
207*620fc27eSBagas SanjayaPer sb stats can be found here::
208*620fc27eSBagas Sanjaya
209*620fc27eSBagas Sanjaya    /sys/kernel/debug/gfs2/<fsname>/sbstats
210*620fc27eSBagas Sanjaya
211*620fc27eSBagas SanjayaPer glock stats can be found here::
212*620fc27eSBagas Sanjaya
213*620fc27eSBagas Sanjaya    /sys/kernel/debug/gfs2/<fsname>/glstats
214*620fc27eSBagas Sanjaya
215*620fc27eSBagas SanjayaAssuming that debugfs is mounted on /sys/kernel/debug and also
216*620fc27eSBagas Sanjayathat <fsname> is replaced with the name of the gfs2 filesystem
217*620fc27eSBagas Sanjayain question.
218*620fc27eSBagas Sanjaya
219*620fc27eSBagas SanjayaThe abbreviations used in the output as are follows:
220*620fc27eSBagas Sanjaya
221*620fc27eSBagas Sanjaya=========  ================================================================
222*620fc27eSBagas Sanjayasrtt       Smoothed round trip time for non blocking dlm requests
223*620fc27eSBagas Sanjayasrttvar    Variance estimate for srtt
224*620fc27eSBagas Sanjayasrttb      Smoothed round trip time for (potentially) blocking dlm requests
225*620fc27eSBagas Sanjayasrttvarb   Variance estimate for srttb
226*620fc27eSBagas Sanjayasirt       Smoothed inter request time (for dlm requests)
227*620fc27eSBagas Sanjayasirtvar    Variance estimate for sirt
228*620fc27eSBagas Sanjayadlm        Number of dlm requests made (dcnt in glstats file)
229*620fc27eSBagas Sanjayaqueue      Number of glock requests queued (qcnt in glstats file)
230*620fc27eSBagas Sanjaya=========  ================================================================
231*620fc27eSBagas Sanjaya
232*620fc27eSBagas SanjayaThe sbstats file contains a set of these stats for each glock type (so 8 lines
233*620fc27eSBagas Sanjayafor each type) and for each cpu (one column per cpu). The glstats file contains
234*620fc27eSBagas Sanjayaa set of these stats for each glock in a similar format to the glocks file, but
235*620fc27eSBagas Sanjayausing the format mean/variance for each of the timing stats.
236*620fc27eSBagas Sanjaya
237*620fc27eSBagas SanjayaThe gfs2_glock_lock_time tracepoint prints out the current values of the stats
238*620fc27eSBagas Sanjayafor the glock in question, along with some addition information on each dlm
239*620fc27eSBagas Sanjayareply that is received:
240*620fc27eSBagas Sanjaya
241*620fc27eSBagas Sanjaya======   =======================================
242*620fc27eSBagas Sanjayastatus   The status of the dlm request
243*620fc27eSBagas Sanjayaflags    The dlm request flags
244*620fc27eSBagas Sanjayatdiff    The time taken by this specific request
245*620fc27eSBagas Sanjaya======   =======================================
246*620fc27eSBagas Sanjaya
247*620fc27eSBagas Sanjaya(remaining fields as per above list)
248*620fc27eSBagas Sanjaya
249*620fc27eSBagas Sanjaya
250