xref: /linux/Documentation/filesystems/xfs/xfs-self-describing-metadata.rst (revision 06d07429858317ded2db7986113a9e0129cd599b)
1*011f129fSBagas Sanjaya.. SPDX-License-Identifier: GPL-2.0
2*011f129fSBagas Sanjaya.. _xfs_self_describing_metadata:
3*011f129fSBagas Sanjaya
4*011f129fSBagas Sanjaya============================
5*011f129fSBagas SanjayaXFS Self Describing Metadata
6*011f129fSBagas Sanjaya============================
7*011f129fSBagas Sanjaya
8*011f129fSBagas SanjayaIntroduction
9*011f129fSBagas Sanjaya============
10*011f129fSBagas Sanjaya
11*011f129fSBagas SanjayaThe largest scalability problem facing XFS is not one of algorithmic
12*011f129fSBagas Sanjayascalability, but of verification of the filesystem structure. Scalabilty of the
13*011f129fSBagas Sanjayastructures and indexes on disk and the algorithms for iterating them are
14*011f129fSBagas Sanjayaadequate for supporting PB scale filesystems with billions of inodes, however it
15*011f129fSBagas Sanjayais this very scalability that causes the verification problem.
16*011f129fSBagas Sanjaya
17*011f129fSBagas SanjayaAlmost all metadata on XFS is dynamically allocated. The only fixed location
18*011f129fSBagas Sanjayametadata is the allocation group headers (SB, AGF, AGFL and AGI), while all
19*011f129fSBagas Sanjayaother metadata structures need to be discovered by walking the filesystem
20*011f129fSBagas Sanjayastructure in different ways. While this is already done by userspace tools for
21*011f129fSBagas Sanjayavalidating and repairing the structure, there are limits to what they can
22*011f129fSBagas Sanjayaverify, and this in turn limits the supportable size of an XFS filesystem.
23*011f129fSBagas Sanjaya
24*011f129fSBagas SanjayaFor example, it is entirely possible to manually use xfs_db and a bit of
25*011f129fSBagas Sanjayascripting to analyse the structure of a 100TB filesystem when trying to
26*011f129fSBagas Sanjayadetermine the root cause of a corruption problem, but it is still mainly a
27*011f129fSBagas Sanjayamanual task of verifying that things like single bit errors or misplaced writes
28*011f129fSBagas Sanjayaweren't the ultimate cause of a corruption event. It may take a few hours to a
29*011f129fSBagas Sanjayafew days to perform such forensic analysis, so for at this scale root cause
30*011f129fSBagas Sanjayaanalysis is entirely possible.
31*011f129fSBagas Sanjaya
32*011f129fSBagas SanjayaHowever, if we scale the filesystem up to 1PB, we now have 10x as much metadata
33*011f129fSBagas Sanjayato analyse and so that analysis blows out towards weeks/months of forensic work.
34*011f129fSBagas SanjayaMost of the analysis work is slow and tedious, so as the amount of analysis goes
35*011f129fSBagas Sanjayaup, the more likely that the cause will be lost in the noise.  Hence the primary
36*011f129fSBagas Sanjayaconcern for supporting PB scale filesystems is minimising the time and effort
37*011f129fSBagas Sanjayarequired for basic forensic analysis of the filesystem structure.
38*011f129fSBagas Sanjaya
39*011f129fSBagas Sanjaya
40*011f129fSBagas SanjayaSelf Describing Metadata
41*011f129fSBagas Sanjaya========================
42*011f129fSBagas Sanjaya
43*011f129fSBagas SanjayaOne of the problems with the current metadata format is that apart from the
44*011f129fSBagas Sanjayamagic number in the metadata block, we have no other way of identifying what it
45*011f129fSBagas Sanjayais supposed to be. We can't even identify if it is the right place. Put simply,
46*011f129fSBagas Sanjayayou can't look at a single metadata block in isolation and say "yes, it is
47*011f129fSBagas Sanjayasupposed to be there and the contents are valid".
48*011f129fSBagas Sanjaya
49*011f129fSBagas SanjayaHence most of the time spent on forensic analysis is spent doing basic
50*011f129fSBagas Sanjayaverification of metadata values, looking for values that are in range (and hence
51*011f129fSBagas Sanjayanot detected by automated verification checks) but are not correct. Finding and
52*011f129fSBagas Sanjayaunderstanding how things like cross linked block lists (e.g. sibling
53*011f129fSBagas Sanjayapointers in a btree end up with loops in them) are the key to understanding what
54*011f129fSBagas Sanjayawent wrong, but it is impossible to tell what order the blocks were linked into
55*011f129fSBagas Sanjayaeach other or written to disk after the fact.
56*011f129fSBagas Sanjaya
57*011f129fSBagas SanjayaHence we need to record more information into the metadata to allow us to
58*011f129fSBagas Sanjayaquickly determine if the metadata is intact and can be ignored for the purpose
59*011f129fSBagas Sanjayaof analysis. We can't protect against every possible type of error, but we can
60*011f129fSBagas Sanjayaensure that common types of errors are easily detectable.  Hence the concept of
61*011f129fSBagas Sanjayaself describing metadata.
62*011f129fSBagas Sanjaya
63*011f129fSBagas SanjayaThe first, fundamental requirement of self describing metadata is that the
64*011f129fSBagas Sanjayametadata object contains some form of unique identifier in a well known
65*011f129fSBagas Sanjayalocation. This allows us to identify the expected contents of the block and
66*011f129fSBagas Sanjayahence parse and verify the metadata object. IF we can't independently identify
67*011f129fSBagas Sanjayathe type of metadata in the object, then the metadata doesn't describe itself
68*011f129fSBagas Sanjayavery well at all!
69*011f129fSBagas Sanjaya
70*011f129fSBagas SanjayaLuckily, almost all XFS metadata has magic numbers embedded already - only the
71*011f129fSBagas SanjayaAGFL, remote symlinks and remote attribute blocks do not contain identifying
72*011f129fSBagas Sanjayamagic numbers. Hence we can change the on-disk format of all these objects to
73*011f129fSBagas Sanjayaadd more identifying information and detect this simply by changing the magic
74*011f129fSBagas Sanjayanumbers in the metadata objects. That is, if it has the current magic number,
75*011f129fSBagas Sanjayathe metadata isn't self identifying. If it contains a new magic number, it is
76*011f129fSBagas Sanjayaself identifying and we can do much more expansive automated verification of the
77*011f129fSBagas Sanjayametadata object at runtime, during forensic analysis or repair.
78*011f129fSBagas Sanjaya
79*011f129fSBagas SanjayaAs a primary concern, self describing metadata needs some form of overall
80*011f129fSBagas Sanjayaintegrity checking. We cannot trust the metadata if we cannot verify that it has
81*011f129fSBagas Sanjayanot been changed as a result of external influences. Hence we need some form of
82*011f129fSBagas Sanjayaintegrity check, and this is done by adding CRC32c validation to the metadata
83*011f129fSBagas Sanjayablock. If we can verify the block contains the metadata it was intended to
84*011f129fSBagas Sanjayacontain, a large amount of the manual verification work can be skipped.
85*011f129fSBagas Sanjaya
86*011f129fSBagas SanjayaCRC32c was selected as metadata cannot be more than 64k in length in XFS and
87*011f129fSBagas Sanjayahence a 32 bit CRC is more than sufficient to detect multi-bit errors in
88*011f129fSBagas Sanjayametadata blocks. CRC32c is also now hardware accelerated on common CPUs so it is
89*011f129fSBagas Sanjayafast. So while CRC32c is not the strongest of possible integrity checks that
90*011f129fSBagas Sanjayacould be used, it is more than sufficient for our needs and has relatively
91*011f129fSBagas Sanjayalittle overhead. Adding support for larger integrity fields and/or algorithms
92*011f129fSBagas Sanjayadoes really provide any extra value over CRC32c, but it does add a lot of
93*011f129fSBagas Sanjayacomplexity and so there is no provision for changing the integrity checking
94*011f129fSBagas Sanjayamechanism.
95*011f129fSBagas Sanjaya
96*011f129fSBagas SanjayaSelf describing metadata needs to contain enough information so that the
97*011f129fSBagas Sanjayametadata block can be verified as being in the correct place without needing to
98*011f129fSBagas Sanjayalook at any other metadata. This means it needs to contain location information.
99*011f129fSBagas SanjayaJust adding a block number to the metadata is not sufficient to protect against
100*011f129fSBagas Sanjayamis-directed writes - a write might be misdirected to the wrong LUN and so be
101*011f129fSBagas Sanjayawritten to the "correct block" of the wrong filesystem. Hence location
102*011f129fSBagas Sanjayainformation must contain a filesystem identifier as well as a block number.
103*011f129fSBagas Sanjaya
104*011f129fSBagas SanjayaAnother key information point in forensic analysis is knowing who the metadata
105*011f129fSBagas Sanjayablock belongs to. We already know the type, the location, that it is valid
106*011f129fSBagas Sanjayaand/or corrupted, and how long ago that it was last modified. Knowing the owner
107*011f129fSBagas Sanjayaof the block is important as it allows us to find other related metadata to
108*011f129fSBagas Sanjayadetermine the scope of the corruption. For example, if we have a extent btree
109*011f129fSBagas Sanjayaobject, we don't know what inode it belongs to and hence have to walk the entire
110*011f129fSBagas Sanjayafilesystem to find the owner of the block. Worse, the corruption could mean that
111*011f129fSBagas Sanjayano owner can be found (i.e. it's an orphan block), and so without an owner field
112*011f129fSBagas Sanjayain the metadata we have no idea of the scope of the corruption. If we have an
113*011f129fSBagas Sanjayaowner field in the metadata object, we can immediately do top down validation to
114*011f129fSBagas Sanjayadetermine the scope of the problem.
115*011f129fSBagas Sanjaya
116*011f129fSBagas SanjayaDifferent types of metadata have different owner identifiers. For example,
117*011f129fSBagas Sanjayadirectory, attribute and extent tree blocks are all owned by an inode, while
118*011f129fSBagas Sanjayafreespace btree blocks are owned by an allocation group. Hence the size and
119*011f129fSBagas Sanjayacontents of the owner field are determined by the type of metadata object we are
120*011f129fSBagas Sanjayalooking at.  The owner information can also identify misplaced writes (e.g.
121*011f129fSBagas Sanjayafreespace btree block written to the wrong AG).
122*011f129fSBagas Sanjaya
123*011f129fSBagas SanjayaSelf describing metadata also needs to contain some indication of when it was
124*011f129fSBagas Sanjayawritten to the filesystem. One of the key information points when doing forensic
125*011f129fSBagas Sanjayaanalysis is how recently the block was modified. Correlation of set of corrupted
126*011f129fSBagas Sanjayametadata blocks based on modification times is important as it can indicate
127*011f129fSBagas Sanjayawhether the corruptions are related, whether there's been multiple corruption
128*011f129fSBagas Sanjayaevents that lead to the eventual failure, and even whether there are corruptions
129*011f129fSBagas Sanjayapresent that the run-time verification is not detecting.
130*011f129fSBagas Sanjaya
131*011f129fSBagas SanjayaFor example, we can determine whether a metadata object is supposed to be free
132*011f129fSBagas Sanjayaspace or still allocated if it is still referenced by its owner by looking at
133*011f129fSBagas Sanjayawhen the free space btree block that contains the block was last written
134*011f129fSBagas Sanjayacompared to when the metadata object itself was last written.  If the free space
135*011f129fSBagas Sanjayablock is more recent than the object and the object's owner, then there is a
136*011f129fSBagas Sanjayavery good chance that the block should have been removed from the owner.
137*011f129fSBagas Sanjaya
138*011f129fSBagas SanjayaTo provide this "written timestamp", each metadata block gets the Log Sequence
139*011f129fSBagas SanjayaNumber (LSN) of the most recent transaction it was modified on written into it.
140*011f129fSBagas SanjayaThis number will always increase over the life of the filesystem, and the only
141*011f129fSBagas Sanjayathing that resets it is running xfs_repair on the filesystem. Further, by use of
142*011f129fSBagas Sanjayathe LSN we can tell if the corrupted metadata all belonged to the same log
143*011f129fSBagas Sanjayacheckpoint and hence have some idea of how much modification occurred between
144*011f129fSBagas Sanjayathe first and last instance of corrupt metadata on disk and, further, how much
145*011f129fSBagas Sanjayamodification occurred between the corruption being written and when it was
146*011f129fSBagas Sanjayadetected.
147*011f129fSBagas Sanjaya
148*011f129fSBagas SanjayaRuntime Validation
149*011f129fSBagas Sanjaya==================
150*011f129fSBagas Sanjaya
151*011f129fSBagas SanjayaValidation of self-describing metadata takes place at runtime in two places:
152*011f129fSBagas Sanjaya
153*011f129fSBagas Sanjaya	- immediately after a successful read from disk
154*011f129fSBagas Sanjaya	- immediately prior to write IO submission
155*011f129fSBagas Sanjaya
156*011f129fSBagas SanjayaThe verification is completely stateless - it is done independently of the
157*011f129fSBagas Sanjayamodification process, and seeks only to check that the metadata is what it says
158*011f129fSBagas Sanjayait is and that the metadata fields are within bounds and internally consistent.
159*011f129fSBagas SanjayaAs such, we cannot catch all types of corruption that can occur within a block
160*011f129fSBagas Sanjayaas there may be certain limitations that operational state enforces of the
161*011f129fSBagas Sanjayametadata, or there may be corruption of interblock relationships (e.g. corrupted
162*011f129fSBagas Sanjayasibling pointer lists). Hence we still need stateful checking in the main code
163*011f129fSBagas Sanjayabody, but in general most of the per-field validation is handled by the
164*011f129fSBagas Sanjayaverifiers.
165*011f129fSBagas Sanjaya
166*011f129fSBagas SanjayaFor read verification, the caller needs to specify the expected type of metadata
167*011f129fSBagas Sanjayathat it should see, and the IO completion process verifies that the metadata
168*011f129fSBagas Sanjayaobject matches what was expected. If the verification process fails, then it
169*011f129fSBagas Sanjayamarks the object being read as EFSCORRUPTED. The caller needs to catch this
170*011f129fSBagas Sanjayaerror (same as for IO errors), and if it needs to take special action due to a
171*011f129fSBagas Sanjayaverification error it can do so by catching the EFSCORRUPTED error value. If we
172*011f129fSBagas Sanjayaneed more discrimination of error type at higher levels, we can define new
173*011f129fSBagas Sanjayaerror numbers for different errors as necessary.
174*011f129fSBagas Sanjaya
175*011f129fSBagas SanjayaThe first step in read verification is checking the magic number and determining
176*011f129fSBagas Sanjayawhether CRC validating is necessary. If it is, the CRC32c is calculated and
177*011f129fSBagas Sanjayacompared against the value stored in the object itself. Once this is validated,
178*011f129fSBagas Sanjayafurther checks are made against the location information, followed by extensive
179*011f129fSBagas Sanjayaobject specific metadata validation. If any of these checks fail, then the
180*011f129fSBagas Sanjayabuffer is considered corrupt and the EFSCORRUPTED error is set appropriately.
181*011f129fSBagas Sanjaya
182*011f129fSBagas SanjayaWrite verification is the opposite of the read verification - first the object
183*011f129fSBagas Sanjayais extensively verified and if it is OK we then update the LSN from the last
184*011f129fSBagas Sanjayamodification made to the object, After this, we calculate the CRC and insert it
185*011f129fSBagas Sanjayainto the object. Once this is done the write IO is allowed to continue. If any
186*011f129fSBagas Sanjayaerror occurs during this process, the buffer is again marked with a EFSCORRUPTED
187*011f129fSBagas Sanjayaerror for the higher layers to catch.
188*011f129fSBagas Sanjaya
189*011f129fSBagas SanjayaStructures
190*011f129fSBagas Sanjaya==========
191*011f129fSBagas Sanjaya
192*011f129fSBagas SanjayaA typical on-disk structure needs to contain the following information::
193*011f129fSBagas Sanjaya
194*011f129fSBagas Sanjaya    struct xfs_ondisk_hdr {
195*011f129fSBagas Sanjaya	    __be32  magic;		/* magic number */
196*011f129fSBagas Sanjaya	    __be32  crc;		/* CRC, not logged */
197*011f129fSBagas Sanjaya	    uuid_t  uuid;		/* filesystem identifier */
198*011f129fSBagas Sanjaya	    __be64  owner;		/* parent object */
199*011f129fSBagas Sanjaya	    __be64  blkno;		/* location on disk */
200*011f129fSBagas Sanjaya	    __be64  lsn;		/* last modification in log, not logged */
201*011f129fSBagas Sanjaya    };
202*011f129fSBagas Sanjaya
203*011f129fSBagas SanjayaDepending on the metadata, this information may be part of a header structure
204*011f129fSBagas Sanjayaseparate to the metadata contents, or may be distributed through an existing
205*011f129fSBagas Sanjayastructure. The latter occurs with metadata that already contains some of this
206*011f129fSBagas Sanjayainformation, such as the superblock and AG headers.
207*011f129fSBagas Sanjaya
208*011f129fSBagas SanjayaOther metadata may have different formats for the information, but the same
209*011f129fSBagas Sanjayalevel of information is generally provided. For example:
210*011f129fSBagas Sanjaya
211*011f129fSBagas Sanjaya	- short btree blocks have a 32 bit owner (ag number) and a 32 bit block
212*011f129fSBagas Sanjaya	  number for location. The two of these combined provide the same
213*011f129fSBagas Sanjaya	  information as @owner and @blkno in eh above structure, but using 8
214*011f129fSBagas Sanjaya	  bytes less space on disk.
215*011f129fSBagas Sanjaya
216*011f129fSBagas Sanjaya	- directory/attribute node blocks have a 16 bit magic number, and the
217*011f129fSBagas Sanjaya	  header that contains the magic number has other information in it as
218*011f129fSBagas Sanjaya	  well. hence the additional metadata headers change the overall format
219*011f129fSBagas Sanjaya	  of the metadata.
220*011f129fSBagas Sanjaya
221*011f129fSBagas SanjayaA typical buffer read verifier is structured as follows::
222*011f129fSBagas Sanjaya
223*011f129fSBagas Sanjaya    #define XFS_FOO_CRC_OFF		offsetof(struct xfs_ondisk_hdr, crc)
224*011f129fSBagas Sanjaya
225*011f129fSBagas Sanjaya    static void
226*011f129fSBagas Sanjaya    xfs_foo_read_verify(
227*011f129fSBagas Sanjaya	    struct xfs_buf	*bp)
228*011f129fSBagas Sanjaya    {
229*011f129fSBagas Sanjaya	struct xfs_mount *mp = bp->b_mount;
230*011f129fSBagas Sanjaya
231*011f129fSBagas Sanjaya	    if ((xfs_sb_version_hascrc(&mp->m_sb) &&
232*011f129fSBagas Sanjaya		!xfs_verify_cksum(bp->b_addr, BBTOB(bp->b_length),
233*011f129fSBagas Sanjaya					    XFS_FOO_CRC_OFF)) ||
234*011f129fSBagas Sanjaya		!xfs_foo_verify(bp)) {
235*011f129fSBagas Sanjaya		    XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr);
236*011f129fSBagas Sanjaya		    xfs_buf_ioerror(bp, EFSCORRUPTED);
237*011f129fSBagas Sanjaya	    }
238*011f129fSBagas Sanjaya    }
239*011f129fSBagas Sanjaya
240*011f129fSBagas SanjayaThe code ensures that the CRC is only checked if the filesystem has CRCs enabled
241*011f129fSBagas Sanjayaby checking the superblock of the feature bit, and then if the CRC verifies OK
242*011f129fSBagas Sanjaya(or is not needed) it verifies the actual contents of the block.
243*011f129fSBagas Sanjaya
244*011f129fSBagas SanjayaThe verifier function will take a couple of different forms, depending on
245*011f129fSBagas Sanjayawhether the magic number can be used to determine the format of the block. In
246*011f129fSBagas Sanjayathe case it can't, the code is structured as follows::
247*011f129fSBagas Sanjaya
248*011f129fSBagas Sanjaya    static bool
249*011f129fSBagas Sanjaya    xfs_foo_verify(
250*011f129fSBagas Sanjaya	    struct xfs_buf		*bp)
251*011f129fSBagas Sanjaya    {
252*011f129fSBagas Sanjaya	    struct xfs_mount	*mp = bp->b_mount;
253*011f129fSBagas Sanjaya	    struct xfs_ondisk_hdr	*hdr = bp->b_addr;
254*011f129fSBagas Sanjaya
255*011f129fSBagas Sanjaya	    if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC))
256*011f129fSBagas Sanjaya		    return false;
257*011f129fSBagas Sanjaya
258*011f129fSBagas Sanjaya	    if (!xfs_sb_version_hascrc(&mp->m_sb)) {
259*011f129fSBagas Sanjaya		    if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid))
260*011f129fSBagas Sanjaya			    return false;
261*011f129fSBagas Sanjaya		    if (bp->b_bn != be64_to_cpu(hdr->blkno))
262*011f129fSBagas Sanjaya			    return false;
263*011f129fSBagas Sanjaya		    if (hdr->owner == 0)
264*011f129fSBagas Sanjaya			    return false;
265*011f129fSBagas Sanjaya	    }
266*011f129fSBagas Sanjaya
267*011f129fSBagas Sanjaya	    /* object specific verification checks here */
268*011f129fSBagas Sanjaya
269*011f129fSBagas Sanjaya	    return true;
270*011f129fSBagas Sanjaya    }
271*011f129fSBagas Sanjaya
272*011f129fSBagas SanjayaIf there are different magic numbers for the different formats, the verifier
273*011f129fSBagas Sanjayawill look like::
274*011f129fSBagas Sanjaya
275*011f129fSBagas Sanjaya    static bool
276*011f129fSBagas Sanjaya    xfs_foo_verify(
277*011f129fSBagas Sanjaya	    struct xfs_buf		*bp)
278*011f129fSBagas Sanjaya    {
279*011f129fSBagas Sanjaya	    struct xfs_mount	*mp = bp->b_mount;
280*011f129fSBagas Sanjaya	    struct xfs_ondisk_hdr	*hdr = bp->b_addr;
281*011f129fSBagas Sanjaya
282*011f129fSBagas Sanjaya	    if (hdr->magic == cpu_to_be32(XFS_FOO_CRC_MAGIC)) {
283*011f129fSBagas Sanjaya		    if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid))
284*011f129fSBagas Sanjaya			    return false;
285*011f129fSBagas Sanjaya		    if (bp->b_bn != be64_to_cpu(hdr->blkno))
286*011f129fSBagas Sanjaya			    return false;
287*011f129fSBagas Sanjaya		    if (hdr->owner == 0)
288*011f129fSBagas Sanjaya			    return false;
289*011f129fSBagas Sanjaya	    } else if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC))
290*011f129fSBagas Sanjaya		    return false;
291*011f129fSBagas Sanjaya
292*011f129fSBagas Sanjaya	    /* object specific verification checks here */
293*011f129fSBagas Sanjaya
294*011f129fSBagas Sanjaya	    return true;
295*011f129fSBagas Sanjaya    }
296*011f129fSBagas Sanjaya
297*011f129fSBagas SanjayaWrite verifiers are very similar to the read verifiers, they just do things in
298*011f129fSBagas Sanjayathe opposite order to the read verifiers. A typical write verifier::
299*011f129fSBagas Sanjaya
300*011f129fSBagas Sanjaya    static void
301*011f129fSBagas Sanjaya    xfs_foo_write_verify(
302*011f129fSBagas Sanjaya	    struct xfs_buf	*bp)
303*011f129fSBagas Sanjaya    {
304*011f129fSBagas Sanjaya	    struct xfs_mount	*mp = bp->b_mount;
305*011f129fSBagas Sanjaya	    struct xfs_buf_log_item	*bip = bp->b_fspriv;
306*011f129fSBagas Sanjaya
307*011f129fSBagas Sanjaya	    if (!xfs_foo_verify(bp)) {
308*011f129fSBagas Sanjaya		    XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr);
309*011f129fSBagas Sanjaya		    xfs_buf_ioerror(bp, EFSCORRUPTED);
310*011f129fSBagas Sanjaya		    return;
311*011f129fSBagas Sanjaya	    }
312*011f129fSBagas Sanjaya
313*011f129fSBagas Sanjaya	    if (!xfs_sb_version_hascrc(&mp->m_sb))
314*011f129fSBagas Sanjaya		    return;
315*011f129fSBagas Sanjaya
316*011f129fSBagas Sanjaya
317*011f129fSBagas Sanjaya	    if (bip) {
318*011f129fSBagas Sanjaya		    struct xfs_ondisk_hdr	*hdr = bp->b_addr;
319*011f129fSBagas Sanjaya		    hdr->lsn = cpu_to_be64(bip->bli_item.li_lsn);
320*011f129fSBagas Sanjaya	    }
321*011f129fSBagas Sanjaya	    xfs_update_cksum(bp->b_addr, BBTOB(bp->b_length), XFS_FOO_CRC_OFF);
322*011f129fSBagas Sanjaya    }
323*011f129fSBagas Sanjaya
324*011f129fSBagas SanjayaThis will verify the internal structure of the metadata before we go any
325*011f129fSBagas Sanjayafurther, detecting corruptions that have occurred as the metadata has been
326*011f129fSBagas Sanjayamodified in memory. If the metadata verifies OK, and CRCs are enabled, we then
327*011f129fSBagas Sanjayaupdate the LSN field (when it was last modified) and calculate the CRC on the
328*011f129fSBagas Sanjayametadata. Once this is done, we can issue the IO.
329*011f129fSBagas Sanjaya
330*011f129fSBagas SanjayaInodes and Dquots
331*011f129fSBagas Sanjaya=================
332*011f129fSBagas Sanjaya
333*011f129fSBagas SanjayaInodes and dquots are special snowflakes. They have per-object CRC and
334*011f129fSBagas Sanjayaself-identifiers, but they are packed so that there are multiple objects per
335*011f129fSBagas Sanjayabuffer. Hence we do not use per-buffer verifiers to do the work of per-object
336*011f129fSBagas Sanjayaverification and CRC calculations. The per-buffer verifiers simply perform basic
337*011f129fSBagas Sanjayaidentification of the buffer - that they contain inodes or dquots, and that
338*011f129fSBagas Sanjayathere are magic numbers in all the expected spots. All further CRC and
339*011f129fSBagas Sanjayaverification checks are done when each inode is read from or written back to the
340*011f129fSBagas Sanjayabuffer.
341*011f129fSBagas Sanjaya
342*011f129fSBagas SanjayaThe structure of the verifiers and the identifiers checks is very similar to the
343*011f129fSBagas Sanjayabuffer code described above. The only difference is where they are called. For
344*011f129fSBagas Sanjayaexample, inode read verification is done in xfs_inode_from_disk() when the inode
345*011f129fSBagas Sanjayais first read out of the buffer and the struct xfs_inode is instantiated. The
346*011f129fSBagas Sanjayainode is already extensively verified during writeback in xfs_iflush_int, so the
347*011f129fSBagas Sanjayaonly addition here is to add the LSN and CRC to the inode as it is copied back
348*011f129fSBagas Sanjayainto the buffer.
349*011f129fSBagas Sanjaya
350*011f129fSBagas SanjayaXXX: inode unlinked list modification doesn't recalculate the inode CRC! None of
351*011f129fSBagas Sanjayathe unlinked list modifications check or update CRCs, neither during unlink nor
352*011f129fSBagas Sanjayalog recovery. So, it's gone unnoticed until now. This won't matter immediately -
353*011f129fSBagas Sanjayarepair will probably complain about it - but it needs to be fixed.
354