xref: /linux/Documentation/block/data-integrity.rst (revision a1c613ae4c322ddd58d5a8539dbfba2a0380a8c0)
19e255e2bSRandy Dunlap==============
2898bd37aSMauro Carvalho ChehabData Integrity
3898bd37aSMauro Carvalho Chehab==============
4898bd37aSMauro Carvalho Chehab
5898bd37aSMauro Carvalho Chehab1. Introduction
6898bd37aSMauro Carvalho Chehab===============
7898bd37aSMauro Carvalho Chehab
8898bd37aSMauro Carvalho ChehabModern filesystems feature checksumming of data and metadata to
9898bd37aSMauro Carvalho Chehabprotect against data corruption.  However, the detection of the
10898bd37aSMauro Carvalho Chehabcorruption is done at read time which could potentially be months
11898bd37aSMauro Carvalho Chehabafter the data was written.  At that point the original data that the
12898bd37aSMauro Carvalho Chehabapplication tried to write is most likely lost.
13898bd37aSMauro Carvalho Chehab
14898bd37aSMauro Carvalho ChehabThe solution is to ensure that the disk is actually storing what the
15898bd37aSMauro Carvalho Chehabapplication meant it to.  Recent additions to both the SCSI family
16898bd37aSMauro Carvalho Chehabprotocols (SBC Data Integrity Field, SCC protection proposal) as well
17898bd37aSMauro Carvalho Chehabas SATA/T13 (External Path Protection) try to remedy this by adding
18898bd37aSMauro Carvalho Chehabsupport for appending integrity metadata to an I/O.  The integrity
19898bd37aSMauro Carvalho Chehabmetadata (or protection information in SCSI terminology) includes a
20898bd37aSMauro Carvalho Chehabchecksum for each sector as well as an incrementing counter that
21898bd37aSMauro Carvalho Chehabensures the individual sectors are written in the right order.  And
22898bd37aSMauro Carvalho Chehabfor some protection schemes also that the I/O is written to the right
23898bd37aSMauro Carvalho Chehabplace on disk.
24898bd37aSMauro Carvalho Chehab
25898bd37aSMauro Carvalho ChehabCurrent storage controllers and devices implement various protective
26898bd37aSMauro Carvalho Chehabmeasures, for instance checksumming and scrubbing.  But these
27898bd37aSMauro Carvalho Chehabtechnologies are working in their own isolated domains or at best
28898bd37aSMauro Carvalho Chehabbetween adjacent nodes in the I/O path.  The interesting thing about
29898bd37aSMauro Carvalho ChehabDIF and the other integrity extensions is that the protection format
30898bd37aSMauro Carvalho Chehabis well defined and every node in the I/O path can verify the
31898bd37aSMauro Carvalho Chehabintegrity of the I/O and reject it if corruption is detected.  This
32898bd37aSMauro Carvalho Chehaballows not only corruption prevention but also isolation of the point
33898bd37aSMauro Carvalho Chehabof failure.
34898bd37aSMauro Carvalho Chehab
35898bd37aSMauro Carvalho Chehab2. The Data Integrity Extensions
36898bd37aSMauro Carvalho Chehab================================
37898bd37aSMauro Carvalho Chehab
38898bd37aSMauro Carvalho ChehabAs written, the protocol extensions only protect the path between
39898bd37aSMauro Carvalho Chehabcontroller and storage device.  However, many controllers actually
40898bd37aSMauro Carvalho Chehaballow the operating system to interact with the integrity metadata
41898bd37aSMauro Carvalho Chehab(IMD).  We have been working with several FC/SAS HBA vendors to enable
42898bd37aSMauro Carvalho Chehabthe protection information to be transferred to and from their
43898bd37aSMauro Carvalho Chehabcontrollers.
44898bd37aSMauro Carvalho Chehab
45898bd37aSMauro Carvalho ChehabThe SCSI Data Integrity Field works by appending 8 bytes of protection
46898bd37aSMauro Carvalho Chehabinformation to each sector.  The data + integrity metadata is stored
47898bd37aSMauro Carvalho Chehabin 520 byte sectors on disk.  Data + IMD are interleaved when
48898bd37aSMauro Carvalho Chehabtransferred between the controller and target.  The T13 proposal is
49898bd37aSMauro Carvalho Chehabsimilar.
50898bd37aSMauro Carvalho Chehab
51898bd37aSMauro Carvalho ChehabBecause it is highly inconvenient for operating systems to deal with
52898bd37aSMauro Carvalho Chehab520 (and 4104) byte sectors, we approached several HBA vendors and
53898bd37aSMauro Carvalho Chehabencouraged them to allow separation of the data and integrity metadata
54898bd37aSMauro Carvalho Chehabscatter-gather lists.
55898bd37aSMauro Carvalho Chehab
56898bd37aSMauro Carvalho ChehabThe controller will interleave the buffers on write and split them on
57898bd37aSMauro Carvalho Chehabread.  This means that Linux can DMA the data buffers to and from
58898bd37aSMauro Carvalho Chehabhost memory without changes to the page cache.
59898bd37aSMauro Carvalho Chehab
60898bd37aSMauro Carvalho ChehabAlso, the 16-bit CRC checksum mandated by both the SCSI and SATA specs
61898bd37aSMauro Carvalho Chehabis somewhat heavy to compute in software.  Benchmarks found that
62898bd37aSMauro Carvalho Chehabcalculating this checksum had a significant impact on system
63898bd37aSMauro Carvalho Chehabperformance for a number of workloads.  Some controllers allow a
64898bd37aSMauro Carvalho Chehablighter-weight checksum to be used when interfacing with the operating
65898bd37aSMauro Carvalho Chehabsystem.  Emulex, for instance, supports the TCP/IP checksum instead.
66898bd37aSMauro Carvalho ChehabThe IP checksum received from the OS is converted to the 16-bit CRC
67898bd37aSMauro Carvalho Chehabwhen writing and vice versa.  This allows the integrity metadata to be
68898bd37aSMauro Carvalho Chehabgenerated by Linux or the application at very low cost (comparable to
69898bd37aSMauro Carvalho Chehabsoftware RAID5).
70898bd37aSMauro Carvalho Chehab
71898bd37aSMauro Carvalho ChehabThe IP checksum is weaker than the CRC in terms of detecting bit
72898bd37aSMauro Carvalho Chehaberrors.  However, the strength is really in the separation of the data
73898bd37aSMauro Carvalho Chehabbuffers and the integrity metadata.  These two distinct buffers must
74898bd37aSMauro Carvalho Chehabmatch up for an I/O to complete.
75898bd37aSMauro Carvalho Chehab
76898bd37aSMauro Carvalho ChehabThe separation of the data and integrity metadata buffers as well as
77898bd37aSMauro Carvalho Chehabthe choice in checksums is referred to as the Data Integrity
78898bd37aSMauro Carvalho ChehabExtensions.  As these extensions are outside the scope of the protocol
79898bd37aSMauro Carvalho Chehabbodies (T10, T13), Oracle and its partners are trying to standardize
80898bd37aSMauro Carvalho Chehabthem within the Storage Networking Industry Association.
81898bd37aSMauro Carvalho Chehab
82898bd37aSMauro Carvalho Chehab3. Kernel Changes
83898bd37aSMauro Carvalho Chehab=================
84898bd37aSMauro Carvalho Chehab
85898bd37aSMauro Carvalho ChehabThe data integrity framework in Linux enables protection information
86898bd37aSMauro Carvalho Chehabto be pinned to I/Os and sent to/received from controllers that
87898bd37aSMauro Carvalho Chehabsupport it.
88898bd37aSMauro Carvalho Chehab
89898bd37aSMauro Carvalho ChehabThe advantage to the integrity extensions in SCSI and SATA is that
90898bd37aSMauro Carvalho Chehabthey enable us to protect the entire path from application to storage
91898bd37aSMauro Carvalho Chehabdevice.  However, at the same time this is also the biggest
92898bd37aSMauro Carvalho Chehabdisadvantage. It means that the protection information must be in a
93898bd37aSMauro Carvalho Chehabformat that can be understood by the disk.
94898bd37aSMauro Carvalho Chehab
95898bd37aSMauro Carvalho ChehabGenerally Linux/POSIX applications are agnostic to the intricacies of
96898bd37aSMauro Carvalho Chehabthe storage devices they are accessing.  The virtual filesystem switch
97898bd37aSMauro Carvalho Chehaband the block layer make things like hardware sector size and
98898bd37aSMauro Carvalho Chehabtransport protocols completely transparent to the application.
99898bd37aSMauro Carvalho Chehab
100898bd37aSMauro Carvalho ChehabHowever, this level of detail is required when preparing the
101898bd37aSMauro Carvalho Chehabprotection information to send to a disk.  Consequently, the very
102898bd37aSMauro Carvalho Chehabconcept of an end-to-end protection scheme is a layering violation.
103898bd37aSMauro Carvalho ChehabIt is completely unreasonable for an application to be aware whether
104898bd37aSMauro Carvalho Chehabit is accessing a SCSI or SATA disk.
105898bd37aSMauro Carvalho Chehab
106898bd37aSMauro Carvalho ChehabThe data integrity support implemented in Linux attempts to hide this
107898bd37aSMauro Carvalho Chehabfrom the application.  As far as the application (and to some extent
108898bd37aSMauro Carvalho Chehabthe kernel) is concerned, the integrity metadata is opaque information
109898bd37aSMauro Carvalho Chehabthat's attached to the I/O.
110898bd37aSMauro Carvalho Chehab
111898bd37aSMauro Carvalho ChehabThe current implementation allows the block layer to automatically
112898bd37aSMauro Carvalho Chehabgenerate the protection information for any I/O.  Eventually the
113898bd37aSMauro Carvalho Chehabintent is to move the integrity metadata calculation to userspace for
114898bd37aSMauro Carvalho Chehabuser data.  Metadata and other I/O that originates within the kernel
115898bd37aSMauro Carvalho Chehabwill still use the automatic generation interface.
116898bd37aSMauro Carvalho Chehab
117898bd37aSMauro Carvalho ChehabSome storage devices allow each hardware sector to be tagged with a
118898bd37aSMauro Carvalho Chehab16-bit value.  The owner of this tag space is the owner of the block
119898bd37aSMauro Carvalho Chehabdevice.  I.e. the filesystem in most cases.  The filesystem can use
120898bd37aSMauro Carvalho Chehabthis extra space to tag sectors as they see fit.  Because the tag
121898bd37aSMauro Carvalho Chehabspace is limited, the block interface allows tagging bigger chunks by
122898bd37aSMauro Carvalho Chehabway of interleaving.  This way, 8*16 bits of information can be
123898bd37aSMauro Carvalho Chehabattached to a typical 4KB filesystem block.
124898bd37aSMauro Carvalho Chehab
125898bd37aSMauro Carvalho ChehabThis also means that applications such as fsck and mkfs will need
126898bd37aSMauro Carvalho Chehabaccess to manipulate the tags from user space.  A passthrough
127898bd37aSMauro Carvalho Chehabinterface for this is being worked on.
128898bd37aSMauro Carvalho Chehab
129898bd37aSMauro Carvalho Chehab
130898bd37aSMauro Carvalho Chehab4. Block Layer Implementation Details
131898bd37aSMauro Carvalho Chehab=====================================
132898bd37aSMauro Carvalho Chehab
133898bd37aSMauro Carvalho Chehab4.1 Bio
134898bd37aSMauro Carvalho Chehab-------
135898bd37aSMauro Carvalho Chehab
136898bd37aSMauro Carvalho ChehabThe data integrity patches add a new field to struct bio when
137898bd37aSMauro Carvalho ChehabCONFIG_BLK_DEV_INTEGRITY is enabled.  bio_integrity(bio) returns a
138898bd37aSMauro Carvalho Chehabpointer to a struct bip which contains the bio integrity payload.
139898bd37aSMauro Carvalho ChehabEssentially a bip is a trimmed down struct bio which holds a bio_vec
140898bd37aSMauro Carvalho Chehabcontaining the integrity metadata and the required housekeeping
141898bd37aSMauro Carvalho Chehabinformation (bvec pool, vector count, etc.)
142898bd37aSMauro Carvalho Chehab
143898bd37aSMauro Carvalho ChehabA kernel subsystem can enable data integrity protection on a bio by
144898bd37aSMauro Carvalho Chehabcalling bio_integrity_alloc(bio).  This will allocate and attach the
145898bd37aSMauro Carvalho Chehabbip to the bio.
146898bd37aSMauro Carvalho Chehab
147898bd37aSMauro Carvalho ChehabIndividual pages containing integrity metadata can subsequently be
148898bd37aSMauro Carvalho Chehabattached using bio_integrity_add_page().
149898bd37aSMauro Carvalho Chehab
150898bd37aSMauro Carvalho Chehabbio_free() will automatically free the bip.
151898bd37aSMauro Carvalho Chehab
152898bd37aSMauro Carvalho Chehab
153898bd37aSMauro Carvalho Chehab4.2 Block Device
154898bd37aSMauro Carvalho Chehab----------------
155898bd37aSMauro Carvalho Chehab
156898bd37aSMauro Carvalho ChehabBecause the format of the protection data is tied to the physical
157898bd37aSMauro Carvalho Chehabdisk, each block device has been extended with a block integrity
158898bd37aSMauro Carvalho Chehabprofile (struct blk_integrity).  This optional profile is registered
159898bd37aSMauro Carvalho Chehabwith the block layer using blk_integrity_register().
160898bd37aSMauro Carvalho Chehab
161898bd37aSMauro Carvalho ChehabThe profile contains callback functions for generating and verifying
162898bd37aSMauro Carvalho Chehabthe protection data, as well as getting and setting application tags.
163898bd37aSMauro Carvalho ChehabThe profile also contains a few constants to aid in completing,
164898bd37aSMauro Carvalho Chehabmerging and splitting the integrity metadata.
165898bd37aSMauro Carvalho Chehab
166898bd37aSMauro Carvalho ChehabLayered block devices will need to pick a profile that's appropriate
167898bd37aSMauro Carvalho Chehabfor all subdevices.  blk_integrity_compare() can help with that.  DM
168898bd37aSMauro Carvalho Chehaband MD linear, RAID0 and RAID1 are currently supported.  RAID4/5/6
169898bd37aSMauro Carvalho Chehabwill require extra work due to the application tag.
170898bd37aSMauro Carvalho Chehab
171898bd37aSMauro Carvalho Chehab
172898bd37aSMauro Carvalho Chehab5.0 Block Layer Integrity API
173898bd37aSMauro Carvalho Chehab=============================
174898bd37aSMauro Carvalho Chehab
175898bd37aSMauro Carvalho Chehab5.1 Normal Filesystem
176898bd37aSMauro Carvalho Chehab---------------------
177898bd37aSMauro Carvalho Chehab
178898bd37aSMauro Carvalho Chehab    The normal filesystem is unaware that the underlying block device
179898bd37aSMauro Carvalho Chehab    is capable of sending/receiving integrity metadata.  The IMD will
180898bd37aSMauro Carvalho Chehab    be automatically generated by the block layer at submit_bio() time
181898bd37aSMauro Carvalho Chehab    in case of a WRITE.  A READ request will cause the I/O integrity
182898bd37aSMauro Carvalho Chehab    to be verified upon completion.
183898bd37aSMauro Carvalho Chehab
184898bd37aSMauro Carvalho Chehab    IMD generation and verification can be toggled using the::
185898bd37aSMauro Carvalho Chehab
186898bd37aSMauro Carvalho Chehab      /sys/block/<bdev>/integrity/write_generate
187898bd37aSMauro Carvalho Chehab
188898bd37aSMauro Carvalho Chehab    and::
189898bd37aSMauro Carvalho Chehab
190898bd37aSMauro Carvalho Chehab      /sys/block/<bdev>/integrity/read_verify
191898bd37aSMauro Carvalho Chehab
192898bd37aSMauro Carvalho Chehab    flags.
193898bd37aSMauro Carvalho Chehab
194898bd37aSMauro Carvalho Chehab
195898bd37aSMauro Carvalho Chehab5.2 Integrity-Aware Filesystem
196898bd37aSMauro Carvalho Chehab------------------------------
197898bd37aSMauro Carvalho Chehab
198898bd37aSMauro Carvalho Chehab    A filesystem that is integrity-aware can prepare I/Os with IMD
199898bd37aSMauro Carvalho Chehab    attached.  It can also use the application tag space if this is
200898bd37aSMauro Carvalho Chehab    supported by the block device.
201898bd37aSMauro Carvalho Chehab
202898bd37aSMauro Carvalho Chehab
203898bd37aSMauro Carvalho Chehab    `bool bio_integrity_prep(bio);`
204898bd37aSMauro Carvalho Chehab
205898bd37aSMauro Carvalho Chehab      To generate IMD for WRITE and to set up buffers for READ, the
206898bd37aSMauro Carvalho Chehab      filesystem must call bio_integrity_prep(bio).
207898bd37aSMauro Carvalho Chehab
208898bd37aSMauro Carvalho Chehab      Prior to calling this function, the bio data direction and start
209898bd37aSMauro Carvalho Chehab      sector must be set, and the bio should have all data pages
210898bd37aSMauro Carvalho Chehab      added.  It is up to the caller to ensure that the bio does not
211898bd37aSMauro Carvalho Chehab      change while I/O is in progress.
212*d56b699dSBjorn Helgaas      Complete bio with error if prepare failed for some reason.
213898bd37aSMauro Carvalho Chehab
214898bd37aSMauro Carvalho Chehab
215898bd37aSMauro Carvalho Chehab5.3 Passing Existing Integrity Metadata
216898bd37aSMauro Carvalho Chehab---------------------------------------
217898bd37aSMauro Carvalho Chehab
218898bd37aSMauro Carvalho Chehab    Filesystems that either generate their own integrity metadata or
219898bd37aSMauro Carvalho Chehab    are capable of transferring IMD from user space can use the
220898bd37aSMauro Carvalho Chehab    following calls:
221898bd37aSMauro Carvalho Chehab
222898bd37aSMauro Carvalho Chehab
223898bd37aSMauro Carvalho Chehab    `struct bip * bio_integrity_alloc(bio, gfp_mask, nr_pages);`
224898bd37aSMauro Carvalho Chehab
225898bd37aSMauro Carvalho Chehab      Allocates the bio integrity payload and hangs it off of the bio.
226898bd37aSMauro Carvalho Chehab      nr_pages indicate how many pages of protection data need to be
227898bd37aSMauro Carvalho Chehab      stored in the integrity bio_vec list (similar to bio_alloc()).
228898bd37aSMauro Carvalho Chehab
229898bd37aSMauro Carvalho Chehab      The integrity payload will be freed at bio_free() time.
230898bd37aSMauro Carvalho Chehab
231898bd37aSMauro Carvalho Chehab
232898bd37aSMauro Carvalho Chehab    `int bio_integrity_add_page(bio, page, len, offset);`
233898bd37aSMauro Carvalho Chehab
234898bd37aSMauro Carvalho Chehab      Attaches a page containing integrity metadata to an existing
235898bd37aSMauro Carvalho Chehab      bio.  The bio must have an existing bip,
236898bd37aSMauro Carvalho Chehab      i.e. bio_integrity_alloc() must have been called.  For a WRITE,
237898bd37aSMauro Carvalho Chehab      the integrity metadata in the pages must be in a format
238898bd37aSMauro Carvalho Chehab      understood by the target device with the notable exception that
239898bd37aSMauro Carvalho Chehab      the sector numbers will be remapped as the request traverses the
240898bd37aSMauro Carvalho Chehab      I/O stack.  This implies that the pages added using this call
241898bd37aSMauro Carvalho Chehab      will be modified during I/O!  The first reference tag in the
242898bd37aSMauro Carvalho Chehab      integrity metadata must have a value of bip->bip_sector.
243898bd37aSMauro Carvalho Chehab
244898bd37aSMauro Carvalho Chehab      Pages can be added using bio_integrity_add_page() as long as
245898bd37aSMauro Carvalho Chehab      there is room in the bip bio_vec array (nr_pages).
246898bd37aSMauro Carvalho Chehab
247898bd37aSMauro Carvalho Chehab      Upon completion of a READ operation, the attached pages will
248898bd37aSMauro Carvalho Chehab      contain the integrity metadata received from the storage device.
249898bd37aSMauro Carvalho Chehab      It is up to the receiver to process them and verify data
250898bd37aSMauro Carvalho Chehab      integrity upon completion.
251898bd37aSMauro Carvalho Chehab
252898bd37aSMauro Carvalho Chehab
253898bd37aSMauro Carvalho Chehab5.4 Registering A Block Device As Capable Of Exchanging Integrity Metadata
254898bd37aSMauro Carvalho Chehab--------------------------------------------------------------------------
255898bd37aSMauro Carvalho Chehab
256898bd37aSMauro Carvalho Chehab    To enable integrity exchange on a block device the gendisk must be
257898bd37aSMauro Carvalho Chehab    registered as capable:
258898bd37aSMauro Carvalho Chehab
259898bd37aSMauro Carvalho Chehab    `int blk_integrity_register(gendisk, blk_integrity);`
260898bd37aSMauro Carvalho Chehab
261898bd37aSMauro Carvalho Chehab      The blk_integrity struct is a template and should contain the
262898bd37aSMauro Carvalho Chehab      following::
263898bd37aSMauro Carvalho Chehab
264898bd37aSMauro Carvalho Chehab        static struct blk_integrity my_profile = {
265898bd37aSMauro Carvalho Chehab            .name                   = "STANDARDSBODY-TYPE-VARIANT-CSUM",
266898bd37aSMauro Carvalho Chehab            .generate_fn            = my_generate_fn,
267898bd37aSMauro Carvalho Chehab	    .verify_fn              = my_verify_fn,
268898bd37aSMauro Carvalho Chehab	    .tuple_size             = sizeof(struct my_tuple_size),
269898bd37aSMauro Carvalho Chehab	    .tag_size               = <tag bytes per hw sector>,
270898bd37aSMauro Carvalho Chehab        };
271898bd37aSMauro Carvalho Chehab
272898bd37aSMauro Carvalho Chehab      'name' is a text string which will be visible in sysfs.  This is
273898bd37aSMauro Carvalho Chehab      part of the userland API so chose it carefully and never change
274898bd37aSMauro Carvalho Chehab      it.  The format is standards body-type-variant.
275898bd37aSMauro Carvalho Chehab      E.g. T10-DIF-TYPE1-IP or T13-EPP-0-CRC.
276898bd37aSMauro Carvalho Chehab
277898bd37aSMauro Carvalho Chehab      'generate_fn' generates appropriate integrity metadata (for WRITE).
278898bd37aSMauro Carvalho Chehab
279898bd37aSMauro Carvalho Chehab      'verify_fn' verifies that the data buffer matches the integrity
280898bd37aSMauro Carvalho Chehab      metadata.
281898bd37aSMauro Carvalho Chehab
282898bd37aSMauro Carvalho Chehab      'tuple_size' must be set to match the size of the integrity
283898bd37aSMauro Carvalho Chehab      metadata per sector.  I.e. 8 for DIF and EPP.
284898bd37aSMauro Carvalho Chehab
285898bd37aSMauro Carvalho Chehab      'tag_size' must be set to identify how many bytes of tag space
286898bd37aSMauro Carvalho Chehab      are available per hardware sector.  For DIF this is either 2 or
287898bd37aSMauro Carvalho Chehab      0 depending on the value of the Control Mode Page ATO bit.
288898bd37aSMauro Carvalho Chehab
289898bd37aSMauro Carvalho Chehab----------------------------------------------------------------------
290898bd37aSMauro Carvalho Chehab
291898bd37aSMauro Carvalho Chehab2007-12-24 Martin K. Petersen <martin.petersen@oracle.com>
292