19e255e2bSRandy Dunlap============== 2898bd37aSMauro Carvalho ChehabData Integrity 3898bd37aSMauro Carvalho Chehab============== 4898bd37aSMauro Carvalho Chehab 5898bd37aSMauro Carvalho Chehab1. Introduction 6898bd37aSMauro Carvalho Chehab=============== 7898bd37aSMauro Carvalho Chehab 8898bd37aSMauro Carvalho ChehabModern filesystems feature checksumming of data and metadata to 9898bd37aSMauro Carvalho Chehabprotect against data corruption. However, the detection of the 10898bd37aSMauro Carvalho Chehabcorruption is done at read time which could potentially be months 11898bd37aSMauro Carvalho Chehabafter the data was written. At that point the original data that the 12898bd37aSMauro Carvalho Chehabapplication tried to write is most likely lost. 13898bd37aSMauro Carvalho Chehab 14898bd37aSMauro Carvalho ChehabThe solution is to ensure that the disk is actually storing what the 15898bd37aSMauro Carvalho Chehabapplication meant it to. Recent additions to both the SCSI family 16898bd37aSMauro Carvalho Chehabprotocols (SBC Data Integrity Field, SCC protection proposal) as well 17898bd37aSMauro Carvalho Chehabas SATA/T13 (External Path Protection) try to remedy this by adding 18898bd37aSMauro Carvalho Chehabsupport for appending integrity metadata to an I/O. The integrity 19898bd37aSMauro Carvalho Chehabmetadata (or protection information in SCSI terminology) includes a 20898bd37aSMauro Carvalho Chehabchecksum for each sector as well as an incrementing counter that 21898bd37aSMauro Carvalho Chehabensures the individual sectors are written in the right order. And 22898bd37aSMauro Carvalho Chehabfor some protection schemes also that the I/O is written to the right 23898bd37aSMauro Carvalho Chehabplace on disk. 24898bd37aSMauro Carvalho Chehab 25898bd37aSMauro Carvalho ChehabCurrent storage controllers and devices implement various protective 26898bd37aSMauro Carvalho Chehabmeasures, for instance checksumming and scrubbing. But these 27898bd37aSMauro Carvalho Chehabtechnologies are working in their own isolated domains or at best 28898bd37aSMauro Carvalho Chehabbetween adjacent nodes in the I/O path. The interesting thing about 29898bd37aSMauro Carvalho ChehabDIF and the other integrity extensions is that the protection format 30898bd37aSMauro Carvalho Chehabis well defined and every node in the I/O path can verify the 31898bd37aSMauro Carvalho Chehabintegrity of the I/O and reject it if corruption is detected. This 32898bd37aSMauro Carvalho Chehaballows not only corruption prevention but also isolation of the point 33898bd37aSMauro Carvalho Chehabof failure. 34898bd37aSMauro Carvalho Chehab 35898bd37aSMauro Carvalho Chehab2. The Data Integrity Extensions 36898bd37aSMauro Carvalho Chehab================================ 37898bd37aSMauro Carvalho Chehab 38898bd37aSMauro Carvalho ChehabAs written, the protocol extensions only protect the path between 39898bd37aSMauro Carvalho Chehabcontroller and storage device. However, many controllers actually 40898bd37aSMauro Carvalho Chehaballow the operating system to interact with the integrity metadata 41898bd37aSMauro Carvalho Chehab(IMD). We have been working with several FC/SAS HBA vendors to enable 42898bd37aSMauro Carvalho Chehabthe protection information to be transferred to and from their 43898bd37aSMauro Carvalho Chehabcontrollers. 44898bd37aSMauro Carvalho Chehab 45898bd37aSMauro Carvalho ChehabThe SCSI Data Integrity Field works by appending 8 bytes of protection 46898bd37aSMauro Carvalho Chehabinformation to each sector. The data + integrity metadata is stored 47898bd37aSMauro Carvalho Chehabin 520 byte sectors on disk. Data + IMD are interleaved when 48898bd37aSMauro Carvalho Chehabtransferred between the controller and target. The T13 proposal is 49898bd37aSMauro Carvalho Chehabsimilar. 50898bd37aSMauro Carvalho Chehab 51898bd37aSMauro Carvalho ChehabBecause it is highly inconvenient for operating systems to deal with 52898bd37aSMauro Carvalho Chehab520 (and 4104) byte sectors, we approached several HBA vendors and 53898bd37aSMauro Carvalho Chehabencouraged them to allow separation of the data and integrity metadata 54898bd37aSMauro Carvalho Chehabscatter-gather lists. 55898bd37aSMauro Carvalho Chehab 56898bd37aSMauro Carvalho ChehabThe controller will interleave the buffers on write and split them on 57898bd37aSMauro Carvalho Chehabread. This means that Linux can DMA the data buffers to and from 58898bd37aSMauro Carvalho Chehabhost memory without changes to the page cache. 59898bd37aSMauro Carvalho Chehab 60898bd37aSMauro Carvalho ChehabAlso, the 16-bit CRC checksum mandated by both the SCSI and SATA specs 61898bd37aSMauro Carvalho Chehabis somewhat heavy to compute in software. Benchmarks found that 62898bd37aSMauro Carvalho Chehabcalculating this checksum had a significant impact on system 63898bd37aSMauro Carvalho Chehabperformance for a number of workloads. Some controllers allow a 64898bd37aSMauro Carvalho Chehablighter-weight checksum to be used when interfacing with the operating 65898bd37aSMauro Carvalho Chehabsystem. Emulex, for instance, supports the TCP/IP checksum instead. 66898bd37aSMauro Carvalho ChehabThe IP checksum received from the OS is converted to the 16-bit CRC 67898bd37aSMauro Carvalho Chehabwhen writing and vice versa. This allows the integrity metadata to be 68898bd37aSMauro Carvalho Chehabgenerated by Linux or the application at very low cost (comparable to 69898bd37aSMauro Carvalho Chehabsoftware RAID5). 70898bd37aSMauro Carvalho Chehab 71898bd37aSMauro Carvalho ChehabThe IP checksum is weaker than the CRC in terms of detecting bit 72898bd37aSMauro Carvalho Chehaberrors. However, the strength is really in the separation of the data 73898bd37aSMauro Carvalho Chehabbuffers and the integrity metadata. These two distinct buffers must 74898bd37aSMauro Carvalho Chehabmatch up for an I/O to complete. 75898bd37aSMauro Carvalho Chehab 76898bd37aSMauro Carvalho ChehabThe separation of the data and integrity metadata buffers as well as 77898bd37aSMauro Carvalho Chehabthe choice in checksums is referred to as the Data Integrity 78898bd37aSMauro Carvalho ChehabExtensions. As these extensions are outside the scope of the protocol 79898bd37aSMauro Carvalho Chehabbodies (T10, T13), Oracle and its partners are trying to standardize 80898bd37aSMauro Carvalho Chehabthem within the Storage Networking Industry Association. 81898bd37aSMauro Carvalho Chehab 82898bd37aSMauro Carvalho Chehab3. Kernel Changes 83898bd37aSMauro Carvalho Chehab================= 84898bd37aSMauro Carvalho Chehab 85898bd37aSMauro Carvalho ChehabThe data integrity framework in Linux enables protection information 86898bd37aSMauro Carvalho Chehabto be pinned to I/Os and sent to/received from controllers that 87898bd37aSMauro Carvalho Chehabsupport it. 88898bd37aSMauro Carvalho Chehab 89898bd37aSMauro Carvalho ChehabThe advantage to the integrity extensions in SCSI and SATA is that 90898bd37aSMauro Carvalho Chehabthey enable us to protect the entire path from application to storage 91898bd37aSMauro Carvalho Chehabdevice. However, at the same time this is also the biggest 92898bd37aSMauro Carvalho Chehabdisadvantage. It means that the protection information must be in a 93898bd37aSMauro Carvalho Chehabformat that can be understood by the disk. 94898bd37aSMauro Carvalho Chehab 95898bd37aSMauro Carvalho ChehabGenerally Linux/POSIX applications are agnostic to the intricacies of 96898bd37aSMauro Carvalho Chehabthe storage devices they are accessing. The virtual filesystem switch 97898bd37aSMauro Carvalho Chehaband the block layer make things like hardware sector size and 98898bd37aSMauro Carvalho Chehabtransport protocols completely transparent to the application. 99898bd37aSMauro Carvalho Chehab 100898bd37aSMauro Carvalho ChehabHowever, this level of detail is required when preparing the 101898bd37aSMauro Carvalho Chehabprotection information to send to a disk. Consequently, the very 102898bd37aSMauro Carvalho Chehabconcept of an end-to-end protection scheme is a layering violation. 103898bd37aSMauro Carvalho ChehabIt is completely unreasonable for an application to be aware whether 104898bd37aSMauro Carvalho Chehabit is accessing a SCSI or SATA disk. 105898bd37aSMauro Carvalho Chehab 106898bd37aSMauro Carvalho ChehabThe data integrity support implemented in Linux attempts to hide this 107898bd37aSMauro Carvalho Chehabfrom the application. As far as the application (and to some extent 108898bd37aSMauro Carvalho Chehabthe kernel) is concerned, the integrity metadata is opaque information 109898bd37aSMauro Carvalho Chehabthat's attached to the I/O. 110898bd37aSMauro Carvalho Chehab 111898bd37aSMauro Carvalho ChehabThe current implementation allows the block layer to automatically 112898bd37aSMauro Carvalho Chehabgenerate the protection information for any I/O. Eventually the 113898bd37aSMauro Carvalho Chehabintent is to move the integrity metadata calculation to userspace for 114898bd37aSMauro Carvalho Chehabuser data. Metadata and other I/O that originates within the kernel 115898bd37aSMauro Carvalho Chehabwill still use the automatic generation interface. 116898bd37aSMauro Carvalho Chehab 117898bd37aSMauro Carvalho ChehabSome storage devices allow each hardware sector to be tagged with a 118898bd37aSMauro Carvalho Chehab16-bit value. The owner of this tag space is the owner of the block 119898bd37aSMauro Carvalho Chehabdevice. I.e. the filesystem in most cases. The filesystem can use 120898bd37aSMauro Carvalho Chehabthis extra space to tag sectors as they see fit. Because the tag 121898bd37aSMauro Carvalho Chehabspace is limited, the block interface allows tagging bigger chunks by 122898bd37aSMauro Carvalho Chehabway of interleaving. This way, 8*16 bits of information can be 123898bd37aSMauro Carvalho Chehabattached to a typical 4KB filesystem block. 124898bd37aSMauro Carvalho Chehab 125898bd37aSMauro Carvalho ChehabThis also means that applications such as fsck and mkfs will need 126898bd37aSMauro Carvalho Chehabaccess to manipulate the tags from user space. A passthrough 127898bd37aSMauro Carvalho Chehabinterface for this is being worked on. 128898bd37aSMauro Carvalho Chehab 129898bd37aSMauro Carvalho Chehab 130898bd37aSMauro Carvalho Chehab4. Block Layer Implementation Details 131898bd37aSMauro Carvalho Chehab===================================== 132898bd37aSMauro Carvalho Chehab 133898bd37aSMauro Carvalho Chehab4.1 Bio 134898bd37aSMauro Carvalho Chehab------- 135898bd37aSMauro Carvalho Chehab 136898bd37aSMauro Carvalho ChehabThe data integrity patches add a new field to struct bio when 137898bd37aSMauro Carvalho ChehabCONFIG_BLK_DEV_INTEGRITY is enabled. bio_integrity(bio) returns a 138898bd37aSMauro Carvalho Chehabpointer to a struct bip which contains the bio integrity payload. 139898bd37aSMauro Carvalho ChehabEssentially a bip is a trimmed down struct bio which holds a bio_vec 140898bd37aSMauro Carvalho Chehabcontaining the integrity metadata and the required housekeeping 141898bd37aSMauro Carvalho Chehabinformation (bvec pool, vector count, etc.) 142898bd37aSMauro Carvalho Chehab 143898bd37aSMauro Carvalho ChehabA kernel subsystem can enable data integrity protection on a bio by 144898bd37aSMauro Carvalho Chehabcalling bio_integrity_alloc(bio). This will allocate and attach the 145898bd37aSMauro Carvalho Chehabbip to the bio. 146898bd37aSMauro Carvalho Chehab 147898bd37aSMauro Carvalho ChehabIndividual pages containing integrity metadata can subsequently be 148898bd37aSMauro Carvalho Chehabattached using bio_integrity_add_page(). 149898bd37aSMauro Carvalho Chehab 150898bd37aSMauro Carvalho Chehabbio_free() will automatically free the bip. 151898bd37aSMauro Carvalho Chehab 152898bd37aSMauro Carvalho Chehab 153898bd37aSMauro Carvalho Chehab4.2 Block Device 154898bd37aSMauro Carvalho Chehab---------------- 155898bd37aSMauro Carvalho Chehab 156898bd37aSMauro Carvalho ChehabBecause the format of the protection data is tied to the physical 157898bd37aSMauro Carvalho Chehabdisk, each block device has been extended with a block integrity 158898bd37aSMauro Carvalho Chehabprofile (struct blk_integrity). This optional profile is registered 159898bd37aSMauro Carvalho Chehabwith the block layer using blk_integrity_register(). 160898bd37aSMauro Carvalho Chehab 161898bd37aSMauro Carvalho ChehabThe profile contains callback functions for generating and verifying 162898bd37aSMauro Carvalho Chehabthe protection data, as well as getting and setting application tags. 163898bd37aSMauro Carvalho ChehabThe profile also contains a few constants to aid in completing, 164898bd37aSMauro Carvalho Chehabmerging and splitting the integrity metadata. 165898bd37aSMauro Carvalho Chehab 166898bd37aSMauro Carvalho ChehabLayered block devices will need to pick a profile that's appropriate 167898bd37aSMauro Carvalho Chehabfor all subdevices. blk_integrity_compare() can help with that. DM 168898bd37aSMauro Carvalho Chehaband MD linear, RAID0 and RAID1 are currently supported. RAID4/5/6 169898bd37aSMauro Carvalho Chehabwill require extra work due to the application tag. 170898bd37aSMauro Carvalho Chehab 171898bd37aSMauro Carvalho Chehab 172898bd37aSMauro Carvalho Chehab5.0 Block Layer Integrity API 173898bd37aSMauro Carvalho Chehab============================= 174898bd37aSMauro Carvalho Chehab 175898bd37aSMauro Carvalho Chehab5.1 Normal Filesystem 176898bd37aSMauro Carvalho Chehab--------------------- 177898bd37aSMauro Carvalho Chehab 178898bd37aSMauro Carvalho Chehab The normal filesystem is unaware that the underlying block device 179898bd37aSMauro Carvalho Chehab is capable of sending/receiving integrity metadata. The IMD will 180898bd37aSMauro Carvalho Chehab be automatically generated by the block layer at submit_bio() time 181898bd37aSMauro Carvalho Chehab in case of a WRITE. A READ request will cause the I/O integrity 182898bd37aSMauro Carvalho Chehab to be verified upon completion. 183898bd37aSMauro Carvalho Chehab 184898bd37aSMauro Carvalho Chehab IMD generation and verification can be toggled using the:: 185898bd37aSMauro Carvalho Chehab 186898bd37aSMauro Carvalho Chehab /sys/block/<bdev>/integrity/write_generate 187898bd37aSMauro Carvalho Chehab 188898bd37aSMauro Carvalho Chehab and:: 189898bd37aSMauro Carvalho Chehab 190898bd37aSMauro Carvalho Chehab /sys/block/<bdev>/integrity/read_verify 191898bd37aSMauro Carvalho Chehab 192898bd37aSMauro Carvalho Chehab flags. 193898bd37aSMauro Carvalho Chehab 194898bd37aSMauro Carvalho Chehab 195898bd37aSMauro Carvalho Chehab5.2 Integrity-Aware Filesystem 196898bd37aSMauro Carvalho Chehab------------------------------ 197898bd37aSMauro Carvalho Chehab 198898bd37aSMauro Carvalho Chehab A filesystem that is integrity-aware can prepare I/Os with IMD 199898bd37aSMauro Carvalho Chehab attached. It can also use the application tag space if this is 200898bd37aSMauro Carvalho Chehab supported by the block device. 201898bd37aSMauro Carvalho Chehab 202898bd37aSMauro Carvalho Chehab 203898bd37aSMauro Carvalho Chehab `bool bio_integrity_prep(bio);` 204898bd37aSMauro Carvalho Chehab 205898bd37aSMauro Carvalho Chehab To generate IMD for WRITE and to set up buffers for READ, the 206898bd37aSMauro Carvalho Chehab filesystem must call bio_integrity_prep(bio). 207898bd37aSMauro Carvalho Chehab 208898bd37aSMauro Carvalho Chehab Prior to calling this function, the bio data direction and start 209898bd37aSMauro Carvalho Chehab sector must be set, and the bio should have all data pages 210898bd37aSMauro Carvalho Chehab added. It is up to the caller to ensure that the bio does not 211898bd37aSMauro Carvalho Chehab change while I/O is in progress. 212*d56b699dSBjorn Helgaas Complete bio with error if prepare failed for some reason. 213898bd37aSMauro Carvalho Chehab 214898bd37aSMauro Carvalho Chehab 215898bd37aSMauro Carvalho Chehab5.3 Passing Existing Integrity Metadata 216898bd37aSMauro Carvalho Chehab--------------------------------------- 217898bd37aSMauro Carvalho Chehab 218898bd37aSMauro Carvalho Chehab Filesystems that either generate their own integrity metadata or 219898bd37aSMauro Carvalho Chehab are capable of transferring IMD from user space can use the 220898bd37aSMauro Carvalho Chehab following calls: 221898bd37aSMauro Carvalho Chehab 222898bd37aSMauro Carvalho Chehab 223898bd37aSMauro Carvalho Chehab `struct bip * bio_integrity_alloc(bio, gfp_mask, nr_pages);` 224898bd37aSMauro Carvalho Chehab 225898bd37aSMauro Carvalho Chehab Allocates the bio integrity payload and hangs it off of the bio. 226898bd37aSMauro Carvalho Chehab nr_pages indicate how many pages of protection data need to be 227898bd37aSMauro Carvalho Chehab stored in the integrity bio_vec list (similar to bio_alloc()). 228898bd37aSMauro Carvalho Chehab 229898bd37aSMauro Carvalho Chehab The integrity payload will be freed at bio_free() time. 230898bd37aSMauro Carvalho Chehab 231898bd37aSMauro Carvalho Chehab 232898bd37aSMauro Carvalho Chehab `int bio_integrity_add_page(bio, page, len, offset);` 233898bd37aSMauro Carvalho Chehab 234898bd37aSMauro Carvalho Chehab Attaches a page containing integrity metadata to an existing 235898bd37aSMauro Carvalho Chehab bio. The bio must have an existing bip, 236898bd37aSMauro Carvalho Chehab i.e. bio_integrity_alloc() must have been called. For a WRITE, 237898bd37aSMauro Carvalho Chehab the integrity metadata in the pages must be in a format 238898bd37aSMauro Carvalho Chehab understood by the target device with the notable exception that 239898bd37aSMauro Carvalho Chehab the sector numbers will be remapped as the request traverses the 240898bd37aSMauro Carvalho Chehab I/O stack. This implies that the pages added using this call 241898bd37aSMauro Carvalho Chehab will be modified during I/O! The first reference tag in the 242898bd37aSMauro Carvalho Chehab integrity metadata must have a value of bip->bip_sector. 243898bd37aSMauro Carvalho Chehab 244898bd37aSMauro Carvalho Chehab Pages can be added using bio_integrity_add_page() as long as 245898bd37aSMauro Carvalho Chehab there is room in the bip bio_vec array (nr_pages). 246898bd37aSMauro Carvalho Chehab 247898bd37aSMauro Carvalho Chehab Upon completion of a READ operation, the attached pages will 248898bd37aSMauro Carvalho Chehab contain the integrity metadata received from the storage device. 249898bd37aSMauro Carvalho Chehab It is up to the receiver to process them and verify data 250898bd37aSMauro Carvalho Chehab integrity upon completion. 251898bd37aSMauro Carvalho Chehab 252898bd37aSMauro Carvalho Chehab 253898bd37aSMauro Carvalho Chehab5.4 Registering A Block Device As Capable Of Exchanging Integrity Metadata 254898bd37aSMauro Carvalho Chehab-------------------------------------------------------------------------- 255898bd37aSMauro Carvalho Chehab 256898bd37aSMauro Carvalho Chehab To enable integrity exchange on a block device the gendisk must be 257898bd37aSMauro Carvalho Chehab registered as capable: 258898bd37aSMauro Carvalho Chehab 259898bd37aSMauro Carvalho Chehab `int blk_integrity_register(gendisk, blk_integrity);` 260898bd37aSMauro Carvalho Chehab 261898bd37aSMauro Carvalho Chehab The blk_integrity struct is a template and should contain the 262898bd37aSMauro Carvalho Chehab following:: 263898bd37aSMauro Carvalho Chehab 264898bd37aSMauro Carvalho Chehab static struct blk_integrity my_profile = { 265898bd37aSMauro Carvalho Chehab .name = "STANDARDSBODY-TYPE-VARIANT-CSUM", 266898bd37aSMauro Carvalho Chehab .generate_fn = my_generate_fn, 267898bd37aSMauro Carvalho Chehab .verify_fn = my_verify_fn, 268898bd37aSMauro Carvalho Chehab .tuple_size = sizeof(struct my_tuple_size), 269898bd37aSMauro Carvalho Chehab .tag_size = <tag bytes per hw sector>, 270898bd37aSMauro Carvalho Chehab }; 271898bd37aSMauro Carvalho Chehab 272898bd37aSMauro Carvalho Chehab 'name' is a text string which will be visible in sysfs. This is 273898bd37aSMauro Carvalho Chehab part of the userland API so chose it carefully and never change 274898bd37aSMauro Carvalho Chehab it. The format is standards body-type-variant. 275898bd37aSMauro Carvalho Chehab E.g. T10-DIF-TYPE1-IP or T13-EPP-0-CRC. 276898bd37aSMauro Carvalho Chehab 277898bd37aSMauro Carvalho Chehab 'generate_fn' generates appropriate integrity metadata (for WRITE). 278898bd37aSMauro Carvalho Chehab 279898bd37aSMauro Carvalho Chehab 'verify_fn' verifies that the data buffer matches the integrity 280898bd37aSMauro Carvalho Chehab metadata. 281898bd37aSMauro Carvalho Chehab 282898bd37aSMauro Carvalho Chehab 'tuple_size' must be set to match the size of the integrity 283898bd37aSMauro Carvalho Chehab metadata per sector. I.e. 8 for DIF and EPP. 284898bd37aSMauro Carvalho Chehab 285898bd37aSMauro Carvalho Chehab 'tag_size' must be set to identify how many bytes of tag space 286898bd37aSMauro Carvalho Chehab are available per hardware sector. For DIF this is either 2 or 287898bd37aSMauro Carvalho Chehab 0 depending on the value of the Control Mode Page ATO bit. 288898bd37aSMauro Carvalho Chehab 289898bd37aSMauro Carvalho Chehab---------------------------------------------------------------------- 290898bd37aSMauro Carvalho Chehab 291898bd37aSMauro Carvalho Chehab2007-12-24 Martin K. Petersen <martin.petersen@oracle.com> 292