1.. SPDX-License-Identifier: GPL-2.0-only 2 3================ 4Design of dm-vdo 5================ 6 7The dm-vdo (virtual data optimizer) target provides inline deduplication, 8compression, zero-block elimination, and thin provisioning. A dm-vdo target 9can be backed by up to 256TB of storage, and can present a logical size of 10up to 4PB. This target was originally developed at Permabit Technology 11Corp. starting in 2009. It was first released in 2013 and has been used in 12production environments ever since. It was made open-source in 2017 after 13Permabit was acquired by Red Hat. This document describes the design of 14dm-vdo. For usage, see vdo.rst in the same directory as this file. 15 16Because deduplication rates fall drastically as the block size increases, a 17vdo target has a maximum block size of 4K. However, it can achieve 18deduplication rates of 254:1, i.e. up to 254 copies of a given 4K block can 19reference a single 4K of actual storage. It can achieve compression rates 20of 14:1. All zero blocks consume no storage at all. 21 22Theory of Operation 23=================== 24 25The design of dm-vdo is based on the idea that deduplication is a two-part 26problem. The first is to recognize duplicate data. The second is to avoid 27storing multiple copies of those duplicates. Therefore, dm-vdo has two main 28parts: a deduplication index (called UDS) that is used to discover 29duplicate data, and a data store with a reference counted block map that 30maps from logical block addresses to the actual storage location of the 31data. 32 33Zones and Threading 34------------------- 35 36Due to the complexity of data optimization, the number of metadata 37structures involved in a single write operation to a vdo target is larger 38than most other targets. Furthermore, because vdo must operate on small 39block sizes in order to achieve good deduplication rates, acceptable 40performance can only be achieved through parallelism. Therefore, vdo's 41design attempts to be lock-free. 42 43Most of a vdo's main data structures are designed to be easily divided into 44"zones" such that any given bio must only access a single zone of any zoned 45structure. Safety with minimal locking is achieved by ensuring that during 46normal operation, each zone is assigned to a specific thread, and only that 47thread will access the portion of the data structure in that zone. 48Associated with each thread is a work queue. Each bio is associated with a 49request object (the "data_vio") which will be added to a work queue when 50the next phase of its operation requires access to the structures in the 51zone associated with that queue. 52 53Another way of thinking about this arrangement is that the work queue for 54each zone has an implicit lock on the structures it manages for all its 55operations, because vdo guarantees that no other thread will alter those 56structures. 57 58Although each structure is divided into zones, this division is not 59reflected in the on-disk representation of each data structure. Therefore, 60the number of zones for each structure, and hence the number of threads, 61can be reconfigured each time a vdo target is started. 62 63The Deduplication Index 64----------------------- 65 66In order to identify duplicate data efficiently, vdo was designed to 67leverage some common characteristics of duplicate data. From empirical 68observations, we gathered two key insights. The first is that in most data 69sets with significant amounts of duplicate data, the duplicates tend to 70have temporal locality. When a duplicate appears, it is more likely that 71other duplicates will be detected, and that those duplicates will have been 72written at about the same time. This is why the index keeps records in 73temporal order. The second insight is that new data is more likely to 74duplicate recent data than it is to duplicate older data and in general, 75there are diminishing returns to looking further back in time. Therefore, 76when the index is full, it should cull its oldest records to make space for 77new ones. Another important idea behind the design of the index is that the 78ultimate goal of deduplication is to reduce storage costs. Since there is a 79trade-off between the storage saved and the resources expended to achieve 80those savings, vdo does not attempt to find every last duplicate block. It 81is sufficient to find and eliminate most of the redundancy. 82 83Each block of data is hashed to produce a 16-byte block name. An index 84record consists of this block name paired with the presumed location of 85that data on the underlying storage. However, it is not possible to 86guarantee that the index is accurate. In the most common case, this occurs 87because it is too costly to update the index when a block is over-written 88or discarded. Doing so would require either storing the block name along 89with the blocks, which is difficult to do efficiently in block-based 90storage, or reading and rehashing each block before overwriting it. 91Inaccuracy can also result from a hash collision where two different blocks 92have the same name. In practice, this is extremely unlikely, but because 93vdo does not use a cryptographic hash, a malicious workload could be 94constructed. Because of these inaccuracies, vdo treats the locations in the 95index as hints, and reads each indicated block to verify that it is indeed 96a duplicate before sharing the existing block with a new one. 97 98Records are collected into groups called chapters. New records are added to 99the newest chapter, called the open chapter. This chapter is stored in a 100format optimized for adding and modifying records, and the content of the 101open chapter is not finalized until it runs out of space for new records. 102When the open chapter fills up, it is closed and a new open chapter is 103created to collect new records. 104 105Closing a chapter converts it to a different format which is optimized for 106reading. The records are written to a series of record pages based on the 107order in which they were received. This means that records with temporal 108locality should be on a small number of pages, reducing the I/O required to 109retrieve them. The chapter also compiles an index that indicates which 110record page contains any given name. This index means that a request for a 111name can determine exactly which record page may contain that record, 112without having to load the entire chapter from storage. This index uses 113only a subset of the block name as its key, so it cannot guarantee that an 114index entry refers to the desired block name. It can only guarantee that if 115there is a record for this name, it will be on the indicated page. Closed 116chapters are read-only structures and their contents are never altered in 117any way. 118 119Once enough records have been written to fill up all the available index 120space, the oldest chapter is removed to make space for new chapters. Any 121time a request finds a matching record in the index, that record is copied 122into the open chapter. This ensures that useful block names remain available 123in the index, while unreferenced block names are forgotten over time. 124 125In order to find records in older chapters, the index also maintains a 126higher level structure called the volume index, which contains entries 127mapping each block name to the chapter containing its newest record. This 128mapping is updated as records for the block name are copied or updated, 129ensuring that only the newest record for a given block name can be found. 130An older record for a block name will no longer be found even though it has 131not been deleted from its chapter. Like the chapter index, the volume index 132uses only a subset of the block name as its key and can not definitively 133say that a record exists for a name. It can only say which chapter would 134contain the record if a record exists. The volume index is stored entirely 135in memory and is saved to storage only when the vdo target is shut down. 136 137From the viewpoint of a request for a particular block name, it will first 138look up the name in the volume index. This search will either indicate that 139the name is new, or which chapter to search. If it returns a chapter, the 140request looks up its name in the chapter index. This will indicate either 141that the name is new, or which record page to search. Finally, if it is not 142new, the request will look for its name in the indicated record page. 143This process may require up to two page reads per request (one for the 144chapter index page and one for the request page). However, recently 145accessed pages are cached so that these page reads can be amortized across 146many block name requests. 147 148The volume index and the chapter indexes are implemented using a 149memory-efficient structure called a delta index. Instead of storing the 150entire block name (the key) for each entry, the entries are sorted by name 151and only the difference between adjacent keys (the delta) is stored. 152Because we expect the hashes to be randomly distributed, the size of the 153deltas follows an exponential distribution. Because of this distribution, 154the deltas are expressed using a Huffman code to take up even less space. 155The entire sorted list of keys is called a delta list. This structure 156allows the index to use many fewer bytes per entry than a traditional hash 157table, but it is slightly more expensive to look up entries, because a 158request must read every entry in a delta list to add up the deltas in order 159to find the record it needs. The delta index reduces this lookup cost by 160splitting its key space into many sub-lists, each starting at a fixed key 161value, so that each individual list is short. 162 163The default index size can hold 64 million records, corresponding to about 164256GB of data. This means that the index can identify duplicate data if the 165original data was written within the last 256GB of writes. This range is 166called the deduplication window. If new writes duplicate data that is older 167than that, the index will not be able to find it because the records of the 168older data have been removed. This means that if an application writes a 169200 GB file to a vdo target and then immediately writes it again, the two 170copies will deduplicate perfectly. Doing the same with a 500 GB file will 171result in no deduplication, because the beginning of the file will no 172longer be in the index by the time the second write begins (assuming there 173is no duplication within the file itself). 174 175If an application anticipates a data workload that will see useful 176deduplication beyond the 256GB threshold, vdo can be configured to use a 177larger index with a correspondingly larger deduplication window. (This 178configuration can only be set when the target is created, not altered 179later. It is important to consider the expected workload for a vdo target 180before configuring it.) There are two ways to do this. 181 182One way is to increase the memory size of the index, which also increases 183the amount of backing storage required. Doubling the size of the index will 184double the length of the deduplication window at the expense of doubling 185the storage size and the memory requirements. 186 187The other option is to enable sparse indexing. Sparse indexing increases 188the deduplication window by a factor of 10, at the expense of also 189increasing the storage size by a factor of 10. However with sparse 190indexing, the memory requirements do not increase. The trade-off is 191slightly more computation per request and a slight decrease in the amount 192of deduplication detected. For most workloads with significant amounts of 193duplicate data, sparse indexing will detect 97-99% of the deduplication 194that a standard index will detect. 195 196The vio and data_vio Structures 197------------------------------- 198 199A vio (short for Vdo I/O) is conceptually similar to a bio, with additional 200fields and data to track vdo-specific information. A struct vio maintains a 201pointer to a bio but also tracks other fields specific to the operation of 202vdo. The vio is kept separate from its related bio because there are many 203circumstances where vdo completes the bio but must continue to do work 204related to deduplication or compression. 205 206Metadata reads and writes, and other writes that originate within vdo, use 207a struct vio directly. Application reads and writes use a larger structure 208called a data_vio to track information about their progress. A struct 209data_vio contain a struct vio and also includes several other fields 210related to deduplication and other vdo features. The data_vio is the 211primary unit of application work in vdo. Each data_vio proceeds through a 212set of steps to handle the application data, after which it is reset and 213returned to a pool of data_vios for reuse. 214 215There is a fixed pool of 2048 data_vios. This number was chosen to bound 216the amount of work that is required to recover from a crash. In addition, 217benchmarks have indicated that increasing the size of the pool does not 218significantly improve performance. 219 220The Data Store 221-------------- 222 223The data store is implemented by three main data structures, all of which 224work in concert to reduce or amortize metadata updates across as many data 225writes as possible. 226 227*The Slab Depot* 228 229Most of the vdo volume belongs to the slab depot. The depot contains a 230collection of slabs. The slabs can be up to 32GB, and are divided into 231three sections. Most of a slab consists of a linear sequence of 4K blocks. 232These blocks are used either to store data, or to hold portions of the 233block map (see below). In addition to the data blocks, each slab has a set 234of reference counters, using 1 byte for each data block. Finally each slab 235has a journal. 236 237Reference updates are written to the slab journal. Slab journal blocks are 238written out either when they are full, or when the recovery journal 239requests they do so in order to allow the main recovery journal (see below) 240to free up space. The slab journal is used both to ensure that the main 241recovery journal can regularly free up space, and also to amortize the cost 242of updating individual reference blocks. The reference counters are kept in 243memory and are written out, a block at a time in oldest-dirtied-order, only 244when there is a need to reclaim slab journal space. The write operations 245are performed in the background as needed so they do not add latency to 246particular I/O operations. 247 248Each slab is independent of every other. They are assigned to "physical 249zones" in round-robin fashion. If there are P physical zones, then slab n 250is assigned to zone n mod P. 251 252The slab depot maintains an additional small data structure, the "slab 253summary," which is used to reduce the amount of work needed to come back 254online after a crash. The slab summary maintains an entry for each slab 255indicating whether or not the slab has ever been used, whether all of its 256reference count updates have been persisted to storage, and approximately 257how full it is. During recovery, each physical zone will attempt to recover 258at least one slab, stopping whenever it has recovered a slab which has some 259free blocks. Once each zone has some space, or has determined that none is 260available, the target can resume normal operation in a degraded mode. Read 261and write requests can be serviced, perhaps with degraded performance, 262while the remainder of the dirty slabs are recovered. 263 264*The Block Map* 265 266The block map contains the logical to physical mapping. It can be thought 267of as an array with one entry per logical address. Each entry is 5 bytes, 26836 bits of which contain the physical block number which holds the data for 269the given logical address. The other 4 bits are used to indicate the nature 270of the mapping. Of the 16 possible states, one represents a logical address 271which is unmapped (i.e. it has never been written, or has been discarded), 272one represents an uncompressed block, and the other 14 states are used to 273indicate that the mapped data is compressed, and which of the compression 274slots in the compressed block contains the data for this logical address. 275 276In practice, the array of mapping entries is divided into "block map 277pages," each of which fits in a single 4K block. Each block map page 278consists of a header and 812 mapping entries. Each mapping page is actually 279a leaf of a radix tree which consists of block map pages at each level. 280There are 60 radix trees which are assigned to "logical zones" in round 281robin fashion. (If there are L logical zones, tree n will belong to zone n 282mod L.) At each level, the trees are interleaved, so logical addresses 2830-811 belong to tree 0, logical addresses 812-1623 belong to tree 1, and so 284on. The interleaving is maintained all the way up to the 60 root nodes. 285Choosing 60 trees results in an evenly distributed number of trees per zone 286for a large number of possible logical zone counts. The storage for the 60 287tree roots is allocated at format time. All other block map pages are 288allocated out of the slabs as needed. This flexible allocation avoids the 289need to pre-allocate space for the entire set of logical mappings and also 290makes growing the logical size of a vdo relatively easy. 291 292In operation, the block map maintains two caches. It is prohibitive to keep 293the entire leaf level of the trees in memory, so each logical zone 294maintains its own cache of leaf pages. The size of this cache is 295configurable at target start time. The second cache is allocated at start 296time, and is large enough to hold all the non-leaf pages of the entire 297block map. This cache is populated as pages are needed. 298 299*The Recovery Journal* 300 301The recovery journal is used to amortize updates across the block map and 302slab depot. Each write request causes an entry to be made in the journal. 303Entries are either "data remappings" or "block map remappings." For a data 304remapping, the journal records the logical address affected and its old and 305new physical mappings. For a block map remapping, the journal records the 306block map page number and the physical block allocated for it. Block map 307pages are never reclaimed or repurposed, so the old mapping is always 0. 308 309Each journal entry is an intent record summarizing the metadata updates 310that are required for a data_vio. The recovery journal issues a flush 311before each journal block write to ensure that the physical data for the 312new block mappings in that block are stable on storage, and journal block 313writes are all issued with the FUA bit set to ensure the recovery journal 314entries themselves are stable. The journal entry and the data write it 315represents must be stable on disk before the other metadata structures may 316be updated to reflect the operation. These entries allow the vdo device to 317reconstruct the logical to physical mappings after an unexpected 318interruption such as a loss of power. 319 320*Write Path* 321 322All write I/O to vdo is asynchronous. Each bio will be acknowledged as soon 323as vdo has done enough work to guarantee that it can complete the write 324eventually. Generally, the data for acknowledged but unflushed write I/O 325can be treated as though it is cached in memory. If an application 326requires data to be stable on storage, it must issue a flush or write the 327data with the FUA bit set like any other asynchronous I/O. Shutting down 328the vdo target will also flush any remaining I/O. 329 330Application write bios follow the steps outlined below. 331 3321. A data_vio is obtained from the data_vio pool and associated with the 333 application bio. If there are no data_vios available, the incoming bio 334 will block until a data_vio is available. This provides back pressure 335 to the application. The data_vio pool is protected by a spin lock. 336 337 The newly acquired data_vio is reset and the bio's data is copied into 338 the data_vio if it is a write and the data is not all zeroes. The data 339 must be copied because the application bio can be acknowledged before 340 the data_vio processing is complete, which means later processing steps 341 will no longer have access to the application bio. The application bio 342 may also be smaller than 4K, in which case the data_vio will have 343 already read the underlying block and the data is instead copied over 344 the relevant portion of the larger block. 345 3462. The data_vio places a claim (the "logical lock") on the logical address 347 of the bio. It is vital to prevent simultaneous modifications of the 348 same logical address, because deduplication involves sharing blocks. 349 This claim is implemented as an entry in a hashtable where the key is 350 the logical address and the value is a pointer to the data_vio 351 currently handling that address. 352 353 If a data_vio looks in the hashtable and finds that another data_vio is 354 already operating on that logical address, it waits until the previous 355 operation finishes. It also sends a message to inform the current 356 lock holder that it is waiting. Most notably, a new data_vio waiting 357 for a logical lock will flush the previous lock holder out of the 358 compression packer (step 8d) rather than allowing it to continue 359 waiting to be packed. 360 361 This stage requires the data_vio to get an implicit lock on the 362 appropriate logical zone to prevent concurrent modifications of the 363 hashtable. This implicit locking is handled by the zone divisions 364 described above. 365 3663. The data_vio traverses the block map tree to ensure that all the 367 necessary internal tree nodes have been allocated, by trying to find 368 the leaf page for its logical address. If any interior tree page is 369 missing, it is allocated at this time out of the same physical storage 370 pool used to store application data. 371 372 a. If any page-node in the tree has not yet been allocated, it must be 373 allocated before the write can continue. This step requires the 374 data_vio to lock the page-node that needs to be allocated. This 375 lock, like the logical block lock in step 2, is a hashtable entry 376 that causes other data_vios to wait for the allocation process to 377 complete. 378 379 The implicit logical zone lock is released while the allocation is 380 happening, in order to allow other operations in the same logical 381 zone to proceed. The details of allocation are the same as in 382 step 4. Once a new node has been allocated, that node is added to 383 the tree using a similar process to adding a new data block mapping. 384 The data_vio journals the intent to add the new node to the block 385 map tree (step 10), updates the reference count of the new block 386 (step 11), and reacquires the implicit logical zone lock to add the 387 new mapping to the parent tree node (step 12). Once the tree is 388 updated, the data_vio proceeds down the tree. Any other data_vios 389 waiting on this allocation also proceed. 390 391 b. In the steady-state case, the block map tree nodes will already be 392 allocated, so the data_vio just traverses the tree until it finds 393 the required leaf node. The location of the mapping (the "block map 394 slot") is recorded in the data_vio so that later steps do not need 395 to traverse the tree again. The data_vio then releases the implicit 396 logical zone lock. 397 3984. If the block is a zero block, skip to step 9. Otherwise, an attempt is 399 made to allocate a free data block. This allocation ensures that the 400 data_vio can write its data somewhere even if deduplication and 401 compression are not possible. This stage gets an implicit lock on a 402 physical zone to search for free space within that zone. 403 404 The data_vio will search each slab in a zone until it finds a free 405 block or decides there are none. If the first zone has no free space, 406 it will proceed to search the next physical zone by taking the implicit 407 lock for that zone and releasing the previous one until it finds a 408 free block or runs out of zones to search. The data_vio will acquire a 409 struct pbn_lock (the "physical block lock") on the free block. The 410 struct pbn_lock also has several fields to record the various kinds of 411 claims that data_vios can have on physical blocks. The pbn_lock is 412 added to a hashtable like the logical block locks in step 2. This 413 hashtable is also covered by the implicit physical zone lock. The 414 reference count of the free block is updated to prevent any other 415 data_vio from considering it free. The reference counters are a 416 sub-component of the slab and are thus also covered by the implicit 417 physical zone lock. 418 4195. If an allocation was obtained, the data_vio has all the resources it 420 needs to complete the write. The application bio can safely be 421 acknowledged at this point. The acknowledgment happens on a separate 422 thread to prevent the application callback from blocking other data_vio 423 operations. 424 425 If an allocation could not be obtained, the data_vio continues to 426 attempt to deduplicate or compress the data, but the bio is not 427 acknowledged because the vdo device may be out of space. 428 4296. At this point vdo must determine where to store the application data. 430 The data_vio's data is hashed and the hash (the "record name") is 431 recorded in the data_vio. 432 4337. The data_vio reserves or joins a struct hash_lock, which manages all of 434 the data_vios currently writing the same data. Active hash locks are 435 tracked in a hashtable similar to the way logical block locks are 436 tracked in step 2. This hashtable is covered by the implicit lock on 437 the hash zone. 438 439 If there is no existing hash lock for this data_vio's record_name, the 440 data_vio obtains a hash lock from the pool, adds it to the hashtable, 441 and sets itself as the new hash lock's "agent." The hash_lock pool is 442 also covered by the implicit hash zone lock. The hash lock agent will 443 do all the work to decide where the application data will be 444 written. If a hash lock for the data_vio's record_name already exists, 445 and the data_vio's data is the same as the agent's data, the new 446 data_vio will wait for the agent to complete its work and then share 447 its result. 448 449 In the rare case that a hash lock exists for the data_vio's hash but 450 the data does not match the hash lock's agent, the data_vio skips to 451 step 8h and attempts to write its data directly. This can happen if two 452 different data blocks produce the same hash, for example. 453 4548. The hash lock agent attempts to deduplicate or compress its data with 455 the following steps. 456 457 a. The agent initializes and sends its embedded deduplication request 458 (struct uds_request) to the deduplication index. This does not 459 require the data_vio to get any locks because the index components 460 manage their own locking. The data_vio waits until it either gets a 461 response from the index or times out. 462 463 b. If the deduplication index returns advice, the data_vio attempts to 464 obtain a physical block lock on the indicated physical address, in 465 order to read the data and verify that it is the same as the 466 data_vio's data, and that it can accept more references. If the 467 physical address is already locked by another data_vio, the data at 468 that address may soon be overwritten so it is not safe to use the 469 address for deduplication. 470 471 c. If the data matches and the physical block can add references, the 472 agent and any other data_vios waiting on it will record this 473 physical block as their new physical address and proceed to step 9 474 to record their new mapping. If there are more data_vios in the hash 475 lock than there are references available, one of the remaining 476 data_vios becomes the new agent and continues to step 8d as if no 477 valid advice was returned. 478 479 d. If no usable duplicate block was found, the agent first checks that 480 it has an allocated physical block (from step 3) that it can write 481 to. If the agent does not have an allocation, some other data_vio in 482 the hash lock that does have an allocation takes over as agent. If 483 none of the data_vios have an allocated physical block, these writes 484 are out of space, so they proceed to step 13 for cleanup. 485 486 e. The agent attempts to compress its data. If the data does not 487 compress, the data_vio will continue to step 8h to write its data 488 directly. 489 490 If the compressed size is small enough, the agent will release the 491 implicit hash zone lock and go to the packer (struct packer) where 492 it will be placed in a bin (struct packer_bin) along with other 493 data_vios. All compression operations require the implicit lock on 494 the packer zone. 495 496 The packer can combine up to 14 compressed blocks in a single 4k 497 data block. Compression is only helpful if vdo can pack at least 2 498 data_vios into a single data block. This means that a data_vio may 499 wait in the packer for an arbitrarily long time for other data_vios 500 to fill out the compressed block. There is a mechanism for vdo to 501 evict waiting data_vios when continuing to wait would cause 502 problems. Circumstances causing an eviction include an application 503 flush, device shutdown, or a subsequent data_vio trying to overwrite 504 the same logical block address. A data_vio may also be evicted from 505 the packer if it cannot be paired with any other compressed block 506 before more compressible blocks need to use its bin. An evicted 507 data_vio will proceed to step 8h to write its data directly. 508 509 f. If the agent fills a packer bin, either because all 14 of its slots 510 are used or because it has no remaining space, it is written out 511 using the allocated physical block from one of its data_vios. Step 512 8d has already ensured that an allocation is available. 513 514 g. Each data_vio sets the compressed block as its new physical address. 515 The data_vio obtains an implicit lock on the physical zone and 516 acquires the struct pbn_lock for the compressed block, which is 517 modified to be a shared lock. Then it releases the implicit physical 518 zone lock and proceeds to step 8i. 519 520 h. Any data_vio evicted from the packer will have an allocation from 521 step 3. It will write its data to that allocated physical block. 522 523 i. After the data is written, if the data_vio is the agent of a hash 524 lock, it will reacquire the implicit hash zone lock and share its 525 physical address with as many other data_vios in the hash lock as 526 possible. Each data_vio will then proceed to step 9 to record its 527 new mapping. 528 529 j. If the agent actually wrote new data (whether compressed or not), 530 the deduplication index is updated to reflect the location of the 531 new data. The agent then releases the implicit hash zone lock. 532 5339. The data_vio determines the previous mapping of the logical address. 534 There is a cache for block map leaf pages (the "block map cache"), 535 because there are usually too many block map leaf nodes to store 536 entirely in memory. If the desired leaf page is not in the cache, the 537 data_vio will reserve a slot in the cache and load the desired page 538 into it, possibly evicting an older cached page. The data_vio then 539 finds the current physical address for this logical address (the "old 540 physical mapping"), if any, and records it. This step requires a lock 541 on the block map cache structures, covered by the implicit logical zone 542 lock. 543 54410. The data_vio makes an entry in the recovery journal containing the 545 logical block address, the old physical mapping, and the new physical 546 mapping. Making this journal entry requires holding the implicit 547 recovery journal lock. The data_vio will wait in the journal until all 548 recovery blocks up to the one containing its entry have been written 549 and flushed to ensure the transaction is stable on storage. 550 55111. Once the recovery journal entry is stable, the data_vio makes two slab 552 journal entries: an increment entry for the new mapping, and a 553 decrement entry for the old mapping. These two operations each require 554 holding a lock on the affected physical slab, covered by its implicit 555 physical zone lock. For correctness during recovery, the slab journal 556 entries in any given slab journal must be in the same order as the 557 corresponding recovery journal entries. Therefore, if the two entries 558 are in different zones, they are made concurrently, and if they are in 559 the same zone, the increment is always made before the decrement in 560 order to avoid underflow. After each slab journal entry is made in 561 memory, the associated reference count is also updated in memory. 562 56312. Once both of the reference count updates are done, the data_vio 564 acquires the implicit logical zone lock and updates the 565 logical-to-physical mapping in the block map to point to the new 566 physical block. At this point the write operation is complete. 567 56813. If the data_vio has a hash lock, it acquires the implicit hash zone 569 lock and releases its hash lock to the pool. 570 571 The data_vio then acquires the implicit physical zone lock and releases 572 the struct pbn_lock it holds for its allocated block. If it had an 573 allocation that it did not use, it also sets the reference count for 574 that block back to zero to free it for use by subsequent data_vios. 575 576 The data_vio then acquires the implicit logical zone lock and releases 577 the logical block lock acquired in step 2. 578 579 The application bio is then acknowledged if it has not previously been 580 acknowledged, and the data_vio is returned to the pool. 581 582*Read Path* 583 584An application read bio follows a much simpler set of steps. It does steps 5851 and 2 in the write path to obtain a data_vio and lock its logical 586address. If there is already a write data_vio in progress for that logical 587address that is guaranteed to complete, the read data_vio will copy the 588data from the write data_vio and return it. Otherwise, it will look up the 589logical-to-physical mapping by traversing the block map tree as in step 3, 590and then read and possibly decompress the indicated data at the indicated 591physical block address. A read data_vio will not allocate block map tree 592nodes if they are missing. If the interior block map nodes do not exist 593yet, the logical block map address must still be unmapped and the read 594data_vio will return all zeroes. A read data_vio handles cleanup and 595acknowledgment as in step 13, although it only needs to release the logical 596lock and return itself to the pool. 597 598*Small Writes* 599 600All storage within vdo is managed as 4KB blocks, but it can accept writes 601as small as 512 bytes. Processing a write that is smaller than 4K requires 602a read-modify-write operation that reads the relevant 4K block, copies the 603new data over the approriate sectors of the block, and then launches a 604write operation for the modified data block. The read and write stages of 605this operation are nearly identical to the normal read and write 606operations, and a single data_vio is used throughout this operation. 607 608*Recovery* 609 610When a vdo is restarted after a crash, it will attempt to recover from the 611recovery journal. During the pre-resume phase of the next start, the 612recovery journal is read. The increment portion of valid entries are played 613into the block map. Next, valid entries are played, in order as required, 614into the slab journals. Finally, each physical zone attempts to replay at 615least one slab journal to reconstruct the reference counts of one slab. 616Once each zone has some free space (or has determined that it has none), 617the vdo comes back online, while the remainder of the slab journals are 618used to reconstruct the rest of the reference counts in the background. 619 620*Read-only Rebuild* 621 622If a vdo encounters an unrecoverable error, it will enter read-only mode. 623This mode indicates that some previously acknowledged data may have been 624lost. The vdo may be instructed to rebuild as best it can in order to 625return to a writable state. However, this is never done automatically due 626to the possibility that data has been lost. During a read-only rebuild, the 627block map is recovered from the recovery journal as before. However, the 628reference counts are not rebuilt from the slab journals. Instead, the 629reference counts are zeroed, the entire block map is traversed, and the 630reference counts are updated from the block mappings. While this may lose 631some data, it ensures that the block map and reference counts are 632consistent with each other. This allows vdo to resume normal operation and 633accept further writes. 634