vdo-design.rst - OpenGrok cross reference for /linux/Documentation/admin-guide/device-mapper/vdo-design.rst

Lines Matching +full:data +full:- +full:mapping
1 .. SPDX-License-Identifier: GPL-2.0-only
4 Design of dm-vdo
7 The dm-vdo (virtual data optimizer) target provides inline deduplication,
8 compression, zero-block elimination, and thin provisioning. A dm-vdo target
12 production environments ever since. It was made open-source in 2017 after
14 dm-vdo. For usage, see vdo.rst in the same directory as this file.
25 The design of dm-vdo is based on the idea that deduplication is a two-part
26 problem. The first is to recognize duplicate data. The second is to avoid
27 storing multiple copies of those duplicates. Therefore, dm-vdo has two main
29 duplicate data, and a data store with a reference counted block map that
31 data.
34 -------------------
36 Due to the complexity of data optimization, the number of metadata
41 design attempts to be lock-free.
43 Most of a vdo's main data structures are designed to be easily divided into
47 thread will access the portion of the data structure in that zone.
59 reflected in the on-disk representation of each data structure. Therefore,
64 -----------------------
66 In order to identify duplicate data efficiently, vdo was designed to
67 leverage some common characteristics of duplicate data. From empirical
68 observations, we gathered two key insights. The first is that in most data
69 sets with significant amounts of duplicate data, the duplicates tend to
73 temporal order. The second insight is that new data is more likely to
74 duplicate recent data than it is to duplicate older data and in general,
79 trade-off between the storage saved and the resources expended to achieve
83 Each block of data is hashed to produce a 16-byte block name. An index
85 that data on the underlying storage. However, it is not possible to
87 because it is too costly to update the index when a block is over-written
89 with the blocks, which is difficult to do efficiently in block-based
116 chapters are read-only structures and their contents are never altered in
127 mapping each block name to the chapter containing its newest record. This
128 mapping is updated as records for the block name are copied or updated,
149 memory-efficient structure called a delta index. Instead of storing the
160 splitting its key space into many sub-lists, each starting at a fixed key
164 256GB of data. This means that the index can identify duplicate data if the
165 original data was written within the last 256GB of writes. This range is
166 called the deduplication window. If new writes duplicate data that is older
168 older data have been removed. This means that if an application writes a
175 If an application anticipates a data workload that will see useful
190 indexing, the memory requirements do not increase. The trade-off is
193 duplicate data, sparse indexing will detect 97-99% of the deduplication
197 -------------------------------
200 fields and data to track vdo-specific information. A struct vio maintains a
212 set of steps to handle the application data, after which it is reset and
220 The Data Store
221 --------------
223 The data store is implemented by three main data structures, all of which
224 work in concert to reduce or amortize metadata updates across as many data
232 These blocks are used either to store data, or to hold portions of the
233 block map (see below). In addition to the data blocks, each slab has a set
234 of reference counters, using 1 byte for each data block. Finally each slab
243 memory and are written out, a block at a time in oldest-dirtied-order, only
249 zones" in round-robin fashion. If there are P physical zones, then slab n
252 The slab depot maintains an additional small data structure, the "slab
266 The block map contains the logical to physical mapping. It can be thought
268 36 bits of which contain the physical block number which holds the data for
270 of the mapping. Of the 16 possible states, one represents a logical address
273 indicate that the mapped data is compressed, and which of the compression
274 slots in the compressed block contains the data for this logical address.
276 In practice, the array of mapping entries is divided into "block map
278 consists of a header and 812 mapping entries. Each mapping page is actually
283 0-811 belong to tree 0, logical addresses 812-1623 belong to tree 1, and so
289 need to pre-allocate space for the entire set of logical mappings and also
296 time, and is large enough to hold all the non-leaf pages of the entire
303 Entries are either "data remappings" or "block map remappings." For a data
307 pages are never reclaimed or repurposed, so the old mapping is always 0.
311 before each journal block write to ensure that the physical data for the
314 entries themselves are stable. The journal entry and the data write it
324 eventually. Generally, the data for acknowledged but unflushed write I/O
326 requires data to be stable on storage, it must issue a flush or write the
327 data with the FUA bit set like any other asynchronous I/O. Shutting down
337     The newly acquired data_vio is reset and the bio's data is copied into
338     the data_vio if it is a write and the data is not all zeroes. The data
343     already read the underlying block and the data is instead copied over
370     pool used to store application data.
372     a. If any page-node in the tree has not yet been allocated, it must be
374        data_vio to lock the page-node that needs to be allocated. This
383        the tree using a similar process to adding a new data block mapping.
387        new mapping to the parent tree node (step 12). Once the tree is
391     b. In the steady-state case, the block map tree nodes will already be
393        the required leaf node. The location of the mapping (the "block map
399     made to allocate a free data block. This allocation ensures that the
400     data_vio can write its data somewhere even if deduplication and
416     sub-component of the slab and are thus also covered by the implicit
426     attempt to deduplicate or compress the data, but the bio is not
429 6.  At this point vdo must determine where to store the application data.
430     The data_vio's data is hashed and the hash (the "record name") is
434     the data_vios currently writing the same data. Active hash locks are
443     do all the work to decide where the application data will be
445     and the data_vio's data is the same as the agent's data, the new
450     the data does not match the hash lock's agent, the data_vio skips to
451     step 8h and attempts to write its data directly. This can happen if two
452     different data blocks produce the same hash, for example.
454 8.  The hash lock agent attempts to deduplicate or compress its data with
465        order to read the data and verify that it is the same as the
466        data_vio's data, and that it can accept more references. If the
467        physical address is already locked by another data_vio, the data at
471     c. If the data matches and the physical block can add references, the
474        to record their new mapping. If there are more data_vios in the hash
486     e. The agent attempts to compress its data. If the data does not
487        compress, the data_vio will continue to step 8h to write its data
497        data block. Compression is only helpful if vdo can pack at least 2
498        data_vios into a single data block. This means that a data_vio may
507        data_vio will proceed to step 8h to write its data directly.
521        step 3. It will write its data to that allocated physical block.
523     i. After the data is written, if the data_vio is the agent of a hash
527        new mapping.
529     j. If the agent actually wrote new data (whether compressed or not),
531        new data. The agent then releases the implicit hash zone lock.
533 9.  The data_vio determines the previous mapping of the logical address.
540     physical mapping"), if any, and records it. This step requires a lock
545     logical block address, the old physical mapping, and the new physical
546     mapping. Making this journal entry requires holding the implicit
552     journal entries: an increment entry for the new mapping, and a
553     decrement entry for the old mapping. These two operations each require
565     logical-to-physical mapping in the block map to point to the new
588 data from the write data_vio and return it. Otherwise, it will look up the
589 logical-to-physical mapping by traversing the block map tree as in step 3,
590 and then read and possibly decompress the indicated data at the indicated
602 a read-modify-write operation that reads the relevant 4K block, copies the
603 new data over the approriate sectors of the block, and then launches a
604 write operation for the modified data block. The read and write stages of
611 recovery journal. During the pre-resume phase of the next start, the
620 *Read-only Rebuild*
622 If a vdo encounters an unrecoverable error, it will enter read-only mode.
623 This mode indicates that some previously acknowledged data may have been
626 to the possibility that data has been lost. During a read-only rebuild, the
631 some data, it ensures that the block map and reference counts are