xref: /linux/Documentation/admin-guide/device-mapper/dm-pcache.rst (revision 68a052239fc4b351e961f698b824f7654a346091)
1.. SPDX-License-Identifier: GPL-2.0
2
3=================================
4dm-pcache — Persistent Cache
5=================================
6
7*Author: Dongsheng Yang <dongsheng.yang@linux.dev>*
8
9This document describes *dm-pcache*, a Device-Mapper target that lets a
10byte-addressable *DAX* (persistent-memory, “pmem”) region act as a
11high-performance, crash-persistent cache in front of a slower block
12device.  The code lives in `drivers/md/dm-pcache/`.
13
14Quick feature summary
15=====================
16
17* *Write-back* caching (only mode currently supported).
18* *16 MiB segments* allocated on the pmem device.
19* *Data CRC32* verification (optional, per cache).
20* Crash-safe: every metadata structure is duplicated (`PCACHE_META_INDEX_MAX
21  == 2`) and protected with CRC+sequence numbers.
22* *Multi-tree indexing* (indexing trees sharded by logical address) for high PMem parallelism
23* Pure *DAX path* I/O – no extra BIO round-trips
24* *Log-structured write-back* that preserves backend crash-consistency
25
26
27Constructor
28===========
29
30::
31
32    pcache <cache_dev> <backing_dev> [<number_of_optional_arguments> <cache_mode writeback> <data_crc true|false>]
33
34=========================  ====================================================
35``cache_dev``               Any DAX-capable block device (``/dev/pmem0``…).
36                            All metadata *and* cached blocks are stored here.
37
38``backing_dev``             The slow block device to be cached.
39
40``cache_mode``              Optional, Only ``writeback`` is accepted at the
41                            moment.
42
43``data_crc``                Optional, default to ``false``
44
45                            * ``true``  – store CRC32 for every cached entry
46			      and verify on reads
47                            * ``false`` – skip CRC (faster)
48=========================  ====================================================
49
50Example
51-------
52
53.. code-block:: shell
54
55   dmsetup create pcache_sdb --table \
56     "0 $(blockdev --getsz /dev/sdb) pcache /dev/pmem0 /dev/sdb 4 cache_mode writeback data_crc true"
57
58The first time a pmem device is used, dm-pcache formats it automatically
59(super-block, cache_info, etc.).
60
61
62Status line
63===========
64
65``dmsetup status <device>`` (``STATUSTYPE_INFO``) prints:
66
67::
68
69   <sb_flags> <seg_total> <cache_segs> <segs_used> \
70   <gc_percent> <cache_flags> \
71   <key_head_seg>:<key_head_off> \
72   <dirty_tail_seg>:<dirty_tail_off> \
73   <key_tail_seg>:<key_tail_off>
74
75Field meanings
76--------------
77
78===============================  =============================================
79``sb_flags``                     Super-block flags (e.g. endian marker).
80
81``seg_total``                    Number of physical *pmem* segments.
82
83``cache_segs``                   Number of segments used for cache.
84
85``segs_used``                    Segments currently allocated (bitmap weight).
86
87``gc_percent``                   Current GC high-water mark (0-90).
88
89``cache_flags``                  Bit 0 – DATA_CRC enabled
90                                 Bit 1 – INIT_DONE (cache initialised)
91                                 Bits 2-5 – cache mode (0 == WB).
92
93``key_head``                     Where new key-sets are being written.
94
95``dirty_tail``                   First dirty key-set that still needs
96                                 write-back to the backing device.
97
98``key_tail``                     First key-set that may be reclaimed by GC.
99===============================  =============================================
100
101
102Messages
103========
104
105*Change GC trigger*
106
107::
108
109   dmsetup message <dev> 0 gc_percent <0-90>
110
111
112Theory of operation
113===================
114
115Sub-devices
116-----------
117
118====================  =========================================================
119backing_dev             Any block device (SSD/HDD/loop/LVM, etc.).
120cache_dev               DAX device; must expose direct-access memory.
121====================  =========================================================
122
123Segments and key-sets
124---------------------
125
126* The pmem space is divided into *16 MiB segments*.
127* Each write allocates space from a per-CPU *data_head* inside a segment.
128* A *cache-key* records a logical range on the origin and where it lives
129  inside pmem (segment + offset + generation).
130* 128 keys form a *key-set* (kset); ksets are written sequentially in pmem
131  and are themselves crash-safe (CRC).
132* The pair *(key_tail, dirty_tail)* delimit clean/dirty and live/dead ksets.
133
134Write-back
135----------
136
137Dirty keys are queued into a tree; a background worker copies data
138back to the backing_dev and advances *dirty_tail*.  A FLUSH/FUA bio from the
139upper layers forces an immediate metadata commit.
140
141Garbage collection
142------------------
143
144GC starts when ``segs_used >= seg_total * gc_percent / 100``.  It walks
145from *key_tail*, frees segments whose every key has been invalidated, and
146advances *key_tail*.
147
148CRC verification
149----------------
150
151If ``data_crc is enabled`` dm-pcache computes a CRC32 over every cached data
152range when it is inserted and stores it in the on-media key.  Reads
153validate the CRC before copying to the caller.
154
155
156Failure handling
157================
158
159* *pmem media errors* – all metadata copies are read with
160  ``copy_mc_to_kernel``; an uncorrectable error logs and aborts initialisation.
161* *Cache full* – if no free segment can be found, writes return ``-EBUSY``;
162  dm-pcache retries internally (request deferral).
163* *System crash* – on attach, the driver replays ksets from *key_tail* to
164  rebuild the in-core trees; every segment’s generation guards against
165  use-after-free keys.
166
167
168Limitations & TODO
169==================
170
171* Only *write-back* mode; other modes planned.
172* Only FIFO cache invalidate; other (LRU, ARC...) planned.
173* Table reload is not supported currently.
174* Discard planned.
175
176
177Example workflow
178================
179
180.. code-block:: shell
181
182   # 1.  Create devices
183   dmsetup create pcache_sdb --table \
184     "0 $(blockdev --getsz /dev/sdb) pcache /dev/pmem0 /dev/sdb 4 cache_mode writeback data_crc true"
185
186   # 2.  Put a filesystem on top
187   mkfs.ext4 /dev/mapper/pcache_sdb
188   mount /dev/mapper/pcache_sdb /mnt
189
190   # 3.  Tune GC threshold to 80 %
191   dmsetup message pcache_sdb 0 gc_percent 80
192
193   # 4.  Observe status
194   watch -n1 'dmsetup status pcache_sdb'
195
196   # 5.  Shutdown
197   umount /mnt
198   dmsetup remove pcache_sdb
199
200
201``dm-pcache`` is under active development; feedback, bug reports and patches
202are very welcome!
203