xref: /linux/Documentation/driver-api/nvdimm/btt.rst (revision ae4a05027e2f883fb5f822e48d67cacc26bf60e1)
1*ae4a0502SMauro Carvalho Chehab=============================
2*ae4a0502SMauro Carvalho ChehabBTT - Block Translation Table
3*ae4a0502SMauro Carvalho Chehab=============================
4*ae4a0502SMauro Carvalho Chehab
5*ae4a0502SMauro Carvalho Chehab
6*ae4a0502SMauro Carvalho Chehab1. Introduction
7*ae4a0502SMauro Carvalho Chehab===============
8*ae4a0502SMauro Carvalho Chehab
9*ae4a0502SMauro Carvalho ChehabPersistent memory based storage is able to perform IO at byte (or more
10*ae4a0502SMauro Carvalho Chehabaccurately, cache line) granularity. However, we often want to expose such
11*ae4a0502SMauro Carvalho Chehabstorage as traditional block devices. The block drivers for persistent memory
12*ae4a0502SMauro Carvalho Chehabwill do exactly this. However, they do not provide any atomicity guarantees.
13*ae4a0502SMauro Carvalho ChehabTraditional SSDs typically provide protection against torn sectors in hardware,
14*ae4a0502SMauro Carvalho Chehabusing stored energy in capacitors to complete in-flight block writes, or perhaps
15*ae4a0502SMauro Carvalho Chehabin firmware. We don't have this luxury with persistent memory - if a write is in
16*ae4a0502SMauro Carvalho Chehabprogress, and we experience a power failure, the block will contain a mix of old
17*ae4a0502SMauro Carvalho Chehaband new data. Applications may not be prepared to handle such a scenario.
18*ae4a0502SMauro Carvalho Chehab
19*ae4a0502SMauro Carvalho ChehabThe Block Translation Table (BTT) provides atomic sector update semantics for
20*ae4a0502SMauro Carvalho Chehabpersistent memory devices, so that applications that rely on sector writes not
21*ae4a0502SMauro Carvalho Chehabbeing torn can continue to do so. The BTT manifests itself as a stacked block
22*ae4a0502SMauro Carvalho Chehabdevice, and reserves a portion of the underlying storage for its metadata. At
23*ae4a0502SMauro Carvalho Chehabthe heart of it, is an indirection table that re-maps all the blocks on the
24*ae4a0502SMauro Carvalho Chehabvolume. It can be thought of as an extremely simple file system that only
25*ae4a0502SMauro Carvalho Chehabprovides atomic sector updates.
26*ae4a0502SMauro Carvalho Chehab
27*ae4a0502SMauro Carvalho Chehab
28*ae4a0502SMauro Carvalho Chehab2. Static Layout
29*ae4a0502SMauro Carvalho Chehab================
30*ae4a0502SMauro Carvalho Chehab
31*ae4a0502SMauro Carvalho ChehabThe underlying storage on which a BTT can be laid out is not limited in any way.
32*ae4a0502SMauro Carvalho ChehabThe BTT, however, splits the available space into chunks of up to 512 GiB,
33*ae4a0502SMauro Carvalho Chehabcalled "Arenas".
34*ae4a0502SMauro Carvalho Chehab
35*ae4a0502SMauro Carvalho ChehabEach arena follows the same layout for its metadata, and all references in an
36*ae4a0502SMauro Carvalho Chehabarena are internal to it (with the exception of one field that points to the
37*ae4a0502SMauro Carvalho Chehabnext arena). The following depicts the "On-disk" metadata layout::
38*ae4a0502SMauro Carvalho Chehab
39*ae4a0502SMauro Carvalho Chehab
40*ae4a0502SMauro Carvalho Chehab    Backing Store     +------->  Arena
41*ae4a0502SMauro Carvalho Chehab  +---------------+   |   +------------------+
42*ae4a0502SMauro Carvalho Chehab  |               |   |   | Arena info block |
43*ae4a0502SMauro Carvalho Chehab  |    Arena 0    +---+   |       4K         |
44*ae4a0502SMauro Carvalho Chehab  |     512G      |       +------------------+
45*ae4a0502SMauro Carvalho Chehab  |               |       |                  |
46*ae4a0502SMauro Carvalho Chehab  +---------------+       |                  |
47*ae4a0502SMauro Carvalho Chehab  |               |       |                  |
48*ae4a0502SMauro Carvalho Chehab  |    Arena 1    |       |   Data Blocks    |
49*ae4a0502SMauro Carvalho Chehab  |     512G      |       |                  |
50*ae4a0502SMauro Carvalho Chehab  |               |       |                  |
51*ae4a0502SMauro Carvalho Chehab  +---------------+       |                  |
52*ae4a0502SMauro Carvalho Chehab  |       .       |       |                  |
53*ae4a0502SMauro Carvalho Chehab  |       .       |       |                  |
54*ae4a0502SMauro Carvalho Chehab  |       .       |       |                  |
55*ae4a0502SMauro Carvalho Chehab  |               |       |                  |
56*ae4a0502SMauro Carvalho Chehab  |               |       |                  |
57*ae4a0502SMauro Carvalho Chehab  +---------------+       +------------------+
58*ae4a0502SMauro Carvalho Chehab                          |                  |
59*ae4a0502SMauro Carvalho Chehab                          |     BTT Map      |
60*ae4a0502SMauro Carvalho Chehab                          |                  |
61*ae4a0502SMauro Carvalho Chehab                          |                  |
62*ae4a0502SMauro Carvalho Chehab                          +------------------+
63*ae4a0502SMauro Carvalho Chehab                          |                  |
64*ae4a0502SMauro Carvalho Chehab                          |     BTT Flog     |
65*ae4a0502SMauro Carvalho Chehab                          |                  |
66*ae4a0502SMauro Carvalho Chehab                          +------------------+
67*ae4a0502SMauro Carvalho Chehab                          | Info block copy  |
68*ae4a0502SMauro Carvalho Chehab                          |       4K         |
69*ae4a0502SMauro Carvalho Chehab                          +------------------+
70*ae4a0502SMauro Carvalho Chehab
71*ae4a0502SMauro Carvalho Chehab
72*ae4a0502SMauro Carvalho Chehab3. Theory of Operation
73*ae4a0502SMauro Carvalho Chehab======================
74*ae4a0502SMauro Carvalho Chehab
75*ae4a0502SMauro Carvalho Chehab
76*ae4a0502SMauro Carvalho Chehaba. The BTT Map
77*ae4a0502SMauro Carvalho Chehab--------------
78*ae4a0502SMauro Carvalho Chehab
79*ae4a0502SMauro Carvalho ChehabThe map is a simple lookup/indirection table that maps an LBA to an internal
80*ae4a0502SMauro Carvalho Chehabblock. Each map entry is 32 bits. The two most significant bits are special
81*ae4a0502SMauro Carvalho Chehabflags, and the remaining form the internal block number.
82*ae4a0502SMauro Carvalho Chehab
83*ae4a0502SMauro Carvalho Chehab======== =============================================================
84*ae4a0502SMauro Carvalho ChehabBit      Description
85*ae4a0502SMauro Carvalho Chehab======== =============================================================
86*ae4a0502SMauro Carvalho Chehab31 - 30	 Error and Zero flags - Used in the following way:
87*ae4a0502SMauro Carvalho Chehab
88*ae4a0502SMauro Carvalho Chehab	   == ==  ====================================================
89*ae4a0502SMauro Carvalho Chehab	   31 30  Description
90*ae4a0502SMauro Carvalho Chehab	   == ==  ====================================================
91*ae4a0502SMauro Carvalho Chehab	   0  0	  Initial state. Reads return zeroes; Premap = Postmap
92*ae4a0502SMauro Carvalho Chehab	   0  1	  Zero state: Reads return zeroes
93*ae4a0502SMauro Carvalho Chehab	   1  0	  Error state: Reads fail; Writes clear 'E' bit
94*ae4a0502SMauro Carvalho Chehab	   1  1	  Normal Block – has valid postmap
95*ae4a0502SMauro Carvalho Chehab	   == ==  ====================================================
96*ae4a0502SMauro Carvalho Chehab
97*ae4a0502SMauro Carvalho Chehab29 - 0	 Mappings to internal 'postmap' blocks
98*ae4a0502SMauro Carvalho Chehab======== =============================================================
99*ae4a0502SMauro Carvalho Chehab
100*ae4a0502SMauro Carvalho Chehab
101*ae4a0502SMauro Carvalho ChehabSome of the terminology that will be subsequently used:
102*ae4a0502SMauro Carvalho Chehab
103*ae4a0502SMauro Carvalho Chehab============	================================================================
104*ae4a0502SMauro Carvalho ChehabExternal LBA	LBA as made visible to upper layers.
105*ae4a0502SMauro Carvalho ChehabABA		Arena Block Address - Block offset/number within an arena
106*ae4a0502SMauro Carvalho ChehabPremap ABA	The block offset into an arena, which was decided upon by range
107*ae4a0502SMauro Carvalho Chehab		checking the External LBA
108*ae4a0502SMauro Carvalho ChehabPostmap ABA	The block number in the "Data Blocks" area obtained after
109*ae4a0502SMauro Carvalho Chehab		indirection from the map
110*ae4a0502SMauro Carvalho Chehabnfree		The number of free blocks that are maintained at any given time.
111*ae4a0502SMauro Carvalho Chehab		This is the number of concurrent writes that can happen to the
112*ae4a0502SMauro Carvalho Chehab		arena.
113*ae4a0502SMauro Carvalho Chehab============	================================================================
114*ae4a0502SMauro Carvalho Chehab
115*ae4a0502SMauro Carvalho Chehab
116*ae4a0502SMauro Carvalho ChehabFor example, after adding a BTT, we surface a disk of 1024G. We get a read for
117*ae4a0502SMauro Carvalho Chehabthe external LBA at 768G. This falls into the second arena, and of the 512G
118*ae4a0502SMauro Carvalho Chehabworth of blocks that this arena contributes, this block is at 256G. Thus, the
119*ae4a0502SMauro Carvalho Chehabpremap ABA is 256G. We now refer to the map, and find out the mapping for block
120*ae4a0502SMauro Carvalho Chehab'X' (256G) points to block 'Y', say '64'. Thus the postmap ABA is 64.
121*ae4a0502SMauro Carvalho Chehab
122*ae4a0502SMauro Carvalho Chehab
123*ae4a0502SMauro Carvalho Chehabb. The BTT Flog
124*ae4a0502SMauro Carvalho Chehab---------------
125*ae4a0502SMauro Carvalho Chehab
126*ae4a0502SMauro Carvalho ChehabThe BTT provides sector atomicity by making every write an "allocating write",
127*ae4a0502SMauro Carvalho Chehabi.e. Every write goes to a "free" block. A running list of free blocks is
128*ae4a0502SMauro Carvalho Chehabmaintained in the form of the BTT flog. 'Flog' is a combination of the words
129*ae4a0502SMauro Carvalho Chehab"free list" and "log". The flog contains 'nfree' entries, and an entry contains:
130*ae4a0502SMauro Carvalho Chehab
131*ae4a0502SMauro Carvalho Chehab========  =====================================================================
132*ae4a0502SMauro Carvalho Chehablba       The premap ABA that is being written to
133*ae4a0502SMauro Carvalho Chehabold_map   The old postmap ABA - after 'this' write completes, this will be a
134*ae4a0502SMauro Carvalho Chehab	  free block.
135*ae4a0502SMauro Carvalho Chehabnew_map   The new postmap ABA. The map will up updated to reflect this
136*ae4a0502SMauro Carvalho Chehab	  lba->postmap_aba mapping, but we log it here in case we have to
137*ae4a0502SMauro Carvalho Chehab	  recover.
138*ae4a0502SMauro Carvalho Chehabseq	  Sequence number to mark which of the 2 sections of this flog entry is
139*ae4a0502SMauro Carvalho Chehab	  valid/newest. It cycles between 01->10->11->01 (binary) under normal
140*ae4a0502SMauro Carvalho Chehab	  operation, with 00 indicating an uninitialized state.
141*ae4a0502SMauro Carvalho Chehablba'	  alternate lba entry
142*ae4a0502SMauro Carvalho Chehabold_map'  alternate old postmap entry
143*ae4a0502SMauro Carvalho Chehabnew_map'  alternate new postmap entry
144*ae4a0502SMauro Carvalho Chehabseq'	  alternate sequence number.
145*ae4a0502SMauro Carvalho Chehab========  =====================================================================
146*ae4a0502SMauro Carvalho Chehab
147*ae4a0502SMauro Carvalho ChehabEach of the above fields is 32-bit, making one entry 32 bytes. Entries are also
148*ae4a0502SMauro Carvalho Chehabpadded to 64 bytes to avoid cache line sharing or aliasing. Flog updates are
149*ae4a0502SMauro Carvalho Chehabdone such that for any entry being written, it:
150*ae4a0502SMauro Carvalho Chehaba. overwrites the 'old' section in the entry based on sequence numbers
151*ae4a0502SMauro Carvalho Chehabb. writes the 'new' section such that the sequence number is written last.
152*ae4a0502SMauro Carvalho Chehab
153*ae4a0502SMauro Carvalho Chehab
154*ae4a0502SMauro Carvalho Chehabc. The concept of lanes
155*ae4a0502SMauro Carvalho Chehab-----------------------
156*ae4a0502SMauro Carvalho Chehab
157*ae4a0502SMauro Carvalho ChehabWhile 'nfree' describes the number of concurrent IOs an arena can process
158*ae4a0502SMauro Carvalho Chehabconcurrently, 'nlanes' is the number of IOs the BTT device as a whole can
159*ae4a0502SMauro Carvalho Chehabprocess::
160*ae4a0502SMauro Carvalho Chehab
161*ae4a0502SMauro Carvalho Chehab	nlanes = min(nfree, num_cpus)
162*ae4a0502SMauro Carvalho Chehab
163*ae4a0502SMauro Carvalho ChehabA lane number is obtained at the start of any IO, and is used for indexing into
164*ae4a0502SMauro Carvalho Chehaball the on-disk and in-memory data structures for the duration of the IO. If
165*ae4a0502SMauro Carvalho Chehabthere are more CPUs than the max number of available lanes, than lanes are
166*ae4a0502SMauro Carvalho Chehabprotected by spinlocks.
167*ae4a0502SMauro Carvalho Chehab
168*ae4a0502SMauro Carvalho Chehab
169*ae4a0502SMauro Carvalho Chehabd. In-memory data structure: Read Tracking Table (RTT)
170*ae4a0502SMauro Carvalho Chehab------------------------------------------------------
171*ae4a0502SMauro Carvalho Chehab
172*ae4a0502SMauro Carvalho ChehabConsider a case where we have two threads, one doing reads and the other,
173*ae4a0502SMauro Carvalho Chehabwrites. We can hit a condition where the writer thread grabs a free block to do
174*ae4a0502SMauro Carvalho Chehaba new IO, but the (slow) reader thread is still reading from it. In other words,
175*ae4a0502SMauro Carvalho Chehabthe reader consulted a map entry, and started reading the corresponding block. A
176*ae4a0502SMauro Carvalho Chehabwriter started writing to the same external LBA, and finished the write updating
177*ae4a0502SMauro Carvalho Chehabthe map for that external LBA to point to its new postmap ABA. At this point the
178*ae4a0502SMauro Carvalho Chehabinternal, postmap block that the reader is (still) reading has been inserted
179*ae4a0502SMauro Carvalho Chehabinto the list of free blocks. If another write comes in for the same LBA, it can
180*ae4a0502SMauro Carvalho Chehabgrab this free block, and start writing to it, causing the reader to read
181*ae4a0502SMauro Carvalho Chehabincorrect data. To prevent this, we introduce the RTT.
182*ae4a0502SMauro Carvalho Chehab
183*ae4a0502SMauro Carvalho ChehabThe RTT is a simple, per arena table with 'nfree' entries. Every reader inserts
184*ae4a0502SMauro Carvalho Chehabinto rtt[lane_number], the postmap ABA it is reading, and clears it after the
185*ae4a0502SMauro Carvalho Chehabread is complete. Every writer thread, after grabbing a free block, checks the
186*ae4a0502SMauro Carvalho ChehabRTT for its presence. If the postmap free block is in the RTT, it waits till the
187*ae4a0502SMauro Carvalho Chehabreader clears the RTT entry, and only then starts writing to it.
188*ae4a0502SMauro Carvalho Chehab
189*ae4a0502SMauro Carvalho Chehab
190*ae4a0502SMauro Carvalho Chehabe. In-memory data structure: map locks
191*ae4a0502SMauro Carvalho Chehab--------------------------------------
192*ae4a0502SMauro Carvalho Chehab
193*ae4a0502SMauro Carvalho ChehabConsider a case where two writer threads are writing to the same LBA. There can
194*ae4a0502SMauro Carvalho Chehabbe a race in the following sequence of steps::
195*ae4a0502SMauro Carvalho Chehab
196*ae4a0502SMauro Carvalho Chehab	free[lane] = map[premap_aba]
197*ae4a0502SMauro Carvalho Chehab	map[premap_aba] = postmap_aba
198*ae4a0502SMauro Carvalho Chehab
199*ae4a0502SMauro Carvalho ChehabBoth threads can update their respective free[lane] with the same old, freed
200*ae4a0502SMauro Carvalho Chehabpostmap_aba. This has made the layout inconsistent by losing a free entry, and
201*ae4a0502SMauro Carvalho Chehabat the same time, duplicating another free entry for two lanes.
202*ae4a0502SMauro Carvalho Chehab
203*ae4a0502SMauro Carvalho ChehabTo solve this, we could have a single map lock (per arena) that has to be taken
204*ae4a0502SMauro Carvalho Chehabbefore performing the above sequence, but we feel that could be too contentious.
205*ae4a0502SMauro Carvalho ChehabInstead we use an array of (nfree) map_locks that is indexed by
206*ae4a0502SMauro Carvalho Chehab(premap_aba modulo nfree).
207*ae4a0502SMauro Carvalho Chehab
208*ae4a0502SMauro Carvalho Chehab
209*ae4a0502SMauro Carvalho Chehabf. Reconstruction from the Flog
210*ae4a0502SMauro Carvalho Chehab-------------------------------
211*ae4a0502SMauro Carvalho Chehab
212*ae4a0502SMauro Carvalho ChehabOn startup, we analyze the BTT flog to create our list of free blocks. We walk
213*ae4a0502SMauro Carvalho Chehabthrough all the entries, and for each lane, of the set of two possible
214*ae4a0502SMauro Carvalho Chehab'sections', we always look at the most recent one only (based on the sequence
215*ae4a0502SMauro Carvalho Chehabnumber). The reconstruction rules/steps are simple:
216*ae4a0502SMauro Carvalho Chehab
217*ae4a0502SMauro Carvalho Chehab- Read map[log_entry.lba].
218*ae4a0502SMauro Carvalho Chehab- If log_entry.new matches the map entry, then log_entry.old is free.
219*ae4a0502SMauro Carvalho Chehab- If log_entry.new does not match the map entry, then log_entry.new is free.
220*ae4a0502SMauro Carvalho Chehab  (This case can only be caused by power-fails/unsafe shutdowns)
221*ae4a0502SMauro Carvalho Chehab
222*ae4a0502SMauro Carvalho Chehab
223*ae4a0502SMauro Carvalho Chehabg. Summarizing - Read and Write flows
224*ae4a0502SMauro Carvalho Chehab-------------------------------------
225*ae4a0502SMauro Carvalho Chehab
226*ae4a0502SMauro Carvalho ChehabRead:
227*ae4a0502SMauro Carvalho Chehab
228*ae4a0502SMauro Carvalho Chehab1.  Convert external LBA to arena number + pre-map ABA
229*ae4a0502SMauro Carvalho Chehab2.  Get a lane (and take lane_lock)
230*ae4a0502SMauro Carvalho Chehab3.  Read map to get the entry for this pre-map ABA
231*ae4a0502SMauro Carvalho Chehab4.  Enter post-map ABA into RTT[lane]
232*ae4a0502SMauro Carvalho Chehab5.  If TRIM flag set in map, return zeroes, and end IO (go to step 8)
233*ae4a0502SMauro Carvalho Chehab6.  If ERROR flag set in map, end IO with EIO (go to step 8)
234*ae4a0502SMauro Carvalho Chehab7.  Read data from this block
235*ae4a0502SMauro Carvalho Chehab8.  Remove post-map ABA entry from RTT[lane]
236*ae4a0502SMauro Carvalho Chehab9.  Release lane (and lane_lock)
237*ae4a0502SMauro Carvalho Chehab
238*ae4a0502SMauro Carvalho ChehabWrite:
239*ae4a0502SMauro Carvalho Chehab
240*ae4a0502SMauro Carvalho Chehab1.  Convert external LBA to Arena number + pre-map ABA
241*ae4a0502SMauro Carvalho Chehab2.  Get a lane (and take lane_lock)
242*ae4a0502SMauro Carvalho Chehab3.  Use lane to index into in-memory free list and obtain a new block, next flog
243*ae4a0502SMauro Carvalho Chehab    index, next sequence number
244*ae4a0502SMauro Carvalho Chehab4.  Scan the RTT to check if free block is present, and spin/wait if it is.
245*ae4a0502SMauro Carvalho Chehab5.  Write data to this free block
246*ae4a0502SMauro Carvalho Chehab6.  Read map to get the existing post-map ABA entry for this pre-map ABA
247*ae4a0502SMauro Carvalho Chehab7.  Write flog entry: [premap_aba / old postmap_aba / new postmap_aba / seq_num]
248*ae4a0502SMauro Carvalho Chehab8.  Write new post-map ABA into map.
249*ae4a0502SMauro Carvalho Chehab9.  Write old post-map entry into the free list
250*ae4a0502SMauro Carvalho Chehab10. Calculate next sequence number and write into the free list entry
251*ae4a0502SMauro Carvalho Chehab11. Release lane (and lane_lock)
252*ae4a0502SMauro Carvalho Chehab
253*ae4a0502SMauro Carvalho Chehab
254*ae4a0502SMauro Carvalho Chehab4. Error Handling
255*ae4a0502SMauro Carvalho Chehab=================
256*ae4a0502SMauro Carvalho Chehab
257*ae4a0502SMauro Carvalho ChehabAn arena would be in an error state if any of the metadata is corrupted
258*ae4a0502SMauro Carvalho Chehabirrecoverably, either due to a bug or a media error. The following conditions
259*ae4a0502SMauro Carvalho Chehabindicate an error:
260*ae4a0502SMauro Carvalho Chehab
261*ae4a0502SMauro Carvalho Chehab- Info block checksum does not match (and recovering from the copy also fails)
262*ae4a0502SMauro Carvalho Chehab- All internal available blocks are not uniquely and entirely addressed by the
263*ae4a0502SMauro Carvalho Chehab  sum of mapped blocks and free blocks (from the BTT flog).
264*ae4a0502SMauro Carvalho Chehab- Rebuilding free list from the flog reveals missing/duplicate/impossible
265*ae4a0502SMauro Carvalho Chehab  entries
266*ae4a0502SMauro Carvalho Chehab- A map entry is out of bounds
267*ae4a0502SMauro Carvalho Chehab
268*ae4a0502SMauro Carvalho ChehabIf any of these error conditions are encountered, the arena is put into a read
269*ae4a0502SMauro Carvalho Chehabonly state using a flag in the info block.
270*ae4a0502SMauro Carvalho Chehab
271*ae4a0502SMauro Carvalho Chehab
272*ae4a0502SMauro Carvalho Chehab5. Usage
273*ae4a0502SMauro Carvalho Chehab========
274*ae4a0502SMauro Carvalho Chehab
275*ae4a0502SMauro Carvalho ChehabThe BTT can be set up on any disk (namespace) exposed by the libnvdimm subsystem
276*ae4a0502SMauro Carvalho Chehab(pmem, or blk mode). The easiest way to set up such a namespace is using the
277*ae4a0502SMauro Carvalho Chehab'ndctl' utility [1]:
278*ae4a0502SMauro Carvalho Chehab
279*ae4a0502SMauro Carvalho ChehabFor example, the ndctl command line to setup a btt with a 4k sector size is::
280*ae4a0502SMauro Carvalho Chehab
281*ae4a0502SMauro Carvalho Chehab    ndctl create-namespace -f -e namespace0.0 -m sector -l 4k
282*ae4a0502SMauro Carvalho Chehab
283*ae4a0502SMauro Carvalho ChehabSee ndctl create-namespace --help for more options.
284*ae4a0502SMauro Carvalho Chehab
285*ae4a0502SMauro Carvalho Chehab[1]: https://github.com/pmem/ndctl
286