1*ae4a0502SMauro Carvalho Chehab============================= 2*ae4a0502SMauro Carvalho ChehabBTT - Block Translation Table 3*ae4a0502SMauro Carvalho Chehab============================= 4*ae4a0502SMauro Carvalho Chehab 5*ae4a0502SMauro Carvalho Chehab 6*ae4a0502SMauro Carvalho Chehab1. Introduction 7*ae4a0502SMauro Carvalho Chehab=============== 8*ae4a0502SMauro Carvalho Chehab 9*ae4a0502SMauro Carvalho ChehabPersistent memory based storage is able to perform IO at byte (or more 10*ae4a0502SMauro Carvalho Chehabaccurately, cache line) granularity. However, we often want to expose such 11*ae4a0502SMauro Carvalho Chehabstorage as traditional block devices. The block drivers for persistent memory 12*ae4a0502SMauro Carvalho Chehabwill do exactly this. However, they do not provide any atomicity guarantees. 13*ae4a0502SMauro Carvalho ChehabTraditional SSDs typically provide protection against torn sectors in hardware, 14*ae4a0502SMauro Carvalho Chehabusing stored energy in capacitors to complete in-flight block writes, or perhaps 15*ae4a0502SMauro Carvalho Chehabin firmware. We don't have this luxury with persistent memory - if a write is in 16*ae4a0502SMauro Carvalho Chehabprogress, and we experience a power failure, the block will contain a mix of old 17*ae4a0502SMauro Carvalho Chehaband new data. Applications may not be prepared to handle such a scenario. 18*ae4a0502SMauro Carvalho Chehab 19*ae4a0502SMauro Carvalho ChehabThe Block Translation Table (BTT) provides atomic sector update semantics for 20*ae4a0502SMauro Carvalho Chehabpersistent memory devices, so that applications that rely on sector writes not 21*ae4a0502SMauro Carvalho Chehabbeing torn can continue to do so. The BTT manifests itself as a stacked block 22*ae4a0502SMauro Carvalho Chehabdevice, and reserves a portion of the underlying storage for its metadata. At 23*ae4a0502SMauro Carvalho Chehabthe heart of it, is an indirection table that re-maps all the blocks on the 24*ae4a0502SMauro Carvalho Chehabvolume. It can be thought of as an extremely simple file system that only 25*ae4a0502SMauro Carvalho Chehabprovides atomic sector updates. 26*ae4a0502SMauro Carvalho Chehab 27*ae4a0502SMauro Carvalho Chehab 28*ae4a0502SMauro Carvalho Chehab2. Static Layout 29*ae4a0502SMauro Carvalho Chehab================ 30*ae4a0502SMauro Carvalho Chehab 31*ae4a0502SMauro Carvalho ChehabThe underlying storage on which a BTT can be laid out is not limited in any way. 32*ae4a0502SMauro Carvalho ChehabThe BTT, however, splits the available space into chunks of up to 512 GiB, 33*ae4a0502SMauro Carvalho Chehabcalled "Arenas". 34*ae4a0502SMauro Carvalho Chehab 35*ae4a0502SMauro Carvalho ChehabEach arena follows the same layout for its metadata, and all references in an 36*ae4a0502SMauro Carvalho Chehabarena are internal to it (with the exception of one field that points to the 37*ae4a0502SMauro Carvalho Chehabnext arena). The following depicts the "On-disk" metadata layout:: 38*ae4a0502SMauro Carvalho Chehab 39*ae4a0502SMauro Carvalho Chehab 40*ae4a0502SMauro Carvalho Chehab Backing Store +-------> Arena 41*ae4a0502SMauro Carvalho Chehab +---------------+ | +------------------+ 42*ae4a0502SMauro Carvalho Chehab | | | | Arena info block | 43*ae4a0502SMauro Carvalho Chehab | Arena 0 +---+ | 4K | 44*ae4a0502SMauro Carvalho Chehab | 512G | +------------------+ 45*ae4a0502SMauro Carvalho Chehab | | | | 46*ae4a0502SMauro Carvalho Chehab +---------------+ | | 47*ae4a0502SMauro Carvalho Chehab | | | | 48*ae4a0502SMauro Carvalho Chehab | Arena 1 | | Data Blocks | 49*ae4a0502SMauro Carvalho Chehab | 512G | | | 50*ae4a0502SMauro Carvalho Chehab | | | | 51*ae4a0502SMauro Carvalho Chehab +---------------+ | | 52*ae4a0502SMauro Carvalho Chehab | . | | | 53*ae4a0502SMauro Carvalho Chehab | . | | | 54*ae4a0502SMauro Carvalho Chehab | . | | | 55*ae4a0502SMauro Carvalho Chehab | | | | 56*ae4a0502SMauro Carvalho Chehab | | | | 57*ae4a0502SMauro Carvalho Chehab +---------------+ +------------------+ 58*ae4a0502SMauro Carvalho Chehab | | 59*ae4a0502SMauro Carvalho Chehab | BTT Map | 60*ae4a0502SMauro Carvalho Chehab | | 61*ae4a0502SMauro Carvalho Chehab | | 62*ae4a0502SMauro Carvalho Chehab +------------------+ 63*ae4a0502SMauro Carvalho Chehab | | 64*ae4a0502SMauro Carvalho Chehab | BTT Flog | 65*ae4a0502SMauro Carvalho Chehab | | 66*ae4a0502SMauro Carvalho Chehab +------------------+ 67*ae4a0502SMauro Carvalho Chehab | Info block copy | 68*ae4a0502SMauro Carvalho Chehab | 4K | 69*ae4a0502SMauro Carvalho Chehab +------------------+ 70*ae4a0502SMauro Carvalho Chehab 71*ae4a0502SMauro Carvalho Chehab 72*ae4a0502SMauro Carvalho Chehab3. Theory of Operation 73*ae4a0502SMauro Carvalho Chehab====================== 74*ae4a0502SMauro Carvalho Chehab 75*ae4a0502SMauro Carvalho Chehab 76*ae4a0502SMauro Carvalho Chehaba. The BTT Map 77*ae4a0502SMauro Carvalho Chehab-------------- 78*ae4a0502SMauro Carvalho Chehab 79*ae4a0502SMauro Carvalho ChehabThe map is a simple lookup/indirection table that maps an LBA to an internal 80*ae4a0502SMauro Carvalho Chehabblock. Each map entry is 32 bits. The two most significant bits are special 81*ae4a0502SMauro Carvalho Chehabflags, and the remaining form the internal block number. 82*ae4a0502SMauro Carvalho Chehab 83*ae4a0502SMauro Carvalho Chehab======== ============================================================= 84*ae4a0502SMauro Carvalho ChehabBit Description 85*ae4a0502SMauro Carvalho Chehab======== ============================================================= 86*ae4a0502SMauro Carvalho Chehab31 - 30 Error and Zero flags - Used in the following way: 87*ae4a0502SMauro Carvalho Chehab 88*ae4a0502SMauro Carvalho Chehab == == ==================================================== 89*ae4a0502SMauro Carvalho Chehab 31 30 Description 90*ae4a0502SMauro Carvalho Chehab == == ==================================================== 91*ae4a0502SMauro Carvalho Chehab 0 0 Initial state. Reads return zeroes; Premap = Postmap 92*ae4a0502SMauro Carvalho Chehab 0 1 Zero state: Reads return zeroes 93*ae4a0502SMauro Carvalho Chehab 1 0 Error state: Reads fail; Writes clear 'E' bit 94*ae4a0502SMauro Carvalho Chehab 1 1 Normal Block – has valid postmap 95*ae4a0502SMauro Carvalho Chehab == == ==================================================== 96*ae4a0502SMauro Carvalho Chehab 97*ae4a0502SMauro Carvalho Chehab29 - 0 Mappings to internal 'postmap' blocks 98*ae4a0502SMauro Carvalho Chehab======== ============================================================= 99*ae4a0502SMauro Carvalho Chehab 100*ae4a0502SMauro Carvalho Chehab 101*ae4a0502SMauro Carvalho ChehabSome of the terminology that will be subsequently used: 102*ae4a0502SMauro Carvalho Chehab 103*ae4a0502SMauro Carvalho Chehab============ ================================================================ 104*ae4a0502SMauro Carvalho ChehabExternal LBA LBA as made visible to upper layers. 105*ae4a0502SMauro Carvalho ChehabABA Arena Block Address - Block offset/number within an arena 106*ae4a0502SMauro Carvalho ChehabPremap ABA The block offset into an arena, which was decided upon by range 107*ae4a0502SMauro Carvalho Chehab checking the External LBA 108*ae4a0502SMauro Carvalho ChehabPostmap ABA The block number in the "Data Blocks" area obtained after 109*ae4a0502SMauro Carvalho Chehab indirection from the map 110*ae4a0502SMauro Carvalho Chehabnfree The number of free blocks that are maintained at any given time. 111*ae4a0502SMauro Carvalho Chehab This is the number of concurrent writes that can happen to the 112*ae4a0502SMauro Carvalho Chehab arena. 113*ae4a0502SMauro Carvalho Chehab============ ================================================================ 114*ae4a0502SMauro Carvalho Chehab 115*ae4a0502SMauro Carvalho Chehab 116*ae4a0502SMauro Carvalho ChehabFor example, after adding a BTT, we surface a disk of 1024G. We get a read for 117*ae4a0502SMauro Carvalho Chehabthe external LBA at 768G. This falls into the second arena, and of the 512G 118*ae4a0502SMauro Carvalho Chehabworth of blocks that this arena contributes, this block is at 256G. Thus, the 119*ae4a0502SMauro Carvalho Chehabpremap ABA is 256G. We now refer to the map, and find out the mapping for block 120*ae4a0502SMauro Carvalho Chehab'X' (256G) points to block 'Y', say '64'. Thus the postmap ABA is 64. 121*ae4a0502SMauro Carvalho Chehab 122*ae4a0502SMauro Carvalho Chehab 123*ae4a0502SMauro Carvalho Chehabb. The BTT Flog 124*ae4a0502SMauro Carvalho Chehab--------------- 125*ae4a0502SMauro Carvalho Chehab 126*ae4a0502SMauro Carvalho ChehabThe BTT provides sector atomicity by making every write an "allocating write", 127*ae4a0502SMauro Carvalho Chehabi.e. Every write goes to a "free" block. A running list of free blocks is 128*ae4a0502SMauro Carvalho Chehabmaintained in the form of the BTT flog. 'Flog' is a combination of the words 129*ae4a0502SMauro Carvalho Chehab"free list" and "log". The flog contains 'nfree' entries, and an entry contains: 130*ae4a0502SMauro Carvalho Chehab 131*ae4a0502SMauro Carvalho Chehab======== ===================================================================== 132*ae4a0502SMauro Carvalho Chehablba The premap ABA that is being written to 133*ae4a0502SMauro Carvalho Chehabold_map The old postmap ABA - after 'this' write completes, this will be a 134*ae4a0502SMauro Carvalho Chehab free block. 135*ae4a0502SMauro Carvalho Chehabnew_map The new postmap ABA. The map will up updated to reflect this 136*ae4a0502SMauro Carvalho Chehab lba->postmap_aba mapping, but we log it here in case we have to 137*ae4a0502SMauro Carvalho Chehab recover. 138*ae4a0502SMauro Carvalho Chehabseq Sequence number to mark which of the 2 sections of this flog entry is 139*ae4a0502SMauro Carvalho Chehab valid/newest. It cycles between 01->10->11->01 (binary) under normal 140*ae4a0502SMauro Carvalho Chehab operation, with 00 indicating an uninitialized state. 141*ae4a0502SMauro Carvalho Chehablba' alternate lba entry 142*ae4a0502SMauro Carvalho Chehabold_map' alternate old postmap entry 143*ae4a0502SMauro Carvalho Chehabnew_map' alternate new postmap entry 144*ae4a0502SMauro Carvalho Chehabseq' alternate sequence number. 145*ae4a0502SMauro Carvalho Chehab======== ===================================================================== 146*ae4a0502SMauro Carvalho Chehab 147*ae4a0502SMauro Carvalho ChehabEach of the above fields is 32-bit, making one entry 32 bytes. Entries are also 148*ae4a0502SMauro Carvalho Chehabpadded to 64 bytes to avoid cache line sharing or aliasing. Flog updates are 149*ae4a0502SMauro Carvalho Chehabdone such that for any entry being written, it: 150*ae4a0502SMauro Carvalho Chehaba. overwrites the 'old' section in the entry based on sequence numbers 151*ae4a0502SMauro Carvalho Chehabb. writes the 'new' section such that the sequence number is written last. 152*ae4a0502SMauro Carvalho Chehab 153*ae4a0502SMauro Carvalho Chehab 154*ae4a0502SMauro Carvalho Chehabc. The concept of lanes 155*ae4a0502SMauro Carvalho Chehab----------------------- 156*ae4a0502SMauro Carvalho Chehab 157*ae4a0502SMauro Carvalho ChehabWhile 'nfree' describes the number of concurrent IOs an arena can process 158*ae4a0502SMauro Carvalho Chehabconcurrently, 'nlanes' is the number of IOs the BTT device as a whole can 159*ae4a0502SMauro Carvalho Chehabprocess:: 160*ae4a0502SMauro Carvalho Chehab 161*ae4a0502SMauro Carvalho Chehab nlanes = min(nfree, num_cpus) 162*ae4a0502SMauro Carvalho Chehab 163*ae4a0502SMauro Carvalho ChehabA lane number is obtained at the start of any IO, and is used for indexing into 164*ae4a0502SMauro Carvalho Chehaball the on-disk and in-memory data structures for the duration of the IO. If 165*ae4a0502SMauro Carvalho Chehabthere are more CPUs than the max number of available lanes, than lanes are 166*ae4a0502SMauro Carvalho Chehabprotected by spinlocks. 167*ae4a0502SMauro Carvalho Chehab 168*ae4a0502SMauro Carvalho Chehab 169*ae4a0502SMauro Carvalho Chehabd. In-memory data structure: Read Tracking Table (RTT) 170*ae4a0502SMauro Carvalho Chehab------------------------------------------------------ 171*ae4a0502SMauro Carvalho Chehab 172*ae4a0502SMauro Carvalho ChehabConsider a case where we have two threads, one doing reads and the other, 173*ae4a0502SMauro Carvalho Chehabwrites. We can hit a condition where the writer thread grabs a free block to do 174*ae4a0502SMauro Carvalho Chehaba new IO, but the (slow) reader thread is still reading from it. In other words, 175*ae4a0502SMauro Carvalho Chehabthe reader consulted a map entry, and started reading the corresponding block. A 176*ae4a0502SMauro Carvalho Chehabwriter started writing to the same external LBA, and finished the write updating 177*ae4a0502SMauro Carvalho Chehabthe map for that external LBA to point to its new postmap ABA. At this point the 178*ae4a0502SMauro Carvalho Chehabinternal, postmap block that the reader is (still) reading has been inserted 179*ae4a0502SMauro Carvalho Chehabinto the list of free blocks. If another write comes in for the same LBA, it can 180*ae4a0502SMauro Carvalho Chehabgrab this free block, and start writing to it, causing the reader to read 181*ae4a0502SMauro Carvalho Chehabincorrect data. To prevent this, we introduce the RTT. 182*ae4a0502SMauro Carvalho Chehab 183*ae4a0502SMauro Carvalho ChehabThe RTT is a simple, per arena table with 'nfree' entries. Every reader inserts 184*ae4a0502SMauro Carvalho Chehabinto rtt[lane_number], the postmap ABA it is reading, and clears it after the 185*ae4a0502SMauro Carvalho Chehabread is complete. Every writer thread, after grabbing a free block, checks the 186*ae4a0502SMauro Carvalho ChehabRTT for its presence. If the postmap free block is in the RTT, it waits till the 187*ae4a0502SMauro Carvalho Chehabreader clears the RTT entry, and only then starts writing to it. 188*ae4a0502SMauro Carvalho Chehab 189*ae4a0502SMauro Carvalho Chehab 190*ae4a0502SMauro Carvalho Chehabe. In-memory data structure: map locks 191*ae4a0502SMauro Carvalho Chehab-------------------------------------- 192*ae4a0502SMauro Carvalho Chehab 193*ae4a0502SMauro Carvalho ChehabConsider a case where two writer threads are writing to the same LBA. There can 194*ae4a0502SMauro Carvalho Chehabbe a race in the following sequence of steps:: 195*ae4a0502SMauro Carvalho Chehab 196*ae4a0502SMauro Carvalho Chehab free[lane] = map[premap_aba] 197*ae4a0502SMauro Carvalho Chehab map[premap_aba] = postmap_aba 198*ae4a0502SMauro Carvalho Chehab 199*ae4a0502SMauro Carvalho ChehabBoth threads can update their respective free[lane] with the same old, freed 200*ae4a0502SMauro Carvalho Chehabpostmap_aba. This has made the layout inconsistent by losing a free entry, and 201*ae4a0502SMauro Carvalho Chehabat the same time, duplicating another free entry for two lanes. 202*ae4a0502SMauro Carvalho Chehab 203*ae4a0502SMauro Carvalho ChehabTo solve this, we could have a single map lock (per arena) that has to be taken 204*ae4a0502SMauro Carvalho Chehabbefore performing the above sequence, but we feel that could be too contentious. 205*ae4a0502SMauro Carvalho ChehabInstead we use an array of (nfree) map_locks that is indexed by 206*ae4a0502SMauro Carvalho Chehab(premap_aba modulo nfree). 207*ae4a0502SMauro Carvalho Chehab 208*ae4a0502SMauro Carvalho Chehab 209*ae4a0502SMauro Carvalho Chehabf. Reconstruction from the Flog 210*ae4a0502SMauro Carvalho Chehab------------------------------- 211*ae4a0502SMauro Carvalho Chehab 212*ae4a0502SMauro Carvalho ChehabOn startup, we analyze the BTT flog to create our list of free blocks. We walk 213*ae4a0502SMauro Carvalho Chehabthrough all the entries, and for each lane, of the set of two possible 214*ae4a0502SMauro Carvalho Chehab'sections', we always look at the most recent one only (based on the sequence 215*ae4a0502SMauro Carvalho Chehabnumber). The reconstruction rules/steps are simple: 216*ae4a0502SMauro Carvalho Chehab 217*ae4a0502SMauro Carvalho Chehab- Read map[log_entry.lba]. 218*ae4a0502SMauro Carvalho Chehab- If log_entry.new matches the map entry, then log_entry.old is free. 219*ae4a0502SMauro Carvalho Chehab- If log_entry.new does not match the map entry, then log_entry.new is free. 220*ae4a0502SMauro Carvalho Chehab (This case can only be caused by power-fails/unsafe shutdowns) 221*ae4a0502SMauro Carvalho Chehab 222*ae4a0502SMauro Carvalho Chehab 223*ae4a0502SMauro Carvalho Chehabg. Summarizing - Read and Write flows 224*ae4a0502SMauro Carvalho Chehab------------------------------------- 225*ae4a0502SMauro Carvalho Chehab 226*ae4a0502SMauro Carvalho ChehabRead: 227*ae4a0502SMauro Carvalho Chehab 228*ae4a0502SMauro Carvalho Chehab1. Convert external LBA to arena number + pre-map ABA 229*ae4a0502SMauro Carvalho Chehab2. Get a lane (and take lane_lock) 230*ae4a0502SMauro Carvalho Chehab3. Read map to get the entry for this pre-map ABA 231*ae4a0502SMauro Carvalho Chehab4. Enter post-map ABA into RTT[lane] 232*ae4a0502SMauro Carvalho Chehab5. If TRIM flag set in map, return zeroes, and end IO (go to step 8) 233*ae4a0502SMauro Carvalho Chehab6. If ERROR flag set in map, end IO with EIO (go to step 8) 234*ae4a0502SMauro Carvalho Chehab7. Read data from this block 235*ae4a0502SMauro Carvalho Chehab8. Remove post-map ABA entry from RTT[lane] 236*ae4a0502SMauro Carvalho Chehab9. Release lane (and lane_lock) 237*ae4a0502SMauro Carvalho Chehab 238*ae4a0502SMauro Carvalho ChehabWrite: 239*ae4a0502SMauro Carvalho Chehab 240*ae4a0502SMauro Carvalho Chehab1. Convert external LBA to Arena number + pre-map ABA 241*ae4a0502SMauro Carvalho Chehab2. Get a lane (and take lane_lock) 242*ae4a0502SMauro Carvalho Chehab3. Use lane to index into in-memory free list and obtain a new block, next flog 243*ae4a0502SMauro Carvalho Chehab index, next sequence number 244*ae4a0502SMauro Carvalho Chehab4. Scan the RTT to check if free block is present, and spin/wait if it is. 245*ae4a0502SMauro Carvalho Chehab5. Write data to this free block 246*ae4a0502SMauro Carvalho Chehab6. Read map to get the existing post-map ABA entry for this pre-map ABA 247*ae4a0502SMauro Carvalho Chehab7. Write flog entry: [premap_aba / old postmap_aba / new postmap_aba / seq_num] 248*ae4a0502SMauro Carvalho Chehab8. Write new post-map ABA into map. 249*ae4a0502SMauro Carvalho Chehab9. Write old post-map entry into the free list 250*ae4a0502SMauro Carvalho Chehab10. Calculate next sequence number and write into the free list entry 251*ae4a0502SMauro Carvalho Chehab11. Release lane (and lane_lock) 252*ae4a0502SMauro Carvalho Chehab 253*ae4a0502SMauro Carvalho Chehab 254*ae4a0502SMauro Carvalho Chehab4. Error Handling 255*ae4a0502SMauro Carvalho Chehab================= 256*ae4a0502SMauro Carvalho Chehab 257*ae4a0502SMauro Carvalho ChehabAn arena would be in an error state if any of the metadata is corrupted 258*ae4a0502SMauro Carvalho Chehabirrecoverably, either due to a bug or a media error. The following conditions 259*ae4a0502SMauro Carvalho Chehabindicate an error: 260*ae4a0502SMauro Carvalho Chehab 261*ae4a0502SMauro Carvalho Chehab- Info block checksum does not match (and recovering from the copy also fails) 262*ae4a0502SMauro Carvalho Chehab- All internal available blocks are not uniquely and entirely addressed by the 263*ae4a0502SMauro Carvalho Chehab sum of mapped blocks and free blocks (from the BTT flog). 264*ae4a0502SMauro Carvalho Chehab- Rebuilding free list from the flog reveals missing/duplicate/impossible 265*ae4a0502SMauro Carvalho Chehab entries 266*ae4a0502SMauro Carvalho Chehab- A map entry is out of bounds 267*ae4a0502SMauro Carvalho Chehab 268*ae4a0502SMauro Carvalho ChehabIf any of these error conditions are encountered, the arena is put into a read 269*ae4a0502SMauro Carvalho Chehabonly state using a flag in the info block. 270*ae4a0502SMauro Carvalho Chehab 271*ae4a0502SMauro Carvalho Chehab 272*ae4a0502SMauro Carvalho Chehab5. Usage 273*ae4a0502SMauro Carvalho Chehab======== 274*ae4a0502SMauro Carvalho Chehab 275*ae4a0502SMauro Carvalho ChehabThe BTT can be set up on any disk (namespace) exposed by the libnvdimm subsystem 276*ae4a0502SMauro Carvalho Chehab(pmem, or blk mode). The easiest way to set up such a namespace is using the 277*ae4a0502SMauro Carvalho Chehab'ndctl' utility [1]: 278*ae4a0502SMauro Carvalho Chehab 279*ae4a0502SMauro Carvalho ChehabFor example, the ndctl command line to setup a btt with a 4k sector size is:: 280*ae4a0502SMauro Carvalho Chehab 281*ae4a0502SMauro Carvalho Chehab ndctl create-namespace -f -e namespace0.0 -m sector -l 4k 282*ae4a0502SMauro Carvalho Chehab 283*ae4a0502SMauro Carvalho ChehabSee ndctl create-namespace --help for more options. 284*ae4a0502SMauro Carvalho Chehab 285*ae4a0502SMauro Carvalho Chehab[1]: https://github.com/pmem/ndctl 286