1ae4a0502SMauro Carvalho Chehab============================= 2ae4a0502SMauro Carvalho ChehabBTT - Block Translation Table 3ae4a0502SMauro Carvalho Chehab============================= 4ae4a0502SMauro Carvalho Chehab 5ae4a0502SMauro Carvalho Chehab 6ae4a0502SMauro Carvalho Chehab1. Introduction 7ae4a0502SMauro Carvalho Chehab=============== 8ae4a0502SMauro Carvalho Chehab 9ae4a0502SMauro Carvalho ChehabPersistent memory based storage is able to perform IO at byte (or more 10ae4a0502SMauro Carvalho Chehabaccurately, cache line) granularity. However, we often want to expose such 11ae4a0502SMauro Carvalho Chehabstorage as traditional block devices. The block drivers for persistent memory 12ae4a0502SMauro Carvalho Chehabwill do exactly this. However, they do not provide any atomicity guarantees. 13ae4a0502SMauro Carvalho ChehabTraditional SSDs typically provide protection against torn sectors in hardware, 14ae4a0502SMauro Carvalho Chehabusing stored energy in capacitors to complete in-flight block writes, or perhaps 15ae4a0502SMauro Carvalho Chehabin firmware. We don't have this luxury with persistent memory - if a write is in 16ae4a0502SMauro Carvalho Chehabprogress, and we experience a power failure, the block will contain a mix of old 17ae4a0502SMauro Carvalho Chehaband new data. Applications may not be prepared to handle such a scenario. 18ae4a0502SMauro Carvalho Chehab 19ae4a0502SMauro Carvalho ChehabThe Block Translation Table (BTT) provides atomic sector update semantics for 20ae4a0502SMauro Carvalho Chehabpersistent memory devices, so that applications that rely on sector writes not 21ae4a0502SMauro Carvalho Chehabbeing torn can continue to do so. The BTT manifests itself as a stacked block 22ae4a0502SMauro Carvalho Chehabdevice, and reserves a portion of the underlying storage for its metadata. At 23ae4a0502SMauro Carvalho Chehabthe heart of it, is an indirection table that re-maps all the blocks on the 24ae4a0502SMauro Carvalho Chehabvolume. It can be thought of as an extremely simple file system that only 25ae4a0502SMauro Carvalho Chehabprovides atomic sector updates. 26ae4a0502SMauro Carvalho Chehab 27ae4a0502SMauro Carvalho Chehab 28ae4a0502SMauro Carvalho Chehab2. Static Layout 29ae4a0502SMauro Carvalho Chehab================ 30ae4a0502SMauro Carvalho Chehab 31ae4a0502SMauro Carvalho ChehabThe underlying storage on which a BTT can be laid out is not limited in any way. 32ae4a0502SMauro Carvalho ChehabThe BTT, however, splits the available space into chunks of up to 512 GiB, 33ae4a0502SMauro Carvalho Chehabcalled "Arenas". 34ae4a0502SMauro Carvalho Chehab 35ae4a0502SMauro Carvalho ChehabEach arena follows the same layout for its metadata, and all references in an 36ae4a0502SMauro Carvalho Chehabarena are internal to it (with the exception of one field that points to the 37ae4a0502SMauro Carvalho Chehabnext arena). The following depicts the "On-disk" metadata layout:: 38ae4a0502SMauro Carvalho Chehab 39ae4a0502SMauro Carvalho Chehab 40ae4a0502SMauro Carvalho Chehab Backing Store +-------> Arena 41ae4a0502SMauro Carvalho Chehab +---------------+ | +------------------+ 42ae4a0502SMauro Carvalho Chehab | | | | Arena info block | 43ae4a0502SMauro Carvalho Chehab | Arena 0 +---+ | 4K | 44ae4a0502SMauro Carvalho Chehab | 512G | +------------------+ 45ae4a0502SMauro Carvalho Chehab | | | | 46ae4a0502SMauro Carvalho Chehab +---------------+ | | 47ae4a0502SMauro Carvalho Chehab | | | | 48ae4a0502SMauro Carvalho Chehab | Arena 1 | | Data Blocks | 49ae4a0502SMauro Carvalho Chehab | 512G | | | 50ae4a0502SMauro Carvalho Chehab | | | | 51ae4a0502SMauro Carvalho Chehab +---------------+ | | 52ae4a0502SMauro Carvalho Chehab | . | | | 53ae4a0502SMauro Carvalho Chehab | . | | | 54ae4a0502SMauro Carvalho Chehab | . | | | 55ae4a0502SMauro Carvalho Chehab | | | | 56ae4a0502SMauro Carvalho Chehab | | | | 57ae4a0502SMauro Carvalho Chehab +---------------+ +------------------+ 58ae4a0502SMauro Carvalho Chehab | | 59ae4a0502SMauro Carvalho Chehab | BTT Map | 60ae4a0502SMauro Carvalho Chehab | | 61ae4a0502SMauro Carvalho Chehab | | 62ae4a0502SMauro Carvalho Chehab +------------------+ 63ae4a0502SMauro Carvalho Chehab | | 64ae4a0502SMauro Carvalho Chehab | BTT Flog | 65ae4a0502SMauro Carvalho Chehab | | 66ae4a0502SMauro Carvalho Chehab +------------------+ 67ae4a0502SMauro Carvalho Chehab | Info block copy | 68ae4a0502SMauro Carvalho Chehab | 4K | 69ae4a0502SMauro Carvalho Chehab +------------------+ 70ae4a0502SMauro Carvalho Chehab 71ae4a0502SMauro Carvalho Chehab 72ae4a0502SMauro Carvalho Chehab3. Theory of Operation 73ae4a0502SMauro Carvalho Chehab====================== 74ae4a0502SMauro Carvalho Chehab 75ae4a0502SMauro Carvalho Chehab 76ae4a0502SMauro Carvalho Chehaba. The BTT Map 77ae4a0502SMauro Carvalho Chehab-------------- 78ae4a0502SMauro Carvalho Chehab 79ae4a0502SMauro Carvalho ChehabThe map is a simple lookup/indirection table that maps an LBA to an internal 80ae4a0502SMauro Carvalho Chehabblock. Each map entry is 32 bits. The two most significant bits are special 81ae4a0502SMauro Carvalho Chehabflags, and the remaining form the internal block number. 82ae4a0502SMauro Carvalho Chehab 83ae4a0502SMauro Carvalho Chehab======== ============================================================= 84ae4a0502SMauro Carvalho ChehabBit Description 85ae4a0502SMauro Carvalho Chehab======== ============================================================= 86*eddeed12SMauro Carvalho Chehab31 - 30 Error and Zero flags - Used in the following way:: 87ae4a0502SMauro Carvalho Chehab 88ae4a0502SMauro Carvalho Chehab == == ==================================================== 89ae4a0502SMauro Carvalho Chehab 31 30 Description 90ae4a0502SMauro Carvalho Chehab == == ==================================================== 91ae4a0502SMauro Carvalho Chehab 0 0 Initial state. Reads return zeroes; Premap = Postmap 92ae4a0502SMauro Carvalho Chehab 0 1 Zero state: Reads return zeroes 93ae4a0502SMauro Carvalho Chehab 1 0 Error state: Reads fail; Writes clear 'E' bit 94ae4a0502SMauro Carvalho Chehab 1 1 Normal Block – has valid postmap 95ae4a0502SMauro Carvalho Chehab == == ==================================================== 96ae4a0502SMauro Carvalho Chehab 97ae4a0502SMauro Carvalho Chehab29 - 0 Mappings to internal 'postmap' blocks 98ae4a0502SMauro Carvalho Chehab======== ============================================================= 99ae4a0502SMauro Carvalho Chehab 100ae4a0502SMauro Carvalho Chehab 101ae4a0502SMauro Carvalho ChehabSome of the terminology that will be subsequently used: 102ae4a0502SMauro Carvalho Chehab 103ae4a0502SMauro Carvalho Chehab============ ================================================================ 104ae4a0502SMauro Carvalho ChehabExternal LBA LBA as made visible to upper layers. 105ae4a0502SMauro Carvalho ChehabABA Arena Block Address - Block offset/number within an arena 106ae4a0502SMauro Carvalho ChehabPremap ABA The block offset into an arena, which was decided upon by range 107ae4a0502SMauro Carvalho Chehab checking the External LBA 108ae4a0502SMauro Carvalho ChehabPostmap ABA The block number in the "Data Blocks" area obtained after 109ae4a0502SMauro Carvalho Chehab indirection from the map 110ae4a0502SMauro Carvalho Chehabnfree The number of free blocks that are maintained at any given time. 111ae4a0502SMauro Carvalho Chehab This is the number of concurrent writes that can happen to the 112ae4a0502SMauro Carvalho Chehab arena. 113ae4a0502SMauro Carvalho Chehab============ ================================================================ 114ae4a0502SMauro Carvalho Chehab 115ae4a0502SMauro Carvalho Chehab 116ae4a0502SMauro Carvalho ChehabFor example, after adding a BTT, we surface a disk of 1024G. We get a read for 117ae4a0502SMauro Carvalho Chehabthe external LBA at 768G. This falls into the second arena, and of the 512G 118ae4a0502SMauro Carvalho Chehabworth of blocks that this arena contributes, this block is at 256G. Thus, the 119ae4a0502SMauro Carvalho Chehabpremap ABA is 256G. We now refer to the map, and find out the mapping for block 120ae4a0502SMauro Carvalho Chehab'X' (256G) points to block 'Y', say '64'. Thus the postmap ABA is 64. 121ae4a0502SMauro Carvalho Chehab 122ae4a0502SMauro Carvalho Chehab 123ae4a0502SMauro Carvalho Chehabb. The BTT Flog 124ae4a0502SMauro Carvalho Chehab--------------- 125ae4a0502SMauro Carvalho Chehab 126ae4a0502SMauro Carvalho ChehabThe BTT provides sector atomicity by making every write an "allocating write", 127ae4a0502SMauro Carvalho Chehabi.e. Every write goes to a "free" block. A running list of free blocks is 128ae4a0502SMauro Carvalho Chehabmaintained in the form of the BTT flog. 'Flog' is a combination of the words 129ae4a0502SMauro Carvalho Chehab"free list" and "log". The flog contains 'nfree' entries, and an entry contains: 130ae4a0502SMauro Carvalho Chehab 131ae4a0502SMauro Carvalho Chehab======== ===================================================================== 132ae4a0502SMauro Carvalho Chehablba The premap ABA that is being written to 133ae4a0502SMauro Carvalho Chehabold_map The old postmap ABA - after 'this' write completes, this will be a 134ae4a0502SMauro Carvalho Chehab free block. 135ae4a0502SMauro Carvalho Chehabnew_map The new postmap ABA. The map will up updated to reflect this 136ae4a0502SMauro Carvalho Chehab lba->postmap_aba mapping, but we log it here in case we have to 137ae4a0502SMauro Carvalho Chehab recover. 138ae4a0502SMauro Carvalho Chehabseq Sequence number to mark which of the 2 sections of this flog entry is 139ae4a0502SMauro Carvalho Chehab valid/newest. It cycles between 01->10->11->01 (binary) under normal 140ae4a0502SMauro Carvalho Chehab operation, with 00 indicating an uninitialized state. 141ae4a0502SMauro Carvalho Chehablba' alternate lba entry 142ae4a0502SMauro Carvalho Chehabold_map' alternate old postmap entry 143ae4a0502SMauro Carvalho Chehabnew_map' alternate new postmap entry 144ae4a0502SMauro Carvalho Chehabseq' alternate sequence number. 145ae4a0502SMauro Carvalho Chehab======== ===================================================================== 146ae4a0502SMauro Carvalho Chehab 147ae4a0502SMauro Carvalho ChehabEach of the above fields is 32-bit, making one entry 32 bytes. Entries are also 148ae4a0502SMauro Carvalho Chehabpadded to 64 bytes to avoid cache line sharing or aliasing. Flog updates are 149ae4a0502SMauro Carvalho Chehabdone such that for any entry being written, it: 150ae4a0502SMauro Carvalho Chehaba. overwrites the 'old' section in the entry based on sequence numbers 151ae4a0502SMauro Carvalho Chehabb. writes the 'new' section such that the sequence number is written last. 152ae4a0502SMauro Carvalho Chehab 153ae4a0502SMauro Carvalho Chehab 154ae4a0502SMauro Carvalho Chehabc. The concept of lanes 155ae4a0502SMauro Carvalho Chehab----------------------- 156ae4a0502SMauro Carvalho Chehab 157ae4a0502SMauro Carvalho ChehabWhile 'nfree' describes the number of concurrent IOs an arena can process 158ae4a0502SMauro Carvalho Chehabconcurrently, 'nlanes' is the number of IOs the BTT device as a whole can 159ae4a0502SMauro Carvalho Chehabprocess:: 160ae4a0502SMauro Carvalho Chehab 161ae4a0502SMauro Carvalho Chehab nlanes = min(nfree, num_cpus) 162ae4a0502SMauro Carvalho Chehab 163ae4a0502SMauro Carvalho ChehabA lane number is obtained at the start of any IO, and is used for indexing into 164ae4a0502SMauro Carvalho Chehaball the on-disk and in-memory data structures for the duration of the IO. If 165ae4a0502SMauro Carvalho Chehabthere are more CPUs than the max number of available lanes, than lanes are 166ae4a0502SMauro Carvalho Chehabprotected by spinlocks. 167ae4a0502SMauro Carvalho Chehab 168ae4a0502SMauro Carvalho Chehab 169ae4a0502SMauro Carvalho Chehabd. In-memory data structure: Read Tracking Table (RTT) 170ae4a0502SMauro Carvalho Chehab------------------------------------------------------ 171ae4a0502SMauro Carvalho Chehab 172ae4a0502SMauro Carvalho ChehabConsider a case where we have two threads, one doing reads and the other, 173ae4a0502SMauro Carvalho Chehabwrites. We can hit a condition where the writer thread grabs a free block to do 174ae4a0502SMauro Carvalho Chehaba new IO, but the (slow) reader thread is still reading from it. In other words, 175ae4a0502SMauro Carvalho Chehabthe reader consulted a map entry, and started reading the corresponding block. A 176ae4a0502SMauro Carvalho Chehabwriter started writing to the same external LBA, and finished the write updating 177ae4a0502SMauro Carvalho Chehabthe map for that external LBA to point to its new postmap ABA. At this point the 178ae4a0502SMauro Carvalho Chehabinternal, postmap block that the reader is (still) reading has been inserted 179ae4a0502SMauro Carvalho Chehabinto the list of free blocks. If another write comes in for the same LBA, it can 180ae4a0502SMauro Carvalho Chehabgrab this free block, and start writing to it, causing the reader to read 181ae4a0502SMauro Carvalho Chehabincorrect data. To prevent this, we introduce the RTT. 182ae4a0502SMauro Carvalho Chehab 183ae4a0502SMauro Carvalho ChehabThe RTT is a simple, per arena table with 'nfree' entries. Every reader inserts 184ae4a0502SMauro Carvalho Chehabinto rtt[lane_number], the postmap ABA it is reading, and clears it after the 185ae4a0502SMauro Carvalho Chehabread is complete. Every writer thread, after grabbing a free block, checks the 186ae4a0502SMauro Carvalho ChehabRTT for its presence. If the postmap free block is in the RTT, it waits till the 187ae4a0502SMauro Carvalho Chehabreader clears the RTT entry, and only then starts writing to it. 188ae4a0502SMauro Carvalho Chehab 189ae4a0502SMauro Carvalho Chehab 190ae4a0502SMauro Carvalho Chehabe. In-memory data structure: map locks 191ae4a0502SMauro Carvalho Chehab-------------------------------------- 192ae4a0502SMauro Carvalho Chehab 193ae4a0502SMauro Carvalho ChehabConsider a case where two writer threads are writing to the same LBA. There can 194ae4a0502SMauro Carvalho Chehabbe a race in the following sequence of steps:: 195ae4a0502SMauro Carvalho Chehab 196ae4a0502SMauro Carvalho Chehab free[lane] = map[premap_aba] 197ae4a0502SMauro Carvalho Chehab map[premap_aba] = postmap_aba 198ae4a0502SMauro Carvalho Chehab 199ae4a0502SMauro Carvalho ChehabBoth threads can update their respective free[lane] with the same old, freed 200ae4a0502SMauro Carvalho Chehabpostmap_aba. This has made the layout inconsistent by losing a free entry, and 201ae4a0502SMauro Carvalho Chehabat the same time, duplicating another free entry for two lanes. 202ae4a0502SMauro Carvalho Chehab 203ae4a0502SMauro Carvalho ChehabTo solve this, we could have a single map lock (per arena) that has to be taken 204ae4a0502SMauro Carvalho Chehabbefore performing the above sequence, but we feel that could be too contentious. 205ae4a0502SMauro Carvalho ChehabInstead we use an array of (nfree) map_locks that is indexed by 206ae4a0502SMauro Carvalho Chehab(premap_aba modulo nfree). 207ae4a0502SMauro Carvalho Chehab 208ae4a0502SMauro Carvalho Chehab 209ae4a0502SMauro Carvalho Chehabf. Reconstruction from the Flog 210ae4a0502SMauro Carvalho Chehab------------------------------- 211ae4a0502SMauro Carvalho Chehab 212ae4a0502SMauro Carvalho ChehabOn startup, we analyze the BTT flog to create our list of free blocks. We walk 213ae4a0502SMauro Carvalho Chehabthrough all the entries, and for each lane, of the set of two possible 214ae4a0502SMauro Carvalho Chehab'sections', we always look at the most recent one only (based on the sequence 215ae4a0502SMauro Carvalho Chehabnumber). The reconstruction rules/steps are simple: 216ae4a0502SMauro Carvalho Chehab 217ae4a0502SMauro Carvalho Chehab- Read map[log_entry.lba]. 218ae4a0502SMauro Carvalho Chehab- If log_entry.new matches the map entry, then log_entry.old is free. 219ae4a0502SMauro Carvalho Chehab- If log_entry.new does not match the map entry, then log_entry.new is free. 220ae4a0502SMauro Carvalho Chehab (This case can only be caused by power-fails/unsafe shutdowns) 221ae4a0502SMauro Carvalho Chehab 222ae4a0502SMauro Carvalho Chehab 223ae4a0502SMauro Carvalho Chehabg. Summarizing - Read and Write flows 224ae4a0502SMauro Carvalho Chehab------------------------------------- 225ae4a0502SMauro Carvalho Chehab 226ae4a0502SMauro Carvalho ChehabRead: 227ae4a0502SMauro Carvalho Chehab 228ae4a0502SMauro Carvalho Chehab1. Convert external LBA to arena number + pre-map ABA 229ae4a0502SMauro Carvalho Chehab2. Get a lane (and take lane_lock) 230ae4a0502SMauro Carvalho Chehab3. Read map to get the entry for this pre-map ABA 231ae4a0502SMauro Carvalho Chehab4. Enter post-map ABA into RTT[lane] 232ae4a0502SMauro Carvalho Chehab5. If TRIM flag set in map, return zeroes, and end IO (go to step 8) 233ae4a0502SMauro Carvalho Chehab6. If ERROR flag set in map, end IO with EIO (go to step 8) 234ae4a0502SMauro Carvalho Chehab7. Read data from this block 235ae4a0502SMauro Carvalho Chehab8. Remove post-map ABA entry from RTT[lane] 236ae4a0502SMauro Carvalho Chehab9. Release lane (and lane_lock) 237ae4a0502SMauro Carvalho Chehab 238ae4a0502SMauro Carvalho ChehabWrite: 239ae4a0502SMauro Carvalho Chehab 240ae4a0502SMauro Carvalho Chehab1. Convert external LBA to Arena number + pre-map ABA 241ae4a0502SMauro Carvalho Chehab2. Get a lane (and take lane_lock) 242ae4a0502SMauro Carvalho Chehab3. Use lane to index into in-memory free list and obtain a new block, next flog 243ae4a0502SMauro Carvalho Chehab index, next sequence number 244ae4a0502SMauro Carvalho Chehab4. Scan the RTT to check if free block is present, and spin/wait if it is. 245ae4a0502SMauro Carvalho Chehab5. Write data to this free block 246ae4a0502SMauro Carvalho Chehab6. Read map to get the existing post-map ABA entry for this pre-map ABA 247ae4a0502SMauro Carvalho Chehab7. Write flog entry: [premap_aba / old postmap_aba / new postmap_aba / seq_num] 248ae4a0502SMauro Carvalho Chehab8. Write new post-map ABA into map. 249ae4a0502SMauro Carvalho Chehab9. Write old post-map entry into the free list 250ae4a0502SMauro Carvalho Chehab10. Calculate next sequence number and write into the free list entry 251ae4a0502SMauro Carvalho Chehab11. Release lane (and lane_lock) 252ae4a0502SMauro Carvalho Chehab 253ae4a0502SMauro Carvalho Chehab 254ae4a0502SMauro Carvalho Chehab4. Error Handling 255ae4a0502SMauro Carvalho Chehab================= 256ae4a0502SMauro Carvalho Chehab 257ae4a0502SMauro Carvalho ChehabAn arena would be in an error state if any of the metadata is corrupted 258ae4a0502SMauro Carvalho Chehabirrecoverably, either due to a bug or a media error. The following conditions 259ae4a0502SMauro Carvalho Chehabindicate an error: 260ae4a0502SMauro Carvalho Chehab 261ae4a0502SMauro Carvalho Chehab- Info block checksum does not match (and recovering from the copy also fails) 262ae4a0502SMauro Carvalho Chehab- All internal available blocks are not uniquely and entirely addressed by the 263ae4a0502SMauro Carvalho Chehab sum of mapped blocks and free blocks (from the BTT flog). 264ae4a0502SMauro Carvalho Chehab- Rebuilding free list from the flog reveals missing/duplicate/impossible 265ae4a0502SMauro Carvalho Chehab entries 266ae4a0502SMauro Carvalho Chehab- A map entry is out of bounds 267ae4a0502SMauro Carvalho Chehab 268ae4a0502SMauro Carvalho ChehabIf any of these error conditions are encountered, the arena is put into a read 269ae4a0502SMauro Carvalho Chehabonly state using a flag in the info block. 270ae4a0502SMauro Carvalho Chehab 271ae4a0502SMauro Carvalho Chehab 272ae4a0502SMauro Carvalho Chehab5. Usage 273ae4a0502SMauro Carvalho Chehab======== 274ae4a0502SMauro Carvalho Chehab 275ae4a0502SMauro Carvalho ChehabThe BTT can be set up on any disk (namespace) exposed by the libnvdimm subsystem 276ae4a0502SMauro Carvalho Chehab(pmem, or blk mode). The easiest way to set up such a namespace is using the 277ae4a0502SMauro Carvalho Chehab'ndctl' utility [1]: 278ae4a0502SMauro Carvalho Chehab 279ae4a0502SMauro Carvalho ChehabFor example, the ndctl command line to setup a btt with a 4k sector size is:: 280ae4a0502SMauro Carvalho Chehab 281ae4a0502SMauro Carvalho Chehab ndctl create-namespace -f -e namespace0.0 -m sector -l 4k 282ae4a0502SMauro Carvalho Chehab 283ae4a0502SMauro Carvalho ChehabSee ndctl create-namespace --help for more options. 284ae4a0502SMauro Carvalho Chehab 285ae4a0502SMauro Carvalho Chehab[1]: https://github.com/pmem/ndctl 286