1=============================== 2LIBNVDIMM: Non-Volatile Devices 3=============================== 4 5libnvdimm - kernel / libndctl - userspace helper library 6 7linux-nvdimm@lists.01.org 8 9Version 13 10 11.. contents: 12 13 Glossary 14 Overview 15 Supporting Documents 16 Git Trees 17 LIBNVDIMM PMEM and BLK 18 Why BLK? 19 PMEM vs BLK 20 BLK-REGIONs, PMEM-REGIONs, Atomic Sectors, and DAX 21 Example NVDIMM Platform 22 LIBNVDIMM Kernel Device Model and LIBNDCTL Userspace API 23 LIBNDCTL: Context 24 libndctl: instantiate a new library context example 25 LIBNVDIMM/LIBNDCTL: Bus 26 libnvdimm: control class device in /sys/class 27 libnvdimm: bus 28 libndctl: bus enumeration example 29 LIBNVDIMM/LIBNDCTL: DIMM (NMEM) 30 libnvdimm: DIMM (NMEM) 31 libndctl: DIMM enumeration example 32 LIBNVDIMM/LIBNDCTL: Region 33 libnvdimm: region 34 libndctl: region enumeration example 35 Why Not Encode the Region Type into the Region Name? 36 How Do I Determine the Major Type of a Region? 37 LIBNVDIMM/LIBNDCTL: Namespace 38 libnvdimm: namespace 39 libndctl: namespace enumeration example 40 libndctl: namespace creation example 41 Why the Term "namespace"? 42 LIBNVDIMM/LIBNDCTL: Block Translation Table "btt" 43 libnvdimm: btt layout 44 libndctl: btt creation example 45 Summary LIBNDCTL Diagram 46 47 48Glossary 49======== 50 51PMEM: 52 A system-physical-address range where writes are persistent. A 53 block device composed of PMEM is capable of DAX. A PMEM address range 54 may span an interleave of several DIMMs. 55 56BLK: 57 A set of one or more programmable memory mapped apertures provided 58 by a DIMM to access its media. This indirection precludes the 59 performance benefit of interleaving, but enables DIMM-bounded failure 60 modes. 61 62DPA: 63 DIMM Physical Address, is a DIMM-relative offset. With one DIMM in 64 the system there would be a 1:1 system-physical-address:DPA association. 65 Once more DIMMs are added a memory controller interleave must be 66 decoded to determine the DPA associated with a given 67 system-physical-address. BLK capacity always has a 1:1 relationship 68 with a single-DIMM's DPA range. 69 70DAX: 71 File system extensions to bypass the page cache and block layer to 72 mmap persistent memory, from a PMEM block device, directly into a 73 process address space. 74 75DSM: 76 Device Specific Method: ACPI method to to control specific 77 device - in this case the firmware. 78 79DCR: 80 NVDIMM Control Region Structure defined in ACPI 6 Section 5.2.25.5. 81 It defines a vendor-id, device-id, and interface format for a given DIMM. 82 83BTT: 84 Block Translation Table: Persistent memory is byte addressable. 85 Existing software may have an expectation that the power-fail-atomicity 86 of writes is at least one sector, 512 bytes. The BTT is an indirection 87 table with atomic update semantics to front a PMEM/BLK block device 88 driver and present arbitrary atomic sector sizes. 89 90LABEL: 91 Metadata stored on a DIMM device that partitions and identifies 92 (persistently names) storage between PMEM and BLK. It also partitions 93 BLK storage to host BTTs with different parameters per BLK-partition. 94 Note that traditional partition tables, GPT/MBR, are layered on top of a 95 BLK or PMEM device. 96 97 98Overview 99======== 100 101The LIBNVDIMM subsystem provides support for three types of NVDIMMs, namely, 102PMEM, BLK, and NVDIMM devices that can simultaneously support both PMEM 103and BLK mode access. These three modes of operation are described by 104the "NVDIMM Firmware Interface Table" (NFIT) in ACPI 6. While the LIBNVDIMM 105implementation is generic and supports pre-NFIT platforms, it was guided 106by the superset of capabilities need to support this ACPI 6 definition 107for NVDIMM resources. The bulk of the kernel implementation is in place 108to handle the case where DPA accessible via PMEM is aliased with DPA 109accessible via BLK. When that occurs a LABEL is needed to reserve DPA 110for exclusive access via one mode a time. 111 112Supporting Documents 113-------------------- 114 115ACPI 6: 116 https://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf 117NVDIMM Namespace: 118 https://pmem.io/documents/NVDIMM_Namespace_Spec.pdf 119DSM Interface Example: 120 https://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf 121Driver Writer's Guide: 122 https://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf 123 124Git Trees 125--------- 126 127LIBNVDIMM: 128 https://git.kernel.org/cgit/linux/kernel/git/djbw/nvdimm.git 129LIBNDCTL: 130 https://github.com/pmem/ndctl.git 131PMEM: 132 https://github.com/01org/prd 133 134 135LIBNVDIMM PMEM and BLK 136====================== 137 138Prior to the arrival of the NFIT, non-volatile memory was described to a 139system in various ad-hoc ways. Usually only the bare minimum was 140provided, namely, a single system-physical-address range where writes 141are expected to be durable after a system power loss. Now, the NFIT 142specification standardizes not only the description of PMEM, but also 143BLK and platform message-passing entry points for control and 144configuration. 145 146For each NVDIMM access method (PMEM, BLK), LIBNVDIMM provides a block 147device driver: 148 149 1. PMEM (nd_pmem.ko): Drives a system-physical-address range. This 150 range is contiguous in system memory and may be interleaved (hardware 151 memory controller striped) across multiple DIMMs. When interleaved the 152 platform may optionally provide details of which DIMMs are participating 153 in the interleave. 154 155 Note that while LIBNVDIMM describes system-physical-address ranges that may 156 alias with BLK access as ND_NAMESPACE_PMEM ranges and those without 157 alias as ND_NAMESPACE_IO ranges, to the nd_pmem driver there is no 158 distinction. The different device-types are an implementation detail 159 that userspace can exploit to implement policies like "only interface 160 with address ranges from certain DIMMs". It is worth noting that when 161 aliasing is present and a DIMM lacks a label, then no block device can 162 be created by default as userspace needs to do at least one allocation 163 of DPA to the PMEM range. In contrast ND_NAMESPACE_IO ranges, once 164 registered, can be immediately attached to nd_pmem. 165 166 2. BLK (nd_blk.ko): This driver performs I/O using a set of platform 167 defined apertures. A set of apertures will access just one DIMM. 168 Multiple windows (apertures) allow multiple concurrent accesses, much like 169 tagged-command-queuing, and would likely be used by different threads or 170 different CPUs. 171 172 The NFIT specification defines a standard format for a BLK-aperture, but 173 the spec also allows for vendor specific layouts, and non-NFIT BLK 174 implementations may have other designs for BLK I/O. For this reason 175 "nd_blk" calls back into platform-specific code to perform the I/O. 176 177 One such implementation is defined in the "Driver Writer's Guide" and "DSM 178 Interface Example". 179 180 181Why BLK? 182======== 183 184While PMEM provides direct byte-addressable CPU-load/store access to 185NVDIMM storage, it does not provide the best system RAS (recovery, 186availability, and serviceability) model. An access to a corrupted 187system-physical-address address causes a CPU exception while an access 188to a corrupted address through an BLK-aperture causes that block window 189to raise an error status in a register. The latter is more aligned with 190the standard error model that host-bus-adapter attached disks present. 191 192Also, if an administrator ever wants to replace a memory it is easier to 193service a system at DIMM module boundaries. Compare this to PMEM where 194data could be interleaved in an opaque hardware specific manner across 195several DIMMs. 196 197PMEM vs BLK 198----------- 199 200BLK-apertures solve these RAS problems, but their presence is also the 201major contributing factor to the complexity of the ND subsystem. They 202complicate the implementation because PMEM and BLK alias in DPA space. 203Any given DIMM's DPA-range may contribute to one or more 204system-physical-address sets of interleaved DIMMs, *and* may also be 205accessed in its entirety through its BLK-aperture. Accessing a DPA 206through a system-physical-address while simultaneously accessing the 207same DPA through a BLK-aperture has undefined results. For this reason, 208DIMMs with this dual interface configuration include a DSM function to 209store/retrieve a LABEL. The LABEL effectively partitions the DPA-space 210into exclusive system-physical-address and BLK-aperture accessible 211regions. For simplicity a DIMM is allowed a PMEM "region" per each 212interleave set in which it is a member. The remaining DPA space can be 213carved into an arbitrary number of BLK devices with discontiguous 214extents. 215 216BLK-REGIONs, PMEM-REGIONs, Atomic Sectors, and DAX 217^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 218 219One of the few 220reasons to allow multiple BLK namespaces per REGION is so that each 221BLK-namespace can be configured with a BTT with unique atomic sector 222sizes. While a PMEM device can host a BTT the LABEL specification does 223not provide for a sector size to be specified for a PMEM namespace. 224 225This is due to the expectation that the primary usage model for PMEM is 226via DAX, and the BTT is incompatible with DAX. However, for the cases 227where an application or filesystem still needs atomic sector update 228guarantees it can register a BTT on a PMEM device or partition. See 229LIBNVDIMM/NDCTL: Block Translation Table "btt" 230 231 232Example NVDIMM Platform 233======================= 234 235For the remainder of this document the following diagram will be 236referenced for any example sysfs layouts:: 237 238 239 (a) (b) DIMM BLK-REGION 240 +-------------------+--------+--------+--------+ 241 +------+ | pm0.0 | blk2.0 | pm1.0 | blk2.1 | 0 region2 242 | imc0 +--+- - - region0- - - +--------+ +--------+ 243 +--+---+ | pm0.0 | blk3.0 | pm1.0 | blk3.1 | 1 region3 244 | +-------------------+--------v v--------+ 245 +--+---+ | | 246 | cpu0 | region1 247 +--+---+ | | 248 | +----------------------------^ ^--------+ 249 +--+---+ | blk4.0 | pm1.0 | blk4.0 | 2 region4 250 | imc1 +--+----------------------------| +--------+ 251 +------+ | blk5.0 | pm1.0 | blk5.0 | 3 region5 252 +----------------------------+--------+--------+ 253 254In this platform we have four DIMMs and two memory controllers in one 255socket. Each unique interface (BLK or PMEM) to DPA space is identified 256by a region device with a dynamically assigned id (REGION0 - REGION5). 257 258 1. The first portion of DIMM0 and DIMM1 are interleaved as REGION0. A 259 single PMEM namespace is created in the REGION0-SPA-range that spans most 260 of DIMM0 and DIMM1 with a user-specified name of "pm0.0". Some of that 261 interleaved system-physical-address range is reclaimed as BLK-aperture 262 accessed space starting at DPA-offset (a) into each DIMM. In that 263 reclaimed space we create two BLK-aperture "namespaces" from REGION2 and 264 REGION3 where "blk2.0" and "blk3.0" are just human readable names that 265 could be set to any user-desired name in the LABEL. 266 267 2. In the last portion of DIMM0 and DIMM1 we have an interleaved 268 system-physical-address range, REGION1, that spans those two DIMMs as 269 well as DIMM2 and DIMM3. Some of REGION1 is allocated to a PMEM namespace 270 named "pm1.0", the rest is reclaimed in 4 BLK-aperture namespaces (for 271 each DIMM in the interleave set), "blk2.1", "blk3.1", "blk4.0", and 272 "blk5.0". 273 274 3. The portion of DIMM2 and DIMM3 that do not participate in the REGION1 275 interleaved system-physical-address range (i.e. the DPA address past 276 offset (b) are also included in the "blk4.0" and "blk5.0" namespaces. 277 Note, that this example shows that BLK-aperture namespaces don't need to 278 be contiguous in DPA-space. 279 280 This bus is provided by the kernel under the device 281 /sys/devices/platform/nfit_test.0 when the nfit_test.ko module from 282 tools/testing/nvdimm is loaded. This not only test LIBNVDIMM but the 283 acpi_nfit.ko driver as well. 284 285 286LIBNVDIMM Kernel Device Model and LIBNDCTL Userspace API 287======================================================== 288 289What follows is a description of the LIBNVDIMM sysfs layout and a 290corresponding object hierarchy diagram as viewed through the LIBNDCTL 291API. The example sysfs paths and diagrams are relative to the Example 292NVDIMM Platform which is also the LIBNVDIMM bus used in the LIBNDCTL unit 293test. 294 295LIBNDCTL: Context 296----------------- 297 298Every API call in the LIBNDCTL library requires a context that holds the 299logging parameters and other library instance state. The library is 300based on the libabc template: 301 302 https://git.kernel.org/cgit/linux/kernel/git/kay/libabc.git 303 304LIBNDCTL: instantiate a new library context example 305^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 306 307:: 308 309 struct ndctl_ctx *ctx; 310 311 if (ndctl_new(&ctx) == 0) 312 return ctx; 313 else 314 return NULL; 315 316LIBNVDIMM/LIBNDCTL: Bus 317----------------------- 318 319A bus has a 1:1 relationship with an NFIT. The current expectation for 320ACPI based systems is that there is only ever one platform-global NFIT. 321That said, it is trivial to register multiple NFITs, the specification 322does not preclude it. The infrastructure supports multiple busses and 323we use this capability to test multiple NFIT configurations in the unit 324test. 325 326LIBNVDIMM: control class device in /sys/class 327--------------------------------------------- 328 329This character device accepts DSM messages to be passed to DIMM 330identified by its NFIT handle:: 331 332 /sys/class/nd/ndctl0 333 |-- dev 334 |-- device -> ../../../ndbus0 335 |-- subsystem -> ../../../../../../../class/nd 336 337 338 339LIBNVDIMM: bus 340-------------- 341 342:: 343 344 struct nvdimm_bus *nvdimm_bus_register(struct device *parent, 345 struct nvdimm_bus_descriptor *nfit_desc); 346 347:: 348 349 /sys/devices/platform/nfit_test.0/ndbus0 350 |-- commands 351 |-- nd 352 |-- nfit 353 |-- nmem0 354 |-- nmem1 355 |-- nmem2 356 |-- nmem3 357 |-- power 358 |-- provider 359 |-- region0 360 |-- region1 361 |-- region2 362 |-- region3 363 |-- region4 364 |-- region5 365 |-- uevent 366 `-- wait_probe 367 368LIBNDCTL: bus enumeration example 369^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 370 371Find the bus handle that describes the bus from Example NVDIMM Platform:: 372 373 static struct ndctl_bus *get_bus_by_provider(struct ndctl_ctx *ctx, 374 const char *provider) 375 { 376 struct ndctl_bus *bus; 377 378 ndctl_bus_foreach(ctx, bus) 379 if (strcmp(provider, ndctl_bus_get_provider(bus)) == 0) 380 return bus; 381 382 return NULL; 383 } 384 385 bus = get_bus_by_provider(ctx, "nfit_test.0"); 386 387 388LIBNVDIMM/LIBNDCTL: DIMM (NMEM) 389------------------------------- 390 391The DIMM device provides a character device for sending commands to 392hardware, and it is a container for LABELs. If the DIMM is defined by 393NFIT then an optional 'nfit' attribute sub-directory is available to add 394NFIT-specifics. 395 396Note that the kernel device name for "DIMMs" is "nmemX". The NFIT 397describes these devices via "Memory Device to System Physical Address 398Range Mapping Structure", and there is no requirement that they actually 399be physical DIMMs, so we use a more generic name. 400 401LIBNVDIMM: DIMM (NMEM) 402^^^^^^^^^^^^^^^^^^^^^^ 403 404:: 405 406 struct nvdimm *nvdimm_create(struct nvdimm_bus *nvdimm_bus, void *provider_data, 407 const struct attribute_group **groups, unsigned long flags, 408 unsigned long *dsm_mask); 409 410:: 411 412 /sys/devices/platform/nfit_test.0/ndbus0 413 |-- nmem0 414 | |-- available_slots 415 | |-- commands 416 | |-- dev 417 | |-- devtype 418 | |-- driver -> ../../../../../bus/nd/drivers/nvdimm 419 | |-- modalias 420 | |-- nfit 421 | | |-- device 422 | | |-- format 423 | | |-- handle 424 | | |-- phys_id 425 | | |-- rev_id 426 | | |-- serial 427 | | `-- vendor 428 | |-- state 429 | |-- subsystem -> ../../../../../bus/nd 430 | `-- uevent 431 |-- nmem1 432 [..] 433 434 435LIBNDCTL: DIMM enumeration example 436^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 437 438Note, in this example we are assuming NFIT-defined DIMMs which are 439identified by an "nfit_handle" a 32-bit value where: 440 441 - Bit 3:0 DIMM number within the memory channel 442 - Bit 7:4 memory channel number 443 - Bit 11:8 memory controller ID 444 - Bit 15:12 socket ID (within scope of a Node controller if node 445 controller is present) 446 - Bit 27:16 Node Controller ID 447 - Bit 31:28 Reserved 448 449:: 450 451 static struct ndctl_dimm *get_dimm_by_handle(struct ndctl_bus *bus, 452 unsigned int handle) 453 { 454 struct ndctl_dimm *dimm; 455 456 ndctl_dimm_foreach(bus, dimm) 457 if (ndctl_dimm_get_handle(dimm) == handle) 458 return dimm; 459 460 return NULL; 461 } 462 463 #define DIMM_HANDLE(n, s, i, c, d) \ 464 (((n & 0xfff) << 16) | ((s & 0xf) << 12) | ((i & 0xf) << 8) \ 465 | ((c & 0xf) << 4) | (d & 0xf)) 466 467 dimm = get_dimm_by_handle(bus, DIMM_HANDLE(0, 0, 0, 0, 0)); 468 469LIBNVDIMM/LIBNDCTL: Region 470-------------------------- 471 472A generic REGION device is registered for each PMEM range or BLK-aperture 473set. Per the example there are 6 regions: 2 PMEM and 4 BLK-aperture 474sets on the "nfit_test.0" bus. The primary role of regions are to be a 475container of "mappings". A mapping is a tuple of <DIMM, 476DPA-start-offset, length>. 477 478LIBNVDIMM provides a built-in driver for these REGION devices. This driver 479is responsible for reconciling the aliased DPA mappings across all 480regions, parsing the LABEL, if present, and then emitting NAMESPACE 481devices with the resolved/exclusive DPA-boundaries for the nd_pmem or 482nd_blk device driver to consume. 483 484In addition to the generic attributes of "mapping"s, "interleave_ways" 485and "size" the REGION device also exports some convenience attributes. 486"nstype" indicates the integer type of namespace-device this region 487emits, "devtype" duplicates the DEVTYPE variable stored by udev at the 488'add' event, "modalias" duplicates the MODALIAS variable stored by udev 489at the 'add' event, and finally, the optional "spa_index" is provided in 490the case where the region is defined by a SPA. 491 492LIBNVDIMM: region:: 493 494 struct nd_region *nvdimm_pmem_region_create(struct nvdimm_bus *nvdimm_bus, 495 struct nd_region_desc *ndr_desc); 496 struct nd_region *nvdimm_blk_region_create(struct nvdimm_bus *nvdimm_bus, 497 struct nd_region_desc *ndr_desc); 498 499:: 500 501 /sys/devices/platform/nfit_test.0/ndbus0 502 |-- region0 503 | |-- available_size 504 | |-- btt0 505 | |-- btt_seed 506 | |-- devtype 507 | |-- driver -> ../../../../../bus/nd/drivers/nd_region 508 | |-- init_namespaces 509 | |-- mapping0 510 | |-- mapping1 511 | |-- mappings 512 | |-- modalias 513 | |-- namespace0.0 514 | |-- namespace_seed 515 | |-- numa_node 516 | |-- nfit 517 | | `-- spa_index 518 | |-- nstype 519 | |-- set_cookie 520 | |-- size 521 | |-- subsystem -> ../../../../../bus/nd 522 | `-- uevent 523 |-- region1 524 [..] 525 526LIBNDCTL: region enumeration example 527^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 528 529Sample region retrieval routines based on NFIT-unique data like 530"spa_index" (interleave set id) for PMEM and "nfit_handle" (dimm id) for 531BLK:: 532 533 static struct ndctl_region *get_pmem_region_by_spa_index(struct ndctl_bus *bus, 534 unsigned int spa_index) 535 { 536 struct ndctl_region *region; 537 538 ndctl_region_foreach(bus, region) { 539 if (ndctl_region_get_type(region) != ND_DEVICE_REGION_PMEM) 540 continue; 541 if (ndctl_region_get_spa_index(region) == spa_index) 542 return region; 543 } 544 return NULL; 545 } 546 547 static struct ndctl_region *get_blk_region_by_dimm_handle(struct ndctl_bus *bus, 548 unsigned int handle) 549 { 550 struct ndctl_region *region; 551 552 ndctl_region_foreach(bus, region) { 553 struct ndctl_mapping *map; 554 555 if (ndctl_region_get_type(region) != ND_DEVICE_REGION_BLOCK) 556 continue; 557 ndctl_mapping_foreach(region, map) { 558 struct ndctl_dimm *dimm = ndctl_mapping_get_dimm(map); 559 560 if (ndctl_dimm_get_handle(dimm) == handle) 561 return region; 562 } 563 } 564 return NULL; 565 } 566 567 568Why Not Encode the Region Type into the Region Name? 569---------------------------------------------------- 570 571At first glance it seems since NFIT defines just PMEM and BLK interface 572types that we should simply name REGION devices with something derived 573from those type names. However, the ND subsystem explicitly keeps the 574REGION name generic and expects userspace to always consider the 575region-attributes for four reasons: 576 577 1. There are already more than two REGION and "namespace" types. For 578 PMEM there are two subtypes. As mentioned previously we have PMEM where 579 the constituent DIMM devices are known and anonymous PMEM. For BLK 580 regions the NFIT specification already anticipates vendor specific 581 implementations. The exact distinction of what a region contains is in 582 the region-attributes not the region-name or the region-devtype. 583 584 2. A region with zero child-namespaces is a possible configuration. For 585 example, the NFIT allows for a DCR to be published without a 586 corresponding BLK-aperture. This equates to a DIMM that can only accept 587 control/configuration messages, but no i/o through a descendant block 588 device. Again, this "type" is advertised in the attributes ('mappings' 589 == 0) and the name does not tell you much. 590 591 3. What if a third major interface type arises in the future? Outside 592 of vendor specific implementations, it's not difficult to envision a 593 third class of interface type beyond BLK and PMEM. With a generic name 594 for the REGION level of the device-hierarchy old userspace 595 implementations can still make sense of new kernel advertised 596 region-types. Userspace can always rely on the generic region 597 attributes like "mappings", "size", etc and the expected child devices 598 named "namespace". This generic format of the device-model hierarchy 599 allows the LIBNVDIMM and LIBNDCTL implementations to be more uniform and 600 future-proof. 601 602 4. There are more robust mechanisms for determining the major type of a 603 region than a device name. See the next section, How Do I Determine the 604 Major Type of a Region? 605 606How Do I Determine the Major Type of a Region? 607---------------------------------------------- 608 609Outside of the blanket recommendation of "use libndctl", or simply 610looking at the kernel header (/usr/include/linux/ndctl.h) to decode the 611"nstype" integer attribute, here are some other options. 612 6131. module alias lookup 614^^^^^^^^^^^^^^^^^^^^^^ 615 616 The whole point of region/namespace device type differentiation is to 617 decide which block-device driver will attach to a given LIBNVDIMM namespace. 618 One can simply use the modalias to lookup the resulting module. It's 619 important to note that this method is robust in the presence of a 620 vendor-specific driver down the road. If a vendor-specific 621 implementation wants to supplant the standard nd_blk driver it can with 622 minimal impact to the rest of LIBNVDIMM. 623 624 In fact, a vendor may also want to have a vendor-specific region-driver 625 (outside of nd_region). For example, if a vendor defined its own LABEL 626 format it would need its own region driver to parse that LABEL and emit 627 the resulting namespaces. The output from module resolution is more 628 accurate than a region-name or region-devtype. 629 6302. udev 631^^^^^^^ 632 633 The kernel "devtype" is registered in the udev database:: 634 635 # udevadm info --path=/devices/platform/nfit_test.0/ndbus0/region0 636 P: /devices/platform/nfit_test.0/ndbus0/region0 637 E: DEVPATH=/devices/platform/nfit_test.0/ndbus0/region0 638 E: DEVTYPE=nd_pmem 639 E: MODALIAS=nd:t2 640 E: SUBSYSTEM=nd 641 642 # udevadm info --path=/devices/platform/nfit_test.0/ndbus0/region4 643 P: /devices/platform/nfit_test.0/ndbus0/region4 644 E: DEVPATH=/devices/platform/nfit_test.0/ndbus0/region4 645 E: DEVTYPE=nd_blk 646 E: MODALIAS=nd:t3 647 E: SUBSYSTEM=nd 648 649 ...and is available as a region attribute, but keep in mind that the 650 "devtype" does not indicate sub-type variations and scripts should 651 really be understanding the other attributes. 652 6533. type specific attributes 654^^^^^^^^^^^^^^^^^^^^^^^^^^^ 655 656 As it currently stands a BLK-aperture region will never have a 657 "nfit/spa_index" attribute, but neither will a non-NFIT PMEM region. A 658 BLK region with a "mappings" value of 0 is, as mentioned above, a DIMM 659 that does not allow I/O. A PMEM region with a "mappings" value of zero 660 is a simple system-physical-address range. 661 662 663LIBNVDIMM/LIBNDCTL: Namespace 664----------------------------- 665 666A REGION, after resolving DPA aliasing and LABEL specified boundaries, 667surfaces one or more "namespace" devices. The arrival of a "namespace" 668device currently triggers either the nd_blk or nd_pmem driver to load 669and register a disk/block device. 670 671LIBNVDIMM: namespace 672^^^^^^^^^^^^^^^^^^^^ 673 674Here is a sample layout from the three major types of NAMESPACE where 675namespace0.0 represents DIMM-info-backed PMEM (note that it has a 'uuid' 676attribute), namespace2.0 represents a BLK namespace (note it has a 677'sector_size' attribute) that, and namespace6.0 represents an anonymous 678PMEM namespace (note that has no 'uuid' attribute due to not support a 679LABEL):: 680 681 /sys/devices/platform/nfit_test.0/ndbus0/region0/namespace0.0 682 |-- alt_name 683 |-- devtype 684 |-- dpa_extents 685 |-- force_raw 686 |-- modalias 687 |-- numa_node 688 |-- resource 689 |-- size 690 |-- subsystem -> ../../../../../../bus/nd 691 |-- type 692 |-- uevent 693 `-- uuid 694 /sys/devices/platform/nfit_test.0/ndbus0/region2/namespace2.0 695 |-- alt_name 696 |-- devtype 697 |-- dpa_extents 698 |-- force_raw 699 |-- modalias 700 |-- numa_node 701 |-- sector_size 702 |-- size 703 |-- subsystem -> ../../../../../../bus/nd 704 |-- type 705 |-- uevent 706 `-- uuid 707 /sys/devices/platform/nfit_test.1/ndbus1/region6/namespace6.0 708 |-- block 709 | `-- pmem0 710 |-- devtype 711 |-- driver -> ../../../../../../bus/nd/drivers/pmem 712 |-- force_raw 713 |-- modalias 714 |-- numa_node 715 |-- resource 716 |-- size 717 |-- subsystem -> ../../../../../../bus/nd 718 |-- type 719 `-- uevent 720 721LIBNDCTL: namespace enumeration example 722^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 723Namespaces are indexed relative to their parent region, example below. 724These indexes are mostly static from boot to boot, but subsystem makes 725no guarantees in this regard. For a static namespace identifier use its 726'uuid' attribute. 727 728:: 729 730 static struct ndctl_namespace 731 *get_namespace_by_id(struct ndctl_region *region, unsigned int id) 732 { 733 struct ndctl_namespace *ndns; 734 735 ndctl_namespace_foreach(region, ndns) 736 if (ndctl_namespace_get_id(ndns) == id) 737 return ndns; 738 739 return NULL; 740 } 741 742LIBNDCTL: namespace creation example 743^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 744 745Idle namespaces are automatically created by the kernel if a given 746region has enough available capacity to create a new namespace. 747Namespace instantiation involves finding an idle namespace and 748configuring it. For the most part the setting of namespace attributes 749can occur in any order, the only constraint is that 'uuid' must be set 750before 'size'. This enables the kernel to track DPA allocations 751internally with a static identifier:: 752 753 static int configure_namespace(struct ndctl_region *region, 754 struct ndctl_namespace *ndns, 755 struct namespace_parameters *parameters) 756 { 757 char devname[50]; 758 759 snprintf(devname, sizeof(devname), "namespace%d.%d", 760 ndctl_region_get_id(region), paramaters->id); 761 762 ndctl_namespace_set_alt_name(ndns, devname); 763 /* 'uuid' must be set prior to setting size! */ 764 ndctl_namespace_set_uuid(ndns, paramaters->uuid); 765 ndctl_namespace_set_size(ndns, paramaters->size); 766 /* unlike pmem namespaces, blk namespaces have a sector size */ 767 if (parameters->lbasize) 768 ndctl_namespace_set_sector_size(ndns, parameters->lbasize); 769 ndctl_namespace_enable(ndns); 770 } 771 772 773Why the Term "namespace"? 774^^^^^^^^^^^^^^^^^^^^^^^^^ 775 776 1. Why not "volume" for instance? "volume" ran the risk of confusing 777 ND (libnvdimm subsystem) to a volume manager like device-mapper. 778 779 2. The term originated to describe the sub-devices that can be created 780 within a NVME controller (see the nvme specification: 781 https://www.nvmexpress.org/specifications/), and NFIT namespaces are 782 meant to parallel the capabilities and configurability of 783 NVME-namespaces. 784 785 786LIBNVDIMM/LIBNDCTL: Block Translation Table "btt" 787------------------------------------------------- 788 789A BTT (design document: https://pmem.io/2014/09/23/btt.html) is a stacked 790block device driver that fronts either the whole block device or a 791partition of a block device emitted by either a PMEM or BLK NAMESPACE. 792 793LIBNVDIMM: btt layout 794^^^^^^^^^^^^^^^^^^^^^ 795 796Every region will start out with at least one BTT device which is the 797seed device. To activate it set the "namespace", "uuid", and 798"sector_size" attributes and then bind the device to the nd_pmem or 799nd_blk driver depending on the region type:: 800 801 /sys/devices/platform/nfit_test.1/ndbus0/region0/btt0/ 802 |-- namespace 803 |-- delete 804 |-- devtype 805 |-- modalias 806 |-- numa_node 807 |-- sector_size 808 |-- subsystem -> ../../../../../bus/nd 809 |-- uevent 810 `-- uuid 811 812LIBNDCTL: btt creation example 813^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 814 815Similar to namespaces an idle BTT device is automatically created per 816region. Each time this "seed" btt device is configured and enabled a new 817seed is created. Creating a BTT configuration involves two steps of 818finding and idle BTT and assigning it to consume a PMEM or BLK namespace:: 819 820 static struct ndctl_btt *get_idle_btt(struct ndctl_region *region) 821 { 822 struct ndctl_btt *btt; 823 824 ndctl_btt_foreach(region, btt) 825 if (!ndctl_btt_is_enabled(btt) 826 && !ndctl_btt_is_configured(btt)) 827 return btt; 828 829 return NULL; 830 } 831 832 static int configure_btt(struct ndctl_region *region, 833 struct btt_parameters *parameters) 834 { 835 btt = get_idle_btt(region); 836 837 ndctl_btt_set_uuid(btt, parameters->uuid); 838 ndctl_btt_set_sector_size(btt, parameters->sector_size); 839 ndctl_btt_set_namespace(btt, parameters->ndns); 840 /* turn off raw mode device */ 841 ndctl_namespace_disable(parameters->ndns); 842 /* turn on btt access */ 843 ndctl_btt_enable(btt); 844 } 845 846Once instantiated a new inactive btt seed device will appear underneath 847the region. 848 849Once a "namespace" is removed from a BTT that instance of the BTT device 850will be deleted or otherwise reset to default values. This deletion is 851only at the device model level. In order to destroy a BTT the "info 852block" needs to be destroyed. Note, that to destroy a BTT the media 853needs to be written in raw mode. By default, the kernel will autodetect 854the presence of a BTT and disable raw mode. This autodetect behavior 855can be suppressed by enabling raw mode for the namespace via the 856ndctl_namespace_set_raw_mode() API. 857 858 859Summary LIBNDCTL Diagram 860------------------------ 861 862For the given example above, here is the view of the objects as seen by the 863LIBNDCTL API:: 864 865 +---+ 866 |CTX| +---------+ +--------------+ +---------------+ 867 +-+-+ +-> REGION0 +---> NAMESPACE0.0 +--> PMEM8 "pm0.0" | 868 | | +---------+ +--------------+ +---------------+ 869 +-------+ | | +---------+ +--------------+ +---------------+ 870 | DIMM0 <-+ | +-> REGION1 +---> NAMESPACE1.0 +--> PMEM6 "pm1.0" | 871 +-------+ | | | +---------+ +--------------+ +---------------+ 872 | DIMM1 <-+ +-v--+ | +---------+ +--------------+ +---------------+ 873 +-------+ +-+BUS0+---> REGION2 +-+-> NAMESPACE2.0 +--> ND6 "blk2.0" | 874 | DIMM2 <-+ +----+ | +---------+ | +--------------+ +----------------------+ 875 +-------+ | | +-> NAMESPACE2.1 +--> ND5 "blk2.1" | BTT2 | 876 | DIMM3 <-+ | +--------------+ +----------------------+ 877 +-------+ | +---------+ +--------------+ +---------------+ 878 +-> REGION3 +-+-> NAMESPACE3.0 +--> ND4 "blk3.0" | 879 | +---------+ | +--------------+ +----------------------+ 880 | +-> NAMESPACE3.1 +--> ND3 "blk3.1" | BTT1 | 881 | +--------------+ +----------------------+ 882 | +---------+ +--------------+ +---------------+ 883 +-> REGION4 +---> NAMESPACE4.0 +--> ND2 "blk4.0" | 884 | +---------+ +--------------+ +---------------+ 885 | +---------+ +--------------+ +----------------------+ 886 +-> REGION5 +---> NAMESPACE5.0 +--> ND1 "blk5.0" | BTT0 | 887 +---------+ +--------------+ +---------------+------+ 888