1.. SPDX-License-Identifier: GPL-2.0 2 3==================== 4CXL Driver Operation 5==================== 6 7The devices described in this section are present in :: 8 9 /sys/bus/cxl/devices/ 10 /dev/cxl/ 11 12The :code:`cxl-cli` library, maintained as part of the NDTCL project, may 13be used to script interactions with these devices. 14 15Drivers 16======= 17The CXL driver is split into a number of drivers. 18 19* cxl_core - fundamental init interface and core object creation 20* cxl_port - initializes root and provides port enumeration interface. 21* cxl_acpi - initializes root decoders and interacts with ACPI data. 22* cxl_p/mem - initializes memory devices 23* cxl_pci - uses cxl_port to enumerate the actual fabric hierarchy. 24 25Driver Devices 26============== 27Here is an example from a single-socket system with 4 host bridges. Two host 28bridges have a single memory device attached, and the devices are interleaved 29into a single memory region. The memory region has been converted to dax. :: 30 31 # ls /sys/bus/cxl/devices/ 32 dax_region0 decoder3.0 decoder6.0 mem0 port3 33 decoder0.0 decoder4.0 decoder6.1 mem1 port4 34 decoder1.0 decoder5.0 endpoint5 port1 region0 35 decoder2.0 decoder5.1 endpoint6 port2 root0 36 37 38.. kernel-render:: DOT 39 :alt: Digraph of CXL fabric describing host-bridge interleaving 40 :caption: Diagraph of CXL fabric with a host-bridge interleave memory region 41 42 digraph foo { 43 "root0" -> "port1"; 44 "root0" -> "port3"; 45 "root0" -> "decoder0.0"; 46 "port1" -> "endpoint5"; 47 "port3" -> "endpoint6"; 48 "port1" -> "decoder1.0"; 49 "port3" -> "decoder3.0"; 50 "endpoint5" -> "decoder5.0"; 51 "endpoint6" -> "decoder6.0"; 52 "decoder0.0" -> "region0"; 53 "decoder0.0" -> "decoder1.0"; 54 "decoder0.0" -> "decoder3.0"; 55 "decoder1.0" -> "decoder5.0"; 56 "decoder3.0" -> "decoder6.0"; 57 "decoder5.0" -> "region0"; 58 "decoder6.0" -> "region0"; 59 "region0" -> "dax_region0"; 60 "dax_region0" -> "dax0.0"; 61 } 62 63For this section we'll explore the devices present in this configuration, but 64we'll explore more configurations in-depth in example configurations below. 65 66Base Devices 67------------ 68Most devices in a CXL fabric are a `port` of some kind (because each 69device mostly routes request from one device to the next, rather than 70provide a direct service). 71 72Root 73~~~~ 74The `CXL Root` is logical object created by the `cxl_acpi` driver during 75:code:`cxl_acpi_probe` - if the :code:`ACPI0017` `Compute Express Link 76Root Object` Device Class is found. 77 78The Root contains links to: 79 80* `Host Bridge Ports` defined by CHBS in the :doc:`CEDT<../platform/acpi/cedt>` 81 82* `Downstream Ports` typically connected to `Host Bridge Ports`. 83 84* `Root Decoders` defined by CFMWS the :doc:`CEDT<../platform/acpi/cedt>` 85 86:: 87 88 # ls /sys/bus/cxl/devices/root0 89 decoder0.0 dport0 dport5 port2 subsystem 90 decoders_committed dport1 modalias port3 uevent 91 devtype dport4 port1 port4 uport 92 93 # cat /sys/bus/cxl/devices/root0/devtype 94 cxl_port 95 96 # cat port1/devtype 97 cxl_port 98 99 # cat decoder0.0/devtype 100 cxl_decoder_root 101 102The root is first `logical port` in the CXL fabric, as presented by the Linux 103CXL driver. The `CXL root` is a special type of `switch port`, in that it 104only has downstream port connections. 105 106Port 107~~~~ 108A `port` object is better described as a `switch port`. It may represent a 109host bridge to the root or an actual switch port on a switch. A `switch port` 110contains one or more decoders used to route memory requests downstream ports, 111which may be connected to another `switch port` or an `endpoint port`. 112 113:: 114 115 # ls /sys/bus/cxl/devices/port1 116 decoder1.0 dport0 driver parent_dport uport 117 decoders_committed dport113 endpoint5 subsystem 118 devtype dport2 modalias uevent 119 120 # cat devtype 121 cxl_port 122 123 # cat decoder1.0/devtype 124 cxl_decoder_switch 125 126 # cat endpoint5/devtype 127 cxl_port 128 129CXL `Host Bridges` in the fabric are probed during :code:`cxl_acpi_probe` at 130the time the `CXL Root` is probed. The allows for the immediate logical 131connection to between the root and host bridge. 132 133* The root has a downstream port connection to a host bridge 134 135* The host bridge has an upstream port connection to the root. 136 137* The host bridge has one or more downstream port connections to switch 138 or endpoint ports. 139 140A `Host Bridge` is a special type of CXL `switch port`. It is explicitly 141defined in the ACPI specification via `ACPI0016` ID. `Host Bridge` ports 142will be probed at `acpi_probe` time, while similar ports on an actual switch 143will be probed later. Otherwise, switch and host bridge ports look very 144similar - the both contain switch decoders which route accesses between 145upstream and downstream ports. 146 147Endpoint 148~~~~~~~~ 149An `endpoint` is a terminal port in the fabric. This is a `logical device`, 150and may be one of many `logical devices` presented by a memory device. It 151is still considered a type of `port` in the fabric. 152 153An `endpoint` contains `endpoint decoders` and the device's Coherent Device 154Attribute Table (which describes the device's capabilities). :: 155 156 # ls /sys/bus/cxl/devices/endpoint5 157 CDAT decoders_committed modalias uevent 158 decoder5.0 devtype parent_dport uport 159 decoder5.1 driver subsystem 160 161 # cat /sys/bus/cxl/devices/endpoint5/devtype 162 cxl_port 163 164 # cat /sys/bus/cxl/devices/endpoint5/decoder5.0/devtype 165 cxl_decoder_endpoint 166 167 168Memory Device (memdev) 169~~~~~~~~~~~~~~~~~~~~~~ 170A `memdev` is probed and added by the `cxl_pci` driver in :code:`cxl_pci_probe` 171and is managed by the `cxl_mem` driver. It primarily provides the `IOCTL` 172interface to a memory device, via :code:`/dev/cxl/memN`, and exposes various 173device configuration data. :: 174 175 # ls /sys/bus/cxl/devices/mem0 176 dev firmware_version payload_max security uevent 177 driver label_storage_size pmem serial 178 firmware numa_node ram subsystem 179 180A Memory Device is a discrete base object that is not a port. While the 181physical device it belongs to may also host an `endpoint`, the relationship 182between an `endpoint` and a `memdev` is not captured in sysfs. 183 184Port Relationships 185~~~~~~~~~~~~~~~~~~ 186In our example described above, there are four host bridges attached to the 187root, and two of the host bridges have one endpoint attached. 188 189.. kernel-render:: DOT 190 :alt: Digraph of CXL fabric describing host-bridge interleaving 191 :caption: Diagraph of CXL fabric with a host-bridge interleave memory region 192 193 digraph foo { 194 "root0" -> "port1"; 195 "root0" -> "port2"; 196 "root0" -> "port3"; 197 "root0" -> "port4"; 198 "port1" -> "endpoint5"; 199 "port3" -> "endpoint6"; 200 } 201 202Decoders 203-------- 204A `Decoder` is short for a CXL Host-Managed Device Memory (HDM) Decoder. It is 205a device that routes accesses through the CXL fabric to an endpoint, and at 206the endpoint translates a `Host Physical` to `Device Physical` Addressing. 207 208The CXL 3.1 specification heavily implies that only endpoint decoders should 209engage in translation of `Host Physical Address` to `Device Physical Address`. 210:: 211 212 8.2.4.20 CXL HDM Decoder Capability Structure 213 214 IMPLEMENTATION NOTE 215 CXL Host Bridge and Upstream Switch Port Decode Flow 216 217 IMPLEMENTATION NOTE 218 Device Decode Logic 219 220These notes imply that there are two logical groups of decoders. 221 222* Routing Decoder - a decoder which routes accesses but does not translate 223 addresses from HPA to DPA. 224 225* Translating Decoder - a decoder which translates accesses from HPA to DPA 226 for an endpoint to service. 227 228The CXL drivers distinguish 3 decoder types: root, switch, and endpoint. Only 229endpoint decoders are Translating Decoders, all others are Routing Decoders. 230 231.. note:: PLATFORM VENDORS BE AWARE 232 233 Linux makes a strong assumption that endpoint decoders are the only decoder 234 in the fabric that actively translates HPA to DPA. Linux assumes routing 235 decoders pass the HPA unchanged to the next decoder in the fabric. 236 237 It is therefore assumed that any given decoder in the fabric will have an 238 address range that is a subset of its upstream port decoder. Any deviation 239 from this scheme undefined per the specification. Linux prioritizes 240 spec-defined / architectural behavior. 241 242Decoders may have one or more `Downstream Targets` if configured to interleave 243memory accesses. This will be presented in sysfs via the :code:`target_list` 244parameter. 245 246Root Decoder 247~~~~~~~~~~~~ 248A `Root Decoder` is logical construct of the physical address and interleave 249configurations present in the CFMWS field of the :doc:`CEDT 250<../platform/acpi/cedt>`. 251Linux presents this information as a decoder present in the `CXL Root`. We 252consider this a `Root Decoder`, though technically it exists on the boundary 253of the CXL specification and platform-specific CXL root implementations. 254 255Linux considers these logical decoders a type of `Routing Decoder`, and is the 256first decoder in the CXL fabric to receive a memory access from the platform's 257memory controllers. 258 259`Root Decoders` are created during :code:`cxl_acpi_probe`. One root decoder 260is created per CFMWS entry in the :doc:`CEDT <../platform/acpi/cedt>`. 261 262The :code:`target_list` parameter is filled by the CFMWS target fields. Targets 263of a root decoder are `Host Bridges`, which means interleave done at the root 264decoder level is an `Inter-Host-Bridge Interleave`. 265 266Only root decoders are capable of `Inter-Host-Bridge Interleave`. 267 268Such interleaves must be configured by the platform and described in the ACPI 269CEDT CFMWS, as the target CXL host bridge UIDs in the CFMWS must match the CXL 270host bridge UIDs in the CHBS field of the :doc:`CEDT 271<../platform/acpi/cedt>` and the UID field of CXL Host Bridges defined in 272the :doc:`DSDT <../platform/acpi/dsdt>`. 273 274Interleave settings in a root decoder describe how to interleave accesses among 275the *immediate downstream targets*, not the entire interleave set. 276 277The memory range described in the root decoder is used to 278 2791) Create a memory region (:code:`region0` in this example), and 280 2812) Associate the region with an IO Memory Resource (:code:`kernel/resource.c`) 282 283:: 284 285 # ls /sys/bus/cxl/devices/decoder0.0/ 286 cap_pmem devtype region0 287 cap_ram interleave_granularity size 288 cap_type2 interleave_ways start 289 cap_type3 locked subsystem 290 create_ram_region modalias target_list 291 delete_region qos_class uevent 292 293 # cat /sys/bus/cxl/devices/decoder0.0/region0/resource 294 0xc050000000 295 296The IO Memory Resource is created during early boot when the CFMWS region is 297identified in the EFI Memory Map or E820 table (on x86). 298 299Root decoders are defined as a separate devtype, but are also a type 300of `Switch Decoder` due to having downstream targets. :: 301 302 # cat /sys/bus/cxl/devices/decoder0.0/devtype 303 cxl_decoder_root 304 305Switch Decoder 306~~~~~~~~~~~~~~ 307Any non-root, translating decoder is considered a `Switch Decoder`, and will 308present with the type :code:`cxl_decoder_switch`. Both `Host Bridge` and `CXL 309Switch` (device) decoders are of type :code:`cxl_decoder_switch`. :: 310 311 # ls /sys/bus/cxl/devices/decoder1.0/ 312 devtype locked size target_list 313 interleave_granularity modalias start target_type 314 interleave_ways region subsystem uevent 315 316 # cat /sys/bus/cxl/devices/decoder1.0/devtype 317 cxl_decoder_switch 318 319 # cat /sys/bus/cxl/devices/decoder1.0/region 320 region0 321 322A `Switch Decoder` has associations between a region defined by a root 323decoder and downstream target ports. Interleaving done within a switch decoder 324is a multi-downstream-port interleave (or `Intra-Host-Bridge Interleave` for 325host bridges). 326 327Interleave settings in a switch decoder describe how to interleave accesses 328among the *immediate downstream targets*, not the entire interleave set. 329 330Switch decoders are created during :code:`cxl_switch_port_probe` in the 331:code:`cxl_port` driver, and is created based on a PCI device's DVSEC 332registers. 333 334Switch decoder programming is validated during probe if the platform programs 335them during boot (See `Auto Decoders` below), or on commit if programmed at 336runtime (See `Runtime Programming` below). 337 338 339Endpoint Decoder 340~~~~~~~~~~~~~~~~ 341Any decoder attached to a *terminal* point in the CXL fabric (`An Endpoint`) is 342considered an `Endpoint Decoder`. Endpoint decoders are of type 343:code:`cxl_decoder_endpoint`. :: 344 345 # ls /sys/bus/cxl/devices/decoder5.0 346 devtype locked start 347 dpa_resource modalias subsystem 348 dpa_size mode target_type 349 interleave_granularity region uevent 350 interleave_ways size 351 352 # cat /sys/bus/cxl/devices/decoder5.0/devtype 353 cxl_decoder_endpoint 354 355 # cat /sys/bus/cxl/devices/decoder5.0/region 356 region0 357 358An `Endpoint Decoder` has an association with a region defined by a root 359decoder and describes the device-local resource associated with this region. 360 361Unlike root and switch decoders, endpoint decoders translate `Host Physical` to 362`Device Physical` address ranges. The interleave settings on an endpoint 363therefore describe the entire *interleave set*. 364 365`Device Physical Address` regions must be committed in-order. For example, the 366DPA region starting at 0x80000000 cannot be committed before the DPA region 367starting at 0x0. 368 369As of Linux v6.15, Linux does not support *imbalanced* interleave setups, all 370endpoints in an interleave set are expected to have the same interleave 371settings (granularity and ways must be the same). 372 373Endpoint decoders are created during :code:`cxl_endpoint_port_probe` in the 374:code:`cxl_port` driver, and is created based on a PCI device's DVSEC registers. 375 376Decoder Relationships 377~~~~~~~~~~~~~~~~~~~~~ 378In our example described above, there is one root decoder which routes memory 379accesses over two host bridges. Each host bridge has a decoder which routes 380access to their singular endpoint targets. Each endpoint has a decoder which 381translates HPA to DPA and services the memory request. 382 383The driver validates relationships between ports by decoder programming, so 384we can think of decoders being related in a similarly hierarchical fashion to 385ports. 386 387.. kernel-render:: DOT 388 :alt: Digraph of hierarchical relationship between root, switch, and endpoint decoders. 389 :caption: Diagraph of CXL root, switch, and endpoint decoders. 390 391 digraph foo { 392 "root0" -> "decoder0.0"; 393 "decoder0.0" -> "decoder1.0"; 394 "decoder0.0" -> "decoder3.0"; 395 "decoder1.0" -> "decoder5.0"; 396 "decoder3.0" -> "decoder6.0"; 397 } 398 399Regions 400------- 401 402Memory Region 403~~~~~~~~~~~~~ 404A `Memory Region` is a logical construct that connects a set of CXL ports in 405the fabric to an IO Memory Resource. It is ultimately used to expose the memory 406on these devices to the DAX subsystem via a `DAX Region`. 407 408An example RAM region: :: 409 410 # ls /sys/bus/cxl/devices/region0/ 411 access0 devtype modalias subsystem uuid 412 access1 driver mode target0 413 commit interleave_granularity resource target1 414 dax_region0 interleave_ways size uevent 415 416A memory region can be constructed during endpoint probe, if decoders were 417programmed by BIOS/EFI (see `Auto Decoders`), or by creating a region manually 418via a `Root Decoder`'s :code:`create_ram_region` or :code:`create_pmem_region` 419interfaces. 420 421The interleave settings in a `Memory Region` describe the configuration of the 422`Interleave Set` - and are what can be expected to be seen in the endpoint 423interleave settings. 424 425.. kernel-render:: DOT 426 :alt: Digraph of CXL memory region relationships between root and endpoint decoders. 427 :caption: Regions are created based on root decoder configurations. Endpoint decoders 428 must be programmed with the same interleave settings as the region. 429 430 digraph foo { 431 "root0" -> "decoder0.0"; 432 "decoder0.0" -> "region0"; 433 "region0" -> "decoder5.0"; 434 "region0" -> "decoder6.0"; 435 } 436 437DAX Region 438~~~~~~~~~~ 439A `DAX Region` is used to convert a CXL `Memory Region` to a DAX device. A 440DAX device may then be accessed directly via a file descriptor interface, or 441converted to System RAM via the DAX kmem driver. See the DAX driver section 442for more details. :: 443 444 # ls /sys/bus/cxl/devices/dax_region0/ 445 dax0.0 devtype modalias uevent 446 dax_region driver subsystem 447 448Mailbox Interfaces 449------------------ 450A mailbox command interface for each device is exposed in :: 451 452 /dev/cxl/mem0 453 /dev/cxl/mem1 454 455These mailboxes may receive any specification-defined command. Raw commands 456(custom commands) can only be sent to these interfaces if the build config 457:code:`CXL_MEM_RAW_COMMANDS` is set. This is considered a debug and/or 458development interface, not an officially supported mechanism for creation 459of vendor-specific commands (see the `fwctl` subsystem for that). 460 461Decoder Programming 462=================== 463 464Runtime Programming 465------------------- 466During probe, the only decoders *required* to be programmed are `Root Decoders`. 467In reality, `Root Decoders` are a logical construct to describe the memory 468region and interleave configuration at the host bridge level - as described 469in the ACPI CEDT CFMWS. 470 471All other `Switch` and `Endpoint` decoders may be programmed by the user 472at runtime - if the platform supports such configurations. 473 474This interaction is what creates a `Software Defined Memory` environment. 475 476See the :code:`cxl-cli` documentation for more information about how to 477configure CXL decoders at runtime. 478 479Auto Decoders 480------------- 481Auto Decoders are decoders programmed by BIOS/EFI at boot time, and are 482almost always locked (cannot be changed). This is done by a platform 483which may have a static configuration - or certain quirks which may prevent 484dynamic runtime changes to the decoders (such as requiring additional 485controller programming within the CPU complex outside the scope of CXL). 486 487Auto Decoders are probed automatically as long as the devices and memory 488regions they are associated with probe without issue. When probing Auto 489Decoders, the driver's primary responsibility is to ensure the fabric is 490sane - as-if validating runtime programmed regions and decoders. 491 492If Linux cannot validate auto-decoder configuration, the memory will not 493be surfaced as a DAX device - and therefore not be exposed to the page 494allocator - effectively stranding it. 495 496Interleave 497---------- 498 499The Linux CXL driver supports `Cross-Link First` interleave. This dictates 500how interleave is programmed at each decoder step, as the driver validates 501the relationships between a decoder and it's parent. 502 503For example, in a `Cross-Link First` interleave setup with 16 endpoints 504attached to 4 host bridges, linux expects the following ways/granularity 505across the root, host bridge, and endpoints respectively. 506 507.. flat-table:: 4x4 cross-link first interleave settings 508 509 * - decoder 510 - ways 511 - granularity 512 513 * - root 514 - 4 515 - 256 516 517 * - host bridge 518 - 4 519 - 1024 520 521 * - endpoint 522 - 16 523 - 256 524 525At the root, every a given access will be routed to the 526:code:`((HPA / 256) % 4)th` target host bridge. Within a host bridge, every 527:code:`((HPA / 1024) % 4)th` target endpoint. Each endpoint translates based 528on the entire 16 device interleave set. 529 530Unbalanced interleave sets are not supported - decoders at a similar point 531in the hierarchy (e.g. all host bridge decoders) must have the same ways and 532granularity configuration. 533 534At Root 535~~~~~~~ 536Root decoder interleave is defined by CFMWS field of the :doc:`CEDT 537<../platform/acpi/cedt>`. The CEDT may actually define multiple CFMWS 538configurations to describe the same physical capacity, with the intent to allow 539users to decide at runtime whether to online memory as interleaved or 540non-interleaved. :: 541 542 Subtable Type : 01 [CXL Fixed Memory Window Structure] 543 Window base address : 0000000100000000 544 Window size : 0000000100000000 545 Interleave Members (2^n) : 00 546 Interleave Arithmetic : 00 547 First Target : 00000007 548 549 Subtable Type : 01 [CXL Fixed Memory Window Structure] 550 Window base address : 0000000200000000 551 Window size : 0000000100000000 552 Interleave Members (2^n) : 00 553 Interleave Arithmetic : 00 554 First Target : 00000006 555 556 Subtable Type : 01 [CXL Fixed Memory Window Structure] 557 Window base address : 0000000300000000 558 Window size : 0000000200000000 559 Interleave Members (2^n) : 01 560 Interleave Arithmetic : 00 561 First Target : 00000007 562 Next Target : 00000006 563 564In this example, the CFMWS defines two discrete non-interleaved 4GB regions 565for each host bridge, and one interleaved 8GB region that targets both. This 566would result in 3 root decoders presenting in the root. :: 567 568 # ls /sys/bus/cxl/devices/root0/decoder* 569 decoder0.0 decoder0.1 decoder0.2 570 571 # cat /sys/bus/cxl/devices/decoder0.0/target_list start size 572 7 573 0x100000000 574 0x100000000 575 576 # cat /sys/bus/cxl/devices/decoder0.1/target_list start size 577 6 578 0x200000000 579 0x100000000 580 581 # cat /sys/bus/cxl/devices/decoder0.2/target_list start size 582 7,6 583 0x300000000 584 0x200000000 585 586These decoders are not runtime programmable. They are used to generate a 587`Memory Region` to bring this memory online with runtime programmed settings 588at the `Switch` and `Endpoint` decoders. 589 590At Host Bridge or Switch 591~~~~~~~~~~~~~~~~~~~~~~~~ 592`Host Bridge` and `Switch` decoders are programmable via the following fields: 593 594- :code:`start` - the HPA region associated with the memory region 595- :code:`size` - the size of the region 596- :code:`target_list` - the list of downstream ports 597- :code:`interleave_ways` - the number downstream ports to interleave across 598- :code:`interleave_granularity` - the granularity to interleave at. 599 600Linux expects the :code:`interleave_granularity` of switch decoders to be 601derived from their upstream port connections. In `Cross-Link First` interleave 602configurations, the :code:`interleave_granularity` of a decoder is equal to 603:code:`parent_interleave_granularity * parent_interleave_ways`. 604 605At Endpoint 606~~~~~~~~~~~ 607`Endpoint Decoders` are programmed similar to Host Bridge and Switch decoders, 608with the exception that the ways and granularity are defined by the interleave 609set (e.g. the interleave settings defined by the associated `Memory Region`). 610 611- :code:`start` - the HPA region associated with the memory region 612- :code:`size` - the size of the region 613- :code:`interleave_ways` - the number endpoints in the interleave set 614- :code:`interleave_granularity` - the granularity to interleave at. 615 616These settings are used by endpoint decoders to *Translate* memory requests 617from HPA to DPA. This is why they must be aware of the entire interleave set. 618 619Linux does not support unbalanced interleave configurations. As a result, all 620endpoints in an interleave set must have the same ways and granularity. 621 622Example Configurations 623====================== 624.. toctree:: 625 :maxdepth: 1 626 627 example-configurations/single-device.rst 628 example-configurations/hb-interleave.rst 629 example-configurations/intra-hb-interleave.rst 630 example-configurations/multi-interleave.rst 631