xref: /linux/Documentation/driver-api/cxl/linux/cxl-driver.rst (revision d41e5839d80043beaa63973eab602579ebdb238f)
1.. SPDX-License-Identifier: GPL-2.0
2
3====================
4CXL Driver Operation
5====================
6
7The devices described in this section are present in ::
8
9  /sys/bus/cxl/devices/
10  /dev/cxl/
11
12The :code:`cxl-cli` library, maintained as part of the NDTCL project, may
13be used to script interactions with these devices.
14
15Drivers
16=======
17The CXL driver is split into a number of drivers.
18
19* cxl_core  - fundamental init interface and core object creation
20* cxl_port  - initializes root and provides port enumeration interface.
21* cxl_acpi  - initializes root decoders and interacts with ACPI data.
22* cxl_p/mem - initializes memory devices
23* cxl_pci   - uses cxl_port to enumerate the actual fabric hierarchy.
24
25Driver Devices
26==============
27Here is an example from a single-socket system with 4 host bridges. Two host
28bridges have a single memory device attached, and the devices are interleaved
29into a single memory region. The memory region has been converted to dax. ::
30
31  # ls /sys/bus/cxl/devices/
32    dax_region0  decoder3.0  decoder6.0  mem0   port3
33    decoder0.0   decoder4.0  decoder6.1  mem1   port4
34    decoder1.0   decoder5.0  endpoint5   port1  region0
35    decoder2.0   decoder5.1  endpoint6   port2  root0
36
37
38.. kernel-render:: DOT
39   :alt: Digraph of CXL fabric describing host-bridge interleaving
40   :caption: Diagraph of CXL fabric with a host-bridge interleave memory region
41
42   digraph foo {
43     "root0" -> "port1";
44     "root0" -> "port3";
45     "root0" -> "decoder0.0";
46     "port1" -> "endpoint5";
47     "port3" -> "endpoint6";
48     "port1" -> "decoder1.0";
49     "port3" -> "decoder3.0";
50     "endpoint5" -> "decoder5.0";
51     "endpoint6" -> "decoder6.0";
52     "decoder0.0" -> "region0";
53     "decoder0.0" -> "decoder1.0";
54     "decoder0.0" -> "decoder3.0";
55     "decoder1.0" -> "decoder5.0";
56     "decoder3.0" -> "decoder6.0";
57     "decoder5.0" -> "region0";
58     "decoder6.0" -> "region0";
59     "region0" -> "dax_region0";
60     "dax_region0" -> "dax0.0";
61   }
62
63For this section we'll explore the devices present in this configuration, but
64we'll explore more configurations in-depth in example configurations below.
65
66Base Devices
67------------
68Most devices in a CXL fabric are a `port` of some kind (because each
69device mostly routes request from one device to the next, rather than
70provide a direct service).
71
72Root
73~~~~
74The `CXL Root` is logical object created by the `cxl_acpi` driver during
75:code:`cxl_acpi_probe` - if the :code:`ACPI0017` `Compute Express Link
76Root Object` Device Class is found.
77
78The Root contains links to:
79
80* `Host Bridge Ports` defined by CHBS in the :doc:`CEDT<../platform/acpi/cedt>`
81
82* `Downstream Ports` typically connected to `Host Bridge Ports`.
83
84* `Root Decoders` defined by CFMWS the :doc:`CEDT<../platform/acpi/cedt>`
85
86::
87
88  # ls /sys/bus/cxl/devices/root0
89    decoder0.0          dport0  dport5    port2  subsystem
90    decoders_committed  dport1  modalias  port3  uevent
91    devtype             dport4  port1     port4  uport
92
93  # cat /sys/bus/cxl/devices/root0/devtype
94    cxl_port
95
96  # cat port1/devtype
97    cxl_port
98
99  # cat decoder0.0/devtype
100    cxl_decoder_root
101
102The root is first `logical port` in the CXL fabric, as presented by the Linux
103CXL driver.  The `CXL root` is a special type of `switch port`, in that it
104only has downstream port connections.
105
106Port
107~~~~
108A `port` object is better described as a `switch port`.  It may represent a
109host bridge to the root or an actual switch port on a switch. A `switch port`
110contains one or more decoders used to route memory requests downstream ports,
111which may be connected to another `switch port` or an `endpoint port`.
112
113::
114
115  # ls /sys/bus/cxl/devices/port1
116    decoder1.0          dport0    driver     parent_dport  uport
117    decoders_committed  dport113  endpoint5  subsystem
118    devtype             dport2    modalias   uevent
119
120  # cat devtype
121    cxl_port
122
123  # cat decoder1.0/devtype
124    cxl_decoder_switch
125
126  # cat endpoint5/devtype
127    cxl_port
128
129CXL `Host Bridges` in the fabric are probed during :code:`cxl_acpi_probe` at
130the time the `CXL Root` is probed.  The allows for the immediate logical
131connection to between the root and host bridge.
132
133* The root has a downstream port connection to a host bridge
134
135* The host bridge has an upstream port connection to the root.
136
137* The host bridge has one or more downstream port connections to switch
138  or endpoint ports.
139
140A `Host Bridge` is a special type of CXL `switch port`. It is explicitly
141defined in the ACPI specification via `ACPI0016` ID.  `Host Bridge` ports
142will be probed at `acpi_probe` time, while similar ports on an actual switch
143will be probed later.  Otherwise, switch and host bridge ports look very
144similar - the both contain switch decoders which route accesses between
145upstream and downstream ports.
146
147Endpoint
148~~~~~~~~
149An `endpoint` is a terminal port in the fabric.  This is a `logical device`,
150and may be one of many `logical devices` presented by a memory device. It
151is still considered a type of `port` in the fabric.
152
153An `endpoint` contains `endpoint decoders` and the device's Coherent Device
154Attribute Table (which describes the device's capabilities). ::
155
156  # ls /sys/bus/cxl/devices/endpoint5
157    CDAT        decoders_committed  modalias      uevent
158    decoder5.0  devtype             parent_dport  uport
159    decoder5.1  driver              subsystem
160
161  # cat /sys/bus/cxl/devices/endpoint5/devtype
162    cxl_port
163
164  # cat /sys/bus/cxl/devices/endpoint5/decoder5.0/devtype
165    cxl_decoder_endpoint
166
167
168Memory Device (memdev)
169~~~~~~~~~~~~~~~~~~~~~~
170A `memdev` is probed and added by the `cxl_pci` driver in :code:`cxl_pci_probe`
171and is managed by the `cxl_mem` driver. It primarily provides the `IOCTL`
172interface to a memory device, via :code:`/dev/cxl/memN`, and exposes various
173device configuration data. ::
174
175  # ls /sys/bus/cxl/devices/mem0
176    dev       firmware_version    payload_max  security   uevent
177    driver    label_storage_size  pmem         serial
178    firmware  numa_node           ram          subsystem
179
180A Memory Device is a discrete base object that is not a port.  While the
181physical device it belongs to may also host an `endpoint`, the relationship
182between an `endpoint` and a `memdev` is not captured in sysfs.
183
184Port Relationships
185~~~~~~~~~~~~~~~~~~
186In our example described above, there are four host bridges attached to the
187root, and two of the host bridges have one endpoint attached.
188
189.. kernel-render:: DOT
190   :alt: Digraph of CXL fabric describing host-bridge interleaving
191   :caption: Diagraph of CXL fabric with a host-bridge interleave memory region
192
193   digraph foo {
194     "root0"    -> "port1";
195     "root0"    -> "port2";
196     "root0"    -> "port3";
197     "root0"    -> "port4";
198     "port1" -> "endpoint5";
199     "port3" -> "endpoint6";
200   }
201
202Decoders
203--------
204A `Decoder` is short for a CXL Host-Managed Device Memory (HDM) Decoder. It is
205a device that routes accesses through the CXL fabric to an endpoint, and at
206the endpoint translates a `Host Physical` to `Device Physical` Addressing.
207
208The CXL 3.1 specification heavily implies that only endpoint decoders should
209engage in translation of `Host Physical Address` to `Device Physical Address`.
210::
211
212  8.2.4.20 CXL HDM Decoder Capability Structure
213
214  IMPLEMENTATION NOTE
215  CXL Host Bridge and Upstream Switch Port Decode Flow
216
217  IMPLEMENTATION NOTE
218  Device Decode Logic
219
220These notes imply that there are two logical groups of decoders.
221
222* Routing Decoder - a decoder which routes accesses but does not translate
223  addresses from HPA to DPA.
224
225* Translating Decoder - a decoder which translates accesses from HPA to DPA
226  for an endpoint to service.
227
228The CXL drivers distinguish 3 decoder types: root, switch, and endpoint. Only
229endpoint decoders are Translating Decoders, all others are Routing Decoders.
230
231.. note:: PLATFORM VENDORS BE AWARE
232
233   Linux makes a strong assumption that endpoint decoders are the only decoder
234   in the fabric that actively translates HPA to DPA.  Linux assumes routing
235   decoders pass the HPA unchanged to the next decoder in the fabric.
236
237   It is therefore assumed that any given decoder in the fabric will have an
238   address range that is a subset of its upstream port decoder. Any deviation
239   from this scheme undefined per the specification.  Linux prioritizes
240   spec-defined / architectural behavior.
241
242Decoders may have one or more `Downstream Targets` if configured to interleave
243memory accesses.  This will be presented in sysfs via the :code:`target_list`
244parameter.
245
246Root Decoder
247~~~~~~~~~~~~
248A `Root Decoder` is logical construct of the physical address and interleave
249configurations present in the CFMWS field of the :doc:`CEDT
250<../platform/acpi/cedt>`.
251Linux presents this information as a decoder present in the `CXL Root`.  We
252consider this a `Root Decoder`, though technically it exists on the boundary
253of the CXL specification and platform-specific CXL root implementations.
254
255Linux considers these logical decoders a type of `Routing Decoder`, and is the
256first decoder in the CXL fabric to receive a memory access from the platform's
257memory controllers.
258
259`Root Decoders` are created during :code:`cxl_acpi_probe`.  One root decoder
260is created per CFMWS entry in the :doc:`CEDT <../platform/acpi/cedt>`.
261
262The :code:`target_list` parameter is filled by the CFMWS target fields. Targets
263of a root decoder are `Host Bridges`, which means interleave done at the root
264decoder level is an `Inter-Host-Bridge Interleave`.
265
266Only root decoders are capable of `Inter-Host-Bridge Interleave`.
267
268Such interleaves must be configured by the platform and described in the ACPI
269CEDT CFMWS, as the target CXL host bridge UIDs in the CFMWS must match the CXL
270host bridge UIDs in the CHBS field of the :doc:`CEDT
271<../platform/acpi/cedt>` and the UID field of CXL Host Bridges defined in
272the :doc:`DSDT <../platform/acpi/dsdt>`.
273
274Interleave settings in a root decoder describe how to interleave accesses among
275the *immediate downstream targets*, not the entire interleave set.
276
277The memory range described in the root decoder is used to
278
2791) Create a memory region (:code:`region0` in this example), and
280
2812) Associate the region with an IO Memory Resource (:code:`kernel/resource.c`)
282
283::
284
285  # ls /sys/bus/cxl/devices/decoder0.0/
286    cap_pmem           devtype                 region0
287    cap_ram            interleave_granularity  size
288    cap_type2          interleave_ways         start
289    cap_type3          locked                  subsystem
290    create_ram_region  modalias                target_list
291    delete_region      qos_class               uevent
292
293  # cat /sys/bus/cxl/devices/decoder0.0/region0/resource
294    0xc050000000
295
296The IO Memory Resource is created during early boot when the CFMWS region is
297identified in the EFI Memory Map or E820 table (on x86).
298
299Root decoders are defined as a separate devtype, but are also a type
300of `Switch Decoder` due to having downstream targets. ::
301
302  # cat /sys/bus/cxl/devices/decoder0.0/devtype
303    cxl_decoder_root
304
305Switch Decoder
306~~~~~~~~~~~~~~
307Any non-root, translating decoder is considered a `Switch Decoder`, and will
308present with the type :code:`cxl_decoder_switch`. Both `Host Bridge` and `CXL
309Switch` (device) decoders are of type :code:`cxl_decoder_switch`. ::
310
311  # ls /sys/bus/cxl/devices/decoder1.0/
312    devtype                 locked    size       target_list
313    interleave_granularity  modalias  start      target_type
314    interleave_ways         region    subsystem  uevent
315
316  # cat /sys/bus/cxl/devices/decoder1.0/devtype
317    cxl_decoder_switch
318
319  # cat /sys/bus/cxl/devices/decoder1.0/region
320    region0
321
322A `Switch Decoder` has associations between a region defined by a root
323decoder and downstream target ports.  Interleaving done within a switch decoder
324is a multi-downstream-port interleave (or `Intra-Host-Bridge Interleave` for
325host bridges).
326
327Interleave settings in a switch decoder describe how to interleave accesses
328among the *immediate downstream targets*, not the entire interleave set.
329
330Switch decoders are created during :code:`cxl_switch_port_probe` in the
331:code:`cxl_port` driver, and is created based on a PCI device's DVSEC
332registers.
333
334Switch decoder programming is validated during probe if the platform programs
335them during boot (See `Auto Decoders` below), or on commit if programmed at
336runtime (See `Runtime Programming` below).
337
338
339Endpoint Decoder
340~~~~~~~~~~~~~~~~
341Any decoder attached to a *terminal* point in the CXL fabric (`An Endpoint`) is
342considered an `Endpoint Decoder`. Endpoint decoders are of type
343:code:`cxl_decoder_endpoint`. ::
344
345  # ls /sys/bus/cxl/devices/decoder5.0
346    devtype                 locked    start
347    dpa_resource            modalias  subsystem
348    dpa_size                mode      target_type
349    interleave_granularity  region    uevent
350    interleave_ways         size
351
352  # cat /sys/bus/cxl/devices/decoder5.0/devtype
353    cxl_decoder_endpoint
354
355  # cat /sys/bus/cxl/devices/decoder5.0/region
356    region0
357
358An `Endpoint Decoder` has an association with a region defined by a root
359decoder and describes the device-local resource associated with this region.
360
361Unlike root and switch decoders, endpoint decoders translate `Host Physical` to
362`Device Physical` address ranges.  The interleave settings on an endpoint
363therefore describe the entire *interleave set*.
364
365`Device Physical Address` regions must be committed in-order. For example, the
366DPA region starting at 0x80000000 cannot be committed before the DPA region
367starting at 0x0.
368
369As of Linux v6.15, Linux does not support *imbalanced* interleave setups, all
370endpoints in an interleave set are expected to have the same interleave
371settings (granularity and ways must be the same).
372
373Endpoint decoders are created during :code:`cxl_endpoint_port_probe` in the
374:code:`cxl_port` driver, and is created based on a PCI device's DVSEC registers.
375
376Decoder Relationships
377~~~~~~~~~~~~~~~~~~~~~
378In our example described above, there is one root decoder which routes memory
379accesses over two host bridges.  Each host bridge has a decoder which routes
380access to their singular endpoint targets.  Each endpoint has a decoder which
381translates HPA to DPA and services the memory request.
382
383The driver validates relationships between ports by decoder programming, so
384we can think of decoders being related in a similarly hierarchical fashion to
385ports.
386
387.. kernel-render:: DOT
388   :alt: Digraph of hierarchical relationship between root, switch, and endpoint decoders.
389   :caption: Diagraph of CXL root, switch, and endpoint decoders.
390
391   digraph foo {
392     "root0"    -> "decoder0.0";
393     "decoder0.0" -> "decoder1.0";
394     "decoder0.0" -> "decoder3.0";
395     "decoder1.0" -> "decoder5.0";
396     "decoder3.0" -> "decoder6.0";
397   }
398
399Regions
400-------
401
402Memory Region
403~~~~~~~~~~~~~
404A `Memory Region` is a logical construct that connects a set of CXL ports in
405the fabric to an IO Memory Resource.  It is ultimately used to expose the memory
406on these devices to the DAX subsystem via a `DAX Region`.
407
408An example RAM region: ::
409
410  # ls /sys/bus/cxl/devices/region0/
411    access0      devtype                 modalias  subsystem  uuid
412    access1      driver                  mode      target0
413    commit       interleave_granularity  resource  target1
414    dax_region0  interleave_ways         size      uevent
415
416A memory region can be constructed during endpoint probe, if decoders were
417programmed by BIOS/EFI (see `Auto Decoders`), or by creating a region manually
418via a `Root Decoder`'s :code:`create_ram_region` or :code:`create_pmem_region`
419interfaces.
420
421The interleave settings in a `Memory Region` describe the configuration of the
422`Interleave Set` - and are what can be expected to be seen in the endpoint
423interleave settings.
424
425.. kernel-render:: DOT
426   :alt: Digraph of CXL memory region relationships between root and endpoint decoders.
427   :caption: Regions are created based on root decoder configurations. Endpoint decoders
428             must be programmed with the same interleave settings as the region.
429
430   digraph foo {
431     "root0"    -> "decoder0.0";
432     "decoder0.0" -> "region0";
433     "region0" -> "decoder5.0";
434     "region0" -> "decoder6.0";
435   }
436
437DAX Region
438~~~~~~~~~~
439A `DAX Region` is used to convert a CXL `Memory Region` to a DAX device. A
440DAX device may then be accessed directly via a file descriptor interface, or
441converted to System RAM via the DAX kmem driver.  See the DAX driver section
442for more details. ::
443
444  # ls /sys/bus/cxl/devices/dax_region0/
445    dax0.0      devtype  modalias   uevent
446    dax_region  driver   subsystem
447
448Mailbox Interfaces
449------------------
450A mailbox command interface for each device is exposed in ::
451
452  /dev/cxl/mem0
453  /dev/cxl/mem1
454
455These mailboxes may receive any specification-defined command. Raw commands
456(custom commands) can only be sent to these interfaces if the build config
457:code:`CXL_MEM_RAW_COMMANDS` is set.  This is considered a debug and/or
458development interface, not an officially supported mechanism for creation
459of vendor-specific commands (see the `fwctl` subsystem for that).
460
461Decoder Programming
462===================
463
464Runtime Programming
465-------------------
466During probe, the only decoders *required* to be programmed are `Root Decoders`.
467In reality, `Root Decoders` are a logical construct to describe the memory
468region and interleave configuration at the host bridge level - as described
469in the ACPI CEDT CFMWS.
470
471All other `Switch` and `Endpoint` decoders may be programmed by the user
472at runtime - if the platform supports such configurations.
473
474This interaction is what creates a `Software Defined Memory` environment.
475
476See the :code:`cxl-cli` documentation for more information about how to
477configure CXL decoders at runtime.
478
479Auto Decoders
480-------------
481Auto Decoders are decoders programmed by BIOS/EFI at boot time, and are
482almost always locked (cannot be changed).  This is done by a platform
483which may have a static configuration - or certain quirks which may prevent
484dynamic runtime changes to the decoders (such as requiring additional
485controller programming within the CPU complex outside the scope of CXL).
486
487Auto Decoders are probed automatically as long as the devices and memory
488regions they are associated with probe without issue.  When probing Auto
489Decoders, the driver's primary responsibility is to ensure the fabric is
490sane - as-if validating runtime programmed regions and decoders.
491
492If Linux cannot validate auto-decoder configuration, the memory will not
493be surfaced as a DAX device - and therefore not be exposed to the page
494allocator - effectively stranding it.
495
496Interleave
497----------
498
499The Linux CXL driver supports `Cross-Link First` interleave. This dictates
500how interleave is programmed at each decoder step, as the driver validates
501the relationships between a decoder and it's parent.
502
503For example, in a `Cross-Link First` interleave setup with 16 endpoints
504attached to 4 host bridges, linux expects the following ways/granularity
505across the root, host bridge, and endpoints respectively.
506
507.. flat-table:: 4x4 cross-link first interleave settings
508
509  * - decoder
510    - ways
511    - granularity
512
513  * - root
514    - 4
515    - 256
516
517  * - host bridge
518    - 4
519    - 1024
520
521  * - endpoint
522    - 16
523    - 256
524
525At the root, every a given access will be routed to the
526:code:`((HPA / 256) % 4)th` target host bridge. Within a host bridge, every
527:code:`((HPA / 1024) % 4)th` target endpoint.  Each endpoint translates based
528on the entire 16 device interleave set.
529
530Unbalanced interleave sets are not supported - decoders at a similar point
531in the hierarchy (e.g. all host bridge decoders) must have the same ways and
532granularity configuration.
533
534At Root
535~~~~~~~
536Root decoder interleave is defined by CFMWS field of the :doc:`CEDT
537<../platform/acpi/cedt>`.  The CEDT may actually define multiple CFMWS
538configurations to describe the same physical capacity, with the intent to allow
539users to decide at runtime whether to online memory as interleaved or
540non-interleaved. ::
541
542             Subtable Type : 01 [CXL Fixed Memory Window Structure]
543       Window base address : 0000000100000000
544               Window size : 0000000100000000
545  Interleave Members (2^n) : 00
546     Interleave Arithmetic : 00
547              First Target : 00000007
548
549             Subtable Type : 01 [CXL Fixed Memory Window Structure]
550       Window base address : 0000000200000000
551               Window size : 0000000100000000
552  Interleave Members (2^n) : 00
553     Interleave Arithmetic : 00
554              First Target : 00000006
555
556             Subtable Type : 01 [CXL Fixed Memory Window Structure]
557       Window base address : 0000000300000000
558               Window size : 0000000200000000
559  Interleave Members (2^n) : 01
560     Interleave Arithmetic : 00
561              First Target : 00000007
562               Next Target : 00000006
563
564In this example, the CFMWS defines two discrete non-interleaved 4GB regions
565for each host bridge, and one interleaved 8GB region that targets both. This
566would result in 3 root decoders presenting in the root. ::
567
568  # ls /sys/bus/cxl/devices/root0/decoder*
569    decoder0.0  decoder0.1  decoder0.2
570
571  # cat /sys/bus/cxl/devices/decoder0.0/target_list start size
572    7
573    0x100000000
574    0x100000000
575
576  # cat /sys/bus/cxl/devices/decoder0.1/target_list start size
577    6
578    0x200000000
579    0x100000000
580
581  # cat /sys/bus/cxl/devices/decoder0.2/target_list start size
582    7,6
583    0x300000000
584    0x200000000
585
586These decoders are not runtime programmable.  They are used to generate a
587`Memory Region` to bring this memory online with runtime programmed settings
588at the `Switch` and `Endpoint` decoders.
589
590At Host Bridge or Switch
591~~~~~~~~~~~~~~~~~~~~~~~~
592`Host Bridge` and `Switch` decoders are programmable via the following fields:
593
594- :code:`start` - the HPA region associated with the memory region
595- :code:`size` - the size of the region
596- :code:`target_list` - the list of downstream ports
597- :code:`interleave_ways` - the number downstream ports to interleave across
598- :code:`interleave_granularity` - the granularity to interleave at.
599
600Linux expects the :code:`interleave_granularity` of switch decoders to be
601derived from their upstream port connections. In `Cross-Link First` interleave
602configurations, the :code:`interleave_granularity` of a decoder is equal to
603:code:`parent_interleave_granularity * parent_interleave_ways`.
604
605At Endpoint
606~~~~~~~~~~~
607`Endpoint Decoders` are programmed similar to Host Bridge and Switch decoders,
608with the exception that the ways and granularity are defined by the interleave
609set (e.g. the interleave settings defined by the associated `Memory Region`).
610
611- :code:`start` - the HPA region associated with the memory region
612- :code:`size` - the size of the region
613- :code:`interleave_ways` - the number endpoints in the interleave set
614- :code:`interleave_granularity` - the granularity to interleave at.
615
616These settings are used by endpoint decoders to *Translate* memory requests
617from HPA to DPA.  This is why they must be aware of the entire interleave set.
618
619Linux does not support unbalanced interleave configurations.  As a result, all
620endpoints in an interleave set must have the same ways and granularity.
621
622Example Configurations
623======================
624.. toctree::
625   :maxdepth: 1
626
627   example-configurations/single-device.rst
628   example-configurations/hb-interleave.rst
629   example-configurations/intra-hb-interleave.rst
630   example-configurations/multi-interleave.rst
631