xref: /linux/Documentation/driver-api/pci/p2pdma.rst (revision c31f4aa8fed048fa70e742c4bb49bb48dc489ab3)
1.. SPDX-License-Identifier: GPL-2.0
2
3============================
4PCI Peer-to-Peer DMA Support
5============================
6
7The PCI bus has pretty decent support for performing DMA transfers
8between two devices on the bus. This type of transaction is henceforth
9called Peer-to-Peer (or P2P). However, there are a number of issues that
10make P2P transactions tricky to do in a perfectly safe way.
11
12For PCIe the routing of Transaction Layer Packets (TLPs) is well-defined up
13until they reach a host bridge or root port. If the path includes PCIe switches
14then based on the ACS settings the transaction can route entirely within
15the PCIe hierarchy and never reach the root port. The kernel will evaluate
16the PCIe topology and always permit P2P in these well-defined cases.
17
18However, if the P2P transaction reaches the host bridge then it might have to
19hairpin back out the same root port, be routed inside the CPU SOC to another
20PCIe root port, or routed internally to the SOC.
21
22The PCIe specification doesn't define the forwarding of transactions between
23hierarchy domains and kernel defaults to blocking such routing. There is an
24allow list to allow detecting known-good HW, in which case P2P between any
25two PCIe devices will be permitted.
26
27Since P2P inherently is doing transactions between two devices it requires two
28drivers to be co-operating inside the kernel. The providing driver has to convey
29its MMIO to the consuming driver. To meet the driver model lifecycle rules the
30MMIO must have all DMA mapping removed, all CPU accesses prevented, all page
31table mappings undone before the providing driver completes remove().
32
33This requires the providing and consuming driver to actively work together to
34guarantee that the consuming driver has stopped using the MMIO during a removal
35cycle. This is done by either a synchronous invalidation shutdown or waiting
36for all usage refcounts to reach zero.
37
38At the lowest level the P2P subsystem offers a naked struct p2p_provider that
39delegates lifecycle management to the providing driver. It is expected that
40drivers using this option will wrap their MMIO memory in DMABUF and use DMABUF
41to provide an invalidation shutdown. These MMIO addresess have no struct page, and
42if used with mmap() must create special PTEs. As such there are very few
43kernel uAPIs that can accept pointers to them; in particular they cannot be used
44with read()/write(), including O_DIRECT.
45
46Building on this, the subsystem offers a layer to wrap the MMIO in a ZONE_DEVICE
47pgmap of MEMORY_DEVICE_PCI_P2PDMA to create struct pages. The lifecycle of
48pgmap ensures that when the pgmap is destroyed all other drivers have stopped
49using the MMIO. This option works with O_DIRECT flows, in some cases, if the
50underlying subsystem supports handling MEMORY_DEVICE_PCI_P2PDMA through
51FOLL_PCI_P2PDMA. The use of FOLL_LONGTERM is prevented. As this relies on pgmap
52it also relies on architecture support along with alignment and minimum size
53limitations.
54
55
56Driver Writer's Guide
57=====================
58
59In a given P2P implementation there may be three or more different
60types of kernel drivers in play:
61
62* Provider - A driver which provides or publishes P2P resources like
63  memory or doorbell registers to other drivers.
64* Client - A driver which makes use of a resource by setting up a
65  DMA transaction to or from it.
66* Orchestrator - A driver which orchestrates the flow of data between
67  clients and providers.
68
69In many cases there could be overlap between these three types (i.e.,
70it may be typical for a driver to be both a provider and a client).
71
72For example, in the NVMe Target Copy Offload implementation:
73
74* The NVMe PCI driver is both a client, provider and orchestrator
75  in that it exposes any CMB (Controller Memory Buffer) as a P2P memory
76  resource (provider), it accepts P2P memory pages as buffers in requests
77  to be used directly (client) and it can also make use of the CMB as
78  submission queue entries (orchestrator).
79* The RDMA driver is a client in this arrangement so that an RNIC
80  can DMA directly to the memory exposed by the NVMe device.
81* The NVMe Target driver (nvmet) can orchestrate the data from the RNIC
82  to the P2P memory (CMB) and then to the NVMe device (and vice versa).
83
84This is currently the only arrangement supported by the kernel but
85one could imagine slight tweaks to this that would allow for the same
86functionality. For example, if a specific RNIC added a BAR with some
87memory behind it, its driver could add support as a P2P provider and
88then the NVMe Target could use the RNIC's memory instead of the CMB
89in cases where the NVMe cards in use do not have CMB support.
90
91
92Provider Drivers
93----------------
94
95A provider simply needs to register a BAR (or a portion of a BAR)
96as a P2P DMA resource using :c:func:`pci_p2pdma_add_resource()`.
97This will register struct pages for all the specified memory.
98
99After that it may optionally publish all of its resources as
100P2P memory using :c:func:`pci_p2pmem_publish()`. This will allow
101any orchestrator drivers to find and use the memory. When marked in
102this way, the resource must be regular memory with no side effects.
103
104For the time being this is fairly rudimentary in that all resources
105are typically going to be P2P memory. Future work will likely expand
106this to include other types of resources like doorbells.
107
108
109Client Drivers
110--------------
111
112A client driver only has to use the mapping API :c:func:`dma_map_sg()`
113and :c:func:`dma_unmap_sg()` functions as usual, and the implementation
114will do the right thing for the P2P capable memory.
115
116
117Orchestrator Drivers
118--------------------
119
120The first task an orchestrator driver must do is compile a list of
121all client devices that will be involved in a given transaction. For
122example, the NVMe Target driver creates a list including the namespace
123block device and the RNIC in use. If the orchestrator has access to
124a specific P2P provider to use it may check compatibility using
125:c:func:`pci_p2pdma_distance()` otherwise it may find a memory provider
126that's compatible with all clients using  :c:func:`pci_p2pmem_find()`.
127If more than one provider is supported, the one nearest to all the clients will
128be chosen first. If more than one provider is an equal distance away, the
129one returned will be chosen at random (it is not an arbitrary but
130truly random). This function returns the PCI device to use for the provider
131with a reference taken and therefore when it's no longer needed it should be
132returned with pci_dev_put().
133
134Once a provider is selected, the orchestrator can then use
135:c:func:`pci_alloc_p2pmem()` and :c:func:`pci_free_p2pmem()` to
136allocate P2P memory from the provider. :c:func:`pci_p2pmem_alloc_sgl()`
137and :c:func:`pci_p2pmem_free_sgl()` are convenience functions for
138allocating scatter-gather lists with P2P memory.
139
140Struct Page Caveats
141-------------------
142
143While the MEMORY_DEVICE_PCI_P2PDMA pages can be installed in VMAs,
144pin_user_pages() and related will not return them unless FOLL_PCI_P2PDMA is set.
145
146The MEMORY_DEVICE_PCI_P2PDMA pages require care to support in the kernel. The
147KVA is still MMIO and must still be accessed through the normal
148readX()/writeX()/etc helpers. Direct CPU access (e.g. memcpy) is forbidden, just
149like any other MMIO mapping. While this will actually work on some
150architectures, others will experience corruption or just crash in the kernel.
151Supporting FOLL_PCI_P2PDMA in a subsystem requires scrubbing it to ensure no CPU
152access happens.
153
154
155Usage With DMABUF
156=================
157
158DMABUF provides an alternative to the above struct page-based
159client/provider/orchestrator system and should be used when struct page
160doesn't exist. In this mode the exporting driver will wrap
161some of its MMIO in a DMABUF and give the DMABUF FD to userspace.
162
163Userspace can then pass the FD to an importing driver which will ask the
164exporting driver to map it to the importer.
165
166In this case the initiator and target pci_devices are known and the P2P subsystem
167is used to determine the mapping type. The phys_addr_t-based DMA API is used to
168establish the dma_addr_t.
169
170Lifecycle is controlled by DMABUF move_notify(). When the exporting driver wants
171to remove() it must deliver an invalidation shutdown to all DMABUF importing
172drivers through move_notify() and synchronously DMA unmap all the MMIO.
173
174No importing driver can continue to have a DMA map to the MMIO after the
175exporting driver has destroyed its p2p_provider.
176
177
178P2P DMA Support Library
179=======================
180
181.. kernel-doc:: drivers/pci/p2pdma.c
182   :export:
183