xref: /linux/Documentation/arch/x86/sva.rst (revision 1ac731c529cd4d6adbce134754b51ff7d822b145)
1ff61f079SJonathan Corbet.. SPDX-License-Identifier: GPL-2.0
2ff61f079SJonathan Corbet
3ff61f079SJonathan Corbet===========================================
4ff61f079SJonathan CorbetShared Virtual Addressing (SVA) with ENQCMD
5ff61f079SJonathan Corbet===========================================
6ff61f079SJonathan Corbet
7ff61f079SJonathan CorbetBackground
8ff61f079SJonathan Corbet==========
9ff61f079SJonathan Corbet
10ff61f079SJonathan CorbetShared Virtual Addressing (SVA) allows the processor and device to use the
11ff61f079SJonathan Corbetsame virtual addresses avoiding the need for software to translate virtual
12ff61f079SJonathan Corbetaddresses to physical addresses. SVA is what PCIe calls Shared Virtual
13ff61f079SJonathan CorbetMemory (SVM).
14ff61f079SJonathan Corbet
15ff61f079SJonathan CorbetIn addition to the convenience of using application virtual addresses
16ff61f079SJonathan Corbetby the device, it also doesn't require pinning pages for DMA.
17ff61f079SJonathan CorbetPCIe Address Translation Services (ATS) along with Page Request Interface
18ff61f079SJonathan Corbet(PRI) allow devices to function much the same way as the CPU handling
19ff61f079SJonathan Corbetapplication page-faults. For more information please refer to the PCIe
20ff61f079SJonathan Corbetspecification Chapter 10: ATS Specification.
21ff61f079SJonathan Corbet
22ff61f079SJonathan CorbetUse of SVA requires IOMMU support in the platform. IOMMU is also
23ff61f079SJonathan Corbetrequired to support the PCIe features ATS and PRI. ATS allows devices
24ff61f079SJonathan Corbetto cache translations for virtual addresses. The IOMMU driver uses the
25ff61f079SJonathan Corbetmmu_notifier() support to keep the device TLB cache and the CPU cache in
26ff61f079SJonathan Corbetsync. When an ATS lookup fails for a virtual address, the device should
27ff61f079SJonathan Corbetuse the PRI in order to request the virtual address to be paged into the
28ff61f079SJonathan CorbetCPU page tables. The device must use ATS again in order the fetch the
29ff61f079SJonathan Corbettranslation before use.
30ff61f079SJonathan Corbet
31ff61f079SJonathan CorbetShared Hardware Workqueues
32ff61f079SJonathan Corbet==========================
33ff61f079SJonathan Corbet
34ff61f079SJonathan CorbetUnlike Single Root I/O Virtualization (SR-IOV), Scalable IOV (SIOV) permits
35ff61f079SJonathan Corbetthe use of Shared Work Queues (SWQ) by both applications and Virtual
36ff61f079SJonathan CorbetMachines (VM's). This allows better hardware utilization vs. hard
37ff61f079SJonathan Corbetpartitioning resources that could result in under utilization. In order to
38ff61f079SJonathan Corbetallow the hardware to distinguish the context for which work is being
39ff61f079SJonathan Corbetexecuted in the hardware by SWQ interface, SIOV uses Process Address Space
40ff61f079SJonathan CorbetID (PASID), which is a 20-bit number defined by the PCIe SIG.
41ff61f079SJonathan Corbet
42ff61f079SJonathan CorbetPASID value is encoded in all transactions from the device. This allows the
43ff61f079SJonathan CorbetIOMMU to track I/O on a per-PASID granularity in addition to using the PCIe
44ff61f079SJonathan CorbetResource Identifier (RID) which is the Bus/Device/Function.
45ff61f079SJonathan Corbet
46ff61f079SJonathan Corbet
47ff61f079SJonathan CorbetENQCMD
48ff61f079SJonathan Corbet======
49ff61f079SJonathan Corbet
50ff61f079SJonathan CorbetENQCMD is a new instruction on Intel platforms that atomically submits a
51ff61f079SJonathan Corbetwork descriptor to a device. The descriptor includes the operation to be
52ff61f079SJonathan Corbetperformed, virtual addresses of all parameters, virtual address of a completion
53ff61f079SJonathan Corbetrecord, and the PASID (process address space ID) of the current process.
54ff61f079SJonathan Corbet
55ff61f079SJonathan CorbetENQCMD works with non-posted semantics and carries a status back if the
56ff61f079SJonathan Corbetcommand was accepted by hardware. This allows the submitter to know if the
57ff61f079SJonathan Corbetsubmission needs to be retried or other device specific mechanisms to
58ff61f079SJonathan Corbetimplement fairness or ensure forward progress should be provided.
59ff61f079SJonathan Corbet
60ff61f079SJonathan CorbetENQCMD is the glue that ensures applications can directly submit commands
61ff61f079SJonathan Corbetto the hardware and also permits hardware to be aware of application context
62ff61f079SJonathan Corbetto perform I/O operations via use of PASID.
63ff61f079SJonathan Corbet
64ff61f079SJonathan CorbetProcess Address Space Tagging
65ff61f079SJonathan Corbet=============================
66ff61f079SJonathan Corbet
67ff61f079SJonathan CorbetA new thread-scoped MSR (IA32_PASID) provides the connection between
68ff61f079SJonathan Corbetuser processes and the rest of the hardware. When an application first
69ff61f079SJonathan Corbetaccesses an SVA-capable device, this MSR is initialized with a newly
70ff61f079SJonathan Corbetallocated PASID. The driver for the device calls an IOMMU-specific API
71ff61f079SJonathan Corbetthat sets up the routing for DMA and page-requests.
72ff61f079SJonathan Corbet
73ff61f079SJonathan CorbetFor example, the Intel Data Streaming Accelerator (DSA) uses
74ff61f079SJonathan Corbetiommu_sva_bind_device(), which will do the following:
75ff61f079SJonathan Corbet
76ff61f079SJonathan Corbet- Allocate the PASID, and program the process page-table (%cr3 register) in the
77ff61f079SJonathan Corbet  PASID context entries.
78ff61f079SJonathan Corbet- Register for mmu_notifier() to track any page-table invalidations to keep
79ff61f079SJonathan Corbet  the device TLB in sync. For example, when a page-table entry is invalidated,
80ff61f079SJonathan Corbet  the IOMMU propagates the invalidation to the device TLB. This will force any
81ff61f079SJonathan Corbet  future access by the device to this virtual address to participate in
82ff61f079SJonathan Corbet  ATS. If the IOMMU responds with proper response that a page is not
83ff61f079SJonathan Corbet  present, the device would request the page to be paged in via the PCIe PRI
84ff61f079SJonathan Corbet  protocol before performing I/O.
85ff61f079SJonathan Corbet
86ff61f079SJonathan CorbetThis MSR is managed with the XSAVE feature set as "supervisor state" to
87ff61f079SJonathan Corbetensure the MSR is updated during context switch.
88ff61f079SJonathan Corbet
89ff61f079SJonathan CorbetPASID Management
90ff61f079SJonathan Corbet================
91ff61f079SJonathan Corbet
92ff61f079SJonathan CorbetThe kernel must allocate a PASID on behalf of each process which will use
93ff61f079SJonathan CorbetENQCMD and program it into the new MSR to communicate the process identity to
94ff61f079SJonathan Corbetplatform hardware.  ENQCMD uses the PASID stored in this MSR to tag requests
95ff61f079SJonathan Corbetfrom this process.  When a user submits a work descriptor to a device using the
96ff61f079SJonathan CorbetENQCMD instruction, the PASID field in the descriptor is auto-filled with the
97ff61f079SJonathan Corbetvalue from MSR_IA32_PASID. Requests for DMA from the device are also tagged
98ff61f079SJonathan Corbetwith the same PASID. The platform IOMMU uses the PASID in the transaction to
99ff61f079SJonathan Corbetperform address translation. The IOMMU APIs setup the corresponding PASID
100ff61f079SJonathan Corbetentry in IOMMU with the process address used by the CPU (e.g. %cr3 register in
101ff61f079SJonathan Corbetx86).
102ff61f079SJonathan Corbet
103ff61f079SJonathan CorbetThe MSR must be configured on each logical CPU before any application
104ff61f079SJonathan Corbetthread can interact with a device. Threads that belong to the same
105ff61f079SJonathan Corbetprocess share the same page tables, thus the same MSR value.
106ff61f079SJonathan Corbet
107ff61f079SJonathan CorbetPASID Life Cycle Management
108ff61f079SJonathan Corbet===========================
109ff61f079SJonathan Corbet
110*58390c8cSLinus TorvaldsPASID is initialized as IOMMU_PASID_INVALID (-1) when a process is created.
111ff61f079SJonathan Corbet
112ff61f079SJonathan CorbetOnly processes that access SVA-capable devices need to have a PASID
113ff61f079SJonathan Corbetallocated. This allocation happens when a process opens/binds an SVA-capable
114ff61f079SJonathan Corbetdevice but finds no PASID for this process. Subsequent binds of the same, or
115ff61f079SJonathan Corbetother devices will share the same PASID.
116ff61f079SJonathan Corbet
117ff61f079SJonathan CorbetAlthough the PASID is allocated to the process by opening a device,
118ff61f079SJonathan Corbetit is not active in any of the threads of that process. It's loaded to the
119ff61f079SJonathan CorbetIA32_PASID MSR lazily when a thread tries to submit a work descriptor
120ff61f079SJonathan Corbetto a device using the ENQCMD.
121ff61f079SJonathan Corbet
122ff61f079SJonathan CorbetThat first access will trigger a #GP fault because the IA32_PASID MSR
123ff61f079SJonathan Corbethas not been initialized with the PASID value assigned to the process
124ff61f079SJonathan Corbetwhen the device was opened. The Linux #GP handler notes that a PASID has
125ff61f079SJonathan Corbetbeen allocated for the process, and so initializes the IA32_PASID MSR
126ff61f079SJonathan Corbetand returns so that the ENQCMD instruction is re-executed.
127ff61f079SJonathan Corbet
128ff61f079SJonathan CorbetOn fork(2) or exec(2) the PASID is removed from the process as it no
129ff61f079SJonathan Corbetlonger has the same address space that it had when the device was opened.
130ff61f079SJonathan Corbet
131ff61f079SJonathan CorbetOn clone(2) the new task shares the same address space, so will be
132ff61f079SJonathan Corbetable to use the PASID allocated to the process. The IA32_PASID is not
133ff61f079SJonathan Corbetpreemptively initialized as the PASID value might not be allocated yet or
134ff61f079SJonathan Corbetthe kernel does not know whether this thread is going to access the device
135ff61f079SJonathan Corbetand the cleared IA32_PASID MSR reduces context switch overhead by xstate
136ff61f079SJonathan Corbetinit optimization. Since #GP faults have to be handled on any threads that
137ff61f079SJonathan Corbetwere created before the PASID was assigned to the mm of the process, newly
138ff61f079SJonathan Corbetcreated threads might as well be treated in a consistent way.
139ff61f079SJonathan Corbet
140ff61f079SJonathan CorbetDue to complexity of freeing the PASID and clearing all IA32_PASID MSRs in
141ff61f079SJonathan Corbetall threads in unbind, free the PASID lazily only on mm exit.
142ff61f079SJonathan Corbet
143ff61f079SJonathan CorbetIf a process does a close(2) of the device file descriptor and munmap(2)
144ff61f079SJonathan Corbetof the device MMIO portal, then the driver will unbind the device. The
145ff61f079SJonathan CorbetPASID is still marked VALID in the PASID_MSR for any threads in the
146ff61f079SJonathan Corbetprocess that accessed the device. But this is harmless as without the
147ff61f079SJonathan CorbetMMIO portal they cannot submit new work to the device.
148ff61f079SJonathan Corbet
149ff61f079SJonathan CorbetRelationships
150ff61f079SJonathan Corbet=============
151ff61f079SJonathan Corbet
152ff61f079SJonathan Corbet * Each process has many threads, but only one PASID.
153ff61f079SJonathan Corbet * Devices have a limited number (~10's to 1000's) of hardware workqueues.
154ff61f079SJonathan Corbet   The device driver manages allocating hardware workqueues.
155ff61f079SJonathan Corbet * A single mmap() maps a single hardware workqueue as a "portal" and
156ff61f079SJonathan Corbet   each portal maps down to a single workqueue.
157ff61f079SJonathan Corbet * For each device with which a process interacts, there must be
158ff61f079SJonathan Corbet   one or more mmap()'d portals.
159ff61f079SJonathan Corbet * Many threads within a process can share a single portal to access
160ff61f079SJonathan Corbet   a single device.
161ff61f079SJonathan Corbet * Multiple processes can separately mmap() the same portal, in
162ff61f079SJonathan Corbet   which case they still share one device hardware workqueue.
163ff61f079SJonathan Corbet * The single process-wide PASID is used by all threads to interact
164ff61f079SJonathan Corbet   with all devices.  There is not, for instance, a PASID for each
165ff61f079SJonathan Corbet   thread or each thread<->device pair.
166ff61f079SJonathan Corbet
167ff61f079SJonathan CorbetFAQ
168ff61f079SJonathan Corbet===
169ff61f079SJonathan Corbet
170ff61f079SJonathan Corbet* What is SVA/SVM?
171ff61f079SJonathan Corbet
172ff61f079SJonathan CorbetShared Virtual Addressing (SVA) permits I/O hardware and the processor to
173ff61f079SJonathan Corbetwork in the same address space, i.e., to share it. Some call it Shared
174ff61f079SJonathan CorbetVirtual Memory (SVM), but Linux community wanted to avoid confusing it with
175ff61f079SJonathan CorbetPOSIX Shared Memory and Secure Virtual Machines which were terms already in
176ff61f079SJonathan Corbetcirculation.
177ff61f079SJonathan Corbet
178ff61f079SJonathan Corbet* What is a PASID?
179ff61f079SJonathan Corbet
180ff61f079SJonathan CorbetA Process Address Space ID (PASID) is a PCIe-defined Transaction Layer Packet
181ff61f079SJonathan Corbet(TLP) prefix. A PASID is a 20-bit number allocated and managed by the OS.
182ff61f079SJonathan CorbetPASID is included in all transactions between the platform and the device.
183ff61f079SJonathan Corbet
184ff61f079SJonathan Corbet* How are shared workqueues different?
185ff61f079SJonathan Corbet
186ff61f079SJonathan CorbetTraditionally, in order for userspace applications to interact with hardware,
187ff61f079SJonathan Corbetthere is a separate hardware instance required per process. For example,
188ff61f079SJonathan Corbetconsider doorbells as a mechanism of informing hardware about work to process.
189ff61f079SJonathan CorbetEach doorbell is required to be spaced 4k (or page-size) apart for process
190ff61f079SJonathan Corbetisolation. This requires hardware to provision that space and reserve it in
191ff61f079SJonathan CorbetMMIO. This doesn't scale as the number of threads becomes quite large. The
192ff61f079SJonathan Corbethardware also manages the queue depth for Shared Work Queues (SWQ), and
193ff61f079SJonathan Corbetconsumers don't need to track queue depth. If there is no space to accept
194ff61f079SJonathan Corbeta command, the device will return an error indicating retry.
195ff61f079SJonathan Corbet
196ff61f079SJonathan CorbetA user should check Deferrable Memory Write (DMWr) capability on the device
197ff61f079SJonathan Corbetand only submits ENQCMD when the device supports it. In the new DMWr PCIe
198ff61f079SJonathan Corbetterminology, devices need to support DMWr completer capability. In addition,
199ff61f079SJonathan Corbetit requires all switch ports to support DMWr routing and must be enabled by
200ff61f079SJonathan Corbetthe PCIe subsystem, much like how PCIe atomic operations are managed for
201ff61f079SJonathan Corbetinstance.
202ff61f079SJonathan Corbet
203ff61f079SJonathan CorbetSWQ allows hardware to provision just a single address in the device. When
204ff61f079SJonathan Corbetused with ENQCMD to submit work, the device can distinguish the process
205ff61f079SJonathan Corbetsubmitting the work since it will include the PASID assigned to that
206ff61f079SJonathan Corbetprocess. This helps the device scale to a large number of processes.
207ff61f079SJonathan Corbet
208ff61f079SJonathan Corbet* Is this the same as a user space device driver?
209ff61f079SJonathan Corbet
210ff61f079SJonathan CorbetCommunicating with the device via the shared workqueue is much simpler
211ff61f079SJonathan Corbetthan a full blown user space driver. The kernel driver does all the
212ff61f079SJonathan Corbetinitialization of the hardware. User space only needs to worry about
213ff61f079SJonathan Corbetsubmitting work and processing completions.
214ff61f079SJonathan Corbet
215ff61f079SJonathan Corbet* Is this the same as SR-IOV?
216ff61f079SJonathan Corbet
217ff61f079SJonathan CorbetSingle Root I/O Virtualization (SR-IOV) focuses on providing independent
218ff61f079SJonathan Corbethardware interfaces for virtualizing hardware. Hence, it's required to be
219ff61f079SJonathan Corbetalmost fully functional interface to software supporting the traditional
220ff61f079SJonathan CorbetBARs, space for interrupts via MSI-X, its own register layout.
221ff61f079SJonathan CorbetVirtual Functions (VFs) are assisted by the Physical Function (PF)
222ff61f079SJonathan Corbetdriver.
223ff61f079SJonathan Corbet
224ff61f079SJonathan CorbetScalable I/O Virtualization builds on the PASID concept to create device
225ff61f079SJonathan Corbetinstances for virtualization. SIOV requires host software to assist in
226ff61f079SJonathan Corbetcreating virtual devices; each virtual device is represented by a PASID
227ff61f079SJonathan Corbetalong with the bus/device/function of the device.  This allows device
228ff61f079SJonathan Corbethardware to optimize device resource creation and can grow dynamically on
229ff61f079SJonathan Corbetdemand. SR-IOV creation and management is very static in nature. Consult
230ff61f079SJonathan Corbetreferences below for more details.
231ff61f079SJonathan Corbet
232ff61f079SJonathan Corbet* Why not just create a virtual function for each app?
233ff61f079SJonathan Corbet
234ff61f079SJonathan CorbetCreating PCIe SR-IOV type Virtual Functions (VF) is expensive. VFs require
235ff61f079SJonathan Corbetduplicated hardware for PCI config space and interrupts such as MSI-X.
236ff61f079SJonathan CorbetResources such as interrupts have to be hard partitioned between VFs at
237ff61f079SJonathan Corbetcreation time, and cannot scale dynamically on demand. The VFs are not
238ff61f079SJonathan Corbetcompletely independent from the Physical Function (PF). Most VFs require
239ff61f079SJonathan Corbetsome communication and assistance from the PF driver. SIOV, in contrast,
240ff61f079SJonathan Corbetcreates a software-defined device where all the configuration and control
241ff61f079SJonathan Corbetaspects are mediated via the slow path. The work submission and completion
242ff61f079SJonathan Corbethappen without any mediation.
243ff61f079SJonathan Corbet
244ff61f079SJonathan Corbet* Does this support virtualization?
245ff61f079SJonathan Corbet
246ff61f079SJonathan CorbetENQCMD can be used from within a guest VM. In these cases, the VMM helps
247ff61f079SJonathan Corbetwith setting up a translation table to translate from Guest PASID to Host
248ff61f079SJonathan CorbetPASID. Please consult the ENQCMD instruction set reference for more
249ff61f079SJonathan Corbetdetails.
250ff61f079SJonathan Corbet
251ff61f079SJonathan Corbet* Does memory need to be pinned?
252ff61f079SJonathan Corbet
253ff61f079SJonathan CorbetWhen devices support SVA along with platform hardware such as IOMMU
254ff61f079SJonathan Corbetsupporting such devices, there is no need to pin memory for DMA purposes.
255ff61f079SJonathan CorbetDevices that support SVA also support other PCIe features that remove the
256ff61f079SJonathan Corbetpinning requirement for memory.
257ff61f079SJonathan Corbet
258ff61f079SJonathan CorbetDevice TLB support - Device requests the IOMMU to lookup an address before
259ff61f079SJonathan Corbetuse via Address Translation Service (ATS) requests.  If the mapping exists
260ff61f079SJonathan Corbetbut there is no page allocated by the OS, IOMMU hardware returns that no
261ff61f079SJonathan Corbetmapping exists.
262ff61f079SJonathan Corbet
263ff61f079SJonathan CorbetDevice requests the virtual address to be mapped via Page Request
264ff61f079SJonathan CorbetInterface (PRI). Once the OS has successfully completed the mapping, it
265ff61f079SJonathan Corbetreturns the response back to the device. The device requests again for
266ff61f079SJonathan Corbeta translation and continues.
267ff61f079SJonathan Corbet
268ff61f079SJonathan CorbetIOMMU works with the OS in managing consistency of page-tables with the
269ff61f079SJonathan Corbetdevice. When removing pages, it interacts with the device to remove any
270ff61f079SJonathan Corbetdevice TLB entry that might have been cached before removing the mappings from
271ff61f079SJonathan Corbetthe OS.
272ff61f079SJonathan Corbet
273ff61f079SJonathan CorbetReferences
274ff61f079SJonathan Corbet==========
275ff61f079SJonathan Corbet
276ff61f079SJonathan CorbetVT-D:
277ff61f079SJonathan Corbethttps://01.org/blogs/ashokraj/2018/recent-enhancements-intel-virtualization-technology-directed-i/o-intel-vt-d
278ff61f079SJonathan Corbet
279ff61f079SJonathan CorbetSIOV:
280ff61f079SJonathan Corbethttps://01.org/blogs/2019/assignable-interfaces-intel-scalable-i/o-virtualization-linux
281ff61f079SJonathan Corbet
282ff61f079SJonathan CorbetENQCMD in ISE:
283ff61f079SJonathan Corbethttps://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf
284ff61f079SJonathan Corbet
285ff61f079SJonathan CorbetDSA spec:
286ff61f079SJonathan Corbethttps://software.intel.com/sites/default/files/341204-intel-data-streaming-accelerator-spec.pdf
287