xref: /linux/drivers/gpu/drm/xe/xe_vm_doc.h (revision 36110669ddf832e6c9ceba4dd203749d5be31d31)
1 /* SPDX-License-Identifier: MIT */
2 /*
3  * Copyright © 2022 Intel Corporation
4  */
5 
6 #ifndef _XE_VM_DOC_H_
7 #define _XE_VM_DOC_H_
8 
9 /**
10  * DOC: XE VM (user address space)
11  *
12  * VM creation
13  * ===========
14  *
15  * Allocate a physical page for root of the page table structure, create default
16  * bind engine, and return a handle to the user.
17  *
18  * Scratch page
19  * ------------
20  *
21  * If the VM is created with the flag, DRM_XE_VM_CREATE_FLAG_SCRATCH_PAGE, set the
22  * entire page table structure defaults pointing to blank page allocated by the
23  * VM. Invalid memory access rather than fault just read / write to this page.
24  *
25  * VM bind (create GPU mapping for a BO or userptr)
26  * ================================================
27  *
28  * Creates GPU mappings for a BO or userptr within a VM. VM binds uses the same
29  * in / out fence interface (struct drm_xe_sync) as execs which allows users to
30  * think of binds and execs as more or less the same operation.
31  *
32  * Operations
33  * ----------
34  *
35  * DRM_XE_VM_BIND_OP_MAP		- Create mapping for a BO
36  * DRM_XE_VM_BIND_OP_UNMAP		- Destroy mapping for a BO / userptr
37  * DRM_XE_VM_BIND_OP_MAP_USERPTR	- Create mapping for userptr
38  *
39  * Implementation details
40  * ~~~~~~~~~~~~~~~~~~~~~~
41  *
42  * All bind operations are implemented via a hybrid approach of using the CPU
43  * and GPU to modify page tables. If a new physical page is allocated in the
44  * page table structure we populate that page via the CPU and insert that new
45  * page into the existing page table structure via a GPU job. Also any existing
46  * pages in the page table structure that need to be modified also are updated
47  * via the GPU job. As the root physical page is prealloced on VM creation our
48  * GPU job will always have at least 1 update. The in / out fences are passed to
49  * this job so again this is conceptually the same as an exec.
50  *
51  * Very simple example of few binds on an empty VM with 48 bits of address space
52  * and the resulting operations:
53  *
54  * .. code-block::
55  *
56  *	bind BO0 0x0-0x1000
57  *	alloc page level 3a, program PTE[0] to BO0 phys address (CPU)
58  *	alloc page level 2, program PDE[0] page level 3a phys address (CPU)
59  *	alloc page level 1, program PDE[0] page level 2 phys address (CPU)
60  *	update root PDE[0] to page level 1 phys address (GPU)
61  *
62  *	bind BO1 0x201000-0x202000
63  *	alloc page level 3b, program PTE[1] to BO1 phys address (CPU)
64  *	update page level 2 PDE[1] to page level 3b phys address (GPU)
65  *
66  *	bind BO2 0x1ff000-0x201000
67  *	update page level 3a PTE[511] to BO2 phys addres (GPU)
68  *	update page level 3b PTE[0] to BO2 phys addres + 0x1000 (GPU)
69  *
70  * GPU bypass
71  * ~~~~~~~~~~
72  *
73  * In the above example the steps using the GPU can be converted to CPU if the
74  * bind can be done immediately (all in-fences satisfied, VM dma-resv kernel
75  * slot is idle).
76  *
77  * Address space
78  * -------------
79  *
80  * Depending on platform either 48 or 57 bits of address space is supported.
81  *
82  * Page sizes
83  * ----------
84  *
85  * The minimum page size is either 4k or 64k depending on platform and memory
86  * placement (sysmem vs. VRAM). We enforce that binds must be aligned to the
87  * minimum page size.
88  *
89  * Larger pages (2M or 1GB) can be used for BOs in VRAM, the BO physical address
90  * is aligned to the larger pages size, and VA is aligned to the larger page
91  * size. Larger pages for userptrs / BOs in sysmem should be possible but is not
92  * yet implemented.
93  *
94  * Sync error handling mode
95  * ------------------------
96  *
97  * In both modes during the bind IOCTL the user input is validated. In sync
98  * error handling mode the newly bound BO is validated (potentially moved back
99  * to a region of memory where is can be used), page tables are updated by the
100  * CPU and the job to do the GPU binds is created in the IOCTL itself. This step
101  * can fail due to memory pressure. The user can recover by freeing memory and
102  * trying this operation again.
103  *
104  * Async error handling mode
105  * -------------------------
106  *
107  * In async error handling the step of validating the BO, updating page tables,
108  * and generating a job are deferred to an async worker. As this step can now
109  * fail after the IOCTL has reported success we need an error handling flow for
110  * which the user can recover from.
111  *
112  * The solution is for a user to register a user address with the VM which the
113  * VM uses to report errors to. The ufence wait interface can be used to wait on
114  * a VM going into an error state. Once an error is reported the VM's async
115  * worker is paused. While the VM's async worker is paused sync,
116  * DRM_XE_VM_BIND_OP_UNMAP operations are allowed (this can free memory). Once the
117  * uses believe the error state is fixed, the async worker can be resumed via
118  * XE_VM_BIND_OP_RESTART operation. When VM async bind work is restarted, the
119  * first operation processed is the operation that caused the original error.
120  *
121  * Bind queues / engines
122  * ---------------------
123  *
124  * Think of the case where we have two bind operations A + B and are submitted
125  * in that order. A has in fences while B has none. If using a single bind
126  * queue, B is now blocked on A's in fences even though it is ready to run. This
127  * example is a real use case for VK sparse binding. We work around this
128  * limitation by implementing bind engines.
129  *
130  * In the bind IOCTL the user can optionally pass in an engine ID which must map
131  * to an engine which is of the special class DRM_XE_ENGINE_CLASS_VM_BIND.
132  * Underneath this is a really virtual engine that can run on any of the copy
133  * hardware engines. The job(s) created each IOCTL are inserted into this
134  * engine's ring. In the example above if A and B have different bind engines B
135  * is free to pass A. If the engine ID field is omitted, the default bind queue
136  * for the VM is used.
137  *
138  * TODO: Explain race in issue 41 and how we solve it
139  *
140  * Array of bind operations
141  * ------------------------
142  *
143  * The uAPI allows multiple binds operations to be passed in via a user array,
144  * of struct drm_xe_vm_bind_op, in a single VM bind IOCTL. This interface
145  * matches the VK sparse binding API. The implementation is rather simple, parse
146  * the array into a list of operations, pass the in fences to the first operation,
147  * and pass the out fences to the last operation. The ordered nature of a bind
148  * engine makes this possible.
149  *
150  * Munmap semantics for unbinds
151  * ----------------------------
152  *
153  * Munmap allows things like:
154  *
155  * .. code-block::
156  *
157  *	0x0000-0x2000 and 0x3000-0x5000 have mappings
158  *	Munmap 0x1000-0x4000, results in mappings 0x0000-0x1000 and 0x4000-0x5000
159  *
160  * To support this semantic in the above example we decompose the above example
161  * into 4 operations:
162  *
163  * .. code-block::
164  *
165  *	unbind 0x0000-0x2000
166  *	unbind 0x3000-0x5000
167  *	rebind 0x0000-0x1000
168  *	rebind 0x4000-0x5000
169  *
170  * Why not just do a partial unbind of 0x1000-0x2000 and 0x3000-0x4000? This
171  * falls apart when using large pages at the edges and the unbind forces us to
172  * use a smaller page size. For simplity we always issue a set of unbinds
173  * unmapping anything in the range and at most 2 rebinds on the edges.
174  *
175  * Similar to an array of binds, in fences are passed to the first operation and
176  * out fences are signaled on the last operation.
177  *
178  * In this example there is a window of time where 0x0000-0x1000 and
179  * 0x4000-0x5000 are invalid but the user didn't ask for these addresses to be
180  * removed from the mapping. To work around this we treat any munmap style
181  * unbinds which require a rebind as a kernel operations (BO eviction or userptr
182  * invalidation). The first operation waits on the VM's
183  * DMA_RESV_USAGE_PREEMPT_FENCE slots (waits for all pending jobs on VM to
184  * complete / triggers preempt fences) and the last operation is installed in
185  * the VM's DMA_RESV_USAGE_KERNEL slot (blocks future jobs / resume compute mode
186  * VM). The caveat is all dma-resv slots must be updated atomically with respect
187  * to execs and compute mode rebind worker. To accomplish this, hold the
188  * vm->lock in write mode from the first operation until the last.
189  *
190  * Deferred binds in fault mode
191  * ----------------------------
192  *
193  * If a VM is in fault mode (TODO: link to fault mode), new bind operations that
194  * create mappings are by default deferred to the page fault handler (first
195  * use). This behavior can be overriden by setting the flag
196  * DRM_XE_VM_BIND_FLAG_IMMEDIATE which indicates to creating the mapping
197  * immediately.
198  *
199  * User pointer
200  * ============
201  *
202  * User pointers are user allocated memory (malloc'd, mmap'd, etc..) for which the
203  * user wants to create a GPU mapping. Typically in other DRM drivers a dummy BO
204  * was created and then a binding was created. We bypass creating a dummy BO in
205  * XE and simply create a binding directly from the userptr.
206  *
207  * Invalidation
208  * ------------
209  *
210  * Since this a core kernel managed memory the kernel can move this memory
211  * whenever it wants. We register an invalidation MMU notifier to alert XE when
212  * a user poiter is about to move. The invalidation notifier needs to block
213  * until all pending users (jobs or compute mode engines) of the userptr are
214  * idle to ensure no faults. This done by waiting on all of VM's dma-resv slots.
215  *
216  * Rebinds
217  * -------
218  *
219  * Either the next exec (non-compute) or rebind worker (compute mode) will
220  * rebind the userptr. The invalidation MMU notifier kicks the rebind worker
221  * after the VM dma-resv wait if the VM is in compute mode.
222  *
223  * Compute mode
224  * ============
225  *
226  * A VM in compute mode enables long running workloads and ultra low latency
227  * submission (ULLS). ULLS is implemented via a continuously running batch +
228  * semaphores. This enables the user to insert jump to new batch commands
229  * into the continuously running batch. In both cases these batches exceed the
230  * time a dma fence is allowed to exist for before signaling, as such dma fences
231  * are not used when a VM is in compute mode. User fences (TODO: link user fence
232  * doc) are used instead to signal operation's completion.
233  *
234  * Preempt fences
235  * --------------
236  *
237  * If the kernel decides to move memory around (either userptr invalidate, BO
238  * eviction, or mumap style unbind which results in a rebind) and a batch is
239  * running on an engine, that batch can fault or cause a memory corruption as
240  * page tables for the moved memory are no longer valid. To work around this we
241  * introduce the concept of preempt fences. When sw signaling is enabled on a
242  * preempt fence it tells the submission backend to kick that engine off the
243  * hardware and the preempt fence signals when the engine is off the hardware.
244  * Once all preempt fences are signaled for a VM the kernel can safely move the
245  * memory and kick the rebind worker which resumes all the engines execution.
246  *
247  * A preempt fence, for every engine using the VM, is installed into the VM's
248  * dma-resv DMA_RESV_USAGE_PREEMPT_FENCE slot. The same preempt fence, for every
249  * engine using the VM, is also installed into the same dma-resv slot of every
250  * external BO mapped in the VM.
251  *
252  * Rebind worker
253  * -------------
254  *
255  * The rebind worker is very similar to an exec. It is resposible for rebinding
256  * evicted BOs or userptrs, waiting on those operations, installing new preempt
257  * fences, and finally resuming executing of engines in the VM.
258  *
259  * Flow
260  * ~~~~
261  *
262  * .. code-block::
263  *
264  *	<----------------------------------------------------------------------|
265  *	Check if VM is closed, if so bail out                                  |
266  *	Lock VM global lock in read mode                                       |
267  *	Pin userptrs (also finds userptr invalidated since last rebind worker) |
268  *	Lock VM dma-resv and external BOs dma-resv                             |
269  *	Validate BOs that have been evicted                                    |
270  *	Wait on and allocate new preempt fences for every engine using the VM  |
271  *	Rebind invalidated userptrs + evicted BOs                              |
272  *	Wait on last rebind fence                                              |
273  *	Wait VM's DMA_RESV_USAGE_KERNEL dma-resv slot                          |
274  *	Install preeempt fences and issue resume for every engine using the VM |
275  *	Check if any userptrs invalidated since pin                            |
276  *		Squash resume for all engines                                  |
277  *		Unlock all                                                     |
278  *		Wait all VM's dma-resv slots                                   |
279  *		Retry ----------------------------------------------------------
280  *	Release all engines waiting to resume
281  *	Unlock all
282  *
283  * Timeslicing
284  * -----------
285  *
286  * In order to prevent an engine from continuously being kicked off the hardware
287  * and making no forward progress an engine has a period of time it allowed to
288  * run after resume before it can be kicked off again. This effectively gives
289  * each engine a timeslice.
290  *
291  * Handling multiple GTs
292  * =====================
293  *
294  * If a GT has slower access to some regions and the page table structure are in
295  * the slow region, the performance on that GT could adversely be affected. To
296  * work around this we allow a VM page tables to be shadowed in multiple GTs.
297  * When VM is created, a default bind engine and PT table structure are created
298  * on each GT.
299  *
300  * Binds can optionally pass in a mask of GTs where a mapping should be created,
301  * if this mask is zero then default to all the GTs where the VM has page
302  * tables.
303  *
304  * The implementation for this breaks down into a bunch for_each_gt loops in
305  * various places plus exporting a composite fence for multi-GT binds to the
306  * user.
307  *
308  * Fault mode (unified shared memory)
309  * ==================================
310  *
311  * A VM in fault mode can be enabled on devices that support page faults. If
312  * page faults are enabled, using dma fences can potentially induce a deadlock:
313  * A pending page fault can hold up the GPU work which holds up the dma fence
314  * signaling, and memory allocation is usually required to resolve a page
315  * fault, but memory allocation is not allowed to gate dma fence signaling. As
316  * such, dma fences are not allowed when VM is in fault mode. Because dma-fences
317  * are not allowed, only long running workloads and ULLS are enabled on a faulting
318  * VM.
319  *
320  * Defered VM binds
321  * ----------------
322  *
323  * By default, on a faulting VM binds just allocate the VMA and the actual
324  * updating of the page tables is defered to the page fault handler. This
325  * behavior can be overridden by setting the flag DRM_XE_VM_BIND_FLAG_IMMEDIATE in
326  * the VM bind which will then do the bind immediately.
327  *
328  * Page fault handler
329  * ------------------
330  *
331  * Page faults are received in the G2H worker under the CT lock which is in the
332  * path of dma fences (no memory allocations are allowed, faults require memory
333  * allocations) thus we cannot process faults under the CT lock. Another issue
334  * is faults issue TLB invalidations which require G2H credits and we cannot
335  * allocate G2H credits in the G2H handlers without deadlocking. Lastly, we do
336  * not want the CT lock to be an outer lock of the VM global lock (VM global
337  * lock required to fault processing).
338  *
339  * To work around the above issue with processing faults in the G2H worker, we
340  * sink faults to a buffer which is large enough to sink all possible faults on
341  * the GT (1 per hardware engine) and kick a worker to process the faults. Since
342  * the page faults G2H are already received in a worker, kicking another worker
343  * adds more latency to a critical performance path. We add a fast path in the
344  * G2H irq handler which looks at first G2H and if it is a page fault we sink
345  * the fault to the buffer and kick the worker to process the fault. TLB
346  * invalidation responses are also in the critical path so these can also be
347  * processed in this fast path.
348  *
349  * Multiple buffers and workers are used and hashed over based on the ASID so
350  * faults from different VMs can be processed in parallel.
351  *
352  * The page fault handler itself is rather simple, flow is below.
353  *
354  * .. code-block::
355  *
356  *	Lookup VM from ASID in page fault G2H
357  *	Lock VM global lock in read mode
358  *	Lookup VMA from address in page fault G2H
359  *	Check if VMA is valid, if not bail
360  *	Check if VMA's BO has backing store, if not allocate
361  *	<----------------------------------------------------------------------|
362  *	If userptr, pin pages                                                  |
363  *	Lock VM & BO dma-resv locks                                            |
364  *	If atomic fault, migrate to VRAM, else validate BO location            |
365  *	Issue rebind                                                           |
366  *	Wait on rebind to complete                                             |
367  *	Check if userptr invalidated since pin                                 |
368  *		Drop VM & BO dma-resv locks                                    |
369  *		Retry ----------------------------------------------------------
370  *	Unlock all
371  *	Issue blocking TLB invalidation                                        |
372  *	Send page fault response to GuC
373  *
374  * Access counters
375  * ---------------
376  *
377  * Access counters can be configured to trigger a G2H indicating the device is
378  * accessing VMAs in system memory frequently as hint to migrate those VMAs to
379  * VRAM.
380  *
381  * Same as the page fault handler, access counters G2H cannot be processed the
382  * G2H worker under the CT lock. Again we use a buffer to sink access counter
383  * G2H. Unlike page faults there is no upper bound so if the buffer is full we
384  * simply drop the G2H. Access counters are a best case optimization and it is
385  * safe to drop these unlike page faults.
386  *
387  * The access counter handler itself is rather simple flow is below.
388  *
389  * .. code-block::
390  *
391  *	Lookup VM from ASID in access counter G2H
392  *	Lock VM global lock in read mode
393  *	Lookup VMA from address in access counter G2H
394  *	If userptr, bail nothing to do
395  *	Lock VM & BO dma-resv locks
396  *	Issue migration to VRAM
397  *	Unlock all
398  *
399  * Notice no rebind is issued in the access counter handler as the rebind will
400  * be issued on next page fault.
401  *
402  * Caveats with eviction / user pointer invalidation
403  * -------------------------------------------------
404  *
405  * In the case of eviction and user pointer invalidation on a faulting VM, there
406  * is no need to issue a rebind rather we just need to blow away the page tables
407  * for the VMAs and the page fault handler will rebind the VMAs when they fault.
408  * The caveat is to update / read the page table structure the VM global lock is
409  * needed. In both the case of eviction and user pointer invalidation locks are
410  * held which make acquiring the VM global lock impossible. To work around this
411  * every VMA maintains a list of leaf page table entries which should be written
412  * to zero to blow away the VMA's page tables. After writing zero to these
413  * entries a blocking TLB invalidate is issued. At this point it is safe for the
414  * kernel to move the VMA's memory around. This is a necessary lockless
415  * algorithm and is safe as leafs cannot be changed while either an eviction or
416  * userptr invalidation is occurring.
417  *
418  * Locking
419  * =======
420  *
421  * VM locking protects all of the core data paths (bind operations, execs,
422  * evictions, and compute mode rebind worker) in XE.
423  *
424  * Locks
425  * -----
426  *
427  * VM global lock (vm->lock) - rw semaphore lock. Outer most lock which protects
428  * the list of userptrs mapped in the VM, the list of engines using this VM, and
429  * the array of external BOs mapped in the VM. When adding or removing any of the
430  * aforementioned state from the VM should acquire this lock in write mode. The VM
431  * bind path also acquires this lock in write while the exec / compute mode
432  * rebind worker acquires this lock in read mode.
433  *
434  * VM dma-resv lock (vm->ttm.base.resv->lock) - WW lock. Protects VM dma-resv
435  * slots which is shared with any private BO in the VM. Expected to be acquired
436  * during VM binds, execs, and compute mode rebind worker. This lock is also
437  * held when private BOs are being evicted.
438  *
439  * external BO dma-resv lock (bo->ttm.base.resv->lock) - WW lock. Protects
440  * external BO dma-resv slots. Expected to be acquired during VM binds (in
441  * addition to the VM dma-resv lock). All external BO dma-locks within a VM are
442  * expected to be acquired (in addition to the VM dma-resv lock) during execs
443  * and the compute mode rebind worker. This lock is also held when an external
444  * BO is being evicted.
445  *
446  * Putting it all together
447  * -----------------------
448  *
449  * 1. An exec and bind operation with the same VM can't be executing at the same
450  * time (vm->lock).
451  *
452  * 2. A compute mode rebind worker and bind operation with the same VM can't be
453  * executing at the same time (vm->lock).
454  *
455  * 3. We can't add / remove userptrs or external BOs to a VM while an exec with
456  * the same VM is executing (vm->lock).
457  *
458  * 4. We can't add / remove userptrs, external BOs, or engines to a VM while a
459  * compute mode rebind worker with the same VM is executing (vm->lock).
460  *
461  * 5. Evictions within a VM can't be happen while an exec with the same VM is
462  * executing (dma-resv locks).
463  *
464  * 6. Evictions within a VM can't be happen while a compute mode rebind worker
465  * with the same VM is executing (dma-resv locks).
466  *
467  * dma-resv usage
468  * ==============
469  *
470  * As previously stated to enforce the ordering of kernel ops (eviction, userptr
471  * invalidation, munmap style unbinds which result in a rebind), rebinds during
472  * execs, execs, and resumes in the rebind worker we use both the VMs and
473  * external BOs dma-resv slots. Let try to make this as clear as possible.
474  *
475  * Slot installation
476  * -----------------
477  *
478  * 1. Jobs from kernel ops install themselves into the DMA_RESV_USAGE_KERNEL
479  * slot of either an external BO or VM (depends on if kernel op is operating on
480  * an external or private BO)
481  *
482  * 2. In non-compute mode, jobs from execs install themselves into the
483  * DMA_RESV_USAGE_BOOKKEEP slot of the VM
484  *
485  * 3. In non-compute mode, jobs from execs install themselves into the
486  * DMA_RESV_USAGE_WRITE slot of all external BOs in the VM
487  *
488  * 4. Jobs from binds install themselves into the DMA_RESV_USAGE_BOOKKEEP slot
489  * of the VM
490  *
491  * 5. Jobs from binds install themselves into the DMA_RESV_USAGE_BOOKKEEP slot
492  * of the external BO (if the bind is to an external BO, this is addition to #4)
493  *
494  * 6. Every engine using a compute mode VM has a preempt fence in installed into
495  * the DMA_RESV_USAGE_PREEMPT_FENCE slot of the VM
496  *
497  * 7. Every engine using a compute mode VM has a preempt fence in installed into
498  * the DMA_RESV_USAGE_PREEMPT_FENCE slot of all the external BOs in the VM
499  *
500  * Slot waiting
501  * ------------
502  *
503  * 1. The exection of all jobs from kernel ops shall wait on all slots
504  * (DMA_RESV_USAGE_PREEMPT_FENCE) of either an external BO or VM (depends on if
505  * kernel op is operating on external or private BO)
506  *
507  * 2. In non-compute mode, the exection of all jobs from rebinds in execs shall
508  * wait on the DMA_RESV_USAGE_KERNEL slot of either an external BO or VM
509  * (depends on if the rebind is operatiing on an external or private BO)
510  *
511  * 3. In non-compute mode, the exection of all jobs from execs shall wait on the
512  * last rebind job
513  *
514  * 4. In compute mode, the exection of all jobs from rebinds in the rebind
515  * worker shall wait on the DMA_RESV_USAGE_KERNEL slot of either an external BO
516  * or VM (depends on if rebind is operating on external or private BO)
517  *
518  * 5. In compute mode, resumes in rebind worker shall wait on last rebind fence
519  *
520  * 6. In compute mode, resumes in rebind worker shall wait on the
521  * DMA_RESV_USAGE_KERNEL slot of the VM
522  *
523  * Putting it all together
524  * -----------------------
525  *
526  * 1. New jobs from kernel ops are blocked behind any existing jobs from
527  * non-compute mode execs
528  *
529  * 2. New jobs from non-compute mode execs are blocked behind any existing jobs
530  * from kernel ops and rebinds
531  *
532  * 3. New jobs from kernel ops are blocked behind all preempt fences signaling in
533  * compute mode
534  *
535  * 4. Compute mode engine resumes are blocked behind any existing jobs from
536  * kernel ops and rebinds
537  *
538  * Future work
539  * ===========
540  *
541  * Support large pages for sysmem and userptr.
542  *
543  * Update page faults to handle BOs are page level grainularity (e.g. part of BO
544  * could be in system memory while another part could be in VRAM).
545  *
546  * Page fault handler likely we be optimized a bit more (e.g. Rebinds always
547  * wait on the dma-resv kernel slots of VM or BO, technically we only have to
548  * wait the BO moving. If using a job to do the rebind, we could not block in
549  * the page fault handler rather attach a callback to fence of the rebind job to
550  * signal page fault complete. Our handling of short circuting for atomic faults
551  * for bound VMAs could be better. etc...). We can tune all of this once we have
552  * benchmarks / performance number from workloads up and running.
553  */
554 
555 #endif
556