1 /* SPDX-License-Identifier: MIT */ 2 /* 3 * Copyright © 2022 Intel Corporation 4 */ 5 6 #ifndef _XE_VM_DOC_H_ 7 #define _XE_VM_DOC_H_ 8 9 /** 10 * DOC: XE VM (user address space) 11 * 12 * VM creation 13 * =========== 14 * 15 * Allocate a physical page for root of the page table structure, create default 16 * bind engine, and return a handle to the user. 17 * 18 * Scratch page 19 * ------------ 20 * 21 * If the VM is created with the flag, DRM_XE_VM_CREATE_FLAG_SCRATCH_PAGE, set the 22 * entire page table structure defaults pointing to blank page allocated by the 23 * VM. Invalid memory access rather than fault just read / write to this page. 24 * 25 * VM bind (create GPU mapping for a BO or userptr) 26 * ================================================ 27 * 28 * Creates GPU mapings for a BO or userptr within a VM. VM binds uses the same 29 * in / out fence interface (struct drm_xe_sync) as execs which allows users to 30 * think of binds and execs as more or less the same operation. 31 * 32 * Operations 33 * ---------- 34 * 35 * DRM_XE_VM_BIND_OP_MAP - Create mapping for a BO 36 * DRM_XE_VM_BIND_OP_UNMAP - Destroy mapping for a BO / userptr 37 * DRM_XE_VM_BIND_OP_MAP_USERPTR - Create mapping for userptr 38 * 39 * Implementation details 40 * ~~~~~~~~~~~~~~~~~~~~~~ 41 * 42 * All bind operations are implemented via a hybrid approach of using the CPU 43 * and GPU to modify page tables. If a new physical page is allocated in the 44 * page table structure we populate that page via the CPU and insert that new 45 * page into the existing page table structure via a GPU job. Also any existing 46 * pages in the page table structure that need to be modified also are updated 47 * via the GPU job. As the root physical page is prealloced on VM creation our 48 * GPU job will always have at least 1 update. The in / out fences are passed to 49 * this job so again this is conceptually the same as an exec. 50 * 51 * Very simple example of few binds on an empty VM with 48 bits of address space 52 * and the resulting operations: 53 * 54 * .. code-block:: 55 * 56 * bind BO0 0x0-0x1000 57 * alloc page level 3a, program PTE[0] to BO0 phys address (CPU) 58 * alloc page level 2, program PDE[0] page level 3a phys address (CPU) 59 * alloc page level 1, program PDE[0] page level 2 phys address (CPU) 60 * update root PDE[0] to page level 1 phys address (GPU) 61 * 62 * bind BO1 0x201000-0x202000 63 * alloc page level 3b, program PTE[1] to BO1 phys address (CPU) 64 * update page level 2 PDE[1] to page level 3b phys address (GPU) 65 * 66 * bind BO2 0x1ff000-0x201000 67 * update page level 3a PTE[511] to BO2 phys addres (GPU) 68 * update page level 3b PTE[0] to BO2 phys addres + 0x1000 (GPU) 69 * 70 * GPU bypass 71 * ~~~~~~~~~~ 72 * 73 * In the above example the steps using the GPU can be converted to CPU if the 74 * bind can be done immediately (all in-fences satisfied, VM dma-resv kernel 75 * slot is idle). 76 * 77 * Address space 78 * ------------- 79 * 80 * Depending on platform either 48 or 57 bits of address space is supported. 81 * 82 * Page sizes 83 * ---------- 84 * 85 * The minimum page size is either 4k or 64k depending on platform and memory 86 * placement (sysmem vs. VRAM). We enforce that binds must be aligned to the 87 * minimum page size. 88 * 89 * Larger pages (2M or 1GB) can be used for BOs in VRAM, the BO physical address 90 * is aligned to the larger pages size, and VA is aligned to the larger page 91 * size. Larger pages for userptrs / BOs in sysmem should be possible but is not 92 * yet implemented. 93 * 94 * Sync error handling mode 95 * ------------------------ 96 * 97 * In both modes during the bind IOCTL the user input is validated. In sync 98 * error handling mode the newly bound BO is validated (potentially moved back 99 * to a region of memory where is can be used), page tables are updated by the 100 * CPU and the job to do the GPU binds is created in the IOCTL itself. This step 101 * can fail due to memory pressure. The user can recover by freeing memory and 102 * trying this operation again. 103 * 104 * Async error handling mode 105 * ------------------------- 106 * 107 * In async error handling the step of validating the BO, updating page tables, 108 * and generating a job are deferred to an async worker. As this step can now 109 * fail after the IOCTL has reported success we need an error handling flow for 110 * which the user can recover from. 111 * 112 * The solution is for a user to register a user address with the VM which the 113 * VM uses to report errors to. The ufence wait interface can be used to wait on 114 * a VM going into an error state. Once an error is reported the VM's async 115 * worker is paused. While the VM's async worker is paused sync, 116 * DRM_XE_VM_BIND_OP_UNMAP operations are allowed (this can free memory). Once the 117 * uses believe the error state is fixed, the async worker can be resumed via 118 * XE_VM_BIND_OP_RESTART operation. When VM async bind work is restarted, the 119 * first operation processed is the operation that caused the original error. 120 * 121 * Bind queues / engines 122 * --------------------- 123 * 124 * Think of the case where we have two bind operations A + B and are submitted 125 * in that order. A has in fences while B has none. If using a single bind 126 * queue, B is now blocked on A's in fences even though it is ready to run. This 127 * example is a real use case for VK sparse binding. We work around this 128 * limitation by implementing bind engines. 129 * 130 * In the bind IOCTL the user can optionally pass in an engine ID which must map 131 * to an engine which is of the special class DRM_XE_ENGINE_CLASS_VM_BIND. 132 * Underneath this is a really virtual engine that can run on any of the copy 133 * hardware engines. The job(s) created each IOCTL are inserted into this 134 * engine's ring. In the example above if A and B have different bind engines B 135 * is free to pass A. If the engine ID field is omitted, the default bind queue 136 * for the VM is used. 137 * 138 * TODO: Explain race in issue 41 and how we solve it 139 * 140 * Array of bind operations 141 * ------------------------ 142 * 143 * The uAPI allows multiple binds operations to be passed in via a user array, 144 * of struct drm_xe_vm_bind_op, in a single VM bind IOCTL. This interface 145 * matches the VK sparse binding API. The implementation is rather simple, parse 146 * the array into a list of operations, pass the in fences to the first operation, 147 * and pass the out fences to the last operation. The ordered nature of a bind 148 * engine makes this possible. 149 * 150 * Munmap semantics for unbinds 151 * ---------------------------- 152 * 153 * Munmap allows things like: 154 * 155 * .. code-block:: 156 * 157 * 0x0000-0x2000 and 0x3000-0x5000 have mappings 158 * Munmap 0x1000-0x4000, results in mappings 0x0000-0x1000 and 0x4000-0x5000 159 * 160 * To support this semantic in the above example we decompose the above example 161 * into 4 operations: 162 * 163 * .. code-block:: 164 * 165 * unbind 0x0000-0x2000 166 * unbind 0x3000-0x5000 167 * rebind 0x0000-0x1000 168 * rebind 0x4000-0x5000 169 * 170 * Why not just do a partial unbind of 0x1000-0x2000 and 0x3000-0x4000? This 171 * falls apart when using large pages at the edges and the unbind forces us to 172 * use a smaller page size. For simplity we always issue a set of unbinds 173 * unmapping anything in the range and at most 2 rebinds on the edges. 174 * 175 * Similar to an array of binds, in fences are passed to the first operation and 176 * out fences are signaled on the last operation. 177 * 178 * In this example there is a window of time where 0x0000-0x1000 and 179 * 0x4000-0x5000 are invalid but the user didn't ask for these addresses to be 180 * removed from the mapping. To work around this we treat any munmap style 181 * unbinds which require a rebind as a kernel operations (BO eviction or userptr 182 * invalidation). The first operation waits on the VM's 183 * DMA_RESV_USAGE_PREEMPT_FENCE slots (waits for all pending jobs on VM to 184 * complete / triggers preempt fences) and the last operation is installed in 185 * the VM's DMA_RESV_USAGE_KERNEL slot (blocks future jobs / resume compute mode 186 * VM). The caveat is all dma-resv slots must be updated atomically with respect 187 * to execs and compute mode rebind worker. To accomplish this, hold the 188 * vm->lock in write mode from the first operation until the last. 189 * 190 * Deferred binds in fault mode 191 * ---------------------------- 192 * 193 * In a VM is in fault mode (TODO: link to fault mode), new bind operations that 194 * create mappings are by default are deferred to the page fault handler (first 195 * use). This behavior can be overriden by setting the flag 196 * DRM_XE_VM_BIND_FLAG_IMMEDIATE which indicates to creating the mapping 197 * immediately. 198 * 199 * User pointer 200 * ============ 201 * 202 * User pointers are user allocated memory (malloc'd, mmap'd, etc..) for which the 203 * user wants to create a GPU mapping. Typically in other DRM drivers a dummy BO 204 * was created and then a binding was created. We bypass creating a dummy BO in 205 * XE and simply create a binding directly from the userptr. 206 * 207 * Invalidation 208 * ------------ 209 * 210 * Since this a core kernel managed memory the kernel can move this memory 211 * whenever it wants. We register an invalidation MMU notifier to alert XE when 212 * a user poiter is about to move. The invalidation notifier needs to block 213 * until all pending users (jobs or compute mode engines) of the userptr are 214 * idle to ensure no faults. This done by waiting on all of VM's dma-resv slots. 215 * 216 * Rebinds 217 * ------- 218 * 219 * Either the next exec (non-compute) or rebind worker (compute mode) will 220 * rebind the userptr. The invalidation MMU notifier kicks the rebind worker 221 * after the VM dma-resv wait if the VM is in compute mode. 222 * 223 * Compute mode 224 * ============ 225 * 226 * A VM in compute mode enables long running workloads and ultra low latency 227 * submission (ULLS). ULLS is implemented via a continuously running batch + 228 * semaphores. This enables to the user to insert jump to new batch commands 229 * into the continuously running batch. In both cases these batches exceed the 230 * time a dma fence is allowed to exist for before signaling, as such dma fences 231 * are not used when a VM is in compute mode. User fences (TODO: link user fence 232 * doc) are used instead to signal operation's completion. 233 * 234 * Preempt fences 235 * -------------- 236 * 237 * If the kernel decides to move memory around (either userptr invalidate, BO 238 * eviction, or mumap style unbind which results in a rebind) and a batch is 239 * running on an engine, that batch can fault or cause a memory corruption as 240 * page tables for the moved memory are no longer valid. To work around this we 241 * introduce the concept of preempt fences. When sw signaling is enabled on a 242 * preempt fence it tells the submission backend to kick that engine off the 243 * hardware and the preempt fence signals when the engine is off the hardware. 244 * Once all preempt fences are signaled for a VM the kernel can safely move the 245 * memory and kick the rebind worker which resumes all the engines execution. 246 * 247 * A preempt fence, for every engine using the VM, is installed the VM's 248 * dma-resv DMA_RESV_USAGE_PREEMPT_FENCE slot. The same preempt fence, for every 249 * engine using the VM, is also installed into the same dma-resv slot of every 250 * external BO mapped in the VM. 251 * 252 * Rebind worker 253 * ------------- 254 * 255 * The rebind worker is very similar to an exec. It is resposible for rebinding 256 * evicted BOs or userptrs, waiting on those operations, installing new preempt 257 * fences, and finally resuming executing of engines in the VM. 258 * 259 * Flow 260 * ~~~~ 261 * 262 * .. code-block:: 263 * 264 * <----------------------------------------------------------------------| 265 * Check if VM is closed, if so bail out | 266 * Lock VM global lock in read mode | 267 * Pin userptrs (also finds userptr invalidated since last rebind worker) | 268 * Lock VM dma-resv and external BOs dma-resv | 269 * Validate BOs that have been evicted | 270 * Wait on and allocate new preempt fences for every engine using the VM | 271 * Rebind invalidated userptrs + evicted BOs | 272 * Wait on last rebind fence | 273 * Wait VM's DMA_RESV_USAGE_KERNEL dma-resv slot | 274 * Install preeempt fences and issue resume for every engine using the VM | 275 * Check if any userptrs invalidated since pin | 276 * Squash resume for all engines | 277 * Unlock all | 278 * Wait all VM's dma-resv slots | 279 * Retry ---------------------------------------------------------- 280 * Release all engines waiting to resume 281 * Unlock all 282 * 283 * Timeslicing 284 * ----------- 285 * 286 * In order to prevent an engine from continuously being kicked off the hardware 287 * and making no forward progress an engine has a period of time it allowed to 288 * run after resume before it can be kicked off again. This effectively gives 289 * each engine a timeslice. 290 * 291 * Handling multiple GTs 292 * ===================== 293 * 294 * If a GT has slower access to some regions and the page table structure are in 295 * the slow region, the performance on that GT could adversely be affected. To 296 * work around this we allow a VM page tables to be shadowed in multiple GTs. 297 * When VM is created, a default bind engine and PT table structure are created 298 * on each GT. 299 * 300 * Binds can optionally pass in a mask of GTs where a mapping should be created, 301 * if this mask is zero then default to all the GTs where the VM has page 302 * tables. 303 * 304 * The implementation for this breaks down into a bunch for_each_gt loops in 305 * various places plus exporting a composite fence for multi-GT binds to the 306 * user. 307 * 308 * Fault mode (unified shared memory) 309 * ================================== 310 * 311 * A VM in fault mode can be enabled on devices that support page faults. If 312 * page faults are enabled, using dma fences can potentially induce a deadlock: 313 * A pending page fault can hold up the GPU work which holds up the dma fence 314 * signaling, and memory allocation is usually required to resolve a page 315 * fault, but memory allocation is not allowed to gate dma fence signaling. As 316 * such, dma fences are not allowed when VM is in fault mode. Because dma-fences 317 * are not allowed, long running workloads and ULLS are enabled on a faulting 318 * VM. 319 * 320 * Defered VM binds 321 * ---------------- 322 * 323 * By default, on a faulting VM binds just allocate the VMA and the actual 324 * updating of the page tables is defered to the page fault handler. This 325 * behavior can be overridden by setting the flag DRM_XE_VM_BIND_FLAG_IMMEDIATE in 326 * the VM bind which will then do the bind immediately. 327 * 328 * Page fault handler 329 * ------------------ 330 * 331 * Page faults are received in the G2H worker under the CT lock which is in the 332 * path of dma fences (no memory allocations are allowed, faults require memory 333 * allocations) thus we cannot process faults under the CT lock. Another issue 334 * is faults issue TLB invalidations which require G2H credits and we cannot 335 * allocate G2H credits in the G2H handlers without deadlocking. Lastly, we do 336 * not want the CT lock to be an outer lock of the VM global lock (VM global 337 * lock required to fault processing). 338 * 339 * To work around the above issue with processing faults in the G2H worker, we 340 * sink faults to a buffer which is large enough to sink all possible faults on 341 * the GT (1 per hardware engine) and kick a worker to process the faults. Since 342 * the page faults G2H are already received in a worker, kicking another worker 343 * adds more latency to a critical performance path. We add a fast path in the 344 * G2H irq handler which looks at first G2H and if it is a page fault we sink 345 * the fault to the buffer and kick the worker to process the fault. TLB 346 * invalidation responses are also in the critical path so these can also be 347 * processed in this fast path. 348 * 349 * Multiple buffers and workers are used and hashed over based on the ASID so 350 * faults from different VMs can be processed in parallel. 351 * 352 * The page fault handler itself is rather simple, flow is below. 353 * 354 * .. code-block:: 355 * 356 * Lookup VM from ASID in page fault G2H 357 * Lock VM global lock in read mode 358 * Lookup VMA from address in page fault G2H 359 * Check if VMA is valid, if not bail 360 * Check if VMA's BO has backing store, if not allocate 361 * <----------------------------------------------------------------------| 362 * If userptr, pin pages | 363 * Lock VM & BO dma-resv locks | 364 * If atomic fault, migrate to VRAM, else validate BO location | 365 * Issue rebind | 366 * Wait on rebind to complete | 367 * Check if userptr invalidated since pin | 368 * Drop VM & BO dma-resv locks | 369 * Retry ---------------------------------------------------------- 370 * Unlock all 371 * Issue blocking TLB invalidation | 372 * Send page fault response to GuC 373 * 374 * Access counters 375 * --------------- 376 * 377 * Access counters can be configured to trigger a G2H indicating the device is 378 * accessing VMAs in system memory frequently as hint to migrate those VMAs to 379 * VRAM. 380 * 381 * Same as the page fault handler, access counters G2H cannot be processed the 382 * G2H worker under the CT lock. Again we use a buffer to sink access counter 383 * G2H. Unlike page faults there is no upper bound so if the buffer is full we 384 * simply drop the G2H. Access counters are a best case optimization and it is 385 * safe to drop these unlike page faults. 386 * 387 * The access counter handler itself is rather simple flow is below. 388 * 389 * .. code-block:: 390 * 391 * Lookup VM from ASID in access counter G2H 392 * Lock VM global lock in read mode 393 * Lookup VMA from address in access counter G2H 394 * If userptr, bail nothing to do 395 * Lock VM & BO dma-resv locks 396 * Issue migration to VRAM 397 * Unlock all 398 * 399 * Notice no rebind is issued in the access counter handler as the rebind will 400 * be issued on next page fault. 401 * 402 * Cavets with eviction / user pointer invalidation 403 * ------------------------------------------------ 404 * 405 * In the case of eviction and user pointer invalidation on a faulting VM, there 406 * is no need to issue a rebind rather we just need to blow away the page tables 407 * for the VMAs and the page fault handler will rebind the VMAs when they fault. 408 * The cavet is to update / read the page table structure the VM global lock is 409 * neeeed. In both the case of eviction and user pointer invalidation locks are 410 * held which make acquiring the VM global lock impossible. To work around this 411 * every VMA maintains a list of leaf page table entries which should be written 412 * to zero to blow away the VMA's page tables. After writing zero to these 413 * entries a blocking TLB invalidate is issued. At this point it is safe for the 414 * kernel to move the VMA's memory around. This is a necessary lockless 415 * algorithm and is safe as leafs cannot be changed while either an eviction or 416 * userptr invalidation is occurring. 417 * 418 * Locking 419 * ======= 420 * 421 * VM locking protects all of the core data paths (bind operations, execs, 422 * evictions, and compute mode rebind worker) in XE. 423 * 424 * Locks 425 * ----- 426 * 427 * VM global lock (vm->lock) - rw semaphore lock. Outer most lock which protects 428 * the list of userptrs mapped in the VM, the list of engines using this VM, and 429 * the array of external BOs mapped in the VM. When adding or removing any of the 430 * aforemented state from the VM should acquire this lock in write mode. The VM 431 * bind path also acquires this lock in write while the exec / compute mode 432 * rebind worker acquire this lock in read mode. 433 * 434 * VM dma-resv lock (vm->ttm.base.resv->lock) - WW lock. Protects VM dma-resv 435 * slots which is shared with any private BO in the VM. Expected to be acquired 436 * during VM binds, execs, and compute mode rebind worker. This lock is also 437 * held when private BOs are being evicted. 438 * 439 * external BO dma-resv lock (bo->ttm.base.resv->lock) - WW lock. Protects 440 * external BO dma-resv slots. Expected to be acquired during VM binds (in 441 * addition to the VM dma-resv lock). All external BO dma-locks within a VM are 442 * expected to be acquired (in addition to the VM dma-resv lock) during execs 443 * and the compute mode rebind worker. This lock is also held when an external 444 * BO is being evicted. 445 * 446 * Putting it all together 447 * ----------------------- 448 * 449 * 1. An exec and bind operation with the same VM can't be executing at the same 450 * time (vm->lock). 451 * 452 * 2. A compute mode rebind worker and bind operation with the same VM can't be 453 * executing at the same time (vm->lock). 454 * 455 * 3. We can't add / remove userptrs or external BOs to a VM while an exec with 456 * the same VM is executing (vm->lock). 457 * 458 * 4. We can't add / remove userptrs, external BOs, or engines to a VM while a 459 * compute mode rebind worker with the same VM is executing (vm->lock). 460 * 461 * 5. Evictions within a VM can't be happen while an exec with the same VM is 462 * executing (dma-resv locks). 463 * 464 * 6. Evictions within a VM can't be happen while a compute mode rebind worker 465 * with the same VM is executing (dma-resv locks). 466 * 467 * dma-resv usage 468 * ============== 469 * 470 * As previously stated to enforce the ordering of kernel ops (eviction, userptr 471 * invalidation, munmap style unbinds which result in a rebind), rebinds during 472 * execs, execs, and resumes in the rebind worker we use both the VMs and 473 * external BOs dma-resv slots. Let try to make this as clear as possible. 474 * 475 * Slot installation 476 * ----------------- 477 * 478 * 1. Jobs from kernel ops install themselves into the DMA_RESV_USAGE_KERNEL 479 * slot of either an external BO or VM (depends on if kernel op is operating on 480 * an external or private BO) 481 * 482 * 2. In non-compute mode, jobs from execs install themselves into the 483 * DMA_RESV_USAGE_BOOKKEEP slot of the VM 484 * 485 * 3. In non-compute mode, jobs from execs install themselves into the 486 * DMA_RESV_USAGE_WRITE slot of all external BOs in the VM 487 * 488 * 4. Jobs from binds install themselves into the DMA_RESV_USAGE_BOOKKEEP slot 489 * of the VM 490 * 491 * 5. Jobs from binds install themselves into the DMA_RESV_USAGE_BOOKKEEP slot 492 * of the external BO (if the bind is to an external BO, this is addition to #4) 493 * 494 * 6. Every engine using a compute mode VM has a preempt fence in installed into 495 * the DMA_RESV_USAGE_PREEMPT_FENCE slot of the VM 496 * 497 * 7. Every engine using a compute mode VM has a preempt fence in installed into 498 * the DMA_RESV_USAGE_PREEMPT_FENCE slot of all the external BOs in the VM 499 * 500 * Slot waiting 501 * ------------ 502 * 503 * 1. The exection of all jobs from kernel ops shall wait on all slots 504 * (DMA_RESV_USAGE_PREEMPT_FENCE) of either an external BO or VM (depends on if 505 * kernel op is operating on external or private BO) 506 * 507 * 2. In non-compute mode, the exection of all jobs from rebinds in execs shall 508 * wait on the DMA_RESV_USAGE_KERNEL slot of either an external BO or VM 509 * (depends on if the rebind is operatiing on an external or private BO) 510 * 511 * 3. In non-compute mode, the exection of all jobs from execs shall wait on the 512 * last rebind job 513 * 514 * 4. In compute mode, the exection of all jobs from rebinds in the rebind 515 * worker shall wait on the DMA_RESV_USAGE_KERNEL slot of either an external BO 516 * or VM (depends on if rebind is operating on external or private BO) 517 * 518 * 5. In compute mode, resumes in rebind worker shall wait on last rebind fence 519 * 520 * 6. In compute mode, resumes in rebind worker shall wait on the 521 * DMA_RESV_USAGE_KERNEL slot of the VM 522 * 523 * Putting it all together 524 * ----------------------- 525 * 526 * 1. New jobs from kernel ops are blocked behind any existing jobs from 527 * non-compute mode execs 528 * 529 * 2. New jobs from non-compute mode execs are blocked behind any existing jobs 530 * from kernel ops and rebinds 531 * 532 * 3. New jobs from kernel ops are blocked behind all preempt fences signaling in 533 * compute mode 534 * 535 * 4. Compute mode engine resumes are blocked behind any existing jobs from 536 * kernel ops and rebinds 537 * 538 * Future work 539 * =========== 540 * 541 * Support large pages for sysmem and userptr. 542 * 543 * Update page faults to handle BOs are page level grainularity (e.g. part of BO 544 * could be in system memory while another part could be in VRAM). 545 * 546 * Page fault handler likely we be optimized a bit more (e.g. Rebinds always 547 * wait on the dma-resv kernel slots of VM or BO, technically we only have to 548 * wait the BO moving. If using a job to do the rebind, we could not block in 549 * the page fault handler rather attach a callback to fence of the rebind job to 550 * signal page fault complete. Our handling of short circuting for atomic faults 551 * for bound VMAs could be better. etc...). We can tune all of this once we have 552 * benchmarks / performance number from workloads up and running. 553 */ 554 555 #endif 556