1.. SPDX-License-Identifier: GPL-2.0 2 3==================================================== 4pin_user_pages() and related calls 5==================================================== 6 7.. contents:: :local: 8 9Overview 10======== 11 12This document describes the following functions:: 13 14 pin_user_pages() 15 pin_user_pages_fast() 16 pin_user_pages_remote() 17 18Basic description of FOLL_PIN 19============================= 20 21FOLL_PIN and FOLL_LONGTERM are flags that can be passed to the get_user_pages*() 22("gup") family of functions. FOLL_PIN has significant interactions and 23interdependencies with FOLL_LONGTERM, so both are covered here. 24 25FOLL_PIN is internal to gup, meaning that it should not appear at the gup call 26sites. This allows the associated wrapper functions (pin_user_pages*() and 27others) to set the correct combination of these flags, and to check for problems 28as well. 29 30FOLL_LONGTERM, on the other hand, *is* allowed to be set at the gup call sites. 31This is in order to avoid creating a large number of wrapper functions to cover 32all combinations of get*(), pin*(), FOLL_LONGTERM, and more. Also, the 33pin_user_pages*() APIs are clearly distinct from the get_user_pages*() APIs, so 34that's a natural dividing line, and a good point to make separate wrapper calls. 35In other words, use pin_user_pages*() for DMA-pinned pages, and 36get_user_pages*() for other cases. There are five cases described later on in 37this document, to further clarify that concept. 38 39FOLL_PIN and FOLL_GET are mutually exclusive for a given gup call. However, 40multiple threads and call sites are free to pin the same struct pages, via both 41FOLL_PIN and FOLL_GET. It's just the call site that needs to choose one or the 42other, not the struct page(s). 43 44The FOLL_PIN implementation is nearly the same as FOLL_GET, except that FOLL_PIN 45uses a different reference counting technique. 46 47FOLL_PIN is a prerequisite to FOLL_LONGTERM. Another way of saying that is, 48FOLL_LONGTERM is a specific case, more restrictive case of FOLL_PIN. 49 50Which flags are set by each wrapper 51=================================== 52 53For these pin_user_pages*() functions, FOLL_PIN is OR'd in with whatever gup 54flags the caller provides. The caller is required to pass in a non-null struct 55pages* array, and the function then pins pages by incrementing each by a special 56value: GUP_PIN_COUNTING_BIAS. 57 58For large folios, the GUP_PIN_COUNTING_BIAS scheme is not used. Instead, 59the extra space available in the struct folio is used to store the 60pincount directly. 61 62This approach for large folios avoids the counting upper limit problems 63that are discussed below. Those limitations would have been aggravated 64severely by huge pages, because each tail page adds a refcount to the 65head page. And in fact, testing revealed that, without a separate pincount 66field, refcount overflows were seen in some huge page stress tests. 67 68This also means that huge pages and large folios do not suffer 69from the false positives problem that is mentioned below.:: 70 71 Function 72 -------- 73 pin_user_pages FOLL_PIN is always set internally by this function. 74 pin_user_pages_fast FOLL_PIN is always set internally by this function. 75 pin_user_pages_remote FOLL_PIN is always set internally by this function. 76 77For these get_user_pages*() functions, FOLL_GET might not even be specified. 78Behavior is a little more complex than above. If FOLL_GET was *not* specified, 79but the caller passed in a non-null struct pages* array, then the function 80sets FOLL_GET for you, and proceeds to pin pages by incrementing the refcount 81of each page by +1.:: 82 83 Function 84 -------- 85 get_user_pages FOLL_GET is sometimes set internally by this function. 86 get_user_pages_fast FOLL_GET is sometimes set internally by this function. 87 get_user_pages_remote FOLL_GET is sometimes set internally by this function. 88 89Tracking dma-pinned pages 90========================= 91 92Some of the key design constraints, and solutions, for tracking dma-pinned 93pages: 94 95* An actual reference count, per struct page, is required. This is because 96 multiple processes may pin and unpin a page. 97 98* False positives (reporting that a page is dma-pinned, when in fact it is not) 99 are acceptable, but false negatives are not. 100 101* struct page may not be increased in size for this, and all fields are already 102 used. 103 104* Given the above, we can overload the page->_refcount field by using, sort of, 105 the upper bits in that field for a dma-pinned count. "Sort of", means that, 106 rather than dividing page->_refcount into bit fields, we simple add a medium- 107 large value (GUP_PIN_COUNTING_BIAS, initially chosen to be 1024: 10 bits) to 108 page->_refcount. This provides fuzzy behavior: if a page has get_page() called 109 on it 1024 times, then it will appear to have a single dma-pinned count. 110 And again, that's acceptable. 111 112This also leads to limitations: there are only 31-10==21 bits available for a 113counter that increments 10 bits at a time. 114 115* Because of that limitation, special handling is applied to the zero pages 116 when using FOLL_PIN. We only pretend to pin a zero page - we don't alter its 117 refcount or pincount at all (it is permanent, so there's no need). The 118 unpinning functions also don't do anything to a zero page. This is 119 transparent to the caller. 120 121* Callers must specifically request "dma-pinned tracking of pages". In other 122 words, just calling get_user_pages() will not suffice; a new set of functions, 123 pin_user_page() and related, must be used. 124 125FOLL_PIN, FOLL_GET, FOLL_LONGTERM: when to use which flags 126========================================================== 127 128Thanks to Jan Kara, Vlastimil Babka and several other -mm people, for describing 129these categories: 130 131CASE 1: Direct IO (DIO) 132----------------------- 133There are GUP references to pages that are serving 134as DIO buffers. These buffers are needed for a relatively short time (so they 135are not "long term"). No special synchronization with page_mkclean() or 136munmap() is provided. Therefore, flags to set at the call site are: :: 137 138 FOLL_PIN 139 140...but rather than setting FOLL_PIN directly, call sites should use one of 141the pin_user_pages*() routines that set FOLL_PIN. 142 143CASE 2: RDMA 144------------ 145There are GUP references to pages that are serving as DMA 146buffers. These buffers are needed for a long time ("long term"). No special 147synchronization with page_mkclean() or munmap() is provided. Therefore, flags 148to set at the call site are: :: 149 150 FOLL_PIN | FOLL_LONGTERM 151 152NOTE: Some pages, such as DAX pages, cannot be pinned with longterm pins. That's 153because DAX pages do not have a separate page cache, and so "pinning" implies 154locking down file system blocks, which is not (yet) supported in that way. 155 156CASE 3: MMU notifier registration, with or without page faulting hardware 157------------------------------------------------------------------------- 158Device drivers can pin pages via get_user_pages*(), and register for mmu 159notifier callbacks for the memory range. Then, upon receiving a notifier 160"invalidate range" callback , stop the device from using the range, and unpin 161the pages. There may be other possible schemes, such as for example explicitly 162synchronizing against pending IO, that accomplish approximately the same thing. 163 164Or, if the hardware supports replayable page faults, then the device driver can 165avoid pinning entirely (this is ideal), as follows: register for mmu notifier 166callbacks as above, but instead of stopping the device and unpinning in the 167callback, simply remove the range from the device's page tables. 168 169Either way, as long as the driver unpins the pages upon mmu notifier callback, 170then there is proper synchronization with both filesystem and mm 171(page_mkclean(), munmap(), etc). Therefore, neither flag needs to be set. 172 173CASE 4: Pinning for struct page manipulation only 174------------------------------------------------- 175If only struct page data (as opposed to the actual memory contents that a page 176is tracking) is affected, then normal GUP calls are sufficient, and neither flag 177needs to be set. 178 179CASE 5: Pinning in order to write to the data within the page 180------------------------------------------------------------- 181Even though neither DMA nor Direct IO is involved, just a simple case of "pin, 182write to a page's data, unpin" can cause a problem. Case 5 may be considered a 183superset of Case 1, plus Case 2, plus anything that invokes that pattern. In 184other words, if the code is neither Case 1 nor Case 2, it may still require 185FOLL_PIN, for patterns like this: 186 187Correct (uses FOLL_PIN calls): 188 pin_user_pages() 189 write to the data within the pages 190 unpin_user_pages() 191 192INCORRECT (uses FOLL_GET calls): 193 get_user_pages() 194 write to the data within the pages 195 put_page() 196 197page_maybe_dma_pinned(): the whole point of pinning 198=================================================== 199 200The whole point of marking pages as "DMA-pinned" or "gup-pinned" is to be able 201to query, "is this page DMA-pinned?" That allows code such as page_mkclean() 202(and file system writeback code in general) to make informed decisions about 203what to do when a page cannot be unmapped due to such pins. 204 205What to do in those cases is the subject of a years-long series of discussions 206and debates (see the References at the end of this document). It's a TODO item 207here: fill in the details once that's worked out. Meanwhile, it's safe to say 208that having this available: :: 209 210 static inline bool page_maybe_dma_pinned(struct page *page) 211 212...is a prerequisite to solving the long-running gup+DMA problem. 213 214Another way of thinking about FOLL_GET, FOLL_PIN, and FOLL_LONGTERM 215=================================================================== 216 217Another way of thinking about these flags is as a progression of restrictions: 218FOLL_GET is for struct page manipulation, without affecting the data that the 219struct page refers to. FOLL_PIN is a *replacement* for FOLL_GET, and is for 220short term pins on pages whose data *will* get accessed. As such, FOLL_PIN is 221a "more severe" form of pinning. And finally, FOLL_LONGTERM is an even more 222restrictive case that has FOLL_PIN as a prerequisite: this is for pages that 223will be pinned longterm, and whose data will be accessed. 224 225Unit testing 226============ 227This file:: 228 229 tools/testing/selftests/mm/gup_test.c 230 231has the following new calls to exercise the new pin*() wrapper functions: 232 233* PIN_FAST_BENCHMARK (./gup_test -a) 234* PIN_BASIC_TEST (./gup_test -b) 235 236You can monitor how many total dma-pinned pages have been acquired and released 237since the system was booted, via two new /proc/vmstat entries: :: 238 239 /proc/vmstat/nr_foll_pin_acquired 240 /proc/vmstat/nr_foll_pin_released 241 242Under normal conditions, these two values will be equal unless there are any 243long-term [R]DMA pins in place, or during pin/unpin transitions. 244 245* nr_foll_pin_acquired: This is the number of logical pins that have been 246 acquired since the system was powered on. For huge pages, the head page is 247 pinned once for each page (head page and each tail page) within the huge page. 248 This follows the same sort of behavior that get_user_pages() uses for huge 249 pages: the head page is refcounted once for each tail or head page in the huge 250 page, when get_user_pages() is applied to a huge page. 251 252* nr_foll_pin_released: The number of logical pins that have been released since 253 the system was powered on. Note that pages are released (unpinned) on a 254 PAGE_SIZE granularity, even if the original pin was applied to a huge page. 255 Becaused of the pin count behavior described above in "nr_foll_pin_acquired", 256 the accounting balances out, so that after doing this:: 257 258 pin_user_pages(huge_page); 259 for (each page in huge_page) 260 unpin_user_page(page); 261 262...the following is expected:: 263 264 nr_foll_pin_released == nr_foll_pin_acquired 265 266(...unless it was already out of balance due to a long-term RDMA pin being in 267place.) 268 269Other diagnostics 270================= 271 272dump_page() has been enhanced slightly to handle these new counting 273fields, and to better report on large folios in general. Specifically, 274for large folios, the exact pincount is reported. 275 276References 277========== 278 279* `Some slow progress on get_user_pages() (Apr 2, 2019) <https://lwn.net/Articles/784574/>`_ 280* `DMA and get_user_pages() (LPC: Dec 12, 2018) <https://lwn.net/Articles/774411/>`_ 281* `The trouble with get_user_pages() (Apr 30, 2018) <https://lwn.net/Articles/753027/>`_ 282* `LWN kernel index: get_user_pages() <https://lwn.net/Kernel/Index/#Memory_management-get_user_pages>`_ 283 284John Hubbard, October, 2019 285