1.. SPDX-License-Identifier: GPL-2.0 2 3============= 4False Sharing 5============= 6 7What is False Sharing 8===================== 9False sharing is related with cache mechanism of maintaining the data 10coherence of one cache line stored in multiple CPU's caches; then 11academic definition for it is in [1]_. Consider a struct with a 12refcount and a string:: 13 14 struct foo { 15 refcount_t refcount; 16 ... 17 char name[16]; 18 } ____cacheline_internodealigned_in_smp; 19 20Member 'refcount'(A) and 'name'(B) _share_ one cache line like below:: 21 22 +-----------+ +-----------+ 23 | CPU 0 | | CPU 1 | 24 +-----------+ +-----------+ 25 / | 26 / | 27 V V 28 +----------------------+ +----------------------+ 29 | A B | Cache 0 | A B | Cache 1 30 +----------------------+ +----------------------+ 31 | | 32 ---------------------------+------------------+----------------------------- 33 | | 34 +----------------------+ 35 | | 36 +----------------------+ 37 Main Memory | A B | 38 +----------------------+ 39 40'refcount' is modified frequently, but 'name' is set once at object 41creation time and is never modified. When many CPUs access 'foo' at 42the same time, with 'refcount' being only bumped by one CPU frequently 43and 'name' being read by other CPUs, all those reading CPUs have to 44reload the whole cache line over and over due to the 'sharing', even 45though 'name' is never changed. 46 47There are many real-world cases of performance regressions caused by 48false sharing. One of these is a rw_semaphore 'mmap_lock' inside 49mm_struct struct, whose cache line layout change triggered a 50regression and Linus analyzed in [2]_. 51 52There are two key factors for a harmful false sharing: 53 54* A global datum accessed (shared) by many CPUs 55* In the concurrent accesses to the data, there is at least one write 56 operation: write/write or write/read cases. 57 58The sharing could be from totally unrelated kernel components, or 59different code paths of the same kernel component. 60 61 62False Sharing Pitfalls 63====================== 64Back in time when one platform had only one or a few CPUs, hot data 65members could be purposely put in the same cache line to make them 66cache hot and save cacheline/TLB, like a lock and the data protected 67by it. But for recent large system with hundreds of CPUs, this may 68not work when the lock is heavily contended, as the lock owner CPU 69could write to the data, while other CPUs are busy spinning the lock. 70 71Looking at past cases, there are several frequently occurring patterns 72for false sharing: 73 74* lock (spinlock/mutex/semaphore) and data protected by it are 75 purposely put in one cache line. 76* global data being put together in one cache line. Some kernel 77 subsystems have many global parameters of small size (4 bytes), 78 which can easily be grouped together and put into one cache line. 79* data members of a big data structure randomly sitting together 80 without being noticed (cache line is usually 64 bytes or more), 81 like 'mem_cgroup' struct. 82 83Following 'mitigation' section provides real-world examples. 84 85False sharing could easily happen unless they are intentionally 86checked, and it is valuable to run specific tools for performance 87critical workloads to detect false sharing affecting performance case 88and optimize accordingly. 89 90 91How to detect and analyze False Sharing 92======================================== 93perf record/report/stat are widely used for performance tuning, and 94once hotspots are detected, tools like 'perf-c2c' and 'pahole' can 95be further used to detect and pinpoint the possible false sharing 96data structures. 'addr2line' is also good at decoding instruction 97pointer when there are multiple layers of inline functions. 98 99perf-c2c can capture the cache lines with most false sharing hits, 100decoded functions (line number of file) accessing that cache line, 101and in-line offset of the data. Simple commands are:: 102 103 $ perf c2c record -ag sleep 3 104 $ perf c2c report --call-graph none -k vmlinux 105 106When running above during testing will-it-scale's tlb_flush1 case, 107perf reports something like:: 108 109 Total records : 1658231 110 Locked Load/Store Operations : 89439 111 Load Operations : 623219 112 Load Local HITM : 92117 113 Load Remote HITM : 139 114 115 #---------------------------------------------------------------------- 116 4 0 2374 0 0 0 0xff1100088366d880 117 #---------------------------------------------------------------------- 118 0.00% 42.29% 0.00% 0.00% 0.00% 0x8 1 1 0xffffffff81373b7b 0 231 129 5312 64 [k] __mod_lruvec_page_state [kernel.vmlinux] memcontrol.h:752 1 119 0.00% 13.10% 0.00% 0.00% 0.00% 0x8 1 1 0xffffffff81374718 0 226 97 3551 64 [k] folio_lruvec_lock_irqsave [kernel.vmlinux] memcontrol.h:752 1 120 0.00% 11.20% 0.00% 0.00% 0.00% 0x8 1 1 0xffffffff812c29bf 0 170 136 555 64 [k] lru_add_fn [kernel.vmlinux] mm_inline.h:41 1 121 0.00% 7.62% 0.00% 0.00% 0.00% 0x8 1 1 0xffffffff812c3ec5 0 175 108 632 64 [k] release_pages [kernel.vmlinux] mm_inline.h:41 1 122 0.00% 23.29% 0.00% 0.00% 0.00% 0x10 1 1 0xffffffff81372d0a 0 234 279 1051 64 [k] __mod_memcg_lruvec_state [kernel.vmlinux] memcontrol.c:736 1 123 124A nice introduction for perf-c2c is [3]_. 125 126'pahole' decodes data structure layouts delimited in cache line 127granularity. Users can match the offset in perf-c2c output with 128pahole's decoding to locate the exact data members. For global 129data, users can search the data address in System.map. 130 131 132Possible Mitigations 133==================== 134False sharing does not always need to be mitigated. False sharing 135mitigations should balance performance gains with complexity and 136space consumption. Sometimes, lower performance is OK, and it's 137unnecessary to hyper-optimize every rarely used data structure or 138a cold data path. 139 140False sharing hurting performance cases are seen more frequently with 141core count increasing. Because of these detrimental effects, many 142patches have been proposed across variety of subsystems (like 143networking and memory management) and merged. Some common mitigations 144(with examples) are: 145 146* Separate hot global data in its own dedicated cache line, even if it 147 is just a 'short' type. The downside is more consumption of memory, 148 cache line and TLB entries. 149 150 - Commit 91b6d3256356 ("net: cache align tcp_memory_allocated, tcp_sockets_allocated") 151 152* Reorganize the data structure, separate the interfering members to 153 different cache lines. One downside is it may introduce new false 154 sharing of other members. 155 156 - Commit 802f1d522d5f ("mm: page_counter: re-layout structure to reduce false sharing") 157 158* Replace 'write' with 'read' when possible, especially in loops. 159 Like for some global variable, use compare(read)-then-write instead 160 of unconditional write. For example, use:: 161 162 if (!test_bit(XXX)) 163 set_bit(XXX); 164 165 instead of directly "set_bit(XXX);", similarly for atomic_t data:: 166 167 if (atomic_read(XXX) == AAA) 168 atomic_set(XXX, BBB); 169 170 - Commit 7b1002f7cfe5 ("bcache: fixup bcache_dev_sectors_dirty_add() multithreaded CPU false sharing") 171 - Commit 292648ac5cf1 ("mm: gup: allow FOLL_PIN to scale in SMP") 172 173* Turn hot global data to 'per-cpu data + global data' when possible, 174 or reasonably increase the threshold for syncing per-cpu data to 175 global data, to reduce or postpone the 'write' to that global data. 176 177 - Commit 520f897a3554 ("ext4: use percpu_counters for extent_status cache hits/misses") 178 - Commit 56f3547bfa4d ("mm: adjust vm_committed_as_batch according to vm overcommit policy") 179 180Surely, all mitigations should be carefully verified to not cause side 181effects. To avoid introducing false sharing when coding, it's better 182to: 183 184* Be aware of cache line boundaries 185* Group mostly read-only fields together 186* Group things that are written at the same time together 187* Separate frequently read and frequently written fields on 188 different cache lines. 189 190and better add a comment stating the false sharing consideration. 191 192One note is, sometimes even after a severe false sharing is detected 193and solved, the performance may still have no obvious improvement as 194the hotspot switches to a new place. 195 196 197Miscellaneous 198============= 199One open issue is that the kernel has an optional data structure 200randomization mechanism, which also randomizes the situation of cache 201line sharing among data members. 202 203 204.. [1] https://en.wikipedia.org/wiki/False_sharing 205.. [2] https://lore.kernel.org/lkml/CAHk-=whoqV=cX5VC80mmR9rr+Z+yQ6fiQZm36Fb-izsanHg23w@mail.gmail.com/ 206.. [3] https://joemario.github.io/blog/2016/09/01/c2c-blog/ 207