1da82c92fSMauro Carvalho Chehab========================== 2da82c92fSMauro Carvalho ChehabMemory Resource Controller 3da82c92fSMauro Carvalho Chehab========================== 4da82c92fSMauro Carvalho Chehab 556eb2767SBagas Sanjaya.. caution:: 6da82c92fSMauro Carvalho Chehab This document is hopelessly outdated and it asks for a complete 7da82c92fSMauro Carvalho Chehab rewrite. It still contains a useful information so we are keeping it 8da82c92fSMauro Carvalho Chehab here but make sure to check the current code if you need a deeper 9da82c92fSMauro Carvalho Chehab understanding. 10da82c92fSMauro Carvalho Chehab 1156eb2767SBagas Sanjaya.. note:: 12da82c92fSMauro Carvalho Chehab The Memory Resource Controller has generically been referred to as the 13da82c92fSMauro Carvalho Chehab memory controller in this document. Do not confuse memory controller 14da82c92fSMauro Carvalho Chehab used here with the memory controller that is used in hardware. 15da82c92fSMauro Carvalho Chehab 164ddb1a2aSBagas Sanjaya.. hint:: 17da82c92fSMauro Carvalho Chehab When we mention a cgroup (cgroupfs's directory) with memory controller, 18da82c92fSMauro Carvalho Chehab we call it "memory cgroup". When you see git-log and source code, you'll 19da82c92fSMauro Carvalho Chehab see patch's title and function names tend to use "memcg". 20da82c92fSMauro Carvalho Chehab In this document, we avoid using it. 21da82c92fSMauro Carvalho Chehab 22da82c92fSMauro Carvalho ChehabBenefits and Purpose of the memory controller 23da82c92fSMauro Carvalho Chehab============================================= 24da82c92fSMauro Carvalho Chehab 25da82c92fSMauro Carvalho ChehabThe memory controller isolates the memory behaviour of a group of tasks 2671da431cSBagas Sanjayafrom the rest of the system. The article on LWN [12]_ mentions some probable 27da82c92fSMauro Carvalho Chehabuses of the memory controller. The memory controller can be used to 28da82c92fSMauro Carvalho Chehab 29da82c92fSMauro Carvalho Chehaba. Isolate an application or a group of applications 30da82c92fSMauro Carvalho Chehab Memory-hungry applications can be isolated and limited to a smaller 31da82c92fSMauro Carvalho Chehab amount of memory. 32da82c92fSMauro Carvalho Chehabb. Create a cgroup with a limited amount of memory; this can be used 33da82c92fSMauro Carvalho Chehab as a good alternative to booting with mem=XXXX. 34da82c92fSMauro Carvalho Chehabc. Virtualization solutions can control the amount of memory they want 35da82c92fSMauro Carvalho Chehab to assign to a virtual machine instance. 36da82c92fSMauro Carvalho Chehabd. A CD/DVD burner could control the amount of memory used by the 37da82c92fSMauro Carvalho Chehab rest of the system to ensure that burning does not fail due to lack 38da82c92fSMauro Carvalho Chehab of available memory. 39da82c92fSMauro Carvalho Chehabe. There are several other use cases; find one or use the controller just 40da82c92fSMauro Carvalho Chehab for fun (to learn and hack on the VM subsystem). 41da82c92fSMauro Carvalho Chehab 42da82c92fSMauro Carvalho ChehabCurrent Status: linux-2.6.34-mmotm(development version of 2010/April) 43da82c92fSMauro Carvalho Chehab 44da82c92fSMauro Carvalho ChehabFeatures: 45da82c92fSMauro Carvalho Chehab 46da82c92fSMauro Carvalho Chehab - accounting anonymous pages, file caches, swap caches usage and limiting them. 47da82c92fSMauro Carvalho Chehab - pages are linked to per-memcg LRU exclusively, and there is no global LRU. 48da82c92fSMauro Carvalho Chehab - optionally, memory+swap usage can be accounted and limited. 49da82c92fSMauro Carvalho Chehab - hierarchical accounting 50da82c92fSMauro Carvalho Chehab - soft limit 51da82c92fSMauro Carvalho Chehab - moving (recharging) account at moving a task is selectable. 52da82c92fSMauro Carvalho Chehab - usage threshold notifier 53da82c92fSMauro Carvalho Chehab - memory pressure notifier 54da82c92fSMauro Carvalho Chehab - oom-killer disable knob and oom-notifier 55da82c92fSMauro Carvalho Chehab - Root cgroup has no limit controls. 56da82c92fSMauro Carvalho Chehab 57da82c92fSMauro Carvalho Chehab Kernel memory support is a work in progress, and the current version provides 58da3ad2e1SBagas Sanjaya basically functionality. (See :ref:`section 2.7 59da3ad2e1SBagas Sanjaya <cgroup-v1-memory-kernel-extension>`) 60da82c92fSMauro Carvalho Chehab 61da82c92fSMauro Carvalho ChehabBrief summary of control files. 62da82c92fSMauro Carvalho Chehab 63da82c92fSMauro Carvalho Chehab==================================== ========================================== 64da82c92fSMauro Carvalho Chehab tasks attach a task(thread) and show list of 65da82c92fSMauro Carvalho Chehab threads 66da82c92fSMauro Carvalho Chehab cgroup.procs show list of processes 67da82c92fSMauro Carvalho Chehab cgroup.event_control an interface for event_fd() 682343e88dSSebastian Andrzej Siewior This knob is not available on CONFIG_PREEMPT_RT systems. 69da82c92fSMauro Carvalho Chehab memory.usage_in_bytes show current usage for memory 70da82c92fSMauro Carvalho Chehab (See 5.5 for details) 71da82c92fSMauro Carvalho Chehab memory.memsw.usage_in_bytes show current usage for memory+Swap 72da82c92fSMauro Carvalho Chehab (See 5.5 for details) 73da82c92fSMauro Carvalho Chehab memory.limit_in_bytes set/show limit of memory usage 74da82c92fSMauro Carvalho Chehab memory.memsw.limit_in_bytes set/show limit of memory+Swap usage 75da82c92fSMauro Carvalho Chehab memory.failcnt show the number of memory usage hits limits 76da82c92fSMauro Carvalho Chehab memory.memsw.failcnt show the number of memory+Swap hits limits 77da82c92fSMauro Carvalho Chehab memory.max_usage_in_bytes show max memory usage recorded 78da82c92fSMauro Carvalho Chehab memory.memsw.max_usage_in_bytes show max memory+Swap usage recorded 79da82c92fSMauro Carvalho Chehab memory.soft_limit_in_bytes set/show soft limit of memory usage 802343e88dSSebastian Andrzej Siewior This knob is not available on CONFIG_PREEMPT_RT systems. 81da82c92fSMauro Carvalho Chehab memory.stat show various statistics 82da82c92fSMauro Carvalho Chehab memory.use_hierarchy set/show hierarchical account enabled 8318421863SRoman Gushchin This knob is deprecated and shouldn't be 8418421863SRoman Gushchin used. 85da82c92fSMauro Carvalho Chehab memory.force_empty trigger forced page reclaim 86da82c92fSMauro Carvalho Chehab memory.pressure_level set memory pressure notifications 87da82c92fSMauro Carvalho Chehab memory.swappiness set/show swappiness parameter of vmscan 88da82c92fSMauro Carvalho Chehab (See sysctl's vm.swappiness) 89da82c92fSMauro Carvalho Chehab memory.move_charge_at_immigrate set/show controls of moving charges 90da34a848SJohannes Weiner This knob is deprecated and shouldn't be 91da34a848SJohannes Weiner used. 92da82c92fSMauro Carvalho Chehab memory.oom_control set/show oom controls. 93da82c92fSMauro Carvalho Chehab memory.numa_stat show the number of memory usage per numa 94da82c92fSMauro Carvalho Chehab node 954597648fSMichal Hocko memory.kmem.limit_in_bytes Deprecated knob to set and read the kernel 964597648fSMichal Hocko memory hard limit. Kernel hard limit is not 974597648fSMichal Hocko supported since 5.16. Writing any value to 984597648fSMichal Hocko do file will not have any effect same as if 994597648fSMichal Hocko nokmem kernel parameter was specified. 1004597648fSMichal Hocko Kernel memory is still charged and reported 1014597648fSMichal Hocko by memory.kmem.usage_in_bytes. 102da82c92fSMauro Carvalho Chehab memory.kmem.usage_in_bytes show current kernel memory allocation 103da82c92fSMauro Carvalho Chehab memory.kmem.failcnt show the number of kernel memory usage 104da82c92fSMauro Carvalho Chehab hits limits 105da82c92fSMauro Carvalho Chehab memory.kmem.max_usage_in_bytes show max kernel memory usage recorded 106da82c92fSMauro Carvalho Chehab 107da82c92fSMauro Carvalho Chehab memory.kmem.tcp.limit_in_bytes set/show hard limit for tcp buf memory 108da82c92fSMauro Carvalho Chehab memory.kmem.tcp.usage_in_bytes show current tcp buf memory allocation 109da82c92fSMauro Carvalho Chehab memory.kmem.tcp.failcnt show the number of tcp buf memory usage 110da82c92fSMauro Carvalho Chehab hits limits 111da82c92fSMauro Carvalho Chehab memory.kmem.tcp.max_usage_in_bytes show max tcp buf memory usage recorded 112da82c92fSMauro Carvalho Chehab==================================== ========================================== 113da82c92fSMauro Carvalho Chehab 114da82c92fSMauro Carvalho Chehab1. History 115da82c92fSMauro Carvalho Chehab========== 116da82c92fSMauro Carvalho Chehab 117da82c92fSMauro Carvalho ChehabThe memory controller has a long history. A request for comments for the memory 11871da431cSBagas Sanjayacontroller was posted by Balbir Singh [1]_. At the time the RFC was posted 119da82c92fSMauro Carvalho Chehabthere were several implementations for memory control. The goal of the 120da82c92fSMauro Carvalho ChehabRFC was to build consensus and agreement for the minimal features required 12171da431cSBagas Sanjayafor memory control. The first RSS controller was posted by Balbir Singh [2]_ 12271da431cSBagas Sanjayain Feb 2007. Pavel Emelianov [3]_ [4]_ [5]_ has since posted three versions 12371da431cSBagas Sanjayaof the RSS controller. At OLS, at the resource management BoF, everyone 12471da431cSBagas Sanjayasuggested that we handle both page cache and RSS together. Another request was 12571da431cSBagas Sanjayaraised to allow user space handling of OOM. The current memory controller is 126da82c92fSMauro Carvalho Chehabat version 6; it combines both mapped (RSS) and unmapped Page 12771da431cSBagas SanjayaCache Control [11]_. 128da82c92fSMauro Carvalho Chehab 129da82c92fSMauro Carvalho Chehab2. Memory Control 130da82c92fSMauro Carvalho Chehab================= 131da82c92fSMauro Carvalho Chehab 132da82c92fSMauro Carvalho ChehabMemory is a unique resource in the sense that it is present in a limited 133da82c92fSMauro Carvalho Chehabamount. If a task requires a lot of CPU processing, the task can spread 134da82c92fSMauro Carvalho Chehabits processing over a period of hours, days, months or years, but with 135da82c92fSMauro Carvalho Chehabmemory, the same physical memory needs to be reused to accomplish the task. 136da82c92fSMauro Carvalho Chehab 137da82c92fSMauro Carvalho ChehabThe memory controller implementation has been divided into phases. These 138da82c92fSMauro Carvalho Chehabare: 139da82c92fSMauro Carvalho Chehab 140da82c92fSMauro Carvalho Chehab1. Memory controller 141da82c92fSMauro Carvalho Chehab2. mlock(2) controller 142da82c92fSMauro Carvalho Chehab3. Kernel user memory accounting and slab control 143da82c92fSMauro Carvalho Chehab4. user mappings length controller 144da82c92fSMauro Carvalho Chehab 145da82c92fSMauro Carvalho ChehabThe memory controller is the first controller developed. 146da82c92fSMauro Carvalho Chehab 147da82c92fSMauro Carvalho Chehab2.1. Design 148da82c92fSMauro Carvalho Chehab----------- 149da82c92fSMauro Carvalho Chehab 150da82c92fSMauro Carvalho ChehabThe core of the design is a counter called the page_counter. The 151da82c92fSMauro Carvalho Chehabpage_counter tracks the current memory usage and limit of the group of 152da82c92fSMauro Carvalho Chehabprocesses associated with the controller. Each cgroup has a memory controller 153da82c92fSMauro Carvalho Chehabspecific data structure (mem_cgroup) associated with it. 154da82c92fSMauro Carvalho Chehab 155da82c92fSMauro Carvalho Chehab2.2. Accounting 156da82c92fSMauro Carvalho Chehab--------------- 157da82c92fSMauro Carvalho Chehab 158f7423bb7SBagas Sanjaya.. code-block:: 159f7423bb7SBagas Sanjaya :caption: Figure 1: Hierarchy of Accounting 160da82c92fSMauro Carvalho Chehab 161da82c92fSMauro Carvalho Chehab +--------------------+ 162da82c92fSMauro Carvalho Chehab | mem_cgroup | 163da82c92fSMauro Carvalho Chehab | (page_counter) | 164da82c92fSMauro Carvalho Chehab +--------------------+ 165da82c92fSMauro Carvalho Chehab / ^ \ 166da82c92fSMauro Carvalho Chehab / | \ 167da82c92fSMauro Carvalho Chehab +---------------+ | +---------------+ 168da82c92fSMauro Carvalho Chehab | mm_struct | |.... | mm_struct | 169da82c92fSMauro Carvalho Chehab | | | | | 170da82c92fSMauro Carvalho Chehab +---------------+ | +---------------+ 171da82c92fSMauro Carvalho Chehab | 172da82c92fSMauro Carvalho Chehab + --------------+ 173da82c92fSMauro Carvalho Chehab | 174da82c92fSMauro Carvalho Chehab +---------------+ +------+--------+ 175da82c92fSMauro Carvalho Chehab | page +----------> page_cgroup| 176da82c92fSMauro Carvalho Chehab | | | | 177da82c92fSMauro Carvalho Chehab +---------------+ +---------------+ 178da82c92fSMauro Carvalho Chehab 179da82c92fSMauro Carvalho Chehab 180da82c92fSMauro Carvalho Chehab 181da82c92fSMauro Carvalho ChehabFigure 1 shows the important aspects of the controller 182da82c92fSMauro Carvalho Chehab 183da82c92fSMauro Carvalho Chehab1. Accounting happens per cgroup 184da82c92fSMauro Carvalho Chehab2. Each mm_struct knows about which cgroup it belongs to 185da82c92fSMauro Carvalho Chehab3. Each page has a pointer to the page_cgroup, which in turn knows the 186da82c92fSMauro Carvalho Chehab cgroup it belongs to 187da82c92fSMauro Carvalho Chehab 188da82c92fSMauro Carvalho ChehabThe accounting is done as follows: mem_cgroup_charge_common() is invoked to 189da82c92fSMauro Carvalho Chehabset up the necessary data structures and check if the cgroup that is being 190da82c92fSMauro Carvalho Chehabcharged is over its limit. If it is, then reclaim is invoked on the cgroup. 191da82c92fSMauro Carvalho ChehabMore details can be found in the reclaim section of this document. 192da82c92fSMauro Carvalho ChehabIf everything goes well, a page meta-data-structure called page_cgroup is 193da82c92fSMauro Carvalho Chehabupdated. page_cgroup has its own LRU on cgroup. 194da82c92fSMauro Carvalho Chehab(*) page_cgroup structure is allocated at boot/memory-hotplug time. 195da82c92fSMauro Carvalho Chehab 196da82c92fSMauro Carvalho Chehab2.2.1 Accounting details 197da82c92fSMauro Carvalho Chehab------------------------ 198da82c92fSMauro Carvalho Chehab 199da82c92fSMauro Carvalho ChehabAll mapped anon pages (RSS) and cache pages (Page Cache) are accounted. 200da82c92fSMauro Carvalho ChehabSome pages which are never reclaimable and will not be on the LRU 201da82c92fSMauro Carvalho Chehabare not accounted. We just account pages under usual VM management. 202da82c92fSMauro Carvalho Chehab 203da82c92fSMauro Carvalho ChehabRSS pages are accounted at page_fault unless they've already been accounted 204da82c92fSMauro Carvalho Chehabfor earlier. A file page will be accounted for as Page Cache when it's 205fe9ebb8cSXiongwei Songinserted into inode (xarray). While it's mapped into the page tables of 206da82c92fSMauro Carvalho Chehabprocesses, duplicate accounting is carefully avoided. 207da82c92fSMauro Carvalho Chehab 208da82c92fSMauro Carvalho ChehabAn RSS page is unaccounted when it's fully unmapped. A PageCache page is 209fe9ebb8cSXiongwei Songunaccounted when it's removed from xarray. Even if RSS pages are fully 210da82c92fSMauro Carvalho Chehabunmapped (by kswapd), they may exist as SwapCache in the system until they 211da82c92fSMauro Carvalho Chehabare really freed. Such SwapCaches are also accounted. 2120a27cae1SAlex ShiA swapped-in page is accounted after adding into swapcache. 213da82c92fSMauro Carvalho Chehab 214da82c92fSMauro Carvalho ChehabNote: The kernel does swapin-readahead and reads multiple swaps at once. 2150a27cae1SAlex ShiSince page's memcg recorded into swap whatever memsw enabled, the page will 2160a27cae1SAlex Shibe accounted after swapin. 217da82c92fSMauro Carvalho Chehab 218da82c92fSMauro Carvalho ChehabAt page migration, accounting information is kept. 219da82c92fSMauro Carvalho Chehab 220da82c92fSMauro Carvalho ChehabNote: we just account pages-on-LRU because our purpose is to control amount 221da82c92fSMauro Carvalho Chehabof used pages; not-on-LRU pages tend to be out-of-control from VM view. 222da82c92fSMauro Carvalho Chehab 223da82c92fSMauro Carvalho Chehab2.3 Shared Page Accounting 224da82c92fSMauro Carvalho Chehab-------------------------- 225da82c92fSMauro Carvalho Chehab 226da82c92fSMauro Carvalho ChehabShared pages are accounted on the basis of the first touch approach. The 227da82c92fSMauro Carvalho Chehabcgroup that first touches a page is accounted for the page. The principle 228da82c92fSMauro Carvalho Chehabbehind this approach is that a cgroup that aggressively uses a shared 229da82c92fSMauro Carvalho Chehabpage will eventually get charged for it (once it is uncharged from 230da82c92fSMauro Carvalho Chehabthe cgroup that brought it in -- this will happen on memory pressure). 231da82c92fSMauro Carvalho Chehab 232da3ad2e1SBagas SanjayaBut see :ref:`section 8.2 <cgroup-v1-memory-movable-charges>` when moving a 233da3ad2e1SBagas Sanjayatask to another cgroup, its pages may be recharged to the new cgroup, if 234da3ad2e1SBagas Sanjayamove_charge_at_immigrate has been chosen. 235da82c92fSMauro Carvalho Chehab 2360a27cae1SAlex Shi2.4 Swap Extension 237da82c92fSMauro Carvalho Chehab-------------------------------------- 238da82c92fSMauro Carvalho Chehab 2390a27cae1SAlex ShiSwap usage is always recorded for each of cgroup. Swap Extension allows you to 2400a27cae1SAlex Shiread and limit it. 241da82c92fSMauro Carvalho Chehab 2420a27cae1SAlex ShiWhen CONFIG_SWAP is enabled, following files are added. 243da82c92fSMauro Carvalho Chehab 244da82c92fSMauro Carvalho Chehab - memory.memsw.usage_in_bytes. 245da82c92fSMauro Carvalho Chehab - memory.memsw.limit_in_bytes. 246da82c92fSMauro Carvalho Chehab 247da82c92fSMauro Carvalho Chehabmemsw means memory+swap. Usage of memory+swap is limited by 248da82c92fSMauro Carvalho Chehabmemsw.limit_in_bytes. 249da82c92fSMauro Carvalho Chehab 250da82c92fSMauro Carvalho ChehabExample: Assume a system with 4G of swap. A task which allocates 6G of memory 251da82c92fSMauro Carvalho Chehab(by mistake) under 2G memory limitation will use all swap. 252da82c92fSMauro Carvalho ChehabIn this case, setting memsw.limit_in_bytes=3G will prevent bad use of swap. 253da82c92fSMauro Carvalho ChehabBy using the memsw limit, you can avoid system OOM which can be caused by swap 254da82c92fSMauro Carvalho Chehabshortage. 255da82c92fSMauro Carvalho Chehab 2565fa16afcSBagas Sanjaya2.4.1 why 'memory+swap' rather than swap 2575fa16afcSBagas Sanjaya~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 258da82c92fSMauro Carvalho Chehab 259da82c92fSMauro Carvalho ChehabThe global LRU(kswapd) can swap out arbitrary pages. Swap-out means 260da82c92fSMauro Carvalho Chehabto move account from memory to swap...there is no change in usage of 261da82c92fSMauro Carvalho Chehabmemory+swap. In other words, when we want to limit the usage of swap without 262da82c92fSMauro Carvalho Chehabaffecting global LRU, memory+swap limit is better than just limiting swap from 263da82c92fSMauro Carvalho Chehaban OS point of view. 264da82c92fSMauro Carvalho Chehab 2655fa16afcSBagas Sanjaya2.4.2. What happens when a cgroup hits memory.memsw.limit_in_bytes 2665fa16afcSBagas Sanjaya~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 267da82c92fSMauro Carvalho Chehab 268da82c92fSMauro Carvalho ChehabWhen a cgroup hits memory.memsw.limit_in_bytes, it's useless to do swap-out 269da82c92fSMauro Carvalho Chehabin this cgroup. Then, swap-out will not be done by cgroup routine and file 270da82c92fSMauro Carvalho Chehabcaches are dropped. But as mentioned above, global LRU can do swapout memory 271da82c92fSMauro Carvalho Chehabfrom it for sanity of the system's memory management state. You can't forbid 272da82c92fSMauro Carvalho Chehabit by cgroup. 273da82c92fSMauro Carvalho Chehab 274da82c92fSMauro Carvalho Chehab2.5 Reclaim 275da82c92fSMauro Carvalho Chehab----------- 276da82c92fSMauro Carvalho Chehab 277da82c92fSMauro Carvalho ChehabEach cgroup maintains a per cgroup LRU which has the same structure as 278da82c92fSMauro Carvalho Chehabglobal VM. When a cgroup goes over its limit, we first try 279da82c92fSMauro Carvalho Chehabto reclaim memory from the cgroup so as to make space for the new 280da82c92fSMauro Carvalho Chehabpages that the cgroup has touched. If the reclaim is unsuccessful, 281da82c92fSMauro Carvalho Chehaban OOM routine is invoked to select and kill the bulkiest task in the 282da3ad2e1SBagas Sanjayacgroup. (See :ref:`10. OOM Control <cgroup-v1-memory-oom-control>` below.) 283da82c92fSMauro Carvalho Chehab 284da82c92fSMauro Carvalho ChehabThe reclaim algorithm has not been modified for cgroups, except that 285da82c92fSMauro Carvalho Chehabpages that are selected for reclaiming come from the per-cgroup LRU 286da82c92fSMauro Carvalho Chehablist. 287da82c92fSMauro Carvalho Chehab 28856eb2767SBagas Sanjaya.. note:: 289da82c92fSMauro Carvalho Chehab Reclaim does not work for the root cgroup, since we cannot set any 290da82c92fSMauro Carvalho Chehab limits on the root cgroup. 291da82c92fSMauro Carvalho Chehab 29256eb2767SBagas Sanjaya.. note:: 293da82c92fSMauro Carvalho Chehab When panic_on_oom is set to "2", the whole system will panic. 294da82c92fSMauro Carvalho Chehab 295da82c92fSMauro Carvalho ChehabWhen oom event notifier is registered, event will be delivered. 296da3ad2e1SBagas Sanjaya(See :ref:`oom_control <cgroup-v1-memory-oom-control>` section) 297da82c92fSMauro Carvalho Chehab 298da82c92fSMauro Carvalho Chehab2.6 Locking 299da82c92fSMauro Carvalho Chehab----------- 300da82c92fSMauro Carvalho Chehab 301eb084894SBagas SanjayaLock order is as follows:: 302da82c92fSMauro Carvalho Chehab 3034dc7d373SMatthew Wilcox (Oracle) folio_lock 30415b44736SHugh Dickins mm->page_table_lock or split pte_lock 3056c77b607SKefeng Wang folio_memcg_lock (memcg->move_lock) 30615b44736SHugh Dickins mapping->i_pages lock 30715b44736SHugh Dickins lruvec->lru_lock. 308da82c92fSMauro Carvalho Chehab 30915b44736SHugh DickinsPer-node-per-memcgroup LRU (cgroup's private LRU) is guarded by 3104dc7d373SMatthew Wilcox (Oracle)lruvec->lru_lock; the folio LRU flag is cleared before 31115b44736SHugh Dickinsisolating a page from its LRU under lruvec->lru_lock. 312da82c92fSMauro Carvalho Chehab 313da3ad2e1SBagas Sanjaya.. _cgroup-v1-memory-kernel-extension: 314da3ad2e1SBagas Sanjaya 315e55b9f96SJohannes Weiner2.7 Kernel Memory Extension 316da82c92fSMauro Carvalho Chehab----------------------------------------------- 317da82c92fSMauro Carvalho Chehab 318da82c92fSMauro Carvalho ChehabWith the Kernel memory extension, the Memory Controller is able to limit 319da82c92fSMauro Carvalho Chehabthe amount of kernel memory used by the system. Kernel memory is fundamentally 320da82c92fSMauro Carvalho Chehabdifferent than user memory, since it can't be swapped out, which makes it 321da82c92fSMauro Carvalho Chehabpossible to DoS the system by consuming too much of this precious resource. 322da82c92fSMauro Carvalho Chehab 323da82c92fSMauro Carvalho ChehabKernel memory accounting is enabled for all memory cgroups by default. But 324da82c92fSMauro Carvalho Chehabit can be disabled system-wide by passing cgroup.memory=nokmem to the kernel 325da82c92fSMauro Carvalho Chehabat boot time. In this case, kernel memory will not be accounted at all. 326da82c92fSMauro Carvalho Chehab 327da82c92fSMauro Carvalho ChehabKernel memory limits are not imposed for the root cgroup. Usage for the root 328da82c92fSMauro Carvalho Chehabcgroup may or may not be accounted. The memory used is accumulated into 329da82c92fSMauro Carvalho Chehabmemory.kmem.usage_in_bytes, or in a separate counter when it makes sense. 330da82c92fSMauro Carvalho Chehab(currently only for tcp). 331da82c92fSMauro Carvalho Chehab 332da82c92fSMauro Carvalho ChehabThe main "kmem" counter is fed into the main counter, so kmem charges will 333da82c92fSMauro Carvalho Chehabalso be visible from the user counter. 334da82c92fSMauro Carvalho Chehab 335da82c92fSMauro Carvalho ChehabCurrently no soft limit is implemented for kernel memory. It is future work 336da82c92fSMauro Carvalho Chehabto trigger slab reclaim when those limits are reached. 337da82c92fSMauro Carvalho Chehab 338da82c92fSMauro Carvalho Chehab2.7.1 Current Kernel Memory resources accounted 339da82c92fSMauro Carvalho Chehab----------------------------------------------- 340da82c92fSMauro Carvalho Chehab 341da82c92fSMauro Carvalho Chehabstack pages: 342da82c92fSMauro Carvalho Chehab every process consumes some stack pages. By accounting into 343da82c92fSMauro Carvalho Chehab kernel memory, we prevent new processes from being created when the kernel 344da82c92fSMauro Carvalho Chehab memory usage is too high. 345da82c92fSMauro Carvalho Chehab 346da82c92fSMauro Carvalho Chehabslab pages: 347da82c92fSMauro Carvalho Chehab pages allocated by the SLAB or SLUB allocator are tracked. A copy 348da82c92fSMauro Carvalho Chehab of each kmem_cache is created every time the cache is touched by the first time 349da82c92fSMauro Carvalho Chehab from inside the memcg. The creation is done lazily, so some objects can still be 350da82c92fSMauro Carvalho Chehab skipped while the cache is being created. All objects in a slab page should 351da82c92fSMauro Carvalho Chehab belong to the same memcg. This only fails to hold when a task is migrated to a 352da82c92fSMauro Carvalho Chehab different memcg during the page allocation by the cache. 353da82c92fSMauro Carvalho Chehab 354da82c92fSMauro Carvalho Chehabsockets memory pressure: 355da82c92fSMauro Carvalho Chehab some sockets protocols have memory pressure 356da82c92fSMauro Carvalho Chehab thresholds. The Memory Controller allows them to be controlled individually 357da82c92fSMauro Carvalho Chehab per cgroup, instead of globally. 358da82c92fSMauro Carvalho Chehab 359da82c92fSMauro Carvalho Chehabtcp memory pressure: 360da82c92fSMauro Carvalho Chehab sockets memory pressure for the tcp protocol. 361da82c92fSMauro Carvalho Chehab 362da82c92fSMauro Carvalho Chehab2.7.2 Common use cases 363da82c92fSMauro Carvalho Chehab---------------------- 364da82c92fSMauro Carvalho Chehab 365da82c92fSMauro Carvalho ChehabBecause the "kmem" counter is fed to the main user counter, kernel memory can 366da82c92fSMauro Carvalho Chehabnever be limited completely independently of user memory. Say "U" is the user 367da82c92fSMauro Carvalho Chehablimit, and "K" the kernel limit. There are three possible ways limits can be 368da82c92fSMauro Carvalho Chehabset: 369da82c92fSMauro Carvalho Chehab 370da82c92fSMauro Carvalho ChehabU != 0, K = unlimited: 371da82c92fSMauro Carvalho Chehab This is the standard memcg limitation mechanism already present before kmem 372da82c92fSMauro Carvalho Chehab accounting. Kernel memory is completely ignored. 373da82c92fSMauro Carvalho Chehab 374da82c92fSMauro Carvalho ChehabU != 0, K < U: 375da82c92fSMauro Carvalho Chehab Kernel memory is a subset of the user memory. This setup is useful in 376fdebeae0SBhaskar Chowdhury deployments where the total amount of memory per-cgroup is overcommitted. 377fdebeae0SBhaskar Chowdhury Overcommitting kernel memory limits is definitely not recommended, since the 378da82c92fSMauro Carvalho Chehab box can still run out of non-reclaimable memory. 379da82c92fSMauro Carvalho Chehab In this case, the admin could set up K so that the sum of all groups is 380da82c92fSMauro Carvalho Chehab never greater than the total memory, and freely set U at the cost of his 381da82c92fSMauro Carvalho Chehab QoS. 382da82c92fSMauro Carvalho Chehab 38356eb2767SBagas Sanjaya .. warning:: 38456eb2767SBagas Sanjaya In the current implementation, memory reclaim will NOT be triggered for 38556eb2767SBagas Sanjaya a cgroup when it hits K while staying below U, which makes this setup 38656eb2767SBagas Sanjaya impractical. 387da82c92fSMauro Carvalho Chehab 388da82c92fSMauro Carvalho ChehabU != 0, K >= U: 389da82c92fSMauro Carvalho Chehab Since kmem charges will also be fed to the user counter and reclaim will be 390da82c92fSMauro Carvalho Chehab triggered for the cgroup for both kinds of memory. This setup gives the 391da82c92fSMauro Carvalho Chehab admin a unified view of memory, and it is also useful for people who just 392da82c92fSMauro Carvalho Chehab want to track kernel memory usage. 393da82c92fSMauro Carvalho Chehab 394da82c92fSMauro Carvalho Chehab3. User Interface 395da82c92fSMauro Carvalho Chehab================= 396da82c92fSMauro Carvalho Chehab 397980660caSBagas SanjayaTo use the user interface: 398da82c92fSMauro Carvalho Chehab 399980660caSBagas Sanjaya1. Enable CONFIG_CGROUPS and CONFIG_MEMCG options 400980660caSBagas Sanjaya2. Prepare the cgroups (see :ref:`Why are cgroups needed? 401980660caSBagas Sanjaya <cgroups-why-needed>` for the background information):: 402da82c92fSMauro Carvalho Chehab 403da82c92fSMauro Carvalho Chehab # mount -t tmpfs none /sys/fs/cgroup 404da82c92fSMauro Carvalho Chehab # mkdir /sys/fs/cgroup/memory 405da82c92fSMauro Carvalho Chehab # mount -t cgroup none /sys/fs/cgroup/memory -o memory 406da82c92fSMauro Carvalho Chehab 407980660caSBagas Sanjaya3. Make the new group and move bash into it:: 408da82c92fSMauro Carvalho Chehab 409da82c92fSMauro Carvalho Chehab # mkdir /sys/fs/cgroup/memory/0 410da82c92fSMauro Carvalho Chehab # echo $$ > /sys/fs/cgroup/memory/0/tasks 411da82c92fSMauro Carvalho Chehab 412980660caSBagas Sanjaya4. Since now we're in the 0 cgroup, we can alter the memory limit:: 413da82c92fSMauro Carvalho Chehab 414da82c92fSMauro Carvalho Chehab # echo 4M > /sys/fs/cgroup/memory/0/memory.limit_in_bytes 415da82c92fSMauro Carvalho Chehab 416980660caSBagas Sanjaya The limit can now be queried:: 417980660caSBagas Sanjaya 418980660caSBagas Sanjaya # cat /sys/fs/cgroup/memory/0/memory.limit_in_bytes 419980660caSBagas Sanjaya 4194304 420980660caSBagas Sanjaya 42156eb2767SBagas Sanjaya.. note:: 422da82c92fSMauro Carvalho Chehab We can use a suffix (k, K, m, M, g or G) to indicate values in kilo, 423da82c92fSMauro Carvalho Chehab mega or gigabytes. (Here, Kilo, Mega, Giga are Kibibytes, Mebibytes, 424da82c92fSMauro Carvalho Chehab Gibibytes.) 425da82c92fSMauro Carvalho Chehab 42656eb2767SBagas Sanjaya.. note:: 427da82c92fSMauro Carvalho Chehab We can write "-1" to reset the ``*.limit_in_bytes(unlimited)``. 428da82c92fSMauro Carvalho Chehab 42956eb2767SBagas Sanjaya.. note:: 430da82c92fSMauro Carvalho Chehab We cannot set limits on the root cgroup any more. 431da82c92fSMauro Carvalho Chehab 432da82c92fSMauro Carvalho Chehab 433da82c92fSMauro Carvalho ChehabWe can check the usage:: 434da82c92fSMauro Carvalho Chehab 435da82c92fSMauro Carvalho Chehab # cat /sys/fs/cgroup/memory/0/memory.usage_in_bytes 436da82c92fSMauro Carvalho Chehab 1216512 437da82c92fSMauro Carvalho Chehab 438da82c92fSMauro Carvalho ChehabA successful write to this file does not guarantee a successful setting of 439da82c92fSMauro Carvalho Chehabthis limit to the value written into the file. This can be due to a 440da82c92fSMauro Carvalho Chehabnumber of factors, such as rounding up to page boundaries or the total 441da82c92fSMauro Carvalho Chehabavailability of memory on the system. The user is required to re-read 442da82c92fSMauro Carvalho Chehabthis file after a write to guarantee the value committed by the kernel:: 443da82c92fSMauro Carvalho Chehab 444da82c92fSMauro Carvalho Chehab # echo 1 > memory.limit_in_bytes 445da82c92fSMauro Carvalho Chehab # cat memory.limit_in_bytes 446da82c92fSMauro Carvalho Chehab 4096 447da82c92fSMauro Carvalho Chehab 448da82c92fSMauro Carvalho ChehabThe memory.failcnt field gives the number of times that the cgroup limit was 449da82c92fSMauro Carvalho Chehabexceeded. 450da82c92fSMauro Carvalho Chehab 451da82c92fSMauro Carvalho ChehabThe memory.stat file gives accounting information. Now, the number of 452da82c92fSMauro Carvalho Chehabcaches, RSS and Active pages/Inactive pages are shown. 453da82c92fSMauro Carvalho Chehab 454da82c92fSMauro Carvalho Chehab4. Testing 455da82c92fSMauro Carvalho Chehab========== 456da82c92fSMauro Carvalho Chehab 457da82c92fSMauro Carvalho ChehabFor testing features and implementation, see memcg_test.txt. 458da82c92fSMauro Carvalho Chehab 459da82c92fSMauro Carvalho ChehabPerformance test is also important. To see pure memory controller's overhead, 460da82c92fSMauro Carvalho Chehabtesting on tmpfs will give you good numbers of small overheads. 461da82c92fSMauro Carvalho ChehabExample: do kernel make on tmpfs. 462da82c92fSMauro Carvalho Chehab 463da82c92fSMauro Carvalho ChehabPage-fault scalability is also important. At measuring parallel 464da82c92fSMauro Carvalho Chehabpage fault test, multi-process test may be better than multi-thread 465da82c92fSMauro Carvalho Chehabtest because it has noise of shared objects/status. 466da82c92fSMauro Carvalho Chehab 467da82c92fSMauro Carvalho ChehabBut the above two are testing extreme situations. 468da82c92fSMauro Carvalho ChehabTrying usual test under memory controller is always helpful. 469da82c92fSMauro Carvalho Chehab 470da3ad2e1SBagas Sanjaya.. _cgroup-v1-memory-test-troubleshoot: 471da3ad2e1SBagas Sanjaya 472da82c92fSMauro Carvalho Chehab4.1 Troubleshooting 473da82c92fSMauro Carvalho Chehab------------------- 474da82c92fSMauro Carvalho Chehab 475da82c92fSMauro Carvalho ChehabSometimes a user might find that the application under a cgroup is 476da82c92fSMauro Carvalho Chehabterminated by the OOM killer. There are several causes for this: 477da82c92fSMauro Carvalho Chehab 478da82c92fSMauro Carvalho Chehab1. The cgroup limit is too low (just too low to do anything useful) 479da82c92fSMauro Carvalho Chehab2. The user is using anonymous memory and swap is turned off or too low 480da82c92fSMauro Carvalho Chehab 481da82c92fSMauro Carvalho ChehabA sync followed by echo 1 > /proc/sys/vm/drop_caches will help get rid of 482da82c92fSMauro Carvalho Chehabsome of the pages cached in the cgroup (page cache pages). 483da82c92fSMauro Carvalho Chehab 484da3ad2e1SBagas SanjayaTo know what happens, disabling OOM_Kill as per :ref:`"10. OOM Control" 485da3ad2e1SBagas Sanjaya<cgroup-v1-memory-oom-control>` (below) and seeing what happens will be 486da3ad2e1SBagas Sanjayahelpful. 487da3ad2e1SBagas Sanjaya 488da3ad2e1SBagas Sanjaya.. _cgroup-v1-memory-test-task-migration: 489da82c92fSMauro Carvalho Chehab 490da82c92fSMauro Carvalho Chehab4.2 Task migration 491da82c92fSMauro Carvalho Chehab------------------ 492da82c92fSMauro Carvalho Chehab 493da82c92fSMauro Carvalho ChehabWhen a task migrates from one cgroup to another, its charge is not 494da82c92fSMauro Carvalho Chehabcarried forward by default. The pages allocated from the original cgroup still 495da82c92fSMauro Carvalho Chehabremain charged to it, the charge is dropped when the page is freed or 496da82c92fSMauro Carvalho Chehabreclaimed. 497da82c92fSMauro Carvalho Chehab 498da82c92fSMauro Carvalho ChehabYou can move charges of a task along with task migration. 499da3ad2e1SBagas SanjayaSee :ref:`8. "Move charges at task migration" <cgroup-v1-memory-move-charges>` 500da82c92fSMauro Carvalho Chehab 501da82c92fSMauro Carvalho Chehab4.3 Removing a cgroup 502da82c92fSMauro Carvalho Chehab--------------------- 503da82c92fSMauro Carvalho Chehab 504da3ad2e1SBagas SanjayaA cgroup can be removed by rmdir, but as discussed in :ref:`sections 4.1 505da3ad2e1SBagas Sanjaya<cgroup-v1-memory-test-troubleshoot>` and :ref:`4.2 506da3ad2e1SBagas Sanjaya<cgroup-v1-memory-test-task-migration>`, a cgroup might have some charge 507da3ad2e1SBagas Sanjayaassociated with it, even though all tasks have migrated away from it. (because 508da3ad2e1SBagas Sanjayawe charge against pages, not against tasks.) 509da82c92fSMauro Carvalho Chehab 51018421863SRoman GushchinWe move the stats to parent, and no change on the charge except uncharging 511da82c92fSMauro Carvalho Chehabfrom the child. 512da82c92fSMauro Carvalho Chehab 513da82c92fSMauro Carvalho ChehabCharges recorded in swap information is not updated at removal of cgroup. 514da82c92fSMauro Carvalho ChehabRecorded information is discarded and a cgroup which uses swap (swapcache) 515da82c92fSMauro Carvalho Chehabwill be charged as a new owner of it. 516da82c92fSMauro Carvalho Chehab 517da82c92fSMauro Carvalho Chehab5. Misc. interfaces 518da82c92fSMauro Carvalho Chehab=================== 519da82c92fSMauro Carvalho Chehab 520da82c92fSMauro Carvalho Chehab5.1 force_empty 521da82c92fSMauro Carvalho Chehab--------------- 522da82c92fSMauro Carvalho Chehab memory.force_empty interface is provided to make cgroup's memory usage empty. 523da82c92fSMauro Carvalho Chehab When writing anything to this:: 524da82c92fSMauro Carvalho Chehab 525da82c92fSMauro Carvalho Chehab # echo 0 > memory.force_empty 526da82c92fSMauro Carvalho Chehab 527da82c92fSMauro Carvalho Chehab the cgroup will be reclaimed and as many pages reclaimed as possible. 528da82c92fSMauro Carvalho Chehab 529da82c92fSMauro Carvalho Chehab The typical use case for this interface is before calling rmdir(). 530da82c92fSMauro Carvalho Chehab Though rmdir() offlines memcg, but the memcg may still stay there due to 531da82c92fSMauro Carvalho Chehab charged file caches. Some out-of-use page caches may keep charged until 532da82c92fSMauro Carvalho Chehab memory pressure happens. If you want to avoid that, force_empty will be useful. 533da82c92fSMauro Carvalho Chehab 534da82c92fSMauro Carvalho Chehab5.2 stat file 535da82c92fSMauro Carvalho Chehab------------- 536da82c92fSMauro Carvalho Chehab 537b9d2a17bSBagas Sanjayamemory.stat file includes following statistics: 538da82c92fSMauro Carvalho Chehab 539b9d2a17bSBagas Sanjaya * per-memory cgroup local status 540da82c92fSMauro Carvalho Chehab 541da82c92fSMauro Carvalho Chehab =============== =============================================================== 542da82c92fSMauro Carvalho Chehab cache # of bytes of page cache memory. 543da82c92fSMauro Carvalho Chehab rss # of bytes of anonymous and swap cache memory (includes 544da82c92fSMauro Carvalho Chehab transparent hugepages). 545da82c92fSMauro Carvalho Chehab rss_huge # of bytes of anonymous transparent hugepages. 546da82c92fSMauro Carvalho Chehab mapped_file # of bytes of mapped file (includes tmpfs/shmem) 547da82c92fSMauro Carvalho Chehab pgpgin # of charging events to the memory cgroup. The charging 548da82c92fSMauro Carvalho Chehab event happens each time a page is accounted as either mapped 549da82c92fSMauro Carvalho Chehab anon page(RSS) or cache page(Page Cache) to the cgroup. 550da82c92fSMauro Carvalho Chehab pgpgout # of uncharging events to the memory cgroup. The uncharging 551b9d2a17bSBagas Sanjaya event happens each time a page is unaccounted from the 552b9d2a17bSBagas Sanjaya cgroup. 553da82c92fSMauro Carvalho Chehab swap # of bytes of swap usage 55472a14e82SLiu Shixin swapcached # of bytes of swap cached in memory 555da82c92fSMauro Carvalho Chehab dirty # of bytes that are waiting to get written back to the disk. 556da82c92fSMauro Carvalho Chehab writeback # of bytes of file/anon cache that are queued for syncing to 557da82c92fSMauro Carvalho Chehab disk. 558da82c92fSMauro Carvalho Chehab inactive_anon # of bytes of anonymous and swap cache memory on inactive 559da82c92fSMauro Carvalho Chehab LRU list. 560da82c92fSMauro Carvalho Chehab active_anon # of bytes of anonymous and swap cache memory on active 561da82c92fSMauro Carvalho Chehab LRU list. 562b9d2a17bSBagas Sanjaya inactive_file # of bytes of file-backed memory and MADV_FREE anonymous 563b9d2a17bSBagas Sanjaya memory (LazyFree pages) on inactive LRU list. 564da82c92fSMauro Carvalho Chehab active_file # of bytes of file-backed memory on active LRU list. 565da82c92fSMauro Carvalho Chehab unevictable # of bytes of memory that cannot be reclaimed (mlocked etc). 566da82c92fSMauro Carvalho Chehab =============== =============================================================== 567da82c92fSMauro Carvalho Chehab 568b9d2a17bSBagas Sanjaya * status considering hierarchy (see memory.use_hierarchy settings): 569da82c92fSMauro Carvalho Chehab 570da82c92fSMauro Carvalho Chehab ========================= =================================================== 571b9d2a17bSBagas Sanjaya hierarchical_memory_limit # of bytes of memory limit with regard to 572b9d2a17bSBagas Sanjaya hierarchy 573da82c92fSMauro Carvalho Chehab under which the memory cgroup is 574da82c92fSMauro Carvalho Chehab hierarchical_memsw_limit # of bytes of memory+swap limit with regard to 575da82c92fSMauro Carvalho Chehab hierarchy under which memory cgroup is. 576da82c92fSMauro Carvalho Chehab 577da82c92fSMauro Carvalho Chehab total_<counter> # hierarchical version of <counter>, which in 578da82c92fSMauro Carvalho Chehab addition to the cgroup's own value includes the 579da82c92fSMauro Carvalho Chehab sum of all hierarchical children's values of 580da82c92fSMauro Carvalho Chehab <counter>, i.e. total_cache 581da82c92fSMauro Carvalho Chehab ========================= =================================================== 582da82c92fSMauro Carvalho Chehab 583b9d2a17bSBagas Sanjaya * additional vm parameters (depends on CONFIG_DEBUG_VM): 584da82c92fSMauro Carvalho Chehab 585da82c92fSMauro Carvalho Chehab ========================= ======================================== 586da82c92fSMauro Carvalho Chehab recent_rotated_anon VM internal parameter. (see mm/vmscan.c) 587da82c92fSMauro Carvalho Chehab recent_rotated_file VM internal parameter. (see mm/vmscan.c) 588da82c92fSMauro Carvalho Chehab recent_scanned_anon VM internal parameter. (see mm/vmscan.c) 589da82c92fSMauro Carvalho Chehab recent_scanned_file VM internal parameter. (see mm/vmscan.c) 590da82c92fSMauro Carvalho Chehab ========================= ======================================== 591da82c92fSMauro Carvalho Chehab 59256eb2767SBagas Sanjaya.. hint:: 593da82c92fSMauro Carvalho Chehab recent_rotated means recent frequency of LRU rotation. 594da82c92fSMauro Carvalho Chehab recent_scanned means recent # of scans to LRU. 595da82c92fSMauro Carvalho Chehab showing for better debug please see the code for meanings. 596da82c92fSMauro Carvalho Chehab 59756eb2767SBagas Sanjaya.. note:: 598da82c92fSMauro Carvalho Chehab Only anonymous and swap cache memory is listed as part of 'rss' stat. 599da82c92fSMauro Carvalho Chehab This should not be confused with the true 'resident set size' or the 600da82c92fSMauro Carvalho Chehab amount of physical memory used by the cgroup. 601da82c92fSMauro Carvalho Chehab 602da82c92fSMauro Carvalho Chehab 'rss + mapped_file" will give you resident set size of cgroup. 603da82c92fSMauro Carvalho Chehab 604da82c92fSMauro Carvalho Chehab (Note: file and shmem may be shared among other cgroups. In that case, 605da82c92fSMauro Carvalho Chehab mapped_file is accounted only when the memory cgroup is owner of page 606da82c92fSMauro Carvalho Chehab cache.) 607da82c92fSMauro Carvalho Chehab 608da82c92fSMauro Carvalho Chehab5.3 swappiness 609da82c92fSMauro Carvalho Chehab-------------- 610da82c92fSMauro Carvalho Chehab 611da82c92fSMauro Carvalho ChehabOverrides /proc/sys/vm/swappiness for the particular group. The tunable 612da82c92fSMauro Carvalho Chehabin the root cgroup corresponds to the global swappiness setting. 613da82c92fSMauro Carvalho Chehab 614da82c92fSMauro Carvalho ChehabPlease note that unlike during the global reclaim, limit reclaim 615da82c92fSMauro Carvalho Chehabenforces that 0 swappiness really prevents from any swapping even if 616da82c92fSMauro Carvalho Chehabthere is a swap storage available. This might lead to memcg OOM killer 617da82c92fSMauro Carvalho Chehabif there are no file pages to reclaim. 618da82c92fSMauro Carvalho Chehab 619da82c92fSMauro Carvalho Chehab5.4 failcnt 620da82c92fSMauro Carvalho Chehab----------- 621da82c92fSMauro Carvalho Chehab 622da82c92fSMauro Carvalho ChehabA memory cgroup provides memory.failcnt and memory.memsw.failcnt files. 623da82c92fSMauro Carvalho ChehabThis failcnt(== failure count) shows the number of times that a usage counter 624da82c92fSMauro Carvalho Chehabhit its limit. When a memory cgroup hits a limit, failcnt increases and 625da82c92fSMauro Carvalho Chehabmemory under it will be reclaimed. 626da82c92fSMauro Carvalho Chehab 627da82c92fSMauro Carvalho ChehabYou can reset failcnt by writing 0 to failcnt file:: 628da82c92fSMauro Carvalho Chehab 629da82c92fSMauro Carvalho Chehab # echo 0 > .../memory.failcnt 630da82c92fSMauro Carvalho Chehab 631da82c92fSMauro Carvalho Chehab5.5 usage_in_bytes 632da82c92fSMauro Carvalho Chehab------------------ 633da82c92fSMauro Carvalho Chehab 634da82c92fSMauro Carvalho ChehabFor efficiency, as other kernel components, memory cgroup uses some optimization 635da82c92fSMauro Carvalho Chehabto avoid unnecessary cacheline false sharing. usage_in_bytes is affected by the 636da82c92fSMauro Carvalho Chehabmethod and doesn't show 'exact' value of memory (and swap) usage, it's a fuzz 637da82c92fSMauro Carvalho Chehabvalue for efficient access. (Of course, when necessary, it's synchronized.) 638da82c92fSMauro Carvalho ChehabIf you want to know more exact memory usage, you should use RSS+CACHE(+SWAP) 639da82c92fSMauro Carvalho Chehabvalue in memory.stat(see 5.2). 640da82c92fSMauro Carvalho Chehab 641da82c92fSMauro Carvalho Chehab5.6 numa_stat 642da82c92fSMauro Carvalho Chehab------------- 643da82c92fSMauro Carvalho Chehab 644da82c92fSMauro Carvalho ChehabThis is similar to numa_maps but operates on a per-memcg basis. This is 645da82c92fSMauro Carvalho Chehabuseful for providing visibility into the numa locality information within 646da82c92fSMauro Carvalho Chehaban memcg since the pages are allowed to be allocated from any physical 647da82c92fSMauro Carvalho Chehabnode. One of the use cases is evaluating application performance by 648da82c92fSMauro Carvalho Chehabcombining this information with the application's CPU allocation. 649da82c92fSMauro Carvalho Chehab 650da82c92fSMauro Carvalho ChehabEach memcg's numa_stat file includes "total", "file", "anon" and "unevictable" 651da82c92fSMauro Carvalho Chehabper-node page counts including "hierarchical_<counter>" which sums up all 652da82c92fSMauro Carvalho Chehabhierarchical children's values in addition to the memcg's own value. 653da82c92fSMauro Carvalho Chehab 654da82c92fSMauro Carvalho ChehabThe output format of memory.numa_stat is:: 655da82c92fSMauro Carvalho Chehab 656da82c92fSMauro Carvalho Chehab total=<total pages> N0=<node 0 pages> N1=<node 1 pages> ... 657da82c92fSMauro Carvalho Chehab file=<total file pages> N0=<node 0 pages> N1=<node 1 pages> ... 658da82c92fSMauro Carvalho Chehab anon=<total anon pages> N0=<node 0 pages> N1=<node 1 pages> ... 659da82c92fSMauro Carvalho Chehab unevictable=<total anon pages> N0=<node 0 pages> N1=<node 1 pages> ... 660da82c92fSMauro Carvalho Chehab hierarchical_<counter>=<counter pages> N0=<node 0 pages> N1=<node 1 pages> ... 661da82c92fSMauro Carvalho Chehab 662da82c92fSMauro Carvalho ChehabThe "total" count is sum of file + anon + unevictable. 663da82c92fSMauro Carvalho Chehab 664da82c92fSMauro Carvalho Chehab6. Hierarchy support 665da82c92fSMauro Carvalho Chehab==================== 666da82c92fSMauro Carvalho Chehab 667da82c92fSMauro Carvalho ChehabThe memory controller supports a deep hierarchy and hierarchical accounting. 668da82c92fSMauro Carvalho ChehabThe hierarchy is created by creating the appropriate cgroups in the 669da82c92fSMauro Carvalho Chehabcgroup filesystem. Consider for example, the following cgroup filesystem 670da82c92fSMauro Carvalho Chehabhierarchy:: 671da82c92fSMauro Carvalho Chehab 672da82c92fSMauro Carvalho Chehab root 673da82c92fSMauro Carvalho Chehab / | \ 674da82c92fSMauro Carvalho Chehab / | \ 675da82c92fSMauro Carvalho Chehab a b c 676da82c92fSMauro Carvalho Chehab | \ 677da82c92fSMauro Carvalho Chehab | \ 678da82c92fSMauro Carvalho Chehab d e 679da82c92fSMauro Carvalho Chehab 680da82c92fSMauro Carvalho ChehabIn the diagram above, with hierarchical accounting enabled, all memory 68118421863SRoman Gushchinusage of e, is accounted to its ancestors up until the root (i.e, c and root). 68218421863SRoman GushchinIf one of the ancestors goes over its limit, the reclaim algorithm reclaims 68318421863SRoman Gushchinfrom the tasks in the ancestor and the children of the ancestor. 684da82c92fSMauro Carvalho Chehab 68518421863SRoman Gushchin6.1 Hierarchical accounting and reclaim 68618421863SRoman Gushchin--------------------------------------- 687da82c92fSMauro Carvalho Chehab 68818421863SRoman GushchinHierarchical accounting is enabled by default. Disabling the hierarchical 68918421863SRoman Gushchinaccounting is deprecated. An attempt to do it will result in a failure 69018421863SRoman Gushchinand a warning printed to dmesg. 69118421863SRoman Gushchin 69218421863SRoman GushchinFor compatibility reasons writing 1 to memory.use_hierarchy will always pass:: 693da82c92fSMauro Carvalho Chehab 694da82c92fSMauro Carvalho Chehab # echo 1 > memory.use_hierarchy 695da82c92fSMauro Carvalho Chehab 696da82c92fSMauro Carvalho Chehab7. Soft limits 697da82c92fSMauro Carvalho Chehab============== 698da82c92fSMauro Carvalho Chehab 699da82c92fSMauro Carvalho ChehabSoft limits allow for greater sharing of memory. The idea behind soft limits 700da82c92fSMauro Carvalho Chehabis to allow control groups to use as much of the memory as needed, provided 701da82c92fSMauro Carvalho Chehab 702da82c92fSMauro Carvalho Chehaba. There is no memory contention 703da82c92fSMauro Carvalho Chehabb. They do not exceed their hard limit 704da82c92fSMauro Carvalho Chehab 705da82c92fSMauro Carvalho ChehabWhen the system detects memory contention or low memory, control groups 706da82c92fSMauro Carvalho Chehabare pushed back to their soft limits. If the soft limit of each control 707da82c92fSMauro Carvalho Chehabgroup is very high, they are pushed back as much as possible to make 708da82c92fSMauro Carvalho Chehabsure that one control group does not starve the others of memory. 709da82c92fSMauro Carvalho Chehab 710da82c92fSMauro Carvalho ChehabPlease note that soft limits is a best-effort feature; it comes with 711da82c92fSMauro Carvalho Chehabno guarantees, but it does its best to make sure that when memory is 712da82c92fSMauro Carvalho Chehabheavily contended for, memory is allocated based on the soft limit 713da82c92fSMauro Carvalho Chehabhints/setup. Currently soft limit based reclaim is set up such that 714da82c92fSMauro Carvalho Chehabit gets invoked from balance_pgdat (kswapd). 715da82c92fSMauro Carvalho Chehab 716da82c92fSMauro Carvalho Chehab7.1 Interface 717da82c92fSMauro Carvalho Chehab------------- 718da82c92fSMauro Carvalho Chehab 719da82c92fSMauro Carvalho ChehabSoft limits can be setup by using the following commands (in this example we 720da82c92fSMauro Carvalho Chehabassume a soft limit of 256 MiB):: 721da82c92fSMauro Carvalho Chehab 722da82c92fSMauro Carvalho Chehab # echo 256M > memory.soft_limit_in_bytes 723da82c92fSMauro Carvalho Chehab 724da82c92fSMauro Carvalho ChehabIf we want to change this to 1G, we can at any time use:: 725da82c92fSMauro Carvalho Chehab 726da82c92fSMauro Carvalho Chehab # echo 1G > memory.soft_limit_in_bytes 727da82c92fSMauro Carvalho Chehab 72856eb2767SBagas Sanjaya.. note:: 729da82c92fSMauro Carvalho Chehab Soft limits take effect over a long period of time, since they involve 730da82c92fSMauro Carvalho Chehab reclaiming memory for balancing between memory cgroups 73156eb2767SBagas Sanjaya 73256eb2767SBagas Sanjaya.. note:: 733da82c92fSMauro Carvalho Chehab It is recommended to set the soft limit always below the hard limit, 734da82c92fSMauro Carvalho Chehab otherwise the hard limit will take precedence. 735da82c92fSMauro Carvalho Chehab 736da3ad2e1SBagas Sanjaya.. _cgroup-v1-memory-move-charges: 737da3ad2e1SBagas Sanjaya 738da34a848SJohannes Weiner8. Move charges at task migration (DEPRECATED!) 739da34a848SJohannes Weiner=============================================== 740da34a848SJohannes Weiner 741da34a848SJohannes WeinerTHIS IS DEPRECATED! 742da34a848SJohannes Weiner 743da34a848SJohannes WeinerIt's expensive and unreliable! It's better practice to launch workload 744da34a848SJohannes Weinertasks directly from inside their target cgroup. Use dedicated workload 745da34a848SJohannes Weinercgroups to allow fine-grained policy adjustments without having to 746da34a848SJohannes Weinermove physical pages between control domains. 747da82c92fSMauro Carvalho Chehab 748da82c92fSMauro Carvalho ChehabUsers can move charges associated with a task along with task migration, that 749da82c92fSMauro Carvalho Chehabis, uncharge task's pages from the old cgroup and charge them to the new cgroup. 750da82c92fSMauro Carvalho ChehabThis feature is not supported in !CONFIG_MMU environments because of lack of 751da82c92fSMauro Carvalho Chehabpage tables. 752da82c92fSMauro Carvalho Chehab 753da82c92fSMauro Carvalho Chehab8.1 Interface 754da82c92fSMauro Carvalho Chehab------------- 755da82c92fSMauro Carvalho Chehab 756da82c92fSMauro Carvalho ChehabThis feature is disabled by default. It can be enabled (and disabled again) by 757da82c92fSMauro Carvalho Chehabwriting to memory.move_charge_at_immigrate of the destination cgroup. 758da82c92fSMauro Carvalho Chehab 759da82c92fSMauro Carvalho ChehabIf you want to enable it:: 760da82c92fSMauro Carvalho Chehab 761da82c92fSMauro Carvalho Chehab # echo (some positive value) > memory.move_charge_at_immigrate 762da82c92fSMauro Carvalho Chehab 76356eb2767SBagas Sanjaya.. note:: 764da82c92fSMauro Carvalho Chehab Each bits of move_charge_at_immigrate has its own meaning about what type 765da3ad2e1SBagas Sanjaya of charges should be moved. See :ref:`section 8.2 766da3ad2e1SBagas Sanjaya <cgroup-v1-memory-movable-charges>` for details. 76756eb2767SBagas Sanjaya 76856eb2767SBagas Sanjaya.. note:: 769da82c92fSMauro Carvalho Chehab Charges are moved only when you move mm->owner, in other words, 770da82c92fSMauro Carvalho Chehab a leader of a thread group. 77156eb2767SBagas Sanjaya 77256eb2767SBagas Sanjaya.. note:: 773da82c92fSMauro Carvalho Chehab If we cannot find enough space for the task in the destination cgroup, we 774da82c92fSMauro Carvalho Chehab try to make space by reclaiming memory. Task migration may fail if we 775da82c92fSMauro Carvalho Chehab cannot make enough space. 77656eb2767SBagas Sanjaya 77756eb2767SBagas Sanjaya.. note:: 778da82c92fSMauro Carvalho Chehab It can take several seconds if you move charges much. 779da82c92fSMauro Carvalho Chehab 780da82c92fSMauro Carvalho ChehabAnd if you want disable it again:: 781da82c92fSMauro Carvalho Chehab 782da82c92fSMauro Carvalho Chehab # echo 0 > memory.move_charge_at_immigrate 783da82c92fSMauro Carvalho Chehab 784da3ad2e1SBagas Sanjaya.. _cgroup-v1-memory-movable-charges: 785da3ad2e1SBagas Sanjaya 786da82c92fSMauro Carvalho Chehab8.2 Type of charges which can be moved 787da82c92fSMauro Carvalho Chehab-------------------------------------- 788da82c92fSMauro Carvalho Chehab 789da82c92fSMauro Carvalho ChehabEach bit in move_charge_at_immigrate has its own meaning about what type of 790da82c92fSMauro Carvalho Chehabcharges should be moved. But in any case, it must be noted that an account of 791da82c92fSMauro Carvalho Chehaba page or a swap can be moved only when it is charged to the task's current 792da82c92fSMauro Carvalho Chehab(old) memory cgroup. 793da82c92fSMauro Carvalho Chehab 794da82c92fSMauro Carvalho Chehab+---+--------------------------------------------------------------------------+ 795da82c92fSMauro Carvalho Chehab|bit| what type of charges would be moved ? | 796da82c92fSMauro Carvalho Chehab+===+==========================================================================+ 797da82c92fSMauro Carvalho Chehab| 0 | A charge of an anonymous page (or swap of it) used by the target task. | 798da82c92fSMauro Carvalho Chehab| | You must enable Swap Extension (see 2.4) to enable move of swap charges. | 799da82c92fSMauro Carvalho Chehab+---+--------------------------------------------------------------------------+ 800da82c92fSMauro Carvalho Chehab| 1 | A charge of file pages (normal file, tmpfs file (e.g. ipc shared memory) | 801da82c92fSMauro Carvalho Chehab| | and swaps of tmpfs file) mmapped by the target task. Unlike the case of | 802da82c92fSMauro Carvalho Chehab| | anonymous pages, file pages (and swaps) in the range mmapped by the task | 803da82c92fSMauro Carvalho Chehab| | will be moved even if the task hasn't done page fault, i.e. they might | 804da82c92fSMauro Carvalho Chehab| | not be the task's "RSS", but other task's "RSS" that maps the same file. | 805*65867060SDavid Hildenbrand| | The mapcount of the page is ignored (the page can be moved independent | 806*65867060SDavid Hildenbrand| | of the mapcount). You must enable Swap Extension (see 2.4) to | 807da82c92fSMauro Carvalho Chehab| | enable move of swap charges. | 808da82c92fSMauro Carvalho Chehab+---+--------------------------------------------------------------------------+ 809da82c92fSMauro Carvalho Chehab 810da82c92fSMauro Carvalho Chehab8.3 TODO 811da82c92fSMauro Carvalho Chehab-------- 812da82c92fSMauro Carvalho Chehab 813da82c92fSMauro Carvalho Chehab- All of moving charge operations are done under cgroup_mutex. It's not good 814da82c92fSMauro Carvalho Chehab behavior to hold the mutex too long, so we may need some trick. 815da82c92fSMauro Carvalho Chehab 816da82c92fSMauro Carvalho Chehab9. Memory thresholds 817da82c92fSMauro Carvalho Chehab==================== 818da82c92fSMauro Carvalho Chehab 819da82c92fSMauro Carvalho ChehabMemory cgroup implements memory thresholds using the cgroups notification 820da82c92fSMauro Carvalho ChehabAPI (see cgroups.txt). It allows to register multiple memory and memsw 821da82c92fSMauro Carvalho Chehabthresholds and gets notifications when it crosses. 822da82c92fSMauro Carvalho Chehab 823da82c92fSMauro Carvalho ChehabTo register a threshold, an application must: 824da82c92fSMauro Carvalho Chehab 825da82c92fSMauro Carvalho Chehab- create an eventfd using eventfd(2); 826da82c92fSMauro Carvalho Chehab- open memory.usage_in_bytes or memory.memsw.usage_in_bytes; 827da82c92fSMauro Carvalho Chehab- write string like "<event_fd> <fd of memory.usage_in_bytes> <threshold>" to 828da82c92fSMauro Carvalho Chehab cgroup.event_control. 829da82c92fSMauro Carvalho Chehab 830da82c92fSMauro Carvalho ChehabApplication will be notified through eventfd when memory usage crosses 831da82c92fSMauro Carvalho Chehabthreshold in any direction. 832da82c92fSMauro Carvalho Chehab 833da82c92fSMauro Carvalho ChehabIt's applicable for root and non-root cgroup. 834da82c92fSMauro Carvalho Chehab 835da3ad2e1SBagas Sanjaya.. _cgroup-v1-memory-oom-control: 836da3ad2e1SBagas Sanjaya 837da82c92fSMauro Carvalho Chehab10. OOM Control 838da82c92fSMauro Carvalho Chehab=============== 839da82c92fSMauro Carvalho Chehab 840da82c92fSMauro Carvalho Chehabmemory.oom_control file is for OOM notification and other controls. 841da82c92fSMauro Carvalho Chehab 842da82c92fSMauro Carvalho ChehabMemory cgroup implements OOM notifier using the cgroup notification 843da82c92fSMauro Carvalho ChehabAPI (See cgroups.txt). It allows to register multiple OOM notification 844da82c92fSMauro Carvalho Chehabdelivery and gets notification when OOM happens. 845da82c92fSMauro Carvalho Chehab 846da82c92fSMauro Carvalho ChehabTo register a notifier, an application must: 847da82c92fSMauro Carvalho Chehab 848da82c92fSMauro Carvalho Chehab - create an eventfd using eventfd(2) 849da82c92fSMauro Carvalho Chehab - open memory.oom_control file 850da82c92fSMauro Carvalho Chehab - write string like "<event_fd> <fd of memory.oom_control>" to 851da82c92fSMauro Carvalho Chehab cgroup.event_control 852da82c92fSMauro Carvalho Chehab 853da82c92fSMauro Carvalho ChehabThe application will be notified through eventfd when OOM happens. 854da82c92fSMauro Carvalho ChehabOOM notification doesn't work for the root cgroup. 855da82c92fSMauro Carvalho Chehab 856da82c92fSMauro Carvalho ChehabYou can disable the OOM-killer by writing "1" to memory.oom_control file, as: 857da82c92fSMauro Carvalho Chehab 858da82c92fSMauro Carvalho Chehab #echo 1 > memory.oom_control 859da82c92fSMauro Carvalho Chehab 860da82c92fSMauro Carvalho ChehabIf OOM-killer is disabled, tasks under cgroup will hang/sleep 861da82c92fSMauro Carvalho Chehabin memory cgroup's OOM-waitqueue when they request accountable memory. 862da82c92fSMauro Carvalho Chehab 863da82c92fSMauro Carvalho ChehabFor running them, you have to relax the memory cgroup's OOM status by 864da82c92fSMauro Carvalho Chehab 865da82c92fSMauro Carvalho Chehab * enlarge limit or reduce usage. 866da82c92fSMauro Carvalho Chehab 867da82c92fSMauro Carvalho ChehabTo reduce usage, 868da82c92fSMauro Carvalho Chehab 869da82c92fSMauro Carvalho Chehab * kill some tasks. 870da82c92fSMauro Carvalho Chehab * move some tasks to other group with account migration. 871da82c92fSMauro Carvalho Chehab * remove some files (on tmpfs?) 872da82c92fSMauro Carvalho Chehab 873da82c92fSMauro Carvalho ChehabThen, stopped tasks will work again. 874da82c92fSMauro Carvalho Chehab 875da82c92fSMauro Carvalho ChehabAt reading, current status of OOM is shown. 876da82c92fSMauro Carvalho Chehab 877da82c92fSMauro Carvalho Chehab - oom_kill_disable 0 or 1 878da82c92fSMauro Carvalho Chehab (if 1, oom-killer is disabled) 879da82c92fSMauro Carvalho Chehab - under_oom 0 or 1 880da82c92fSMauro Carvalho Chehab (if 1, the memory cgroup is under OOM, tasks may be stopped.) 8811eff491fSYang Shi - oom_kill integer counter 8821eff491fSYang Shi The number of processes belonging to this cgroup killed by any 8831eff491fSYang Shi kind of OOM killer. 884da82c92fSMauro Carvalho Chehab 885da82c92fSMauro Carvalho Chehab11. Memory Pressure 886da82c92fSMauro Carvalho Chehab=================== 887da82c92fSMauro Carvalho Chehab 888da82c92fSMauro Carvalho ChehabThe pressure level notifications can be used to monitor the memory 889da82c92fSMauro Carvalho Chehaballocation cost; based on the pressure, applications can implement 890da82c92fSMauro Carvalho Chehabdifferent strategies of managing their memory resources. The pressure 891da82c92fSMauro Carvalho Chehablevels are defined as following: 892da82c92fSMauro Carvalho Chehab 893da82c92fSMauro Carvalho ChehabThe "low" level means that the system is reclaiming memory for new 894da82c92fSMauro Carvalho Chehaballocations. Monitoring this reclaiming activity might be useful for 895da82c92fSMauro Carvalho Chehabmaintaining cache level. Upon notification, the program (typically 896da82c92fSMauro Carvalho Chehab"Activity Manager") might analyze vmstat and act in advance (i.e. 897da82c92fSMauro Carvalho Chehabprematurely shutdown unimportant services). 898da82c92fSMauro Carvalho Chehab 899da82c92fSMauro Carvalho ChehabThe "medium" level means that the system is experiencing medium memory 900da82c92fSMauro Carvalho Chehabpressure, the system might be making swap, paging out active file caches, 901da82c92fSMauro Carvalho Chehabetc. Upon this event applications may decide to further analyze 902da82c92fSMauro Carvalho Chehabvmstat/zoneinfo/memcg or internal memory usage statistics and free any 903da82c92fSMauro Carvalho Chehabresources that can be easily reconstructed or re-read from a disk. 904da82c92fSMauro Carvalho Chehab 905da82c92fSMauro Carvalho ChehabThe "critical" level means that the system is actively thrashing, it is 906da82c92fSMauro Carvalho Chehababout to out of memory (OOM) or even the in-kernel OOM killer is on its 907da82c92fSMauro Carvalho Chehabway to trigger. Applications should do whatever they can to help the 908da82c92fSMauro Carvalho Chehabsystem. It might be too late to consult with vmstat or any other 909da82c92fSMauro Carvalho Chehabstatistics, so it's advisable to take an immediate action. 910da82c92fSMauro Carvalho Chehab 911da82c92fSMauro Carvalho ChehabBy default, events are propagated upward until the event is handled, i.e. the 912da82c92fSMauro Carvalho Chehabevents are not pass-through. For example, you have three cgroups: A->B->C. Now 913da82c92fSMauro Carvalho Chehabyou set up an event listener on cgroups A, B and C, and suppose group C 914da82c92fSMauro Carvalho Chehabexperiences some pressure. In this situation, only group C will receive the 915da82c92fSMauro Carvalho Chehabnotification, i.e. groups A and B will not receive it. This is done to avoid 916da82c92fSMauro Carvalho Chehabexcessive "broadcasting" of messages, which disturbs the system and which is 917da82c92fSMauro Carvalho Chehabespecially bad if we are low on memory or thrashing. Group B, will receive 918ab8aebdcSXiongwei Songnotification only if there are no event listeners for group C. 919da82c92fSMauro Carvalho Chehab 920da82c92fSMauro Carvalho ChehabThere are three optional modes that specify different propagation behavior: 921da82c92fSMauro Carvalho Chehab 922da82c92fSMauro Carvalho Chehab - "default": this is the default behavior specified above. This mode is the 923da82c92fSMauro Carvalho Chehab same as omitting the optional mode parameter, preserved by backwards 924da82c92fSMauro Carvalho Chehab compatibility. 925da82c92fSMauro Carvalho Chehab 926da82c92fSMauro Carvalho Chehab - "hierarchy": events always propagate up to the root, similar to the default 927da82c92fSMauro Carvalho Chehab behavior, except that propagation continues regardless of whether there are 928da82c92fSMauro Carvalho Chehab event listeners at each level, with the "hierarchy" mode. In the above 929da82c92fSMauro Carvalho Chehab example, groups A, B, and C will receive notification of memory pressure. 930da82c92fSMauro Carvalho Chehab 931da82c92fSMauro Carvalho Chehab - "local": events are pass-through, i.e. they only receive notifications when 932da82c92fSMauro Carvalho Chehab memory pressure is experienced in the memcg for which the notification is 933da82c92fSMauro Carvalho Chehab registered. In the above example, group C will receive notification if 934da82c92fSMauro Carvalho Chehab registered for "local" notification and the group experiences memory 935da82c92fSMauro Carvalho Chehab pressure. However, group B will never receive notification, regardless if 936da82c92fSMauro Carvalho Chehab there is an event listener for group C or not, if group B is registered for 937da82c92fSMauro Carvalho Chehab local notification. 938da82c92fSMauro Carvalho Chehab 939da82c92fSMauro Carvalho ChehabThe level and event notification mode ("hierarchy" or "local", if necessary) are 940da82c92fSMauro Carvalho Chehabspecified by a comma-delimited string, i.e. "low,hierarchy" specifies 941da82c92fSMauro Carvalho Chehabhierarchical, pass-through, notification for all ancestor memcgs. Notification 942da82c92fSMauro Carvalho Chehabthat is the default, non pass-through behavior, does not specify a mode. 943da82c92fSMauro Carvalho Chehab"medium,local" specifies pass-through notification for the medium level. 944da82c92fSMauro Carvalho Chehab 945da82c92fSMauro Carvalho ChehabThe file memory.pressure_level is only used to setup an eventfd. To 946da82c92fSMauro Carvalho Chehabregister a notification, an application must: 947da82c92fSMauro Carvalho Chehab 948da82c92fSMauro Carvalho Chehab- create an eventfd using eventfd(2); 949da82c92fSMauro Carvalho Chehab- open memory.pressure_level; 950da82c92fSMauro Carvalho Chehab- write string as "<event_fd> <fd of memory.pressure_level> <level[,mode]>" 951da82c92fSMauro Carvalho Chehab to cgroup.event_control. 952da82c92fSMauro Carvalho Chehab 953da82c92fSMauro Carvalho ChehabApplication will be notified through eventfd when memory pressure is at 954da82c92fSMauro Carvalho Chehabthe specific level (or higher). Read/write operations to 955da82c92fSMauro Carvalho Chehabmemory.pressure_level are no implemented. 956da82c92fSMauro Carvalho Chehab 957da82c92fSMauro Carvalho ChehabTest: 958da82c92fSMauro Carvalho Chehab 959da82c92fSMauro Carvalho Chehab Here is a small script example that makes a new cgroup, sets up a 960da82c92fSMauro Carvalho Chehab memory limit, sets up a notification in the cgroup and then makes child 961da82c92fSMauro Carvalho Chehab cgroup experience a critical pressure:: 962da82c92fSMauro Carvalho Chehab 963da82c92fSMauro Carvalho Chehab # cd /sys/fs/cgroup/memory/ 964da82c92fSMauro Carvalho Chehab # mkdir foo 965da82c92fSMauro Carvalho Chehab # cd foo 966da82c92fSMauro Carvalho Chehab # cgroup_event_listener memory.pressure_level low,hierarchy & 967da82c92fSMauro Carvalho Chehab # echo 8000000 > memory.limit_in_bytes 968da82c92fSMauro Carvalho Chehab # echo 8000000 > memory.memsw.limit_in_bytes 969da82c92fSMauro Carvalho Chehab # echo $$ > tasks 970da82c92fSMauro Carvalho Chehab # dd if=/dev/zero | read x 971da82c92fSMauro Carvalho Chehab 972da82c92fSMauro Carvalho Chehab (Expect a bunch of notifications, and eventually, the oom-killer will 973da82c92fSMauro Carvalho Chehab trigger.) 974da82c92fSMauro Carvalho Chehab 975da82c92fSMauro Carvalho Chehab12. TODO 976da82c92fSMauro Carvalho Chehab======== 977da82c92fSMauro Carvalho Chehab 978da82c92fSMauro Carvalho Chehab1. Make per-cgroup scanner reclaim not-shared pages first 979da82c92fSMauro Carvalho Chehab2. Teach controller to account for shared-pages 980da82c92fSMauro Carvalho Chehab3. Start reclamation in the background when the limit is 981da82c92fSMauro Carvalho Chehab not yet hit but the usage is getting closer 982da82c92fSMauro Carvalho Chehab 983da82c92fSMauro Carvalho ChehabSummary 984da82c92fSMauro Carvalho Chehab======= 985da82c92fSMauro Carvalho Chehab 986da82c92fSMauro Carvalho ChehabOverall, the memory controller has been a stable controller and has been 987da82c92fSMauro Carvalho Chehabcommented and discussed quite extensively in the community. 988da82c92fSMauro Carvalho Chehab 989da82c92fSMauro Carvalho ChehabReferences 990da82c92fSMauro Carvalho Chehab========== 991da82c92fSMauro Carvalho Chehab 99271da431cSBagas Sanjaya.. [1] Singh, Balbir. RFC: Memory Controller, http://lwn.net/Articles/206697/ 99371da431cSBagas Sanjaya.. [2] Singh, Balbir. Memory Controller (RSS Control), 994da82c92fSMauro Carvalho Chehab http://lwn.net/Articles/222762/ 99571da431cSBagas Sanjaya.. [3] Emelianov, Pavel. Resource controllers based on process cgroups 99605a5f51cSJoe Perches https://lore.kernel.org/r/45ED7DEC.7010403@sw.ru 99771da431cSBagas Sanjaya.. [4] Emelianov, Pavel. RSS controller based on process cgroups (v2) 99805a5f51cSJoe Perches https://lore.kernel.org/r/461A3010.90403@sw.ru 99971da431cSBagas Sanjaya.. [5] Emelianov, Pavel. RSS controller based on process cgroups (v3) 100005a5f51cSJoe Perches https://lore.kernel.org/r/465D9739.8070209@openvz.org 100171da431cSBagas Sanjaya 1002da82c92fSMauro Carvalho Chehab6. Menage, Paul. Control Groups v10, http://lwn.net/Articles/236032/ 1003da82c92fSMauro Carvalho Chehab7. Vaidyanathan, Srinivasan, Control Groups: Pagecache accounting and control 1004da82c92fSMauro Carvalho Chehab subsystem (v3), http://lwn.net/Articles/235534/ 1005da82c92fSMauro Carvalho Chehab8. Singh, Balbir. RSS controller v2 test results (lmbench), 100605a5f51cSJoe Perches https://lore.kernel.org/r/464C95D4.7070806@linux.vnet.ibm.com 1007da82c92fSMauro Carvalho Chehab9. Singh, Balbir. RSS controller v2 AIM9 results 100805a5f51cSJoe Perches https://lore.kernel.org/r/464D267A.50107@linux.vnet.ibm.com 1009da82c92fSMauro Carvalho Chehab10. Singh, Balbir. Memory controller v6 test results, 101005a5f51cSJoe Perches https://lore.kernel.org/r/20070819094658.654.84837.sendpatchset@balbir-laptop 101171da431cSBagas Sanjaya 101271da431cSBagas Sanjaya.. [11] Singh, Balbir. Memory controller introduction (v6), 101305a5f51cSJoe Perches https://lore.kernel.org/r/20070817084228.26003.12568.sendpatchset@balbir-laptop 101471da431cSBagas Sanjaya.. [12] Corbet, Jonathan, Controlling memory use in cgroups, 1015da82c92fSMauro Carvalho Chehab http://lwn.net/Articles/243795/ 1016