Lines Matching +full:in +full:- +full:memory

1 .. _cgroup-v2:
11 conventions of cgroup v2. It describes all userland-visible aspects
13 future changes must be reflected in this document. Documentation for
14 v1 is available under :ref:`Documentation/admin-guide/cgroup-v1/index.rst <cgroup-v1>`.
19 1-1. Terminology
20 1-2. What is cgroup?
22 2-1. Mounting
23 2-2. Organizing Processes and Threads
24 2-2-1. Processes
25 2-2-2. Threads
26 2-3. [Un]populated Notification
27 2-4. Controlling Controllers
28 2-4-1. Enabling and Disabling
29 2-4-2. Top-down Constraint
30 2-4-3. No Internal Process Constraint
31 2-5. Delegation
32 2-5-1. Model of Delegation
33 2-5-2. Delegation Containment
34 2-6. Guidelines
35 2-6-1. Organize Once and Control
36 2-6-2. Avoid Name Collisions
38 3-1. Weights
39 3-2. Limits
40 3-3. Protections
41 3-4. Allocations
43 4-1. Format
44 4-2. Conventions
45 4-3. Core Interface Files
47 5-1. CPU
48 5-1-1. CPU Interface Files
49 5-2. Memory
50 5-2-1. Memory Interface Files
51 5-2-2. Usage Guidelines
52 5-2-3. Memory Ownership
53 5-3. IO
54 5-3-1. IO Interface Files
55 5-3-2. Writeback
56 5-3-3. IO Latency
57 5-3-3-1. How IO Latency Throttling Works
58 5-3-3-2. IO Latency Interface Files
59 5-3-4. IO Priority
60 5-4. PID
61 5-4-1. PID Interface Files
62 5-5. Cpuset
63 5.5-1. Cpuset Interface Files
64 5-6. Device
65 5-7. RDMA
66 5-7-1. RDMA Interface Files
67 5-8. DMEM
68 5-9. HugeTLB
69 5.9-1. HugeTLB Interface Files
70 5-10. Misc
71 5.10-1 Miscellaneous cgroup Interface Files
72 5.10-2 Migration and Ownership
73 5-11. Others
74 5-11-1. perf_event
75 5-N. Non-normative information
76 5-N-1. CPU controller root cgroup process behaviour
77 5-N-2. IO controller root cgroup process behaviour
79 6-1. Basics
80 6-2. The Root and Views
81 6-3. Migration and setns(2)
82 6-4. Interaction with Other Namespaces
84 P-1. Filesystem Support for Writeback
87 R-1. Multiple Hierarchies
88 R-2. Thread Granularity
89 R-3. Competition Between Inner Nodes and Threads
90 R-4. Other Interface Issues
91 R-5. Controller Issues and Remedies
92 R-5-1. Memory
99 -----------
103 qualifier as in "cgroup controllers". When explicitly referring to
108 ---------------
111 distribute system resources along the hierarchy in a controlled and
114 cgroup is largely composed of two parts - the core and controllers.
121 cgroups form a tree structure and every process in the system belongs
123 same cgroup. On creation, all processes are put in the cgroup that
130 hierarchical - if a controller is enabled on a cgroup, it affects all
132 sub-hierarchy of the cgroup. When a controller is enabled on a nested
134 restrictions set closer to the root in the hierarchy can not be
142 --------
147 # mount -t cgroup2 none $MOUNT_POINT
152 Controllers which are not in active use in the v2 hierarchy can be
154 legacy v1 multiple hierarchies in a fully backward compatible way.
157 is no longer referenced in its current hierarchy. Because per-cgroup
164 to inter-controller dependencies, other controllers may need to be
177 disabling controllers in v1 and make them always available in v2.
185 ignored on non-init namespace mounts. Please refer to the
197 Only populate memory.events with data for the current cgroup,
202 option is ignored on non-init namespace mounts.
205 Recursively apply memory.min and memory.low protection to
210 behavior but is a mount-option to avoid regressing setups
215 Count HugeTLB memory usage towards the cgroup's overall
216 memory usage for the memory controller (for the purpose of
217 statistics reporting and memory protetion). This is a new
219 explicitly opted in with this mount option.
221 A few caveats to keep in mind:
223 * There is no HugeTLB pool management involved in the memory
224 controller. The pre-allocated pool does not belong to anyone.
227 memory controller. It is only charged to a cgroup when it is
228 actually used (for e.g at page fault time). Host memory
230 hard limits. In general, HugeTLB pool management should be
232 * Failure to charge a HugeTLB folio to the memory controller
233 results in SIGBUS. This could happen even if the HugeTLB pool
236 * Charging HugeTLB memory towards the memory controller affects
237 memory protection and reclaim dynamics. Any userspace tuning
240 will not be tracked by the memory controller (even if cgroup
244 The option restores v1-like behavior of pids.events:max, that is only
252 --------------------------------
258 A child cgroup can be created by creating a sub-directory::
263 structure. Each cgroup has a read-writable interface file
265 belong to the cgroup one-per-line. The PIDs are not ordered and the
279 zombie process does not appear in "cgroup.procs" and thus can't be
290 cgroup is in use in the system, this file may contain multiple lines,
291 one for each hierarchy. The entry for cgroup v2 is always in the
296 0::/test-cgroup/test-cgroup-nested
303 0::/test-cgroup/test-cgroup-nested (deleted)
322 cgroup whose resource domain is further up in the hierarchy. The root
327 Inside a threaded subtree, threads of a process can be put in
329 constraint - threaded controllers can be enabled on non-leaf cgroups
330 whether they have threads in them or not.
334 resource consumptions whether there are processes in it or not and
339 The current operation mode or type of the cgroup is shown in the
353 - As the cgroup will join the parent's resource domain. The parent
356 - When the parent is an unthreaded domain, it must not have any domain
360 Topology-wise, a cgroup can be in an invalid state. Please consider
363 A (threaded domain) - B (threaded) - C (domain, just created)
367 threaded cgroup. "cgroup.type" file will report "domain (invalid)" in
372 cgroup becomes threaded or threaded controllers are enabled in the
373 "cgroup.subtree_control" file while there are processes in the cgroup.
378 threads in the cgroup. Except that the operations are per-thread
379 instead of per-process, "cgroup.threads" has the same format and
381 written to in any cgroup, as it can only move threads inside the same
387 all the processes are considered to be in the threaded domain cgroup.
388 "cgroup.procs" in a threaded domain cgroup contains the PIDs of all
389 processes in the subtree and is not readable in the subtree proper.
390 However, "cgroup.procs" can be written to from anywhere in the subtree
393 Only threaded controllers can be enabled in a threaded subtree. When
396 threads in the cgroup and its descendants. All consumptions which
401 between threads in a non-leaf cgroup and its child cgroups. Each
405 in a threaded cgroup::
407 - cpu
408 - cpuset
409 - perf_event
410 - pids
413 --------------------------
415 Each non-root cgroup has a "cgroup.events" file which contains
416 "populated" field indicating whether the cgroup's sub-hierarchy has
417 live processes in it. Its value is 0 if there is no live process in
420 example, to start a clean-up operation after all processes of a given
421 sub-hierarchy have exited. The populated state updates and
422 notifications are recursive. Consider the following sub-hierarchy
423 where the numbers in the parentheses represent the numbers of processes
424 in each cgroup::
426 A(4) - B(0) - C(1)
430 process in C exits, B and C's "populated" fields would flip to "0" and
436 -----------------------
441 A controller is available in a cgroup when it is supported by the kernel (i.e.,
442 compiled in, not disabled and not attached to a v1 hierarchy) and listed in the
444 are exposed in the cgroup’s directory, allowing the distribution of the target
454 cpu io memory
459 # echo "+cpu +memory -io" > cgroup.subtree_control
461 Only controllers which are listed in "cgroup.controllers" can be
466 Enabling a controller in a cgroup indicates that the distribution of
468 Consider the following sub-hierarchy. The enabled controllers are
469 listed in parentheses::
471 A(cpu,memory) - B(memory) - C()
474 As A has "cpu" and "memory" enabled, A will control the distribution
475 of CPU cycles and memory to its children, in this case, B. As B has
476 "memory" enabled but not "CPU", C and D will compete freely on CPU
477 cycles but their division of memory available to B will be controlled.
481 files in the child cgroups. In the above example, enabling "cpu" on B
482 would create the "cpu." prefixed controller interface files in C and
483 D. Likewise, disabling "memory" from B would remove the "memory."
485 controller interface files - anything which doesn't start with
489 Top-down Constraint
492 Resources are distributed top-down and a cgroup can further distribute
494 parent. This means that all non-root "cgroup.subtree_control" files
495 can only contain controllers which are enabled in the parent's
504 Non-root cgroups can distribute domain resources to their children
505 only when they don't have any processes of their own. In other words,
507 controllers enabled in their "cgroup.subtree_control" files.
517 controllers. How resource consumption in the root cgroup is governed
519 refer to the Non-normative information section in the Controllers
522 Note that the restriction doesn't get in the way if there is no
523 enabled controller in the cgroup's "cgroup.subtree_control". This is
527 children before enabling controllers in its "cgroup.subtree_control"
532 ----------
537 A cgroup can be delegated in two ways. First, to a less privileged
543 Because the resource control interface files in a given directory
550 those files listed in "/sys/kernel/cgroup/delegate" (including
554 delegated, the user can build sub-hierarchy under the directory,
558 happens in the delegated sub-hierarchy, nothing can escape the
562 cgroups in or nesting depth of a delegated sub-hierarchy; however,
563 this may be limited explicitly in the future.
569 A delegated sub-hierarchy is contained in the sense that processes
570 can't be moved into or out of the sub-hierarchy by the delegatee.
573 requiring the following conditions for a process with a non-root euid
577 - The writer must have write access to the "cgroup.procs" file.
579 - The writer must have write access to the "cgroup.procs" file of the
583 processes around freely in the delegated sub-hierarchy it can't pull
584 in from or push out to outside the sub-hierarchy.
590 ~~~~~~~~~~~~~ - C0 - C00
593 ~~~~~~~~~~~~~ - C1 - C10
596 currently in C10 into "C00/cgroup.procs". U0 has write access to the
600 will be denied with -EACCES.
605 is not reachable, the migration is rejected with -ENOENT.
609 ----------
615 and stateful resources such as memory are not moved together with the
617 inherent trade-offs between migration and various hot paths in terms
623 resource structure once on start-up. Dynamic adjustments to resource
640 start or end with terms which are often used in categorizing workloads
652 describes major schemes in use along with their expected behaviors.
656 -------
661 resource at the moment participate in the distribution, this is
662 work-conserving. Due to the dynamic nature, this model is usually
665 All weights are in the range [1, 10000] with the default at 100. This
666 allows symmetric multiplicative biases in both directions at fine
667 enough granularity while staying in the intuitive range.
669 As long as the weight is in range, all configuration combinations are
677 .. _cgroupv2-limits-distributor:
680 ------
683 Limits can be over-committed - the sum of the limits of children can
686 Limits are in the range [0, max] and defaults to "max", which is noop.
688 As limits can be over-committed, all configuration combinations are
695 .. _cgroupv2-protections-distributor:
698 -----------
703 soft boundaries. Protections can also be over-committed in which case
707 Protections are in the range [0, max] and defaults to 0, which is
710 As protections can be over-committed, all configuration combinations
714 "memory.low" implements best-effort memory protection and is an
719 -----------
722 resource. Allocations can't be over-committed - the sum of the
726 Allocations are in the range [0, max] and defaults to 0, which is no
729 As allocations can't be over-committed, some configuration
734 "cpu.rt.max" hard-allocates realtime slices and is an example of this
742 ------
744 All interface files should be in one of the following formats whenever
747 New-line separated values
755 (when read-only or multiple values can be written at once)
777 may be specified in any order and not all pairs have to be specified.
781 -----------
783 - Settings for a single feature should be contained in a single file.
785 - The root cgroup should be exempt from resource control and thus
788 - The default time unit is microseconds. If a different unit is ever
791 - A parts-per quantity should use a percentage decimal with at least
792 two digit fractional part - e.g. 13.40.
794 - If a controller implements weight based resource distribution, its
797 enough and symmetric bias in both directions while keeping it
800 - If a controller implements an absolute resource guarantee and/or
806 In the above four control files, the special token "max" should be
809 - If a setting has a configurable default value and keyed specific
811 appear as the first entry in the file.
823 # cat cgroup-example-interface-file
829 # echo 125 > cgroup-example-interface-file
833 # echo "default 125" > cgroup-example-interface-file
837 # echo "8:16 170" > cgroup-example-interface-file
841 # echo "8:0 default" > cgroup-example-interface-file
842 # cat cgroup-example-interface-file
846 - For events which are not very high frequency, an interface file
853 --------------------
858 A read-write single value file which exists on non-root
864 - "domain" : A normal valid domain cgroup.
866 - "domain threaded" : A threaded domain cgroup which is
869 - "domain invalid" : A cgroup which is in an invalid state.
873 - "threaded" : A threaded cgroup which is a member of a
880 A read-write new-line separated values file which exists on
884 the cgroup one-per-line. The PIDs are not ordered and the
893 - It must have write access to the "cgroup.procs" file.
895 - It must have write access to the "cgroup.procs" file of the
898 When delegating a sub-hierarchy, write access to this file
901 In a threaded cgroup, reading this file fails with EOPNOTSUPP
906 A read-write new-line separated values file which exists on
910 the cgroup one-per-line. The TIDs are not ordered and the
919 - It must have write access to the "cgroup.threads" file.
921 - The cgroup that the thread is currently in must be in the
924 - It must have write access to the "cgroup.procs" file of the
927 When delegating a sub-hierarchy, write access to this file
931 A read-only space separated values file which exists on all
938 A read-write space separated values file which exists on all
945 Space separated list of controllers prefixed with '+' or '-'
947 name prefixed with '+' enables the controller and '-'
953 A read-only flat-keyed file which exists on non-root cgroups.
955 otherwise, a value change in this file generates a file
965 A read-write single value files. The default is "max".
969 an attempt to create a new cgroup in the hierarchy will fail.
972 A read-write single value files. The default is "max".
979 A read-only flat-keyed file with the following entries:
987 in dying state for some time undefined time (which can depend
997 Total number of live cgroup subsystems (e.g memory
1001 Total number of dying cgroup subsystems (e.g. memory
1005 A read-write single value file which exists on non-root cgroups.
1012 is completed, the "frozen" value in the cgroup.events control file
1020 Processes in the frozen cgroup can be killed by a fatal signal.
1028 create new sub-cgroups.
1031 A write-only single value file which exists in non-root cgroups.
1035 be killed. This means that all processes located in the affected cgroup
1041 In a threaded cgroup, writing this file fails with EOPNOTSUPP as
1043 the whole thread-group.
1046 A read-write single value file that allowed values are "0" and "1".
1050 Writing "1" to the file will re-enable the cgroup PSI accounting.
1053 accounting in a cgroup does not affect PSI accounting in descendants
1058 This may cause non-negligible overhead for some workloads when under
1059 deep level of the hierarchy, in which case this control attribute can
1060 be used to disable PSI accounting in the non-leaf cgroups.
1063 A read-write nested-keyed file.
1071 .. _cgroup-v2-cpu:
1074 ---
1081 In all the above models, cycles distribution is defined only on a temporal
1091 be enabled when all RT processes are in the root cgroup. Be aware that system
1092 management software may already have placed RT processes into non-root cgroups
1111 * Processes under the fair-class scheduler
1116 For details on when a process is under the fair-class scheduler or a BPF scheduler,
1117 check out :ref:`Documentation/scheduler/sched-ext.rst <sched-ext>`.
1120 will be referred to. All time durations are in microseconds.
1123 A read-only flat-keyed file.
1127 processes in the cgroup:
1129 - usage_usec
1130 - user_usec
1131 - system_usec
1134 only the processes under the fair-class scheduler:
1136 - nr_periods
1137 - nr_throttled
1138 - throttled_usec
1139 - nr_bursts
1140 - burst_usec
1143 A read-write single value file which exists on non-root
1146 For non idle groups (cpu.idle = 0), the weight is in the
1152 This file affects only processes under the fair-class scheduler and a BPF
1157 A read-write single value file which exists on non-root
1160 The nice value is in the range [-20, 19].
1168 This file affects only processes under the fair-class scheduler and a BPF
1173 A read-write two value file which exists on non-root cgroups.
1176 The maximum bandwidth limit. It's in the following format::
1180 which indicates that the group may consume up to $MAX in each
1184 This file affects only processes under the fair-class scheduler.
1187 A read-write single value file which exists on non-root
1190 The burst in the range [0, $MAX].
1192 This file affects only processes under the fair-class scheduler.
1195 A read-write nested-keyed file.
1200 This file accounts for all the processes in the cgroup.
1203 A read-write single value file which exists on non-root cgroups.
1218 This file affects all the processes in the cgroup.
1221 A read-write single value file which exists on non-root cgroups.
1232 This file affects all the processes in the cgroup.
1235 A read-write single value file which exists on non-root cgroups.
1238 This is the cgroup analog of the per-task SCHED_IDLE sched policy.
1244 This file affects only processes under the fair-class scheduler.
1246 Memory
1247 ------
1249 The "memory" controller regulates distribution of memory. Memory is
1251 intertwining between memory usage and reclaim pressure and the
1252 stateful nature of memory, the distribution model is relatively
1255 While not completely water-tight, all major memory usages by a given
1256 cgroup are tracked so that the total memory consumption can be
1258 following types of memory usages are tracked.
1260 - Userland memory - page cache and anonymous memory.
1262 - Kernel data structures such as dentries and inodes.
1264 - TCP socket buffers.
1266 The above list may expand in the future for better coverage.
1269 Memory Interface Files
1272 All memory amounts are in bytes. If a value which is not aligned to
1276 memory.current
1277 A read-only single value file which exists on non-root
1280 The total amount of memory currently being used by the cgroup
1283 memory.min
1284 A read-write single value file which exists on non-root
1287 Hard memory protection. If the memory usage of a cgroup
1288 is within its effective min boundary, the cgroup's memory
1290 unprotected reclaimable memory available, OOM killer
1296 Effective min boundary is limited by memory.min values of
1297 all ancestor cgroups. If there is memory.min overcommitment
1298 (child cgroup or cgroups are requiring more protected memory
1301 actual memory usage below memory.min.
1303 Putting more memory than generally available under this
1306 If a memory cgroup is not populated with processes,
1307 its memory.min is ignored.
1309 memory.low
1310 A read-write single value file which exists on non-root
1313 Best-effort memory protection. If the memory usage of a
1315 memory won't be reclaimed unless there is no reclaimable
1316 memory available in unprotected cgroups.
1322 Effective low boundary is limited by memory.low values of
1323 all ancestor cgroups. If there is memory.low overcommitment
1324 (child cgroup or cgroups are requiring more protected memory
1327 actual memory usage below memory.low.
1329 Putting more memory than generally available under this
1332 memory.high
1333 A read-write single value file which exists on non-root
1336 Memory usage throttle limit. If a cgroup's usage goes
1342 limit should be used in scenarios where an external process
1346 If memory.high is opened with O_NONBLOCK then the synchronous
1348 need to dynamically adjust the job's memory limits without
1349 expending their own CPU resources on memory reclamation. The
1354 target memory cgroup may take indefinite amount of time to
1356 busy-hitting its memory to slow down reclaim.
1358 memory.max
1359 A read-write single value file which exists on non-root
1362 Memory usage hard limit. This is the main mechanism to limit
1363 memory usage of a cgroup. If a cgroup's memory usage reaches
1364 this limit and can't be reduced, the OOM killer is invoked in
1368 In default configuration regular 0-order allocations always
1373 as -ENOMEM or silently ignore in cases like disk readahead.
1375 If memory.max is opened with O_NONBLOCK, then the synchronous
1376 reclaim and oom-kill are bypassed. This is useful for admin
1377 processes that need to dynamically adjust the job's memory limits
1378 without expending their own CPU resources on memory reclamation.
1379 The job will trigger the reclaim and/or oom-kill on its next
1383 target memory cgroup may take indefinite amount of time to
1385 busy-hitting its memory to slow down reclaim.
1387 memory.reclaim
1388 A write-only nested-keyed file which exists for all cgroups.
1390 This is a simple interface to trigger memory reclaim in the
1395 echo "1G" > memory.reclaim
1399 specified amount, -EAGAIN is returned.
1402 interface) is not meant to indicate memory pressure on the
1403 memory cgroup. Therefore socket memory balancing triggered by
1404 the memory reclaim normally is not exercised in this case.
1406 reclaim induced by memory.reclaim.
1419 The valid range for swappiness is [0-200, max], setting
1420 swappiness=max exclusively reclaims anonymous memory.
1422 memory.peak
1423 A read-write single value file which exists on non-root cgroups.
1425 The max memory usage recorded for the cgroup and its descendants since
1428 A write of any non-empty string to this file resets it to the
1429 current memory usage for subsequent reads through the same
1432 memory.oom.group
1433 A read-write single value file which exists on non-root
1439 (if the memory cgroup is not a leaf cgroup) are killed
1443 Tasks with the OOM protection (oom_score_adj set to -1000)
1446 If the OOM killer is invoked in a cgroup, it's not going
1448 memory.oom.group values of ancestor cgroups.
1450 memory.events
1451 A read-only flat-keyed file which exists on non-root cgroups.
1453 otherwise, a value change in this file generates a file
1456 Note that all fields in this file are hierarchical and the
1459 memory.events.local.
1463 high memory pressure even though its usage is under
1465 boundary is over-committed.
1469 throttled and routed to perform direct memory reclaim
1470 because the high memory boundary was exceeded. For a
1471 cgroup whose memory usage is capped by the high limit
1472 rather than global memory pressure, this event's
1476 The number of times the cgroup's memory usage was
1481 The number of time the cgroup's memory usage was
1485 considered as an option, e.g. for failed high-order
1495 memory.events.local
1496 Similar to memory.events but the fields in the file are local
1500 memory.stat
1501 A read-only flat-keyed file which exists on non-root cgroups.
1503 This breaks down the cgroup's memory footprint into different
1504 types of memory, type-specific details, and other information
1505 on the state and past events of the memory management system.
1507 All memory amounts are in bytes.
1510 can show up in the middle. Don't rely on items remaining in a
1513 If the entry has no per-node counter (or not show in the
1514 memory.numa_stat). We use 'npn' (non-per-node) as the tag
1515 to indicate that it will not show in the memory.numa_stat.
1518 Amount of memory used in anonymous mappings such as
1522 memory of such an allocation is mapped anymore.
1525 Amount of memory used to cache filesystem data,
1526 including tmpfs and shared memory.
1529 Amount of total kernel memory, including
1530 (kernel_stack, pagetables, percpu, vmalloc, slab) in
1531 addition to other kernel memory use cases.
1534 Amount of memory allocated to kernel stacks.
1537 Amount of memory allocated for page tables.
1540 Amount of memory allocated for secondary page tables,
1545 Amount of memory used for storing per-cpu kernel
1549 Amount of memory used in network transmission buffers
1552 Amount of memory used for vmap backed memory.
1555 Amount of cached filesystem data that is swap-backed,
1559 Amount of memory consumed by the zswap compression backend.
1562 Amount of application memory swapped out to zswap.
1568 not all the memory of such an allocation is mapped.
1579 Amount of swap cached in memory. The swapcache is accounted
1580 against both memory and swap usage.
1583 Amount of memory used in anonymous mappings backed by
1595 Amount of memory, swap-backed and filesystem-backed,
1596 on the internal memory management lists used by the
1600 memory management lists), inactive_foo + active_foo may not be equal to
1601 the value for the foo counter, since the foo counter is type-based, not
1602 list-based.
1609 Part of "slab" that cannot be reclaimed on memory
1613 Amount of memory used for storing in-kernel data
1641 Number of pages swapped into memory
1644 Number of pages swapped out of memory
1647 Amount of scanned pages (in an inactive LRU list)
1653 Amount of scanned pages by kswapd (in an inactive LRU list)
1656 Amount of scanned pages directly (in an inactive LRU list)
1659 Amount of scanned pages by khugepaged (in an inactive LRU list)
1662 Amount of scanned pages proactively (in an inactive LRU list)
1683 Amount of scanned pages (in an active LRU list)
1692 Amount of pages postponed to be freed under memory pressure
1698 Number of pages swapped into memory and filled with zero, where I/O
1703 Number of zero-filled pages swapped out with I/O skipped due to the
1707 Number of pages moved in to memory from zswap.
1710 Number of pages moved out of memory to zswap.
1726 Number of transparent hugepages which are swapout in one piece
1757 Amount of memory used by hugetlb pages. This metric only shows
1758 up if hugetlb usage is accounted for in memory.current (i.e.
1761 memory.numa_stat
1762 A read-only nested-keyed file which exists on non-root cgroups.
1764 This breaks down the cgroup's memory footprint into different
1765 types of memory, type-specific details, and other information
1766 per node on the state of the memory management system.
1774 All memory amounts are in bytes.
1776 The output format of memory.numa_stat is::
1778 type N0=<bytes in node 0> N1=<bytes in node 1> ...
1781 can show up in the middle. Don't rely on items remaining in a
1784 The entries can refer to the memory.stat.
1786 memory.swap.current
1787 A read-only single value file which exists on non-root
1793 memory.swap.high
1794 A read-write single value file which exists on non-root
1799 allow userspace to implement custom out-of-memory procedures.
1803 during regular operation. Compare to memory.swap.max, which
1805 continue unimpeded as long as other memory can be reclaimed.
1809 memory.swap.peak
1810 A read-write single value file which exists on non-root cgroups.
1815 A write of any non-empty string to this file resets it to the
1816 current memory usage for subsequent reads through the same
1819 memory.swap.max
1820 A read-write single value file which exists on non-root
1824 limit, anonymous memory of the cgroup will not be swapped out.
1826 memory.swap.events
1827 A read-only flat-keyed file which exists on non-root cgroups.
1829 otherwise, a value change in this file generates a file
1843 because of running out of swap system-wide or max
1849 reduces the impact on the workload and memory management.
1851 memory.zswap.current
1852 A read-only single value file which exists on non-root
1855 The total amount of memory consumed by the zswap compression
1858 memory.zswap.max
1859 A read-write single value file which exists on non-root
1864 entries fault back in or are written out to disk.
1866 memory.zswap.writeback
1867 A read-write single value file. The default value is "1".
1879 Note that this is subtly different from setting memory.swap.max to
1882 is allowed unless memory.swap.max is set to 0.
1884 memory.pressure
1885 A read-only nested-keyed file.
1887 Shows pressure stall information for memory. See
1894 "memory.high" is the main mechanism to control memory usage.
1895 Over-committing on high limit (sum of high limits > available memory)
1896 and letting global memory pressure to distribute memory according to
1902 more memory or terminating the workload.
1904 Determining whether a cgroup has enough memory is not trivial as
1905 memory usage doesn't indicate whether the workload can benefit from
1906 more memory. For example, a workload which writes data received from
1907 network to a file can use all available memory but can also operate as
1908 performant with a small amount of memory. A measure of memory
1909 pressure - how much the workload is being impacted due to lack of
1910 memory - is necessary to determine whether a workload needs more
1911 memory; unfortunately, memory pressure monitoring mechanism isn't
1915 Memory Ownership
1918 A memory area is charged to the cgroup which instantiated it and stays
1920 to a different cgroup doesn't move the memory usages that it
1921 instantiated while in the previous cgroup to the new cgroup.
1923 A memory area may be used by processes belonging to different cgroups.
1924 To which cgroup the area will be charged is in-deterministic; however,
1925 over time, the memory area is likely to end up in a cgroup which has
1926 enough memory allowance to avoid high reclaim pressure.
1928 If a cgroup sweeps a considerable amount of memory which is expected
1930 POSIX_FADV_DONTNEED to relinquish the ownership of memory areas
1931 belonging to the affected files to ensure correct memory ownership.
1935 --
1940 only if cfq-iosched is in use and neither scheme is available for
1941 blk-mq devices.
1948 A read-only nested-keyed file.
1968 A read-write nested-keyed file which exists only on the root
1980 enable Weight-based control enable
2012 devices which show wide temporary behavior changes - e.g. a
2023 A read-write nested-keyed file which exists only on the root
2036 model The cost model in use - "linear"
2058 The IO cost model isn't expected to be accurate in absolute
2062 generate device-specific coefficients.
2065 A read-write flat-keyed file which exists on non-root cgroups.
2070 $MAJ:$MIN device numbers and not ordered. The weights are in
2072 the cgroup can use in relation to its siblings.
2085 A read-write nested-keyed file which exists on non-root
2099 When writing, any number of nested key-value pairs can be
2100 specified in any order. "max" can be specified as the value
2104 BPS and IOPS are measured in each IO direction and IOs are
2124 A read-only nested-keyed file.
2135 mechanism. Writeback sits between the memory and IO domains and
2136 regulates the proportion of dirty memory by balancing dirtying and
2139 The io controller, in conjunction with the memory controller,
2140 implements control of page cache writeback IOs. The memory controller
2141 defines the memory domain that dirty memory ratio is calculated and
2143 writes out dirty pages for the memory domain. Both system-wide and
2144 per-cgroup dirty memory states are examined and the more restrictive
2152 There are inherent differences in memory and writeback management
2153 which affects how cgroup ownership is tracked. Memory is tracked per
2158 As cgroup ownership for memory is tracked per page, there can be pages
2168 inode simultaneously are not supported well. In such circumstances, a
2170 As memory controller assigns page ownership on the first use and
2181 amount of available memory capped by limits imposed by the
2182 memory controller and system-wide clean memory.
2186 total available memory and applied the same way as
2198 The limits are only applied at the peer level in the hierarchy. This means that
2199 in the diagram below, only groups A, B, and C will influence each other, and
2209 So the ideal way to configure this is to set io.latency in groups A, B, and C.
2213 avg_lat value in io.stat for your workload group to get an idea of the
2215 your real setting, setting at 10-15% higher than the value in io.stat.
2225 - Queue depth throttling. This is the number of outstanding IO's a group is
2229 - Artificial delay induction. There are certain types of IO that cannot be
2234 fields in io.stat increase. The delay value is how many microseconds that are
2235 being added to any process that runs in this group. Because this number can
2249 "MAJOR:MINOR target=<target time in microseconds>"
2252 If the controller is enabled you will see extra stats in io.stat in
2261 calculated by multiplying the win value in io.stat by the
2265 The sampling window size in milliseconds. This is the minimum
2276 no-change
2279 promote-to-rt
2280 For requests that have a non-RT I/O priority class, change it into RT.
2284 restrict-to-be
2294 none-to-rt
2295 Deprecated. Just an alias for promote-to-rt.
2299 +----------------+---+
2300 | no-change | 0 |
2301 +----------------+---+
2302 | promote-to-rt | 1 |
2303 +----------------+---+
2304 | restrict-to-be | 2 |
2305 +----------------+---+
2307 +----------------+---+
2311 +-------------------------------+---+
2313 +-------------------------------+---+
2314 | IOPRIO_CLASS_RT (real-time) | 1 |
2315 +-------------------------------+---+
2317 +-------------------------------+---+
2319 +-------------------------------+---+
2323 - If I/O priority class policy is promote-to-rt, change the request I/O
2326 - If I/O priority class policy is not promote-to-rt, translate the I/O priority
2332 ---
2338 The number of tasks in a cgroup can be exhausted in ways which other
2341 hitting memory restrictions.
2343 Note that PIDs used in this controller refer to TIDs, process IDs as
2351 A read-write single value file which exists on non-root
2357 A read-only single value file which exists on non-root cgroups.
2359 The number of processes currently in the cgroup and its
2363 A read-only single value file which exists on non-root cgroups.
2365 The maximum value that the number of processes in the cgroup and its
2369 A read-only flat-keyed file which exists on non-root cgroups. Unless
2370 specified otherwise, a value change in this file generates a file
2378 Similar to pids.events but the fields in the file are local
2387 through fork() or clone(). These will return -EAGAIN if the creation
2392 ------
2395 the CPU and memory node placement of tasks to only the resources
2396 specified in the cpuset interface files in a task's current cgroup.
2399 memory placement to reduce cross-node memory access and contention
2403 cannot use CPUs or memory nodes not allowed in its parent.
2410 A read-write multiple values file which exists on non-root
2411 cpuset-enabled cgroups.
2418 The CPU numbers are comma-separated numbers or ranges.
2422 0-4,6,8-10
2425 setting as the nearest cgroup ancestor with a non-empty
2432 A read-only multiple values file which exists on all
2433 cpuset-enabled cgroups.
2442 "cpuset.cpus" unless none of the CPUs listed in "cpuset.cpus"
2443 can be granted. In this case, it will be treated just like an
2449 A read-write multiple values file which exists on non-root
2450 cpuset-enabled cgroups.
2452 It lists the requested memory nodes to be used by tasks within
2453 this cgroup. The actual list of memory nodes granted, however,
2455 from the requested memory nodes.
2457 The memory node numbers are comma-separated numbers or ranges.
2461 0-1,3
2464 setting as the nearest cgroup ancestor with a non-empty
2465 "cpuset.mems" or all the available memory nodes if none
2469 and won't be affected by any memory nodes hotplug events.
2471 Setting a non-empty value to "cpuset.mems" causes memory of
2473 they are currently using memory outside of the designated nodes.
2475 There is a cost for this memory migration. The migration
2476 may not be complete and some memory pages may be left behind.
2483 A read-only multiple values file which exists on all
2484 cpuset-enabled cgroups.
2486 It lists the onlined memory nodes that are actually granted to
2487 this cgroup by its parent. These memory nodes are allowed to
2490 If "cpuset.mems" is empty, it shows all the memory nodes from the
2493 the memory nodes listed in "cpuset.mems" can be granted. In this
2496 Its value will be affected by memory nodes hotplug events.
2499 A read-write multiple values file which exists on non-root
2500 cpuset-enabled cgroups.
2509 CPUs that are allocated to that partition are listed in
2516 "cpuset.cpus". One constraint in setting it is that the list of
2525 exclusive CPU appearing in two or more of its child cgroups is
2530 are in its exclusive CPU set.
2533 A read-only multiple values file which exists on all non-root
2534 cpuset-enabled cgroups.
2542 treated to have an implicit value of "cpuset.cpus" in the
2546 A read-only and root cgroup only multiple values file.
2548 This file shows the set of all isolated CPUs used in existing
2553 A read-write single value file which exists on non-root
2554 cpuset-enabled cgroups. This flag is owned by the parent cgroup
2560 "member" Non-root member of a partition
2565 A cpuset partition is a collection of cpuset-enabled cgroups with
2570 of that partition cannot use any CPUs in that set.
2572 There are two types of partitions - local and remote. A local
2588 be changed. All other non-root cgroups start out as "member".
2594 When set to "isolated", the CPUs in that partition will be in
2596 and excluded from the unbound workqueues. Tasks placed in such
2600 A partition root ("root" or "isolated") can be in one of the
2601 two possible states - valid or invalid. An invalid partition
2602 root is in a degraded state where some state information may
2612 "member" Non-root member of a partition
2619 In the case of an invalid partition root, a descriptive string on
2639 A valid non-root parent partition may distribute out all its CPUs
2645 invalid causing disruption to tasks running in those child
2648 value in "cpuset.cpus" or "cpuset.cpus.exclusive".
2658 A user can pre-configure certain CPUs to an isolated state
2661 into a partition, they have to be used in an isolated partition.
2665 -----------------
2676 on the return value the attempt will succeed or fail with -EPERM.
2681 If the program returns 0, the attempt fails with -EPERM, otherwise it
2684 An example of BPF_PROG_TYPE_CGROUP_DEVICE program may be found in
2685 tools/testing/selftests/bpf/progs/dev_cgroup.c in the kernel source tree.
2689 ----
2698 A readwrite nested-keyed file that exists for all the cgroups
2719 A read-only file that describes current resource usage.
2728 ----
2731 device memory regions. Because each memory region may have its own page size,
2738 A readwrite nested-keyed file that exists for all the cgroups
2747 The semantics are the same as for the memory cgroup controller, and are
2748 calculated in the same way.
2751 A read-only file that describes maximum region capacity.
2752 It only exists on the root cgroup. Not all memory can be
2762 A read-only file that describes current resource usage.
2771 -------
2788 A read-only flat-keyed file which exists on non-root cgroups.
2794 Similar to hugetlb.<hugepagesize>.events but the fields in the file
2799 Similar to memory.numa_stat, it shows the numa information of the
2800 hugetlb pages of <hugepagesize> in this cgroup. Only active in
2801 use hugetlb pages are included. The per-node values are in bytes.
2804 ----
2811 A resource can be added to the controller via enum misc_res_type{} in the
2813 in the kernel/cgroup/misc.c file. Provider of the resource must set its
2817 uncharge APIs. All of the APIs to interact with misc controller are in
2826 A read-only flat-keyed file shown only in the root cgroup. It shows
2835 A read-only flat-keyed file shown in the all cgroups. It shows
2836 the current usage of the resources in the cgroup and its children.::
2843 A read-only flat-keyed file shown in all cgroups. It shows the
2844 historical maximum usage of the resources in the cgroup and its
2852 A read-write flat-keyed file shown in the non root cgroups. Allowed
2853 maximum usage of the resources in the cgroup and its children.::
2867 Limits can be set higher than the capacity value in the misc.capacity
2871 A read-only flat-keyed file which exists on non-root cgroups. The
2873 change in this file generates a file modified event. All fields in
2881 Similar to misc.events but the fields in the file are local to the
2888 A miscellaneous scalar resource is charged to the cgroup in which it is used
2894 ------
2905 Non-normative information
2906 -------------------------
2915 When distributing CPU cycles in the root cgroup each thread in this
2916 cgroup is treated as if it was hosted in a separate child cgroup of the
2920 For details of this mapping see sched_prio_to_weight array in
2922 appropriately so the neutral - nice 0 - value is 100 instead of 1024).
2928 Root cgroup processes are hosted in an implicit leaf child node.
2938 ------
2949 complete path of the cgroup of a process. In a container setup where
2957 The path '/batchjobs/container_id1' can be considered as system-data
2962 # ls -l /proc/self/ns/cgroup
2963 lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
2969 # ls -l /proc/self/ns/cgroup
2970 lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
2974 When some thread from a multi-threaded process unshares its cgroup
2986 ------------------
2988 The 'cgroupns root' for a cgroup namespace is the cgroup in which the
2989 process calling unshare(2) is running. For example, if a process in
2997 # ~/unshare -c # unshare cgroupns in some cgroup
3005 Each process gets its namespace-specific view of "/proc/$PID/cgroup"
3008 cgroup paths (in /proc/self/cgroup) only inside their root cgroup.
3036 ----------------------
3065 ---------------------------------
3068 running inside a non-init cgroup namespace::
3070 # mount -t cgroup2 none $MOUNT_POINT
3077 the view of cgroup hierarchy by namespace-private cgroupfs mount
3084 This section contains kernel programming information in the areas
3090 --------------------------------
3093 address_space_operations->writepages() to annotate bio's using the
3110 super_block by setting SB_I_CGROUPWB in ->s_iflags. This allows for
3127 - Multiple hierarchies including named ones are not supported.
3129 - All v1 mount options are not supported.
3131 - The "tasks" file is removed and "cgroup.procs" is not sorted.
3133 - "cgroup.clone_children" is removed.
3135 - /proc/cgroups is meaningless for v2. Use "cgroup.controllers" or
3143 --------------------
3147 provide a high level of flexibility, it wasn't useful in practice.
3150 type controllers such as freezer which can be useful in all
3151 hierarchies could only be used in one. The issue is exacerbated by
3158 In practice, these issues heavily limited which controllers could be
3169 used in general and what controllers was able to do.
3172 that a thread's cgroup membership couldn't be described in finite
3174 in length, which made it highly awkward to manipulate and led to
3176 which in turn exacerbated the original problem of proliferating number
3185 In most use cases, putting controllers on hierarchies which are
3188 depending on the specific controller. In other words, hierarchy may
3191 how memory is distributed beyond a certain level while still wanting
3196 ------------------
3204 Generally, in-process knowledge is available only to the process
3205 itself; thus, unlike service-level organization of processes,
3210 in combination with thread granularity. cgroups were delegated to
3212 sub-hierarchies and control resource distributions along them. This
3213 effectively raised cgroup to the status of a syscall-like API exposed
3223 that the process would actually be operating on its own sub-hierarchy.
3227 system-management pseudo filesystem. cgroup ended up with interface
3230 individual applications through the ill-defined delegation mechanism
3240 -------------------------------------------
3242 cgroup v1 allowed threads to be in any cgroups which created an
3251 cycles and the number of internal threads fluctuated - the ratios
3265 The memory controller didn't have a way to control what happened
3267 clearly defined. There were attempts to add ad-hoc behaviors and
3269 led to problems extremely difficult to resolve in the long term.
3277 in a uniform way.
3281 ----------------------
3285 was how an empty cgroup was notified - a userland helper binary was
3288 to in-kernel event delivery filtering mechanism further complicating
3303 formats and units even in the same controller.
3310 ------------------------------
3312 Memory
3317 global reclaim prefers is opt-in, rather than opt-out. The costs for
3321 hierarchical meaning. All configured groups are organized in a global
3323 in the hierarchy. This makes subtree delegation impossible. Second,
3327 becomes self-defeating.
3329 The memory.low boundary on the other hand is a top-down allocated
3338 available memory. The memory consumption of workloads varies during
3342 estimation is hard and error prone, and getting it wrong results in
3346 The memory.high boundary on the other hand can be set much more
3352 and make corrections until the minimal memory footprint that still
3355 In extreme cases, with many concurrent allocations and a complete
3358 allocation from the slack available in other groups or the rest of the
3359 system than killing the group. Otherwise, memory.max is there to
3363 Setting the original memory.limit_in_bytes below the current usage was
3365 limit setting to fail. memory.max on the other hand will first set the
3367 new limit is met - or the task writing to memory.max is killed.
3369 The combined memory+swap accounting and limiting is replaced by real
3372 The main argument for a combined memory+swap facility in the original
3374 able to swap all anonymous memory of a child group, regardless of the
3376 groups can sabotage swapping by other means - such as referencing its
3377 anonymous memory in a tight loop - and an admin can not assume full
3381 intuitive userspace interface, and it flies in the face of the idea
3383 resources. Swap space is a resource like all others in the system,