Lines Matching +full:memory +full:- +full:controllers
1 .. _cgroup-v2:
11 conventions of cgroup v2. It describes all userland-visible aspects
14 v1 is available under :ref:`Documentation/admin-guide/cgroup-v1/index.rst <cgroup-v1>`.
19 1-1. Terminology
20 1-2. What is cgroup?
22 2-1. Mounting
23 2-2. Organizing Processes and Threads
24 2-2-1. Processes
25 2-2-2. Threads
26 2-3. [Un]populated Notification
27 2-4. Controlling Controllers
28 2-4-1. Enabling and Disabling
29 2-4-2. Top-down Constraint
30 2-4-3. No Internal Process Constraint
31 2-5. Delegation
32 2-5-1. Model of Delegation
33 2-5-2. Delegation Containment
34 2-6. Guidelines
35 2-6-1. Organize Once and Control
36 2-6-2. Avoid Name Collisions
38 3-1. Weights
39 3-2. Limits
40 3-3. Protections
41 3-4. Allocations
43 4-1. Format
44 4-2. Conventions
45 4-3. Core Interface Files
46 5. Controllers
47 5-1. CPU
48 5-1-1. CPU Interface Files
49 5-2. Memory
50 5-2-1. Memory Interface Files
51 5-2-2. Usage Guidelines
52 5-2-3. Memory Ownership
53 5-3. IO
54 5-3-1. IO Interface Files
55 5-3-2. Writeback
56 5-3-3. IO Latency
57 5-3-3-1. How IO Latency Throttling Works
58 5-3-3-2. IO Latency Interface Files
59 5-3-4. IO Priority
60 5-4. PID
61 5-4-1. PID Interface Files
62 5-5. Cpuset
63 5.5-1. Cpuset Interface Files
64 5-6. Device
65 5-7. RDMA
66 5-7-1. RDMA Interface Files
67 5-8. DMEM
68 5-9. HugeTLB
69 5.9-1. HugeTLB Interface Files
70 5-10. Misc
71 5.10-1 Miscellaneous cgroup Interface Files
72 5.10-2 Migration and Ownership
73 5-11. Others
74 5-11-1. perf_event
75 5-N. Non-normative information
76 5-N-1. CPU controller root cgroup process behaviour
77 5-N-2. IO controller root cgroup process behaviour
79 6-1. Basics
80 6-2. The Root and Views
81 6-3. Migration and setns(2)
82 6-4. Interaction with Other Namespaces
84 P-1. Filesystem Support for Writeback
87 R-1. Multiple Hierarchies
88 R-2. Thread Granularity
89 R-3. Competition Between Inner Nodes and Threads
90 R-4. Other Interface Issues
91 R-5. Controller Issues and Remedies
92 R-5-1. Memory
99 -----------
103 qualifier as in "cgroup controllers". When explicitly referring to
108 ---------------
114 cgroup is largely composed of two parts - the core and controllers.
118 although there are utility controllers which serve purposes other than
128 Following certain structural constraints, controllers may be enabled or
130 hierarchical - if a controller is enabled on a cgroup, it affects all
132 sub-hierarchy of the cgroup. When a controller is enabled on a nested
142 --------
147 # mount -t cgroup2 none $MOUNT_POINT
150 controllers which support v2 and are not bound to a v1 hierarchy are
152 Controllers which are not in active use in the v2 hierarchy can be
157 is no longer referenced in its current hierarchy. Because per-cgroup
158 controller states are destroyed asynchronously and controllers may
164 to inter-controller dependencies, other controllers may need to be
168 controllers dynamically between the v2 and other hierarchies is
171 controllers after system boot.
174 automount the v1 cgroup filesystem and so hijack all controllers
177 disabling controllers in v1 and make them always available in v2.
185 ignored on non-init namespace mounts. Please refer to the
193 controllers, and then seeding it with CLONE_INTO_CGROUP is
197 Only populate memory.events with data for the current cgroup,
202 option is ignored on non-init namespace mounts.
205 Recursively apply memory.min and memory.low protection to
210 behavior but is a mount-option to avoid regressing setups
215 Count HugeTLB memory usage towards the cgroup's overall
216 memory usage for the memory controller (for the purpose of
217 statistics reporting and memory protetion). This is a new
223 * There is no HugeTLB pool management involved in the memory
224 controller. The pre-allocated pool does not belong to anyone.
227 memory controller. It is only charged to a cgroup when it is
228 actually used (for e.g at page fault time). Host memory
232 * Failure to charge a HugeTLB folio to the memory controller
236 * Charging HugeTLB memory towards the memory controller affects
237 memory protection and reclaim dynamics. Any userspace tuning
240 will not be tracked by the memory controller (even if cgroup
244 The option restores v1-like behavior of pids.events:max, that is only
252 --------------------------------
258 A child cgroup can be created by creating a sub-directory::
263 structure. Each cgroup has a read-writable interface file
265 belong to the cgroup one-per-line. The PIDs are not ordered and the
296 0::/test-cgroup/test-cgroup-nested
303 0::/test-cgroup/test-cgroup-nested (deleted)
309 cgroup v2 supports thread granularity for a subset of controllers to
317 Controllers which support thread mode are called threaded controllers.
318 The ones which don't are called domain controllers.
329 constraint - threaded controllers can be enabled on non-leaf cgroups
353 - As the cgroup will join the parent's resource domain. The parent
356 - When the parent is an unthreaded domain, it must not have any domain
357 controllers enabled or populated domain children. The root is
360 Topology-wise, a cgroup can be in an invalid state. Please consider
363 A (threaded domain) - B (threaded) - C (domain, just created)
372 cgroup becomes threaded or threaded controllers are enabled in the
378 threads in the cgroup. Except that the operations are per-thread
379 instead of per-process, "cgroup.threads" has the same format and
393 Only threaded controllers can be enabled in a threaded subtree. When
401 between threads in a non-leaf cgroup and its child cgroups. Each
404 Currently, the following controllers are threaded and can be enabled
407 - cpu
408 - cpuset
409 - perf_event
410 - pids
413 --------------------------
415 Each non-root cgroup has a "cgroup.events" file which contains
416 "populated" field indicating whether the cgroup's sub-hierarchy has
420 example, to start a clean-up operation after all processes of a given
421 sub-hierarchy have exited. The populated state updates and
422 notifications are recursive. Consider the following sub-hierarchy
426 A(4) - B(0) - C(1)
435 Controlling Controllers
436 -----------------------
441 Each cgroup has a "cgroup.controllers" file which lists all
442 controllers available for the cgroup to enable::
444 # cat cgroup.controllers
445 cpu io memory
447 No controller is enabled by default. Controllers can be enabled and
450 # echo "+cpu +memory -io" > cgroup.subtree_control
452 Only controllers which are listed in "cgroup.controllers" can be
459 Consider the following sub-hierarchy. The enabled controllers are
462 A(cpu,memory) - B(memory) - C()
465 As A has "cpu" and "memory" enabled, A will control the distribution
466 of CPU cycles and memory to its children, in this case, B. As B has
467 "memory" enabled but not "CPU", C and D will compete freely on CPU
468 cycles but their division of memory available to B will be controlled.
474 D. Likewise, disabling "memory" from B would remove the "memory."
476 controller interface files - anything which doesn't start with
480 Top-down Constraint
483 Resources are distributed top-down and a cgroup can further distribute
485 parent. This means that all non-root "cgroup.subtree_control" files
486 can only contain controllers which are enabled in the parent's
495 Non-root cgroups can distribute domain resources to their children
498 controllers enabled in their "cgroup.subtree_control" files.
508 controllers. How resource consumption in the root cgroup is governed
510 refer to the Non-normative information section in the Controllers
518 children before enabling controllers in its "cgroup.subtree_control"
523 ----------
545 delegated, the user can build sub-hierarchy under the directory,
548 of all resource controllers are hierarchical and regardless of what
549 happens in the delegated sub-hierarchy, nothing can escape the
553 cgroups in or nesting depth of a delegated sub-hierarchy; however,
560 A delegated sub-hierarchy is contained in the sense that processes
561 can't be moved into or out of the sub-hierarchy by the delegatee.
564 requiring the following conditions for a process with a non-root euid
568 - The writer must have write access to the "cgroup.procs" file.
570 - The writer must have write access to the "cgroup.procs" file of the
574 processes around freely in the delegated sub-hierarchy it can't pull
575 in from or push out to outside the sub-hierarchy.
581 ~~~~~~~~~~~~~ - C0 - C00
584 ~~~~~~~~~~~~~ - C1 - C10
591 will be denied with -EACCES.
596 is not reachable, the migration is rejected with -ENOENT.
600 ----------
606 and stateful resources such as memory are not moved together with the
608 inherent trade-offs between migration and various hot paths in terms
614 resource structure once on start-up. Dynamic adjustments to resource
641 cgroup controllers implement several resource distribution schemes
647 -------
653 work-conserving. Due to the dynamic nature, this model is usually
668 .. _cgroupv2-limits-distributor:
671 ------
674 Limits can be over-committed - the sum of the limits of children can
679 As limits can be over-committed, all configuration combinations are
686 .. _cgroupv2-protections-distributor:
689 -----------
694 soft boundaries. Protections can also be over-committed in which case
701 As protections can be over-committed, all configuration combinations
705 "memory.low" implements best-effort memory protection and is an
710 -----------
713 resource. Allocations can't be over-committed - the sum of the
720 As allocations can't be over-committed, some configuration
725 "cpu.rt.max" hard-allocates realtime slices and is an example of this
733 ------
738 New-line separated values
746 (when read-only or multiple values can be written at once)
763 reading; however, controllers may allow omitting later fields or
772 -----------
774 - Settings for a single feature should be contained in a single file.
776 - The root cgroup should be exempt from resource control and thus
779 - The default time unit is microseconds. If a different unit is ever
782 - A parts-per quantity should use a percentage decimal with at least
783 two digit fractional part - e.g. 13.40.
785 - If a controller implements weight based resource distribution, its
791 - If a controller implements an absolute resource guarantee and/or
800 - If a setting has a configurable default value and keyed specific
814 # cat cgroup-example-interface-file
820 # echo 125 > cgroup-example-interface-file
824 # echo "default 125" > cgroup-example-interface-file
828 # echo "8:16 170" > cgroup-example-interface-file
832 # echo "8:0 default" > cgroup-example-interface-file
833 # cat cgroup-example-interface-file
837 - For events which are not very high frequency, an interface file
844 --------------------
849 A read-write single value file which exists on non-root
855 - "domain" : A normal valid domain cgroup.
857 - "domain threaded" : A threaded domain cgroup which is
860 - "domain invalid" : A cgroup which is in an invalid state.
861 It can't be populated or have controllers enabled. It may
864 - "threaded" : A threaded cgroup which is a member of a
871 A read-write new-line separated values file which exists on
875 the cgroup one-per-line. The PIDs are not ordered and the
884 - It must have write access to the "cgroup.procs" file.
886 - It must have write access to the "cgroup.procs" file of the
889 When delegating a sub-hierarchy, write access to this file
897 A read-write new-line separated values file which exists on
901 the cgroup one-per-line. The TIDs are not ordered and the
910 - It must have write access to the "cgroup.threads" file.
912 - The cgroup that the thread is currently in must be in the
915 - It must have write access to the "cgroup.procs" file of the
918 When delegating a sub-hierarchy, write access to this file
921 cgroup.controllers
922 A read-only space separated values file which exists on all
925 It shows space separated list of all controllers available to
926 the cgroup. The controllers are not ordered.
929 A read-write space separated values file which exists on all
932 When read, it shows space separated list of the controllers
936 Space separated list of controllers prefixed with '+' or '-'
937 can be written to enable or disable controllers. A controller
938 name prefixed with '+' enables the controller and '-'
944 A read-only flat-keyed file which exists on non-root cgroups.
956 A read-write single value files. The default is "max".
963 A read-write single value files. The default is "max".
970 A read-only flat-keyed file with the following entries:
988 Total number of live cgroup subsystems (e.g memory
992 Total number of dying cgroup subsystems (e.g. memory
996 A read-write single value file which exists on non-root cgroups.
1019 create new sub-cgroups.
1022 A write-only single value file which exists in non-root cgroups.
1034 the whole thread-group.
1037 A read-write single value file that allowed values are "0" and "1".
1041 Writing "1" to the file will re-enable the cgroup PSI accounting.
1049 This may cause non-negligible overhead for some workloads when under
1051 be used to disable PSI accounting in the non-leaf cgroups.
1054 A read-write nested-keyed file.
1059 Controllers chapter
1062 .. _cgroup-v2-cpu:
1065 ---
1067 The "cpu" controllers regulates distribution of CPU cycles. This
1096 A read-only flat-keyed file.
1101 - usage_usec
1102 - user_usec
1103 - system_usec
1107 - nr_periods
1108 - nr_throttled
1109 - throttled_usec
1110 - nr_bursts
1111 - burst_usec
1114 A read-write single value file which exists on non-root
1124 A read-write single value file which exists on non-root
1127 The nice value is in the range [-20, 19].
1136 A read-write two value file which exists on non-root cgroups.
1148 A read-write single value file which exists on non-root
1154 A read-write nested-keyed file.
1160 A read-write single value file which exists on non-root cgroups.
1175 A read-write single value file which exists on non-root cgroups.
1186 A read-write single value file which exists on non-root cgroups.
1189 This is the cgroup analog of the per-task SCHED_IDLE sched policy.
1197 Memory section in Controllers
1198 ------
1200 The "memory" controller regulates distribution of memory. Memory is
1202 intertwining between memory usage and reclaim pressure and the
1203 stateful nature of memory, the distribution model is relatively
1206 While not completely water-tight, all major memory usages by a given
1207 cgroup are tracked so that the total memory consumption can be
1209 following types of memory usages are tracked.
1211 - Userland memory - page cache and anonymous memory.
1213 - Kernel data structures such as dentries and inodes.
1215 - TCP socket buffers.
1220 Memory Interface Files argument
1223 All memory amounts are in bytes. If a value which is not aligned to
1227 memory.current
1228 A read-only single value file which exists on non-root
1231 The total amount of memory currently being used by the cgroup
1234 memory.min
1235 A read-write single value file which exists on non-root
1238 Hard memory protection. If the memory usage of a cgroup
1239 is within its effective min boundary, the cgroup's memory
1241 unprotected reclaimable memory available, OOM killer
1247 Effective min boundary is limited by memory.min values of
1248 all ancestor cgroups. If there is memory.min overcommitment
1249 (child cgroup or cgroups are requiring more protected memory
1252 actual memory usage below memory.min.
1254 Putting more memory than generally available under this
1257 If a memory cgroup is not populated with processes,
1258 its memory.min is ignored.
1260 memory.low
1261 A read-write single value file which exists on non-root
1264 Best-effort memory protection. If the memory usage of a
1266 memory won't be reclaimed unless there is no reclaimable
1267 memory available in unprotected cgroups.
1273 Effective low boundary is limited by memory.low values of
1274 all ancestor cgroups. If there is memory.low overcommitment
1275 (child cgroup or cgroups are requiring more protected memory
1278 actual memory usage below memory.low.
1280 Putting more memory than generally available under this
1283 memory.high
1284 A read-write single value file which exists on non-root
1287 Memory usage throttle limit. If a cgroup's usage goes
1297 memory.max
1298 A read-write single value file which exists on non-root
1301 Memory usage hard limit. This is the main mechanism to limit
1302 memory usage of a cgroup. If a cgroup's memory usage reaches
1307 In default configuration regular 0-order allocations always
1312 as -ENOMEM or silently ignore in cases like disk readahead.
1314 memory.reclaim
1315 A write-only nested-keyed file which exists for all cgroups.
1317 This is a simple interface to trigger memory reclaim in the
1322 echo "1G" > memory.reclaim
1326 specified amount, -EAGAIN is returned.
1329 interface) is not meant to indicate memory pressure on the
1330 memory cgroup. Therefore socket memory balancing triggered by
1331 the memory reclaim normally is not exercised in this case.
1333 reclaim induced by memory.reclaim.
1346 memory.peak
1347 A read-write single value file which exists on non-root cgroups.
1349 The max memory usage recorded for the cgroup and its descendants since
1352 A write of any non-empty string to this file resets it to the
1353 current memory usage for subsequent reads through the same
1356 memory.oom.group
1357 A read-write single value file which exists on non-root
1363 (if the memory cgroup is not a leaf cgroup) are killed
1367 Tasks with the OOM protection (oom_score_adj set to -1000)
1372 memory.oom.group values of ancestor cgroups.
1374 memory.events
1375 A read-only flat-keyed file which exists on non-root cgroups.
1383 memory.events.local.
1387 high memory pressure even though its usage is under
1389 boundary is over-committed.
1393 throttled and routed to perform direct memory reclaim
1394 because the high memory boundary was exceeded. For a
1395 cgroup whose memory usage is capped by the high limit
1396 rather than global memory pressure, this event's
1400 The number of times the cgroup's memory usage was
1405 The number of time the cgroup's memory usage was
1409 considered as an option, e.g. for failed high-order
1419 memory.events.local
1420 Similar to memory.events but the fields in the file are local
1424 memory.stat
1425 A read-only flat-keyed file which exists on non-root cgroups.
1427 This breaks down the cgroup's memory footprint into different
1428 types of memory, type-specific details, and other information
1429 on the state and past events of the memory management system.
1431 All memory amounts are in bytes.
1437 If the entry has no per-node counter (or not show in the
1438 memory.numa_stat). We use 'npn' (non-per-node) as the tag
1439 to indicate that it will not show in the memory.numa_stat.
1442 Amount of memory used in anonymous mappings such as
1446 Amount of memory used to cache filesystem data,
1447 including tmpfs and shared memory.
1450 Amount of total kernel memory, including
1452 addition to other kernel memory use cases.
1455 Amount of memory allocated to kernel stacks.
1458 Amount of memory allocated for page tables.
1461 Amount of memory allocated for secondary page tables,
1466 Amount of memory used for storing per-cpu kernel
1470 Amount of memory used in network transmission buffers
1473 Amount of memory used for vmap backed memory.
1476 Amount of cached filesystem data that is swap-backed,
1480 Amount of memory consumed by the zswap compression backend.
1483 Amount of application memory swapped out to zswap.
1497 Amount of swap cached in memory. The swapcache is accounted
1498 against both memory and swap usage.
1501 Amount of memory used in anonymous mappings backed by
1513 Amount of memory, swap-backed and filesystem-backed,
1514 on the internal memory management lists used by the
1518 memory management lists), inactive_foo + active_foo may not be equal to
1519 the value for the foo counter, since the foo counter is type-based, not
1520 list-based.
1527 Part of "slab" that cannot be reclaimed on memory
1531 Amount of memory used for storing in-kernel data
1598 Amount of pages postponed to be freed under memory pressure
1604 Number of pages swapped into memory and filled with zero, where I/O
1609 Number of zero-filled pages swapped out with I/O skipped due to the
1613 Number of pages moved in to memory from zswap.
1616 Number of pages moved out of memory to zswap.
1660 Amount of memory used by hugetlb pages. This metric only shows
1661 up if hugetlb usage is accounted for in memory.current (i.e.
1664 memory.numa_stat
1665 A read-only nested-keyed file which exists on non-root cgroups.
1667 This breaks down the cgroup's memory footprint into different
1668 types of memory, type-specific details, and other information
1669 per node on the state of the memory management system.
1677 All memory amounts are in bytes.
1679 The output format of memory.numa_stat is::
1687 The entries can refer to the memory.stat.
1689 memory.swap.current
1690 A read-only single value file which exists on non-root
1696 memory.swap.high
1697 A read-write single value file which exists on non-root
1702 allow userspace to implement custom out-of-memory procedures.
1706 during regular operation. Compare to memory.swap.max, which
1708 continue unimpeded as long as other memory can be reclaimed.
1712 memory.swap.peak
1713 A read-write single value file which exists on non-root cgroups.
1718 A write of any non-empty string to this file resets it to the
1719 current memory usage for subsequent reads through the same
1722 memory.swap.max
1723 A read-write single value file which exists on non-root
1727 limit, anonymous memory of the cgroup will not be swapped out.
1729 memory.swap.events
1730 A read-only flat-keyed file which exists on non-root cgroups.
1746 because of running out of swap system-wide or max
1752 reduces the impact on the workload and memory management.
1754 memory.zswap.current
1755 A read-only single value file which exists on non-root
1758 The total amount of memory consumed by the zswap compression
1761 memory.zswap.max
1762 A read-write single value file which exists on non-root
1769 memory.zswap.writeback
1770 A read-write single value file. The default value is "1".
1782 Note that this is subtly different from setting memory.swap.max to
1785 is allowed unless memory.swap.max is set to 0.
1787 memory.pressure
1788 A read-only nested-keyed file.
1790 Shows pressure stall information for memory. See
1797 "memory.high" is the main mechanism to control memory usage.
1798 Over-committing on high limit (sum of high limits > available memory)
1799 and letting global memory pressure to distribute memory according to
1805 more memory or terminating the workload.
1807 Determining whether a cgroup has enough memory is not trivial as
1808 memory usage doesn't indicate whether the workload can benefit from
1809 more memory. For example, a workload which writes data received from
1810 network to a file can use all available memory but can also operate as
1811 performant with a small amount of memory. A measure of memory
1812 pressure - how much the workload is being impacted due to lack of
1813 memory - is necessary to determine whether a workload needs more
1814 memory; unfortunately, memory pressure monitoring mechanism isn't
1818 Memory Ownership argument
1821 A memory area is charged to the cgroup which instantiated it and stays
1823 to a different cgroup doesn't move the memory usages that it
1826 A memory area may be used by processes belonging to different cgroups.
1827 To which cgroup the area will be charged is in-deterministic; however,
1828 over time, the memory area is likely to end up in a cgroup which has
1829 enough memory allowance to avoid high reclaim pressure.
1831 If a cgroup sweeps a considerable amount of memory which is expected
1833 POSIX_FADV_DONTNEED to relinquish the ownership of memory areas
1834 belonging to the affected files to ensure correct memory ownership.
1838 --
1843 only if cfq-iosched is in use and neither scheme is available for
1844 blk-mq devices.
1851 A read-only nested-keyed file.
1871 A read-write nested-keyed file which exists only on the root
1883 enable Weight-based control enable
1915 devices which show wide temporary behavior changes - e.g. a
1926 A read-write nested-keyed file which exists only on the root
1939 model The cost model in use - "linear"
1965 generate device-specific coefficients.
1968 A read-write flat-keyed file which exists on non-root cgroups.
1988 A read-write nested-keyed file which exists on non-root
2002 When writing, any number of nested key-value pairs can be
2027 A read-only nested-keyed file.
2038 mechanism. Writeback sits between the memory and IO domains and
2039 regulates the proportion of dirty memory by balancing dirtying and
2042 The io controller, in conjunction with the memory controller,
2043 implements control of page cache writeback IOs. The memory controller
2044 defines the memory domain that dirty memory ratio is calculated and
2046 writes out dirty pages for the memory domain. Both system-wide and
2047 per-cgroup dirty memory states are examined and the more restrictive
2055 There are inherent differences in memory and writeback management
2056 which affects how cgroup ownership is tracked. Memory is tracked per
2061 As cgroup ownership for memory is tracked per page, there can be pages
2073 As memory controller assigns page ownership on the first use and
2084 amount of available memory capped by limits imposed by the
2085 memory controller and system-wide clean memory.
2089 total available memory and applied the same way as
2118 your real setting, setting at 10-15% higher than the value in io.stat.
2128 - Queue depth throttling. This is the number of outstanding IO's a group is
2132 - Artificial delay induction. There are certain types of IO that cannot be
2150 This takes a similar format as the other controllers.
2179 no-change
2182 promote-to-rt
2183 For requests that have a non-RT I/O priority class, change it into RT.
2187 restrict-to-be
2197 none-to-rt
2198 Deprecated. Just an alias for promote-to-rt.
2202 +----------------+---+
2203 | no-change | 0 |
2204 +----------------+---+
2205 | promote-to-rt | 1 |
2206 +----------------+---+
2207 | restrict-to-be | 2 |
2208 +----------------+---+
2210 +----------------+---+
2214 +-------------------------------+---+
2216 +-------------------------------+---+
2217 | IOPRIO_CLASS_RT (real-time) | 1 |
2218 +-------------------------------+---+
2220 +-------------------------------+---+
2222 +-------------------------------+---+
2226 - If I/O priority class policy is promote-to-rt, change the request I/O
2229 - If I/O priority class policy is not promote-to-rt, translate the I/O priority
2235 ---
2242 controllers cannot prevent, thus warranting its own controller. For
2244 hitting memory restrictions.
2254 A read-write single value file which exists on non-root
2260 A read-only single value file which exists on non-root cgroups.
2266 A read-only single value file which exists on non-root cgroups.
2272 A read-only flat-keyed file which exists on non-root cgroups. Unless
2290 through fork() or clone(). These will return -EAGAIN if the creation
2295 ------
2298 the CPU and memory node placement of tasks to only the resources
2302 memory placement to reduce cross-node memory access and contention
2306 cannot use CPUs or memory nodes not allowed in its parent.
2313 A read-write multiple values file which exists on non-root
2314 cpuset-enabled cgroups.
2321 The CPU numbers are comma-separated numbers or ranges.
2325 0-4,6,8-10
2328 setting as the nearest cgroup ancestor with a non-empty
2335 A read-only multiple values file which exists on all
2336 cpuset-enabled cgroups.
2352 A read-write multiple values file which exists on non-root
2353 cpuset-enabled cgroups.
2355 It lists the requested memory nodes to be used by tasks within
2356 this cgroup. The actual list of memory nodes granted, however,
2358 from the requested memory nodes.
2360 The memory node numbers are comma-separated numbers or ranges.
2364 0-1,3
2367 setting as the nearest cgroup ancestor with a non-empty
2368 "cpuset.mems" or all the available memory nodes if none
2372 and won't be affected by any memory nodes hotplug events.
2374 Setting a non-empty value to "cpuset.mems" causes memory of
2376 they are currently using memory outside of the designated nodes.
2378 There is a cost for this memory migration. The migration
2379 may not be complete and some memory pages may be left behind.
2386 A read-only multiple values file which exists on all
2387 cpuset-enabled cgroups.
2389 It lists the onlined memory nodes that are actually granted to
2390 this cgroup by its parent. These memory nodes are allowed to
2393 If "cpuset.mems" is empty, it shows all the memory nodes from the
2396 the memory nodes listed in "cpuset.mems" can be granted. In this
2399 Its value will be affected by memory nodes hotplug events.
2402 A read-write multiple values file which exists on non-root
2403 cpuset-enabled cgroups.
2436 A read-only multiple values file which exists on all non-root
2437 cpuset-enabled cgroups.
2449 A read-only and root cgroup only multiple values file.
2456 A read-write single value file which exists on non-root
2457 cpuset-enabled cgroups. This flag is owned by the parent cgroup
2463 "member" Non-root member of a partition
2468 A cpuset partition is a collection of cpuset-enabled cgroups with
2475 There are two types of partitions - local and remote. A local
2491 be changed. All other non-root cgroups start out as "member".
2504 two possible states - valid or invalid. An invalid partition
2515 "member" Non-root member of a partition
2542 A valid non-root parent partition may distribute out all its CPUs
2561 A user can pre-configure certain CPUs to an isolated state
2568 -----------------
2579 on the return value the attempt will succeed or fail with -EPERM.
2584 If the program returns 0, the attempt fails with -EPERM, otherwise it
2592 ----
2601 A readwrite nested-keyed file that exists for all the cgroups
2622 A read-only file that describes current resource usage.
2631 ----
2634 device memory regions. Because each memory region may have its own page size,
2641 A readwrite nested-keyed file that exists for all the cgroups
2650 The semantics are the same as for the memory cgroup controller, and are
2654 A read-only file that describes maximum region capacity.
2655 It only exists on the root cgroup. Not all memory can be
2665 A read-only file that describes current resource usage.
2674 -------
2691 A read-only flat-keyed file which exists on non-root cgroups.
2702 Similar to memory.numa_stat, it shows the numa information of the
2704 use hugetlb pages are included. The per-node values are in bytes.
2707 ----
2729 A read-only flat-keyed file shown only in the root cgroup. It shows
2738 A read-only flat-keyed file shown in the all cgroups. It shows
2746 A read-only flat-keyed file shown in all cgroups. It shows the
2755 A read-write flat-keyed file shown in the non root cgroups. Allowed
2774 A read-only flat-keyed file which exists on non-root cgroups. The
2797 ------
2808 Non-normative information
2809 -------------------------
2825 appropriately so the neutral - nice 0 - value is 100 instead of 1024).
2841 ------
2860 The path '/batchjobs/container_id1' can be considered as system-data
2865 # ls -l /proc/self/ns/cgroup
2866 lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
2872 # ls -l /proc/self/ns/cgroup
2873 lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
2877 When some thread from a multi-threaded process unshares its cgroup
2889 ------------------
2900 # ~/unshare -c # unshare cgroupns in some cgroup
2908 Each process gets its namespace-specific view of "/proc/$PID/cgroup"
2939 ----------------------
2968 ---------------------------------
2971 running inside a non-init cgroup namespace::
2973 # mount -t cgroup2 none $MOUNT_POINT
2980 the view of cgroup hierarchy by namespace-private cgroupfs mount
2989 controllers are not covered.
2993 --------------------------------
2996 address_space_operations->writepage[s]() to annotate bio's using the
3013 super_block by setting SB_I_CGROUPWB in ->s_iflags. This allows for
3030 - Multiple hierarchies including named ones are not supported.
3032 - All v1 mount options are not supported.
3034 - The "tasks" file is removed and "cgroup.procs" is not sorted.
3036 - "cgroup.clone_children" is removed.
3038 - /proc/cgroups is meaningless for v2. Use "cgroup.controllers" or
3046 --------------------
3049 hierarchy could host any number of controllers. While this seemed to
3053 type controllers such as freezer which can be useful in all
3055 the fact that controllers couldn't be moved to another hierarchy once
3056 hierarchies were populated. Another issue was that all controllers
3061 In practice, these issues heavily limited which controllers could be
3064 as the cpu and cpuacct controllers, made sense to be put on the same
3072 used in general and what controllers was able to do.
3078 addition of controllers which existed only to identify membership,
3083 topologies of hierarchies other controllers might be on, each
3084 controller had to assume that all other controllers were attached to
3086 least very cumbersome, for controllers to cooperate with each other.
3088 In most use cases, putting controllers on hierarchies which are
3093 controllers. For example, a given configuration might not care about
3094 how memory is distributed beyond a certain level while still wanting
3099 ------------------
3102 This didn't make sense for some controllers and those controllers
3107 Generally, in-process knowledge is available only to the process
3108 itself; thus, unlike service-level organization of processes,
3115 sub-hierarchies and control resource distributions along them. This
3116 effectively raised cgroup to the status of a syscall-like API exposed
3126 that the process would actually be operating on its own sub-hierarchy.
3128 cgroup controllers implemented a number of knobs which would never be
3130 system-management pseudo filesystem. cgroup ended up with interface
3133 individual applications through the ill-defined delegation mechanism
3143 -------------------------------------------
3149 settle it. Different controllers did different things.
3154 cycles and the number of internal threads fluctuated - the ratios
3168 The memory controller didn't have a way to control what happened
3170 clearly defined. There were attempts to add ad-hoc behaviors and
3174 Multiple controllers struggled with internal tasks and came up with
3184 ----------------------
3188 was how an empty cgroup was notified - a userland helper binary was
3191 to in-kernel event delivery filtering mechanism further complicating
3195 controllers completely ignoring hierarchical organization and treating
3197 cgroup. Some controllers exposed a large amount of inconsistent
3200 There also was no consistency across controllers. When a new cgroup
3201 was created, some controllers defaulted to not imposing extra
3209 controllers so that they expose minimal and consistent interfaces.
3213 ------------------------------
3215 Memory subsection
3220 global reclaim prefers is opt-in, rather than opt-out. The costs for
3230 becomes self-defeating.
3232 The memory.low boundary on the other hand is a top-down allocated
3241 available memory. The memory consumption of workloads varies during
3249 The memory.high boundary on the other hand can be set much more
3255 and make corrections until the minimal memory footprint that still
3262 system than killing the group. Otherwise, memory.max is there to
3266 Setting the original memory.limit_in_bytes below the current usage was
3268 limit setting to fail. memory.max on the other hand will first set the
3270 new limit is met - or the task writing to memory.max is killed.
3272 The combined memory+swap accounting and limiting is replaced by real
3275 The main argument for a combined memory+swap facility in the original
3277 able to swap all anonymous memory of a child group, regardless of the
3279 groups can sabotage swapping by other means - such as referencing its
3280 anonymous memory in a tight loop - and an admin can not assume full
3285 that cgroup controllers should account and limit specific physical