Lines Matching +full:protection +full:- +full:domain

1 .. _cgroup-v2:
11 conventions of cgroup v2. It describes all userland-visible aspects
14 v1 is available under :ref:`Documentation/admin-guide/cgroup-v1/index.rst <cgroup-v1>`.
19 1-1. Terminology
20 1-2. What is cgroup?
22 2-1. Mounting
23 2-2. Organizing Processes and Threads
24 2-2-1. Processes
25 2-2-2. Threads
26 2-3. [Un]populated Notification
27 2-4. Controlling Controllers
28 2-4-1. Enabling and Disabling
29 2-4-2. Top-down Constraint
30 2-4-3. No Internal Process Constraint
31 2-5. Delegation
32 2-5-1. Model of Delegation
33 2-5-2. Delegation Containment
34 2-6. Guidelines
35 2-6-1. Organize Once and Control
36 2-6-2. Avoid Name Collisions
38 3-1. Weights
39 3-2. Limits
40 3-3. Protections
41 3-4. Allocations
43 4-1. Format
44 4-2. Conventions
45 4-3. Core Interface Files
47 5-1. CPU
48 5-1-1. CPU Interface Files
49 5-2. Memory
50 5-2-1. Memory Interface Files
51 5-2-2. Usage Guidelines
52 5-2-3. Memory Ownership
53 5-3. IO
54 5-3-1. IO Interface Files
55 5-3-2. Writeback
56 5-3-3. IO Latency
57 5-3-3-1. How IO Latency Throttling Works
58 5-3-3-2. IO Latency Interface Files
59 5-3-4. IO Priority
60 5-4. PID
61 5-4-1. PID Interface Files
62 5-5. Cpuset
63 5.5-1. Cpuset Interface Files
64 5-6. Device
65 5-7. RDMA
66 5-7-1. RDMA Interface Files
67 5-8. DMEM
68 5-9. HugeTLB
69 5.9-1. HugeTLB Interface Files
70 5-10. Misc
71 5.10-1 Miscellaneous cgroup Interface Files
72 5.10-2 Migration and Ownership
73 5-11. Others
74 5-11-1. perf_event
75 5-N. Non-normative information
76 5-N-1. CPU controller root cgroup process behaviour
77 5-N-2. IO controller root cgroup process behaviour
79 6-1. Basics
80 6-2. The Root and Views
81 6-3. Migration and setns(2)
82 6-4. Interaction with Other Namespaces
84 P-1. Filesystem Support for Writeback
87 R-1. Multiple Hierarchies
88 R-2. Thread Granularity
89 R-3. Competition Between Inner Nodes and Threads
90 R-4. Other Interface Issues
91 R-5. Controller Issues and Remedies
92 R-5-1. Memory
99 -----------
108 ---------------
114 cgroup is largely composed of two parts - the core and controllers.
130 hierarchical - if a controller is enabled on a cgroup, it affects all
132 sub-hierarchy of the cgroup. When a controller is enabled on a nested
142 --------
147 # mount -t cgroup2 none $MOUNT_POINT
157 is no longer referenced in its current hierarchy. Because per-cgroup
164 to inter-controller dependencies, other controllers may need to be
185 ignored on non-init namespace mounts. Please refer to the
202 option is ignored on non-init namespace mounts.
205 Recursively apply memory.min and memory.low protection to
210 behavior but is a mount-option to avoid regressing setups
212 high 'bypass' protection values at higher tree levels).
224 controller. The pre-allocated pool does not belong to anyone.
237 memory protection and reclaim dynamics. Any userspace tuning
244 The option restores v1-like behavior of pids.events:max, that is only
252 --------------------------------
258 A child cgroup can be created by creating a sub-directory::
263 structure. Each cgroup has a read-writable interface file
265 belong to the cgroup one-per-line. The PIDs are not ordered and the
296 0::/test-cgroup/test-cgroup-nested
303 0::/test-cgroup/test-cgroup-nested (deleted)
313 domain to host resource consumptions which are not specific to a
315 a subtree while still maintaining the common resource domain for them.
318 The ones which don't are called domain controllers.
320 Marking a cgroup threaded makes it join the resource domain of its
322 cgroup whose resource domain is further up in the hierarchy. The root
324 threaded, is called threaded domain or thread root interchangeably and
325 serves as the resource domain for the entire subtree.
329 constraint - threaded controllers can be enabled on non-leaf cgroups
332 As the threaded domain cgroup hosts all the domain resource
337 serve both as a threaded domain and a parent to domain cgroups.
341 domain, a domain which is serving as the domain of a threaded subtree,
344 On creation, a cgroup is always a domain cgroup and can be made
350 Once threaded, the cgroup can't be made a domain again. To enable the
353 - As the cgroup will join the parent's resource domain. The parent
354 must either be a valid (threaded) domain or a threaded cgroup.
356 - When the parent is an unthreaded domain, it must not have any domain
357 controllers enabled or populated domain children. The root is
360 Topology-wise, a cgroup can be in an invalid state. Please consider
363 A (threaded domain) - B (threaded) - C (domain, just created)
365 C is created as a domain but isn't connected to a parent which can
367 threaded cgroup. "cgroup.type" file will report "domain (invalid)" in
371 A domain cgroup is turned into a threaded domain when one of its child
374 A threaded domain reverts to a normal domain when the conditions
378 threads in the cgroup. Except that the operations are per-thread
379 instead of per-process, "cgroup.threads" has the same format and
382 threaded domain, its operations are confined inside each threaded
385 The threaded domain cgroup serves as the resource domain for the whole
387 all the processes are considered to be in the threaded domain cgroup.
388 "cgroup.procs" in a threaded domain cgroup contains the PIDs of all
397 aren't tied to a specific thread belong to the threaded domain cgroup.
401 between threads in a non-leaf cgroup and its child cgroups. Each
407 - cpu
408 - cpuset
409 - perf_event
410 - pids
413 --------------------------
415 Each non-root cgroup has a "cgroup.events" file which contains
416 "populated" field indicating whether the cgroup's sub-hierarchy has
420 example, to start a clean-up operation after all processes of a given
421 sub-hierarchy have exited. The populated state updates and
422 notifications are recursive. Consider the following sub-hierarchy
426 A(4) - B(0) - C(1)
436 -----------------------
459 # echo "+cpu +memory -io" > cgroup.subtree_control
468 Consider the following sub-hierarchy. The enabled controllers are
471 A(cpu,memory) - B(memory) - C()
485 controller interface files - anything which doesn't start with
489 Top-down Constraint
492 Resources are distributed top-down and a cgroup can further distribute
494 parent. This means that all non-root "cgroup.subtree_control" files
504 Non-root cgroups can distribute domain resources to their children
506 only domain cgroups which don't contain any processes can have domain
509 This guarantees that, when a domain controller is looking at the part
519 refer to the Non-normative information section in the Controllers
532 ----------
554 delegated, the user can build sub-hierarchy under the directory,
558 happens in the delegated sub-hierarchy, nothing can escape the
562 cgroups in or nesting depth of a delegated sub-hierarchy; however,
569 A delegated sub-hierarchy is contained in the sense that processes
570 can't be moved into or out of the sub-hierarchy by the delegatee.
573 requiring the following conditions for a process with a non-root euid
577 - The writer must have write access to the "cgroup.procs" file.
579 - The writer must have write access to the "cgroup.procs" file of the
583 processes around freely in the delegated sub-hierarchy it can't pull
584 in from or push out to outside the sub-hierarchy.
590 ~~~~~~~~~~~~~ - C0 - C00
593 ~~~~~~~~~~~~~ - C1 - C10
600 will be denied with -EACCES.
605 is not reachable, the migration is rejected with -ENOENT.
609 ----------
617 inherent trade-offs between migration and various hot paths in terms
623 resource structure once on start-up. Dynamic adjustments to resource
656 -------
662 work-conserving. Due to the dynamic nature, this model is usually
677 .. _cgroupv2-limits-distributor:
680 ------
683 Limits can be over-committed - the sum of the limits of children can
688 As limits can be over-committed, all configuration combinations are
695 .. _cgroupv2-protections-distributor:
698 -----------
703 soft boundaries. Protections can also be over-committed in which case
710 As protections can be over-committed, all configuration combinations
714 "memory.low" implements best-effort memory protection and is an
719 -----------
722 resource. Allocations can't be over-committed - the sum of the
729 As allocations can't be over-committed, some configuration
734 "cpu.rt.max" hard-allocates realtime slices and is an example of this
742 ------
747 New-line separated values
755 (when read-only or multiple values can be written at once)
781 -----------
783 - Settings for a single feature should be contained in a single file.
785 - The root cgroup should be exempt from resource control and thus
788 - The default time unit is microseconds. If a different unit is ever
791 - A parts-per quantity should use a percentage decimal with at least
792 two digit fractional part - e.g. 13.40.
794 - If a controller implements weight based resource distribution, its
800 - If a controller implements an absolute resource guarantee and/or
809 - If a setting has a configurable default value and keyed specific
823 # cat cgroup-example-interface-file
829 # echo 125 > cgroup-example-interface-file
833 # echo "default 125" > cgroup-example-interface-file
837 # echo "8:16 170" > cgroup-example-interface-file
841 # echo "8:0 default" > cgroup-example-interface-file
842 # cat cgroup-example-interface-file
846 - For events which are not very high frequency, an interface file
853 --------------------
858 A read-write single value file which exists on non-root
864 - "domain" : A normal valid domain cgroup.
866 - "domain threaded" : A threaded domain cgroup which is
869 - "domain invalid" : A cgroup which is in an invalid state.
873 - "threaded" : A threaded cgroup which is a member of a
880 A read-write new-line separated values file which exists on
884 the cgroup one-per-line. The PIDs are not ordered and the
893 - It must have write access to the "cgroup.procs" file.
895 - It must have write access to the "cgroup.procs" file of the
898 When delegating a sub-hierarchy, write access to this file
906 A read-write new-line separated values file which exists on
910 the cgroup one-per-line. The TIDs are not ordered and the
919 - It must have write access to the "cgroup.threads" file.
921 - The cgroup that the thread is currently in must be in the
922 same resource domain as the destination cgroup.
924 - It must have write access to the "cgroup.procs" file of the
927 When delegating a sub-hierarchy, write access to this file
931 A read-only space separated values file which exists on all
938 A read-write space separated values file which exists on all
945 Space separated list of controllers prefixed with '+' or '-'
947 name prefixed with '+' enables the controller and '-'
953 A read-only flat-keyed file which exists on non-root cgroups.
965 A read-write single value files. The default is "max".
972 A read-write single value files. The default is "max".
979 A read-only flat-keyed file with the following entries:
1005 A read-write single value file which exists on non-root cgroups.
1028 create new sub-cgroups.
1031 A write-only single value file which exists in non-root cgroups.
1043 the whole thread-group.
1046 A read-write single value file that allowed values are "0" and "1".
1050 Writing "1" to the file will re-enable the cgroup PSI accounting.
1058 This may cause non-negligible overhead for some workloads when under
1060 be used to disable PSI accounting in the non-leaf cgroups.
1063 A read-write nested-keyed file.
1071 .. _cgroup-v2-cpu:
1074 ---
1092 management software may already have placed RT processes into non-root cgroups
1111 * Processes under the fair-class scheduler
1116 For details on when a process is under the fair-class scheduler or a BPF scheduler,
1117 check out :ref:`Documentation/scheduler/sched-ext.rst <sched-ext>`.
1123 A read-only flat-keyed file.
1129 - usage_usec
1130 - user_usec
1131 - system_usec
1134 only the processes under the fair-class scheduler:
1136 - nr_periods
1137 - nr_throttled
1138 - throttled_usec
1139 - nr_bursts
1140 - burst_usec
1143 A read-write single value file which exists on non-root
1152 This file affects only processes under the fair-class scheduler and a BPF
1157 A read-write single value file which exists on non-root
1160 The nice value is in the range [-20, 19].
1168 This file affects only processes under the fair-class scheduler and a BPF
1173 A read-write two value file which exists on non-root cgroups.
1184 This file affects only processes under the fair-class scheduler.
1187 A read-write single value file which exists on non-root
1192 This file affects only processes under the fair-class scheduler.
1195 A read-write nested-keyed file.
1203 A read-write single value file which exists on non-root cgroups.
1206 The requested minimum utilization (protection) as a percentage
1214 The requested minimum utilization (protection) is always capped by
1221 A read-write single value file which exists on non-root cgroups.
1235 A read-write single value file which exists on non-root cgroups.
1238 This is the cgroup analog of the per-task SCHED_IDLE sched policy.
1244 This file affects only processes under the fair-class scheduler.
1247 ------
1250 stateful and implements both limit and protection models. Due to the
1255 While not completely water-tight, all major memory usages by a given
1260 - Userland memory - page cache and anonymous memory.
1262 - Kernel data structures such as dentries and inodes.
1264 - TCP socket buffers.
1277 A read-only single value file which exists on non-root
1284 A read-write single value file which exists on non-root
1287 Hard memory protection. If the memory usage of a cgroup
1300 the part of parent's protection proportional to its
1304 protection is discouraged and may lead to constant OOMs.
1310 A read-write single value file which exists on non-root
1313 Best-effort memory protection. If the memory usage of a
1326 the part of parent's protection proportional to its
1330 protection is discouraged.
1333 A read-write single value file which exists on non-root
1356 busy-hitting its memory to slow down reclaim.
1359 A read-write single value file which exists on non-root
1368 In default configuration regular 0-order allocations always
1373 as -ENOMEM or silently ignore in cases like disk readahead.
1376 reclaim and oom-kill are bypassed. This is useful for admin
1379 The job will trigger the reclaim and/or oom-kill on its next
1385 busy-hitting its memory to slow down reclaim.
1388 A write-only nested-keyed file which exists for all cgroups.
1399 specified amount, -EAGAIN is returned.
1419 The valid range for swappiness is [0-200, max], setting
1423 A read-write single value file which exists on non-root cgroups.
1428 A write of any non-empty string to this file resets it to the
1433 A read-write single value file which exists on non-root
1443 Tasks with the OOM protection (oom_score_adj set to -1000)
1451 A read-only flat-keyed file which exists on non-root cgroups.
1465 boundary is over-committed.
1485 considered as an option, e.g. for failed high-order
1501 A read-only flat-keyed file which exists on non-root cgroups.
1504 types of memory, type-specific details, and other information
1513 If the entry has no per-node counter (or not show in the
1514 memory.numa_stat). We use 'npn' (non-per-node) as the tag
1545 Amount of memory used for storing per-cpu kernel
1555 Amount of cached filesystem data that is swap-backed,
1595 Amount of memory, swap-backed and filesystem-backed,
1601 the value for the foo counter, since the foo counter is type-based, not
1602 list-based.
1613 Amount of memory used for storing in-kernel data
1703 Number of zero-filled pages swapped out with I/O skipped due to the
1762 A read-only nested-keyed file which exists on non-root cgroups.
1765 types of memory, type-specific details, and other information
1787 A read-only single value file which exists on non-root
1794 A read-write single value file which exists on non-root
1799 allow userspace to implement custom out-of-memory procedures.
1810 A read-write single value file which exists on non-root cgroups.
1815 A write of any non-empty string to this file resets it to the
1820 A read-write single value file which exists on non-root
1827 A read-only flat-keyed file which exists on non-root cgroups.
1843 because of running out of swap system-wide or max
1852 A read-only single value file which exists on non-root
1859 A read-write single value file which exists on non-root
1867 A read-write single value file. The default value is "1".
1885 A read-only nested-keyed file.
1895 Over-committing on high limit (sum of high limits > available memory)
1909 pressure - how much the workload is being impacted due to lack of
1910 memory - is necessary to determine whether a workload needs more
1924 To which cgroup the area will be charged is in-deterministic; however,
1935 --
1940 only if cfq-iosched is in use and neither scheme is available for
1941 blk-mq devices.
1948 A read-only nested-keyed file.
1968 A read-write nested-keyed file which exists only on the root
1980 enable Weight-based control enable
2012 devices which show wide temporary behavior changes - e.g. a
2023 A read-write nested-keyed file which exists only on the root
2036 model The cost model in use - "linear"
2062 generate device-specific coefficients.
2065 A read-write flat-keyed file which exists on non-root cgroups.
2085 A read-write nested-keyed file which exists on non-root
2099 When writing, any number of nested key-value pairs can be
2124 A read-only nested-keyed file.
2141 defines the memory domain that dirty memory ratio is calculated and
2142 maintained for and the io controller defines the io domain which
2143 writes out dirty pages for the memory domain. Both system-wide and
2144 per-cgroup dirty memory states are examined and the more restrictive
2182 memory controller and system-wide clean memory.
2193 This is a cgroup v2 controller for IO workload protection. You provide a group
2215 your real setting, setting at 10-15% higher than the value in io.stat.
2225 - Queue depth throttling. This is the number of outstanding IO's a group is
2229 - Artificial delay induction. There are certain types of IO that cannot be
2276 no-change
2279 promote-to-rt
2280 For requests that have a non-RT I/O priority class, change it into RT.
2284 restrict-to-be
2294 none-to-rt
2295 Deprecated. Just an alias for promote-to-rt.
2299 +----------------+---+
2300 | no-change | 0 |
2301 +----------------+---+
2302 | promote-to-rt | 1 |
2303 +----------------+---+
2304 | restrict-to-be | 2 |
2305 +----------------+---+
2307 +----------------+---+
2311 +-------------------------------+---+
2313 +-------------------------------+---+
2314 | IOPRIO_CLASS_RT (real-time) | 1 |
2315 +-------------------------------+---+
2317 +-------------------------------+---+
2319 +-------------------------------+---+
2323 - If I/O priority class policy is promote-to-rt, change the request I/O
2326 - If I/O priority class policy is not promote-to-rt, translate the I/O priority
2332 ---
2351 A read-write single value file which exists on non-root
2357 A read-only single value file which exists on non-root cgroups.
2363 A read-only single value file which exists on non-root cgroups.
2369 A read-only flat-keyed file which exists on non-root cgroups. Unless
2387 through fork() or clone(). These will return -EAGAIN if the creation
2392 ------
2399 memory placement to reduce cross-node memory access and contention
2410 A read-write multiple values file which exists on non-root
2411 cpuset-enabled cgroups.
2418 The CPU numbers are comma-separated numbers or ranges.
2422 0-4,6,8-10
2425 setting as the nearest cgroup ancestor with a non-empty
2432 A read-only multiple values file which exists on all
2433 cpuset-enabled cgroups.
2449 A read-write multiple values file which exists on non-root
2450 cpuset-enabled cgroups.
2457 The memory node numbers are comma-separated numbers or ranges.
2461 0-1,3
2464 setting as the nearest cgroup ancestor with a non-empty
2471 Setting a non-empty value to "cpuset.mems" causes memory of
2483 A read-only multiple values file which exists on all
2484 cpuset-enabled cgroups.
2499 A read-write multiple values file which exists on non-root
2500 cpuset-enabled cgroups.
2533 A read-only multiple values file which exists on all non-root
2534 cpuset-enabled cgroups.
2546 A read-only and root cgroup only multiple values file.
2553 A read-write single value file which exists on non-root
2554 cpuset-enabled cgroups. This flag is owned by the parent cgroup
2560 "member" Non-root member of a partition
2565 A cpuset partition is a collection of cpuset-enabled cgroups with
2572 There are two types of partitions - local and remote. A local
2588 be changed. All other non-root cgroups start out as "member".
2591 partition or scheduling domain. The set of exclusive CPUs is
2601 two possible states - valid or invalid. An invalid partition
2612 "member" Non-root member of a partition
2639 A valid non-root parent partition may distribute out all its CPUs
2658 A user can pre-configure certain CPUs to an isolated state
2665 -----------------
2676 on the return value the attempt will succeed or fail with -EPERM.
2681 If the program returns 0, the attempt fails with -EPERM, otherwise it
2689 ----
2698 A readwrite nested-keyed file that exists for all the cgroups
2719 A read-only file that describes current resource usage.
2728 ----
2738 A readwrite nested-keyed file that exists for all the cgroups
2751 A read-only file that describes maximum region capacity.
2762 A read-only file that describes current resource usage.
2771 -------
2788 A read-only flat-keyed file which exists on non-root cgroups.
2801 use hugetlb pages are included. The per-node values are in bytes.
2804 ----
2826 A read-only flat-keyed file shown only in the root cgroup. It shows
2835 A read-only flat-keyed file shown in the all cgroups. It shows
2843 A read-only flat-keyed file shown in all cgroups. It shows the
2852 A read-write flat-keyed file shown in the non root cgroups. Allowed
2871 A read-only flat-keyed file which exists on non-root cgroups. The
2894 ------
2905 Non-normative information
2906 -------------------------
2922 appropriately so the neutral - nice 0 - value is 100 instead of 1024).
2938 ------
2957 The path '/batchjobs/container_id1' can be considered as system-data
2962 # ls -l /proc/self/ns/cgroup
2963 lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
2969 # ls -l /proc/self/ns/cgroup
2970 lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
2974 When some thread from a multi-threaded process unshares its cgroup
2986 ------------------
2997 # ~/unshare -c # unshare cgroupns in some cgroup
3005 Each process gets its namespace-specific view of "/proc/$PID/cgroup"
3036 ----------------------
3065 ---------------------------------
3068 running inside a non-init cgroup namespace::
3070 # mount -t cgroup2 none $MOUNT_POINT
3077 the view of cgroup hierarchy by namespace-private cgroupfs mount
3090 --------------------------------
3093 address_space_operations->writepages() to annotate bio's using the
3110 super_block by setting SB_I_CGROUPWB in ->s_iflags. This allows for
3127 - Multiple hierarchies including named ones are not supported.
3129 - All v1 mount options are not supported.
3131 - The "tasks" file is removed and "cgroup.procs" is not sorted.
3133 - "cgroup.clone_children" is removed.
3135 - /proc/cgroups is meaningless for v2. Use "cgroup.controllers" or
3143 --------------------
3196 ------------------
3204 Generally, in-process knowledge is available only to the process
3205 itself; thus, unlike service-level organization of processes,
3212 sub-hierarchies and control resource distributions along them. This
3213 effectively raised cgroup to the status of a syscall-like API exposed
3223 that the process would actually be operating on its own sub-hierarchy.
3227 system-management pseudo filesystem. cgroup ended up with interface
3230 individual applications through the ill-defined delegation mechanism
3240 -------------------------------------------
3251 cycles and the number of internal threads fluctuated - the ratios
3267 clearly defined. There were attempts to add ad-hoc behaviors and
3281 ----------------------
3285 was how an empty cgroup was notified - a userland helper binary was
3288 to in-kernel event delivery filtering mechanism further complicating
3310 ------------------------------
3317 global reclaim prefers is opt-in, rather than opt-out. The costs for
3327 becomes self-defeating.
3329 The memory.low boundary on the other hand is a top-down allocated
3330 reserve. A cgroup enjoys reclaim protection when it's within its
3367 new limit is met - or the task writing to memory.max is killed.
3376 groups can sabotage swapping by other means - such as referencing its
3377 anonymous memory in a tight loop - and an admin can not assume full