Lines Matching +full:fails +full:- +full:without +full:- +full:test +full:- +full:cd
1 .. _cgroup-v2:
11 conventions of cgroup v2. It describes all userland-visible aspects
14 v1 is available under :ref:`Documentation/admin-guide/cgroup-v1/index.rst <cgroup-v1>`.
22 1-1. Terminology
23 1-2. What is cgroup?
25 2-1. Mounting
26 2-2. Organizing Processes and Threads
27 2-2-1. Processes
28 2-2-2. Threads
29 2-3. [Un]populated Notification
30 2-4. Controlling Controllers
31 2-4-1. Availability
32 2-4-2. Enabling and Disabling
33 2-4-3. Top-down Constraint
34 2-4-4. No Internal Process Constraint
35 2-5. Delegation
36 2-5-1. Model of Delegation
37 2-5-2. Delegation Containment
38 2-6. Guidelines
39 2-6-1. Organize Once and Control
40 2-6-2. Avoid Name Collisions
42 3-1. Weights
43 3-2. Limits
44 3-3. Protections
45 3-4. Allocations
47 4-1. Format
48 4-2. Conventions
49 4-3. Core Interface Files
51 5-1. CPU
52 5-1-1. CPU Interface Files
53 5-2. Memory
54 5-2-1. Memory Interface Files
55 5-2-2. Usage Guidelines
56 5-2-3. Memory Ownership
57 5-3. IO
58 5-3-1. IO Interface Files
59 5-3-2. Writeback
60 5-3-3. IO Latency
61 5-3-3-1. How IO Latency Throttling Works
62 5-3-3-2. IO Latency Interface Files
63 5-3-4. IO Priority
64 5-4. PID
65 5-4-1. PID Interface Files
66 5-5. Cpuset
67 5.5-1. Cpuset Interface Files
68 5-6. Device controller
69 5-7. RDMA
70 5-7-1. RDMA Interface Files
71 5-8. DMEM
72 5-8-1. DMEM Interface Files
73 5-9. HugeTLB
74 5.9-1. HugeTLB Interface Files
75 5-10. Misc
76 5.10-1 Misc Interface Files
77 5.10-2 Migration and Ownership
78 5-11. Others
79 5-11-1. perf_event
80 5-N. Non-normative information
81 5-N-1. CPU controller root cgroup process behaviour
82 5-N-2. IO controller root cgroup process behaviour
84 6-1. Basics
85 6-2. The Root and Views
86 6-3. Migration and setns(2)
87 6-4. Interaction with Other Namespaces
89 P-1. Filesystem Support for Writeback
92 R-1. Multiple Hierarchies
93 R-2. Thread Granularity
94 R-3. Competition Between Inner Nodes and Threads
95 R-4. Other Interface Issues
96 R-5. Controller Issues and Remedies
97 R-5-1. Memory
104 -----------
113 ---------------
119 cgroup is largely composed of two parts - the core and controllers.
135 hierarchical - if a controller is enabled on a cgroup, it affects all
137 sub-hierarchy of the cgroup. When a controller is enabled on a nested
147 --------
152 # mount -t cgroup2 none $MOUNT_POINT
162 is no longer referenced in its current hierarchy. Because per-cgroup
169 to inter-controller dependencies, other controllers may need to be
190 ignored on non-init namespace mounts. Please refer to the
204 behaviour without this option is to include subtree counts.
207 option is ignored on non-init namespace mounts.
211 entire subtrees, without requiring explicit downward
215 behavior but is a mount-option to avoid regressing setups
229 controller. The pre-allocated pool does not belong to anyone.
240 reclaim attempt fails).
249 The option restores v1-like behavior of pids.events:max, that is only
250 local (inside cgroup proper) fork failures are counted. Without this
257 --------------------------------
263 A child cgroup can be created by creating a sub-directory::
268 structure. Each cgroup has a read-writable interface file
270 belong to the cgroup one-per-line. The PIDs are not ordered and the
301 0::/test-cgroup/test-cgroup-nested
308 0::/test-cgroup/test-cgroup-nested (deleted)
334 constraint - threaded controllers can be enabled on non-leaf cgroups
358 - As the cgroup will join the parent's resource domain. The parent
361 - When the parent is an unthreaded domain, it must not have any domain
365 Topology-wise, a cgroup can be in an invalid state. Please consider
368 A (threaded domain) - B (threaded) - C (domain, just created)
383 threads in the cgroup. Except that the operations are per-thread
384 instead of per-process, "cgroup.threads" has the same format and
406 between threads in a non-leaf cgroup and its child cgroups. Each
412 - cpu
413 - cpuset
414 - perf_event
415 - pids
418 --------------------------
420 Each non-root cgroup has a "cgroup.events" file which contains
421 "populated" field indicating whether the cgroup's sub-hierarchy has
425 example, to start a clean-up operation after all processes of a given
426 sub-hierarchy have exited. The populated state updates and
427 notifications are recursive. Consider the following sub-hierarchy
431 A(4) - B(0) - C(1)
441 -----------------------
464 # echo "+cpu +memory -io" > cgroup.subtree_control
473 Consider the following sub-hierarchy. The enabled controllers are
476 A(cpu,memory) - B(memory) - C()
490 controller interface files - anything which doesn't start with
494 Top-down Constraint
497 Resources are distributed top-down and a cgroup can further distribute
499 parent. This means that all non-root "cgroup.subtree_control" files
509 Non-root cgroups can distribute domain resources to their children
524 refer to the Non-normative information section in the Controllers
537 ----------
559 delegated, the user can build sub-hierarchy under the directory,
563 happens in the delegated sub-hierarchy, nothing can escape the
567 cgroups in or nesting depth of a delegated sub-hierarchy; however,
574 A delegated sub-hierarchy is contained in the sense that processes
575 can't be moved into or out of the sub-hierarchy by the delegatee.
578 requiring the following conditions for a process with a non-root euid
582 - The writer must have write access to the "cgroup.procs" file.
584 - The writer must have write access to the "cgroup.procs" file of the
588 processes around freely in the delegated sub-hierarchy it can't pull
589 in from or push out to outside the sub-hierarchy.
595 ~~~~~~~~~~~~~ - C0 - C00
598 ~~~~~~~~~~~~~ - C1 - C10
605 will be denied with -EACCES.
610 is not reachable, the migration is rejected with -ENOENT.
614 ----------
622 inherent trade-offs between migration and various hot paths in terms
628 resource structure once on start-up. Dynamic adjustments to resource
661 -------
667 work-conserving. Due to the dynamic nature, this model is usually
682 .. _cgroupv2-limits-distributor:
685 ------
688 Limits can be over-committed - the sum of the limits of children can
693 As limits can be over-committed, all configuration combinations are
700 .. _cgroupv2-protections-distributor:
703 -----------
708 soft boundaries. Protections can also be over-committed in which case
715 As protections can be over-committed, all configuration combinations
719 "memory.low" implements best-effort memory protection and is an
724 -----------
727 resource. Allocations can't be over-committed - the sum of the
734 As allocations can't be over-committed, some configuration
739 "cpu.rt.max" hard-allocates realtime slices and is an example of this
747 ------
752 New-line separated values
760 (when read-only or multiple values can be written at once)
786 -----------
788 - Settings for a single feature should be contained in a single file.
790 - The root cgroup should be exempt from resource control and thus
793 - The default time unit is microseconds. If a different unit is ever
796 - A parts-per quantity should use a percentage decimal with at least
797 two digit fractional part - e.g. 13.40.
799 - If a controller implements weight based resource distribution, its
805 - If a controller implements an absolute resource guarantee and/or
814 - If a setting has a configurable default value and keyed specific
828 # cat cgroup-example-interface-file
834 # echo 125 > cgroup-example-interface-file
838 # echo "default 125" > cgroup-example-interface-file
842 # echo "8:16 170" > cgroup-example-interface-file
846 # echo "8:0 default" > cgroup-example-interface-file
847 # cat cgroup-example-interface-file
851 - For events which are not very high frequency, an interface file
858 --------------------
863 A read-write single value file which exists on non-root
869 - "domain" : A normal valid domain cgroup.
871 - "domain threaded" : A threaded domain cgroup which is
874 - "domain invalid" : A cgroup which is in an invalid state.
878 - "threaded" : A threaded cgroup which is a member of a
885 A read-write new-line separated values file which exists on
889 the cgroup one-per-line. The PIDs are not ordered and the
898 - It must have write access to the "cgroup.procs" file.
900 - It must have write access to the "cgroup.procs" file of the
903 When delegating a sub-hierarchy, write access to this file
906 In a threaded cgroup, reading this file fails with EOPNOTSUPP
911 A read-write new-line separated values file which exists on
915 the cgroup one-per-line. The TIDs are not ordered and the
924 - It must have write access to the "cgroup.threads" file.
926 - The cgroup that the thread is currently in must be in the
929 - It must have write access to the "cgroup.procs" file of the
932 When delegating a sub-hierarchy, write access to this file
936 A read-only space separated values file which exists on all
943 A read-write space separated values file which exists on all
950 Space separated list of controllers prefixed with '+' or '-'
952 name prefixed with '+' enables the controller and '-'
958 A read-only flat-keyed file which exists on non-root cgroups.
970 A read-write single value files. The default is "max".
977 A read-write single value files. The default is "max".
984 A read-only flat-keyed file with the following entries:
1010 A read-only flat-keyed file which exists in non-root cgroups.
1023 ab cd
1028 A read-write single value file which exists on non-root cgroups.
1051 create new sub-cgroups.
1054 A write-only single value file which exists in non-root cgroups.
1064 In a threaded cgroup, writing this file fails with EOPNOTSUPP as
1066 the whole thread-group.
1069 A read-write single value file that allowed values are "0" and "1".
1073 Writing "1" to the file will re-enable the cgroup PSI accounting.
1081 This may cause non-negligible overhead for some workloads when under
1083 be used to disable PSI accounting in the non-leaf cgroups.
1086 A read-write nested-keyed file.
1094 .. _cgroup-v2-cpu:
1097 ---
1115 management software may already have placed RT processes into non-root cgroups
1134 * Processes under the fair-class scheduler
1137 without the ``cgroup_set_weight`` callback
1139 For details on when a process is under the fair-class scheduler or a BPF scheduler,
1140 check out :ref:`Documentation/scheduler/sched-ext.rst <sched-ext>`.
1146 A read-only flat-keyed file.
1152 - usage_usec
1153 - user_usec
1154 - system_usec
1157 only the processes under the fair-class scheduler:
1159 - nr_periods
1160 - nr_throttled
1161 - throttled_usec
1162 - nr_bursts
1163 - burst_usec
1166 A read-write single value file which exists on non-root
1175 This file affects only processes under the fair-class scheduler and a BPF
1180 A read-write single value file which exists on non-root
1183 The nice value is in the range [-20, 19].
1191 This file affects only processes under the fair-class scheduler and a BPF
1196 A read-write two value file which exists on non-root cgroups.
1207 This file affects only processes under the fair-class scheduler.
1210 A read-write single value file which exists on non-root
1215 This file affects only processes under the fair-class scheduler.
1218 A read-write nested-keyed file.
1226 A read-write single value file which exists on non-root cgroups.
1244 A read-write single value file which exists on non-root cgroups.
1258 A read-write single value file which exists on non-root cgroups.
1261 This is the cgroup analog of the per-task SCHED_IDLE sched policy.
1267 This file affects only processes under the fair-class scheduler.
1270 ------
1278 While not completely water-tight, all major memory usages by a given
1283 - Userland memory - page cache and anonymous memory.
1285 - Kernel data structures such as dentries and inodes.
1287 - TCP socket buffers.
1300 A read-only single value file which exists on non-root
1307 A read-write single value file which exists on non-root
1333 A read-write single value file which exists on non-root
1336 Best-effort memory protection. If the memory usage of a
1356 A read-write single value file which exists on non-root
1371 need to dynamically adjust the job's memory limits without
1379 busy-hitting its memory to slow down reclaim.
1382 A read-write single value file which exists on non-root
1391 In default configuration regular 0-order allocations always
1396 as -ENOMEM or silently ignore in cases like disk readahead.
1399 reclaim and oom-kill are bypassed. This is useful for admin
1401 without expending their own CPU resources on memory reclamation.
1402 The job will trigger the reclaim and/or oom-kill on its next
1408 busy-hitting its memory to slow down reclaim.
1411 A write-only nested-keyed file which exists for all cgroups.
1422 specified amount, -EAGAIN is returned.
1442 The valid range for swappiness is [0-200, max], setting
1446 A read-write single value file which exists on non-root cgroups.
1451 A write of any non-empty string to this file resets it to the
1456 A read-write single value file which exists on non-root
1466 Tasks with the OOM protection (oom_score_adj set to -1000)
1474 A read-only flat-keyed file which exists on non-root cgroups.
1488 boundary is over-committed.
1501 fails to bring it down, the cgroup goes to OOM state.
1508 considered as an option, e.g. for failed high-order
1524 A read-only flat-keyed file which exists on non-root cgroups.
1527 types of memory, type-specific details, and other information
1536 If the entry has no per-node counter (or not show in the
1537 memory.numa_stat). We use 'npn' (non-per-node) as the tag
1568 Amount of memory used for storing per-cpu kernel
1578 Amount of cached filesystem data that is swap-backed,
1618 Amount of memory, swap-backed and filesystem-backed,
1624 the value for the foo counter, since the foo counter is type-based, not
1625 list-based.
1636 Amount of memory used for storing in-kernel data
1726 Number of zero-filled pages swapped out with I/O skipped due to the
1750 without splitting.
1785 A read-only nested-keyed file which exists on non-root cgroups.
1788 types of memory, type-specific details, and other information
1810 A read-only single value file which exists on non-root
1817 A read-write single value file which exists on non-root
1822 allow userspace to implement custom out-of-memory procedures.
1833 A read-write single value file which exists on non-root cgroups.
1838 A write of any non-empty string to this file resets it to the
1843 A read-write single value file which exists on non-root
1850 A read-only flat-keyed file which exists on non-root cgroups.
1866 because of running out of swap system-wide or max
1875 A read-only single value file which exists on non-root
1882 A read-write single value file which exists on non-root
1890 A read-write single value file. The default value is "1".
1908 A read-only nested-keyed file.
1918 Over-committing on high limit (sum of high limits > available memory)
1932 pressure - how much the workload is being impacted due to lack of
1933 memory - is necessary to determine whether a workload needs more
1947 To which cgroup the area will be charged is in-deterministic; however,
1958 --
1963 only if cfq-iosched is in use and neither scheme is available for
1964 blk-mq devices.
1971 A read-only nested-keyed file.
1991 A read-write nested-keyed file which exists only on the root
2003 enable Weight-based control enable
2035 devices which show wide temporary behavior changes - e.g. a
2046 A read-write nested-keyed file which exists only on the root
2059 model The cost model in use - "linear"
2085 generate device-specific coefficients.
2088 A read-write flat-keyed file which exists on non-root cgroups.
2092 without specific override. The rest are overrides keyed by
2108 A read-write nested-keyed file which exists on non-root
2122 When writing, any number of nested key-value pairs can be
2147 A read-only nested-keyed file.
2166 writes out dirty pages for the memory domain. Both system-wide and
2167 per-cgroup dirty memory states are examined and the more restrictive
2205 memory controller and system-wide clean memory.
2238 your real setting, setting at 10-15% higher than the value in io.stat.
2248 - Queue depth throttling. This is the number of outstanding IO's a group is
2252 - Artificial delay induction. There are certain types of IO that cannot be
2253 throttled without possibly adversely affecting higher priority groups. This
2299 no-change
2302 promote-to-rt
2303 For requests that have a non-RT I/O priority class, change it into RT.
2307 restrict-to-be
2317 none-to-rt
2318 Deprecated. Just an alias for promote-to-rt.
2322 +----------------+---+
2323 | no-change | 0 |
2324 +----------------+---+
2325 | promote-to-rt | 1 |
2326 +----------------+---+
2327 | restrict-to-be | 2 |
2328 +----------------+---+
2330 +----------------+---+
2334 +-------------------------------+---+
2336 +-------------------------------+---+
2337 | IOPRIO_CLASS_RT (real-time) | 1 |
2338 +-------------------------------+---+
2340 +-------------------------------+---+
2342 +-------------------------------+---+
2346 - If I/O priority class policy is promote-to-rt, change the request I/O
2349 - If I/O priority class policy is not promote-to-rt, translate the I/O priority
2355 ---
2374 A read-write single value file which exists on non-root
2380 A read-only single value file which exists on non-root cgroups.
2386 A read-only single value file which exists on non-root cgroups.
2392 A read-only flat-keyed file which exists on non-root cgroups. Unless
2410 through fork() or clone(). These will return -EAGAIN if the creation
2415 ------
2422 memory placement to reduce cross-node memory access and contention
2433 A read-write multiple values file which exists on non-root
2434 cpuset-enabled cgroups.
2441 The CPU numbers are comma-separated numbers or ranges.
2445 0-4,6,8-10
2448 setting as the nearest cgroup ancestor with a non-empty
2455 A read-only multiple values file which exists on all
2456 cpuset-enabled cgroups.
2472 A read-write multiple values file which exists on non-root
2473 cpuset-enabled cgroups.
2480 The memory node numbers are comma-separated numbers or ranges.
2484 0-1,3
2487 setting as the nearest cgroup ancestor with a non-empty
2494 Setting a non-empty value to "cpuset.mems" causes memory of
2506 A read-only multiple values file which exists on all
2507 cpuset-enabled cgroups.
2522 A read-write multiple values file which exists on non-root
2523 cpuset-enabled cgroups.
2556 A read-only multiple values file which exists on all non-root
2557 cpuset-enabled cgroups.
2569 A read-only and root cgroup only multiple values file.
2576 A read-write single value file which exists on non-root
2577 cpuset-enabled cgroups. This flag is owned by the parent cgroup
2583 "member" Non-root member of a partition
2585 "isolated" Partition root without load balancing
2588 A cpuset partition is a collection of cpuset-enabled cgroups with
2595 There are two types of partitions - local and remote. A local
2611 be changed. All other non-root cgroups start out as "member".
2618 an isolated state without any load balancing from the scheduler
2624 two possible states - valid or invalid. An invalid partition
2635 "member" Non-root member of a partition
2637 "isolated" Partition root without load balancing
2662 A valid non-root parent partition may distribute out all its CPUs
2678 to "cpuset.cpus.partition" without the need to do continuous
2681 A user can pre-configure certain CPUs to an isolated state
2688 -----------------
2699 on the return value the attempt will succeed or fail with -EPERM.
2704 If the program returns 0, the attempt fails with -EPERM, otherwise it
2712 ----
2721 A readwrite nested-keyed file that exists for all the cgroups
2742 A read-only file that describes current resource usage.
2751 ----
2761 A readwrite nested-keyed file that exists for all the cgroups
2774 A read-only file that describes maximum region capacity.
2785 A read-only file that describes current resource usage.
2794 -------
2811 A read-only flat-keyed file which exists on non-root cgroups.
2824 use hugetlb pages are included. The per-node values are in bytes.
2827 ----
2849 A read-only flat-keyed file shown only in the root cgroup. It shows
2858 A read-only flat-keyed file shown in the all cgroups. It shows
2866 A read-only flat-keyed file shown in all cgroups. It shows the
2875 A read-write flat-keyed file shown in the non root cgroups. Allowed
2894 A read-only flat-keyed file which exists on non-root cgroups. The
2917 ------
2928 Non-normative information
2929 -------------------------
2945 appropriately so the neutral - nice 0 - value is 100 instead of 1024).
2961 ------
2971 Without cgroup namespace, the "/proc/$PID/cgroup" file shows the
2980 The path '/batchjobs/container_id1' can be considered as system-data
2985 # ls -l /proc/self/ns/cgroup
2986 lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
2992 # ls -l /proc/self/ns/cgroup
2993 lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
2997 When some thread from a multi-threaded process unshares its cgroup
3009 ------------------
3020 # ~/unshare -c # unshare cgroupns in some cgroup
3028 Each process gets its namespace-specific view of "/proc/$PID/cgroup"
3059 ----------------------
3088 ---------------------------------
3091 running inside a non-init cgroup namespace::
3093 # mount -t cgroup2 none $MOUNT_POINT
3100 the view of cgroup hierarchy by namespace-private cgroupfs mount
3113 --------------------------------
3116 address_space_operations->writepages() to annotate bio's using the
3133 super_block by setting SB_I_CGROUPWB in ->s_iflags. This allows for
3150 - Multiple hierarchies including named ones are not supported.
3152 - All v1 mount options are not supported.
3154 - The "tasks" file is removed and "cgroup.procs" is not sorted.
3156 - "cgroup.clone_children" is removed.
3158 - /proc/cgroups is meaningless for v2. Use "cgroup.controllers" or
3166 --------------------
3219 ------------------
3227 Generally, in-process knowledge is available only to the process
3228 itself; thus, unlike service-level organization of processes,
3235 sub-hierarchies and control resource distributions along them. This
3236 effectively raised cgroup to the status of a syscall-like API exposed
3246 that the process would actually be operating on its own sub-hierarchy.
3250 system-management pseudo filesystem. cgroup ended up with interface
3253 individual applications through the ill-defined delegation mechanism
3255 without going through the required scrutiny.
3263 -------------------------------------------
3274 cycles and the number of internal threads fluctuated - the ratios
3290 clearly defined. There were attempts to add ad-hoc behaviors and
3304 ----------------------
3306 cgroup v1 grew without oversight and developed a large number of
3308 was how an empty cgroup was notified - a userland helper binary was
3311 to in-kernel event delivery filtering mechanism further complicating
3333 ------------------------------
3340 global reclaim prefers is opt-in, rather than opt-out. The costs for
3350 becomes self-defeating.
3352 The memory.low boundary on the other hand is a top-down allocated
3390 new limit is met - or the task writing to memory.max is killed.
3399 groups can sabotage swapping by other means - such as referencing its
3400 anonymous memory in a tight loop - and an admin can not assume full