xref: /linux/Documentation/admin-guide/cgroup-v1/cpusets.rst (revision 36ec807b627b4c0a0a382f0ae48eac7187d14b2b)
150aab9b1SRalph Campbell.. _cpusets:
250aab9b1SRalph Campbell
3da82c92fSMauro Carvalho Chehab=======
4da82c92fSMauro Carvalho ChehabCPUSETS
5da82c92fSMauro Carvalho Chehab=======
6da82c92fSMauro Carvalho Chehab
7da82c92fSMauro Carvalho ChehabCopyright (C) 2004 BULL SA.
8da82c92fSMauro Carvalho Chehab
9da82c92fSMauro Carvalho ChehabWritten by Simon.Derr@bull.net
10da82c92fSMauro Carvalho Chehab
11da82c92fSMauro Carvalho Chehab- Portions Copyright (c) 2004-2006 Silicon Graphics, Inc.
12da82c92fSMauro Carvalho Chehab- Modified by Paul Jackson <pj@sgi.com>
13da82c92fSMauro Carvalho Chehab- Modified by Christoph Lameter <cl@linux.com>
14da82c92fSMauro Carvalho Chehab- Modified by Paul Menage <menage@google.com>
15da82c92fSMauro Carvalho Chehab- Modified by Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
16da82c92fSMauro Carvalho Chehab
17da82c92fSMauro Carvalho Chehab.. CONTENTS:
18da82c92fSMauro Carvalho Chehab
19da82c92fSMauro Carvalho Chehab   1. Cpusets
20da82c92fSMauro Carvalho Chehab     1.1 What are cpusets ?
21da82c92fSMauro Carvalho Chehab     1.2 Why are cpusets needed ?
22da82c92fSMauro Carvalho Chehab     1.3 How are cpusets implemented ?
23da82c92fSMauro Carvalho Chehab     1.4 What are exclusive cpusets ?
24da82c92fSMauro Carvalho Chehab     1.5 What is memory_pressure ?
25da82c92fSMauro Carvalho Chehab     1.6 What is memory spread ?
26da82c92fSMauro Carvalho Chehab     1.7 What is sched_load_balance ?
27da82c92fSMauro Carvalho Chehab     1.8 What is sched_relax_domain_level ?
28da82c92fSMauro Carvalho Chehab     1.9 How do I use cpusets ?
29da82c92fSMauro Carvalho Chehab   2. Usage Examples and Syntax
30da82c92fSMauro Carvalho Chehab     2.1 Basic Usage
31da82c92fSMauro Carvalho Chehab     2.2 Adding/removing cpus
32da82c92fSMauro Carvalho Chehab     2.3 Setting flags
33da82c92fSMauro Carvalho Chehab     2.4 Attaching processes
34da82c92fSMauro Carvalho Chehab   3. Questions
35da82c92fSMauro Carvalho Chehab   4. Contact
36da82c92fSMauro Carvalho Chehab
37da82c92fSMauro Carvalho Chehab1. Cpusets
38da82c92fSMauro Carvalho Chehab==========
39da82c92fSMauro Carvalho Chehab
40da82c92fSMauro Carvalho Chehab1.1 What are cpusets ?
41da82c92fSMauro Carvalho Chehab----------------------
42da82c92fSMauro Carvalho Chehab
43da82c92fSMauro Carvalho ChehabCpusets provide a mechanism for assigning a set of CPUs and Memory
44da82c92fSMauro Carvalho ChehabNodes to a set of tasks.   In this document "Memory Node" refers to
45da82c92fSMauro Carvalho Chehaban on-line node that contains memory.
46da82c92fSMauro Carvalho Chehab
47da82c92fSMauro Carvalho ChehabCpusets constrain the CPU and Memory placement of tasks to only
48da82c92fSMauro Carvalho Chehabthe resources within a task's current cpuset.  They form a nested
49da82c92fSMauro Carvalho Chehabhierarchy visible in a virtual file system.  These are the essential
50da82c92fSMauro Carvalho Chehabhooks, beyond what is already present, required to manage dynamic
51da82c92fSMauro Carvalho Chehabjob placement on large systems.
52da82c92fSMauro Carvalho Chehab
53da82c92fSMauro Carvalho ChehabCpusets use the generic cgroup subsystem described in
54da82c92fSMauro Carvalho ChehabDocumentation/admin-guide/cgroup-v1/cgroups.rst.
55da82c92fSMauro Carvalho Chehab
56da82c92fSMauro Carvalho ChehabRequests by a task, using the sched_setaffinity(2) system call to
57da82c92fSMauro Carvalho Chehabinclude CPUs in its CPU affinity mask, and using the mbind(2) and
58da82c92fSMauro Carvalho Chehabset_mempolicy(2) system calls to include Memory Nodes in its memory
59da82c92fSMauro Carvalho Chehabpolicy, are both filtered through that task's cpuset, filtering out any
60da82c92fSMauro Carvalho ChehabCPUs or Memory Nodes not in that cpuset.  The scheduler will not
61da82c92fSMauro Carvalho Chehabschedule a task on a CPU that is not allowed in its cpus_allowed
62da82c92fSMauro Carvalho Chehabvector, and the kernel page allocator will not allocate a page on a
63da82c92fSMauro Carvalho Chehabnode that is not allowed in the requesting task's mems_allowed vector.
64da82c92fSMauro Carvalho Chehab
65da82c92fSMauro Carvalho ChehabUser level code may create and destroy cpusets by name in the cgroup
66da82c92fSMauro Carvalho Chehabvirtual file system, manage the attributes and permissions of these
67da82c92fSMauro Carvalho Chehabcpusets and which CPUs and Memory Nodes are assigned to each cpuset,
68da82c92fSMauro Carvalho Chehabspecify and query to which cpuset a task is assigned, and list the
69da82c92fSMauro Carvalho Chehabtask pids assigned to a cpuset.
70da82c92fSMauro Carvalho Chehab
71da82c92fSMauro Carvalho Chehab
72da82c92fSMauro Carvalho Chehab1.2 Why are cpusets needed ?
73da82c92fSMauro Carvalho Chehab----------------------------
74da82c92fSMauro Carvalho Chehab
75da82c92fSMauro Carvalho ChehabThe management of large computer systems, with many processors (CPUs),
76da82c92fSMauro Carvalho Chehabcomplex memory cache hierarchies and multiple Memory Nodes having
77da82c92fSMauro Carvalho Chehabnon-uniform access times (NUMA) presents additional challenges for
78da82c92fSMauro Carvalho Chehabthe efficient scheduling and memory placement of processes.
79da82c92fSMauro Carvalho Chehab
80da82c92fSMauro Carvalho ChehabFrequently more modest sized systems can be operated with adequate
81da82c92fSMauro Carvalho Chehabefficiency just by letting the operating system automatically share
82da82c92fSMauro Carvalho Chehabthe available CPU and Memory resources amongst the requesting tasks.
83da82c92fSMauro Carvalho Chehab
84da82c92fSMauro Carvalho ChehabBut larger systems, which benefit more from careful processor and
85da82c92fSMauro Carvalho Chehabmemory placement to reduce memory access times and contention,
86da82c92fSMauro Carvalho Chehaband which typically represent a larger investment for the customer,
87da82c92fSMauro Carvalho Chehabcan benefit from explicitly placing jobs on properly sized subsets of
88da82c92fSMauro Carvalho Chehabthe system.
89da82c92fSMauro Carvalho Chehab
90da82c92fSMauro Carvalho ChehabThis can be especially valuable on:
91da82c92fSMauro Carvalho Chehab
92da82c92fSMauro Carvalho Chehab    * Web Servers running multiple instances of the same web application,
93da82c92fSMauro Carvalho Chehab    * Servers running different applications (for instance, a web server
94da82c92fSMauro Carvalho Chehab      and a database), or
95da82c92fSMauro Carvalho Chehab    * NUMA systems running large HPC applications with demanding
96da82c92fSMauro Carvalho Chehab      performance characteristics.
97da82c92fSMauro Carvalho Chehab
98da82c92fSMauro Carvalho ChehabThese subsets, or "soft partitions" must be able to be dynamically
99da82c92fSMauro Carvalho Chehabadjusted, as the job mix changes, without impacting other concurrently
100da82c92fSMauro Carvalho Chehabexecuting jobs. The location of the running jobs pages may also be moved
101da82c92fSMauro Carvalho Chehabwhen the memory locations are changed.
102da82c92fSMauro Carvalho Chehab
103da82c92fSMauro Carvalho ChehabThe kernel cpuset patch provides the minimum essential kernel
104da82c92fSMauro Carvalho Chehabmechanisms required to efficiently implement such subsets.  It
105da82c92fSMauro Carvalho Chehableverages existing CPU and Memory Placement facilities in the Linux
106da82c92fSMauro Carvalho Chehabkernel to avoid any additional impact on the critical scheduler or
107da82c92fSMauro Carvalho Chehabmemory allocator code.
108da82c92fSMauro Carvalho Chehab
109da82c92fSMauro Carvalho Chehab
110da82c92fSMauro Carvalho Chehab1.3 How are cpusets implemented ?
111da82c92fSMauro Carvalho Chehab---------------------------------
112da82c92fSMauro Carvalho Chehab
113da82c92fSMauro Carvalho ChehabCpusets provide a Linux kernel mechanism to constrain which CPUs and
114da82c92fSMauro Carvalho ChehabMemory Nodes are used by a process or set of processes.
115da82c92fSMauro Carvalho Chehab
116da82c92fSMauro Carvalho ChehabThe Linux kernel already has a pair of mechanisms to specify on which
117da82c92fSMauro Carvalho ChehabCPUs a task may be scheduled (sched_setaffinity) and on which Memory
118da82c92fSMauro Carvalho ChehabNodes it may obtain memory (mbind, set_mempolicy).
119da82c92fSMauro Carvalho Chehab
120da82c92fSMauro Carvalho ChehabCpusets extends these two mechanisms as follows:
121da82c92fSMauro Carvalho Chehab
122da82c92fSMauro Carvalho Chehab - Cpusets are sets of allowed CPUs and Memory Nodes, known to the
123da82c92fSMauro Carvalho Chehab   kernel.
124da82c92fSMauro Carvalho Chehab - Each task in the system is attached to a cpuset, via a pointer
125da82c92fSMauro Carvalho Chehab   in the task structure to a reference counted cgroup structure.
126da82c92fSMauro Carvalho Chehab - Calls to sched_setaffinity are filtered to just those CPUs
127da82c92fSMauro Carvalho Chehab   allowed in that task's cpuset.
128da82c92fSMauro Carvalho Chehab - Calls to mbind and set_mempolicy are filtered to just
129da82c92fSMauro Carvalho Chehab   those Memory Nodes allowed in that task's cpuset.
130da82c92fSMauro Carvalho Chehab - The root cpuset contains all the systems CPUs and Memory
131da82c92fSMauro Carvalho Chehab   Nodes.
132da82c92fSMauro Carvalho Chehab - For any cpuset, one can define child cpusets containing a subset
133da82c92fSMauro Carvalho Chehab   of the parents CPU and Memory Node resources.
134da82c92fSMauro Carvalho Chehab - The hierarchy of cpusets can be mounted at /dev/cpuset, for
135da82c92fSMauro Carvalho Chehab   browsing and manipulation from user space.
136da82c92fSMauro Carvalho Chehab - A cpuset may be marked exclusive, which ensures that no other
137da82c92fSMauro Carvalho Chehab   cpuset (except direct ancestors and descendants) may contain
138da82c92fSMauro Carvalho Chehab   any overlapping CPUs or Memory Nodes.
139da82c92fSMauro Carvalho Chehab - You can list all the tasks (by pid) attached to any cpuset.
140da82c92fSMauro Carvalho Chehab
141da82c92fSMauro Carvalho ChehabThe implementation of cpusets requires a few, simple hooks
142da82c92fSMauro Carvalho Chehabinto the rest of the kernel, none in performance critical paths:
143da82c92fSMauro Carvalho Chehab
144da82c92fSMauro Carvalho Chehab - in init/main.c, to initialize the root cpuset at system boot.
145da82c92fSMauro Carvalho Chehab - in fork and exit, to attach and detach a task from its cpuset.
146da82c92fSMauro Carvalho Chehab - in sched_setaffinity, to mask the requested CPUs by what's
147da82c92fSMauro Carvalho Chehab   allowed in that task's cpuset.
148da82c92fSMauro Carvalho Chehab - in sched.c migrate_live_tasks(), to keep migrating tasks within
149da82c92fSMauro Carvalho Chehab   the CPUs allowed by their cpuset, if possible.
150da82c92fSMauro Carvalho Chehab - in the mbind and set_mempolicy system calls, to mask the requested
151da82c92fSMauro Carvalho Chehab   Memory Nodes by what's allowed in that task's cpuset.
152da82c92fSMauro Carvalho Chehab - in page_alloc.c, to restrict memory to allowed nodes.
153da82c92fSMauro Carvalho Chehab - in vmscan.c, to restrict page recovery to the current cpuset.
154da82c92fSMauro Carvalho Chehab
155da82c92fSMauro Carvalho ChehabYou should mount the "cgroup" filesystem type in order to enable
156da82c92fSMauro Carvalho Chehabbrowsing and modifying the cpusets presently known to the kernel.  No
157da82c92fSMauro Carvalho Chehabnew system calls are added for cpusets - all support for querying and
158da82c92fSMauro Carvalho Chehabmodifying cpusets is via this cpuset file system.
159da82c92fSMauro Carvalho Chehab
160da82c92fSMauro Carvalho ChehabThe /proc/<pid>/status file for each task has four added lines,
161da82c92fSMauro Carvalho Chehabdisplaying the task's cpus_allowed (on which CPUs it may be scheduled)
162da82c92fSMauro Carvalho Chehaband mems_allowed (on which Memory Nodes it may obtain memory),
163da82c92fSMauro Carvalho Chehabin the two formats seen in the following example::
164da82c92fSMauro Carvalho Chehab
165da82c92fSMauro Carvalho Chehab  Cpus_allowed:   ffffffff,ffffffff,ffffffff,ffffffff
166da82c92fSMauro Carvalho Chehab  Cpus_allowed_list:      0-127
167da82c92fSMauro Carvalho Chehab  Mems_allowed:   ffffffff,ffffffff
168da82c92fSMauro Carvalho Chehab  Mems_allowed_list:      0-63
169da82c92fSMauro Carvalho Chehab
170da82c92fSMauro Carvalho ChehabEach cpuset is represented by a directory in the cgroup file system
171da82c92fSMauro Carvalho Chehabcontaining (on top of the standard cgroup files) the following
172da82c92fSMauro Carvalho Chehabfiles describing that cpuset:
173da82c92fSMauro Carvalho Chehab
174da82c92fSMauro Carvalho Chehab - cpuset.cpus: list of CPUs in that cpuset
175da82c92fSMauro Carvalho Chehab - cpuset.mems: list of Memory Nodes in that cpuset
176da82c92fSMauro Carvalho Chehab - cpuset.memory_migrate flag: if set, move pages to cpusets nodes
177da82c92fSMauro Carvalho Chehab - cpuset.cpu_exclusive flag: is cpu placement exclusive?
178da82c92fSMauro Carvalho Chehab - cpuset.mem_exclusive flag: is memory placement exclusive?
179da82c92fSMauro Carvalho Chehab - cpuset.mem_hardwall flag:  is memory allocation hardwalled
180da82c92fSMauro Carvalho Chehab - cpuset.memory_pressure: measure of how much paging pressure in cpuset
181da82c92fSMauro Carvalho Chehab - cpuset.memory_spread_page flag: if set, spread page cache evenly on allowed nodes
1823ab67a9cSXiongwei Song - cpuset.memory_spread_slab flag: OBSOLETE. Doesn't have any function.
183da82c92fSMauro Carvalho Chehab - cpuset.sched_load_balance flag: if set, load balance within CPUs on that cpuset
184da82c92fSMauro Carvalho Chehab - cpuset.sched_relax_domain_level: the searching range when migrating tasks
185da82c92fSMauro Carvalho Chehab
186da82c92fSMauro Carvalho ChehabIn addition, only the root cpuset has the following file:
187da82c92fSMauro Carvalho Chehab
188da82c92fSMauro Carvalho Chehab - cpuset.memory_pressure_enabled flag: compute memory_pressure?
189da82c92fSMauro Carvalho Chehab
190da82c92fSMauro Carvalho ChehabNew cpusets are created using the mkdir system call or shell
191da82c92fSMauro Carvalho Chehabcommand.  The properties of a cpuset, such as its flags, allowed
192da82c92fSMauro Carvalho ChehabCPUs and Memory Nodes, and attached tasks, are modified by writing
193da82c92fSMauro Carvalho Chehabto the appropriate file in that cpusets directory, as listed above.
194da82c92fSMauro Carvalho Chehab
195da82c92fSMauro Carvalho ChehabThe named hierarchical structure of nested cpusets allows partitioning
196da82c92fSMauro Carvalho Chehaba large system into nested, dynamically changeable, "soft-partitions".
197da82c92fSMauro Carvalho Chehab
198da82c92fSMauro Carvalho ChehabThe attachment of each task, automatically inherited at fork by any
199da82c92fSMauro Carvalho Chehabchildren of that task, to a cpuset allows organizing the work load
200da82c92fSMauro Carvalho Chehabon a system into related sets of tasks such that each set is constrained
201da82c92fSMauro Carvalho Chehabto using the CPUs and Memory Nodes of a particular cpuset.  A task
202da82c92fSMauro Carvalho Chehabmay be re-attached to any other cpuset, if allowed by the permissions
203da82c92fSMauro Carvalho Chehabon the necessary cpuset file system directories.
204da82c92fSMauro Carvalho Chehab
205da82c92fSMauro Carvalho ChehabSuch management of a system "in the large" integrates smoothly with
206da82c92fSMauro Carvalho Chehabthe detailed placement done on individual tasks and memory regions
207da82c92fSMauro Carvalho Chehabusing the sched_setaffinity, mbind and set_mempolicy system calls.
208da82c92fSMauro Carvalho Chehab
209da82c92fSMauro Carvalho ChehabThe following rules apply to each cpuset:
210da82c92fSMauro Carvalho Chehab
211da82c92fSMauro Carvalho Chehab - Its CPUs and Memory Nodes must be a subset of its parents.
212da82c92fSMauro Carvalho Chehab - It can't be marked exclusive unless its parent is.
213da82c92fSMauro Carvalho Chehab - If its cpu or memory is exclusive, they may not overlap any sibling.
214da82c92fSMauro Carvalho Chehab
215da82c92fSMauro Carvalho ChehabThese rules, and the natural hierarchy of cpusets, enable efficient
216da82c92fSMauro Carvalho Chehabenforcement of the exclusive guarantee, without having to scan all
217da82c92fSMauro Carvalho Chehabcpusets every time any of them change to ensure nothing overlaps a
218da82c92fSMauro Carvalho Chehabexclusive cpuset.  Also, the use of a Linux virtual file system (vfs)
219da82c92fSMauro Carvalho Chehabto represent the cpuset hierarchy provides for a familiar permission
220da82c92fSMauro Carvalho Chehaband name space for cpusets, with a minimum of additional kernel code.
221da82c92fSMauro Carvalho Chehab
222da82c92fSMauro Carvalho ChehabThe cpus and mems files in the root (top_cpuset) cpuset are
223da82c92fSMauro Carvalho Chehabread-only.  The cpus file automatically tracks the value of
224da82c92fSMauro Carvalho Chehabcpu_online_mask using a CPU hotplug notifier, and the mems file
225da82c92fSMauro Carvalho Chehabautomatically tracks the value of node_states[N_MEMORY]--i.e.,
226da82c92fSMauro Carvalho Chehabnodes with memory--using the cpuset_track_online_nodes() hook.
227da82c92fSMauro Carvalho Chehab
2280c05b9bdSWaiman LongThe cpuset.effective_cpus and cpuset.effective_mems files are
2290c05b9bdSWaiman Longnormally read-only copies of cpuset.cpus and cpuset.mems files
2300c05b9bdSWaiman Longrespectively.  If the cpuset cgroup filesystem is mounted with the
2310c05b9bdSWaiman Longspecial "cpuset_v2_mode" option, the behavior of these files will become
2320c05b9bdSWaiman Longsimilar to the corresponding files in cpuset v2.  In other words, hotplug
2330c05b9bdSWaiman Longevents will not change cpuset.cpus and cpuset.mems.  Those events will
2340c05b9bdSWaiman Longonly affect cpuset.effective_cpus and cpuset.effective_mems which show
2350c05b9bdSWaiman Longthe actual cpus and memory nodes that are currently used by this cpuset.
2360c05b9bdSWaiman LongSee Documentation/admin-guide/cgroup-v2.rst for more information about
2370c05b9bdSWaiman Longcpuset v2 behavior.
2380c05b9bdSWaiman Long
239da82c92fSMauro Carvalho Chehab
240da82c92fSMauro Carvalho Chehab1.4 What are exclusive cpusets ?
241da82c92fSMauro Carvalho Chehab--------------------------------
242da82c92fSMauro Carvalho Chehab
243da82c92fSMauro Carvalho ChehabIf a cpuset is cpu or mem exclusive, no other cpuset, other than
244da82c92fSMauro Carvalho Chehaba direct ancestor or descendant, may share any of the same CPUs or
245da82c92fSMauro Carvalho ChehabMemory Nodes.
246da82c92fSMauro Carvalho Chehab
247da82c92fSMauro Carvalho ChehabA cpuset that is cpuset.mem_exclusive *or* cpuset.mem_hardwall is "hardwalled",
248da82c92fSMauro Carvalho Chehabi.e. it restricts kernel allocations for page, buffer and other data
249da82c92fSMauro Carvalho Chehabcommonly shared by the kernel across multiple users.  All cpusets,
250da82c92fSMauro Carvalho Chehabwhether hardwalled or not, restrict allocations of memory for user
251da82c92fSMauro Carvalho Chehabspace.  This enables configuring a system so that several independent
252da82c92fSMauro Carvalho Chehabjobs can share common kernel data, such as file system pages, while
253da82c92fSMauro Carvalho Chehabisolating each job's user allocation in its own cpuset.  To do this,
254da82c92fSMauro Carvalho Chehabconstruct a large mem_exclusive cpuset to hold all the jobs, and
255da82c92fSMauro Carvalho Chehabconstruct child, non-mem_exclusive cpusets for each individual job.
256da82c92fSMauro Carvalho ChehabOnly a small amount of typical kernel memory, such as requests from
257da82c92fSMauro Carvalho Chehabinterrupt handlers, is allowed to be taken outside even a
258da82c92fSMauro Carvalho Chehabmem_exclusive cpuset.
259da82c92fSMauro Carvalho Chehab
260da82c92fSMauro Carvalho Chehab
261da82c92fSMauro Carvalho Chehab1.5 What is memory_pressure ?
262da82c92fSMauro Carvalho Chehab-----------------------------
263da82c92fSMauro Carvalho ChehabThe memory_pressure of a cpuset provides a simple per-cpuset metric
264da82c92fSMauro Carvalho Chehabof the rate that the tasks in a cpuset are attempting to free up in
265da82c92fSMauro Carvalho Chehabuse memory on the nodes of the cpuset to satisfy additional memory
266da82c92fSMauro Carvalho Chehabrequests.
267da82c92fSMauro Carvalho Chehab
268da82c92fSMauro Carvalho ChehabThis enables batch managers monitoring jobs running in dedicated
269da82c92fSMauro Carvalho Chehabcpusets to efficiently detect what level of memory pressure that job
270da82c92fSMauro Carvalho Chehabis causing.
271da82c92fSMauro Carvalho Chehab
272da82c92fSMauro Carvalho ChehabThis is useful both on tightly managed systems running a wide mix of
273da82c92fSMauro Carvalho Chehabsubmitted jobs, which may choose to terminate or re-prioritize jobs that
274da82c92fSMauro Carvalho Chehabare trying to use more memory than allowed on the nodes assigned to them,
275da82c92fSMauro Carvalho Chehaband with tightly coupled, long running, massively parallel scientific
276da82c92fSMauro Carvalho Chehabcomputing jobs that will dramatically fail to meet required performance
277da82c92fSMauro Carvalho Chehabgoals if they start to use more memory than allowed to them.
278da82c92fSMauro Carvalho Chehab
279da82c92fSMauro Carvalho ChehabThis mechanism provides a very economical way for the batch manager
280da82c92fSMauro Carvalho Chehabto monitor a cpuset for signs of memory pressure.  It's up to the
281da82c92fSMauro Carvalho Chehabbatch manager or other user code to decide what to do about it and
282da82c92fSMauro Carvalho Chehabtake action.
283da82c92fSMauro Carvalho Chehab
284da82c92fSMauro Carvalho Chehab==>
285da82c92fSMauro Carvalho Chehab    Unless this feature is enabled by writing "1" to the special file
286da82c92fSMauro Carvalho Chehab    /dev/cpuset/memory_pressure_enabled, the hook in the rebalance
287da82c92fSMauro Carvalho Chehab    code of __alloc_pages() for this metric reduces to simply noticing
288da82c92fSMauro Carvalho Chehab    that the cpuset_memory_pressure_enabled flag is zero.  So only
289da82c92fSMauro Carvalho Chehab    systems that enable this feature will compute the metric.
290da82c92fSMauro Carvalho Chehab
291da82c92fSMauro Carvalho ChehabWhy a per-cpuset, running average:
292da82c92fSMauro Carvalho Chehab
293da82c92fSMauro Carvalho Chehab    Because this meter is per-cpuset, rather than per-task or mm,
294da82c92fSMauro Carvalho Chehab    the system load imposed by a batch scheduler monitoring this
295da82c92fSMauro Carvalho Chehab    metric is sharply reduced on large systems, because a scan of
296da82c92fSMauro Carvalho Chehab    the tasklist can be avoided on each set of queries.
297da82c92fSMauro Carvalho Chehab
298da82c92fSMauro Carvalho Chehab    Because this meter is a running average, instead of an accumulating
299da82c92fSMauro Carvalho Chehab    counter, a batch scheduler can detect memory pressure with a
300da82c92fSMauro Carvalho Chehab    single read, instead of having to read and accumulate results
301da82c92fSMauro Carvalho Chehab    for a period of time.
302da82c92fSMauro Carvalho Chehab
303da82c92fSMauro Carvalho Chehab    Because this meter is per-cpuset rather than per-task or mm,
304da82c92fSMauro Carvalho Chehab    the batch scheduler can obtain the key information, memory
305da82c92fSMauro Carvalho Chehab    pressure in a cpuset, with a single read, rather than having to
306da82c92fSMauro Carvalho Chehab    query and accumulate results over all the (dynamically changing)
307da82c92fSMauro Carvalho Chehab    set of tasks in the cpuset.
308da82c92fSMauro Carvalho Chehab
309da82c92fSMauro Carvalho ChehabA per-cpuset simple digital filter (requires a spinlock and 3 words
310da82c92fSMauro Carvalho Chehabof data per-cpuset) is kept, and updated by any task attached to that
311da82c92fSMauro Carvalho Chehabcpuset, if it enters the synchronous (direct) page reclaim code.
312da82c92fSMauro Carvalho Chehab
313da82c92fSMauro Carvalho ChehabA per-cpuset file provides an integer number representing the recent
314da82c92fSMauro Carvalho Chehab(half-life of 10 seconds) rate of direct page reclaims caused by
315da82c92fSMauro Carvalho Chehabthe tasks in the cpuset, in units of reclaims attempted per second,
316da82c92fSMauro Carvalho Chehabtimes 1000.
317da82c92fSMauro Carvalho Chehab
318da82c92fSMauro Carvalho Chehab
319da82c92fSMauro Carvalho Chehab1.6 What is memory spread ?
320da82c92fSMauro Carvalho Chehab---------------------------
321da82c92fSMauro Carvalho ChehabThere are two boolean flag files per cpuset that control where the
322da82c92fSMauro Carvalho Chehabkernel allocates pages for the file system buffers and related in
323da82c92fSMauro Carvalho Chehabkernel data structures.  They are called 'cpuset.memory_spread_page' and
324da82c92fSMauro Carvalho Chehab'cpuset.memory_spread_slab'.
325da82c92fSMauro Carvalho Chehab
326da82c92fSMauro Carvalho ChehabIf the per-cpuset boolean flag file 'cpuset.memory_spread_page' is set, then
327da82c92fSMauro Carvalho Chehabthe kernel will spread the file system buffers (page cache) evenly
328da82c92fSMauro Carvalho Chehabover all the nodes that the faulting task is allowed to use, instead
329da82c92fSMauro Carvalho Chehabof preferring to put those pages on the node where the task is running.
330da82c92fSMauro Carvalho Chehab
331da82c92fSMauro Carvalho ChehabIf the per-cpuset boolean flag file 'cpuset.memory_spread_slab' is set,
332da82c92fSMauro Carvalho Chehabthen the kernel will spread some file system related slab caches,
333da82c92fSMauro Carvalho Chehabsuch as for inodes and dentries evenly over all the nodes that the
334da82c92fSMauro Carvalho Chehabfaulting task is allowed to use, instead of preferring to put those
335da82c92fSMauro Carvalho Chehabpages on the node where the task is running.
336da82c92fSMauro Carvalho Chehab
337da82c92fSMauro Carvalho ChehabThe setting of these flags does not affect anonymous data segment or
338da82c92fSMauro Carvalho Chehabstack segment pages of a task.
339da82c92fSMauro Carvalho Chehab
340da82c92fSMauro Carvalho ChehabBy default, both kinds of memory spreading are off, and memory
341da82c92fSMauro Carvalho Chehabpages are allocated on the node local to where the task is running,
342da82c92fSMauro Carvalho Chehabexcept perhaps as modified by the task's NUMA mempolicy or cpuset
343da82c92fSMauro Carvalho Chehabconfiguration, so long as sufficient free memory pages are available.
344da82c92fSMauro Carvalho Chehab
345da82c92fSMauro Carvalho ChehabWhen new cpusets are created, they inherit the memory spread settings
346da82c92fSMauro Carvalho Chehabof their parent.
347da82c92fSMauro Carvalho Chehab
348da82c92fSMauro Carvalho ChehabSetting memory spreading causes allocations for the affected page
349da82c92fSMauro Carvalho Chehabor slab caches to ignore the task's NUMA mempolicy and be spread
350da82c92fSMauro Carvalho Chehabinstead.    Tasks using mbind() or set_mempolicy() calls to set NUMA
351da82c92fSMauro Carvalho Chehabmempolicies will not notice any change in these calls as a result of
352da82c92fSMauro Carvalho Chehabtheir containing task's memory spread settings.  If memory spreading
353da82c92fSMauro Carvalho Chehabis turned off, then the currently specified NUMA mempolicy once again
354da82c92fSMauro Carvalho Chehabapplies to memory page allocations.
355da82c92fSMauro Carvalho Chehab
356da82c92fSMauro Carvalho ChehabBoth 'cpuset.memory_spread_page' and 'cpuset.memory_spread_slab' are boolean flag
357da82c92fSMauro Carvalho Chehabfiles.  By default they contain "0", meaning that the feature is off
358da82c92fSMauro Carvalho Chehabfor that cpuset.  If a "1" is written to that file, then that turns
359da82c92fSMauro Carvalho Chehabthe named feature on.
360da82c92fSMauro Carvalho Chehab
361da82c92fSMauro Carvalho ChehabThe implementation is simple.
362da82c92fSMauro Carvalho Chehab
363da82c92fSMauro Carvalho ChehabSetting the flag 'cpuset.memory_spread_page' turns on a per-process flag
364da82c92fSMauro Carvalho ChehabPFA_SPREAD_PAGE for each task that is in that cpuset or subsequently
365da82c92fSMauro Carvalho Chehabjoins that cpuset.  The page allocation calls for the page cache
366da82c92fSMauro Carvalho Chehabis modified to perform an inline check for this PFA_SPREAD_PAGE task
367da82c92fSMauro Carvalho Chehabflag, and if set, a call to a new routine cpuset_mem_spread_node()
368da82c92fSMauro Carvalho Chehabreturns the node to prefer for the allocation.
369da82c92fSMauro Carvalho Chehab
370da82c92fSMauro Carvalho ChehabSimilarly, setting 'cpuset.memory_spread_slab' turns on the flag
371da82c92fSMauro Carvalho ChehabPFA_SPREAD_SLAB, and appropriately marked slab caches will allocate
372da82c92fSMauro Carvalho Chehabpages from the node returned by cpuset_mem_spread_node().
373da82c92fSMauro Carvalho Chehab
374da82c92fSMauro Carvalho ChehabThe cpuset_mem_spread_node() routine is also simple.  It uses the
375da82c92fSMauro Carvalho Chehabvalue of a per-task rotor cpuset_mem_spread_rotor to select the next
376da82c92fSMauro Carvalho Chehabnode in the current task's mems_allowed to prefer for the allocation.
377da82c92fSMauro Carvalho Chehab
378da82c92fSMauro Carvalho ChehabThis memory placement policy is also known (in other contexts) as
379da82c92fSMauro Carvalho Chehabround-robin or interleave.
380da82c92fSMauro Carvalho Chehab
381da82c92fSMauro Carvalho ChehabThis policy can provide substantial improvements for jobs that need
382da82c92fSMauro Carvalho Chehabto place thread local data on the corresponding node, but that need
383da82c92fSMauro Carvalho Chehabto access large file system data sets that need to be spread across
384da82c92fSMauro Carvalho Chehabthe several nodes in the jobs cpuset in order to fit.  Without this
385da82c92fSMauro Carvalho Chehabpolicy, especially for jobs that might have one thread reading in the
386da82c92fSMauro Carvalho Chehabdata set, the memory allocation across the nodes in the jobs cpuset
387da82c92fSMauro Carvalho Chehabcan become very uneven.
388da82c92fSMauro Carvalho Chehab
389da82c92fSMauro Carvalho Chehab1.7 What is sched_load_balance ?
390da82c92fSMauro Carvalho Chehab--------------------------------
391da82c92fSMauro Carvalho Chehab
392da82c92fSMauro Carvalho ChehabThe kernel scheduler (kernel/sched/core.c) automatically load balances
393da82c92fSMauro Carvalho Chehabtasks.  If one CPU is underutilized, kernel code running on that
394da82c92fSMauro Carvalho ChehabCPU will look for tasks on other more overloaded CPUs and move those
395da82c92fSMauro Carvalho Chehabtasks to itself, within the constraints of such placement mechanisms
396da82c92fSMauro Carvalho Chehabas cpusets and sched_setaffinity.
397da82c92fSMauro Carvalho Chehab
398da82c92fSMauro Carvalho ChehabThe algorithmic cost of load balancing and its impact on key shared
399da82c92fSMauro Carvalho Chehabkernel data structures such as the task list increases more than
400da82c92fSMauro Carvalho Chehablinearly with the number of CPUs being balanced.  So the scheduler
401da82c92fSMauro Carvalho Chehabhas support to partition the systems CPUs into a number of sched
402da82c92fSMauro Carvalho Chehabdomains such that it only load balances within each sched domain.
403da82c92fSMauro Carvalho ChehabEach sched domain covers some subset of the CPUs in the system;
404da82c92fSMauro Carvalho Chehabno two sched domains overlap; some CPUs might not be in any sched
405da82c92fSMauro Carvalho Chehabdomain and hence won't be load balanced.
406da82c92fSMauro Carvalho Chehab
407da82c92fSMauro Carvalho ChehabPut simply, it costs less to balance between two smaller sched domains
408da82c92fSMauro Carvalho Chehabthan one big one, but doing so means that overloads in one of the
409da82c92fSMauro Carvalho Chehabtwo domains won't be load balanced to the other one.
410da82c92fSMauro Carvalho Chehab
411da82c92fSMauro Carvalho ChehabBy default, there is one sched domain covering all CPUs, including those
412da82c92fSMauro Carvalho Chehabmarked isolated using the kernel boot time "isolcpus=" argument. However,
413da82c92fSMauro Carvalho Chehabthe isolated CPUs will not participate in load balancing, and will not
414da82c92fSMauro Carvalho Chehabhave tasks running on them unless explicitly assigned.
415da82c92fSMauro Carvalho Chehab
416da82c92fSMauro Carvalho ChehabThis default load balancing across all CPUs is not well suited for
417da82c92fSMauro Carvalho Chehabthe following two situations:
418da82c92fSMauro Carvalho Chehab
419da82c92fSMauro Carvalho Chehab 1) On large systems, load balancing across many CPUs is expensive.
420da82c92fSMauro Carvalho Chehab    If the system is managed using cpusets to place independent jobs
421da82c92fSMauro Carvalho Chehab    on separate sets of CPUs, full load balancing is unnecessary.
422da82c92fSMauro Carvalho Chehab 2) Systems supporting realtime on some CPUs need to minimize
423da82c92fSMauro Carvalho Chehab    system overhead on those CPUs, including avoiding task load
424da82c92fSMauro Carvalho Chehab    balancing if that is not needed.
425da82c92fSMauro Carvalho Chehab
426da82c92fSMauro Carvalho ChehabWhen the per-cpuset flag "cpuset.sched_load_balance" is enabled (the default
427da82c92fSMauro Carvalho Chehabsetting), it requests that all the CPUs in that cpusets allowed 'cpuset.cpus'
428da82c92fSMauro Carvalho Chehabbe contained in a single sched domain, ensuring that load balancing
429da82c92fSMauro Carvalho Chehabcan move a task (not otherwised pinned, as by sched_setaffinity)
430da82c92fSMauro Carvalho Chehabfrom any CPU in that cpuset to any other.
431da82c92fSMauro Carvalho Chehab
432da82c92fSMauro Carvalho ChehabWhen the per-cpuset flag "cpuset.sched_load_balance" is disabled, then the
433da82c92fSMauro Carvalho Chehabscheduler will avoid load balancing across the CPUs in that cpuset,
434da82c92fSMauro Carvalho Chehab--except-- in so far as is necessary because some overlapping cpuset
435da82c92fSMauro Carvalho Chehabhas "sched_load_balance" enabled.
436da82c92fSMauro Carvalho Chehab
437da82c92fSMauro Carvalho ChehabSo, for example, if the top cpuset has the flag "cpuset.sched_load_balance"
438da82c92fSMauro Carvalho Chehabenabled, then the scheduler will have one sched domain covering all
439da82c92fSMauro Carvalho ChehabCPUs, and the setting of the "cpuset.sched_load_balance" flag in any other
440da82c92fSMauro Carvalho Chehabcpusets won't matter, as we're already fully load balancing.
441da82c92fSMauro Carvalho Chehab
442da82c92fSMauro Carvalho ChehabTherefore in the above two situations, the top cpuset flag
443da82c92fSMauro Carvalho Chehab"cpuset.sched_load_balance" should be disabled, and only some of the smaller,
444da82c92fSMauro Carvalho Chehabchild cpusets have this flag enabled.
445da82c92fSMauro Carvalho Chehab
446da82c92fSMauro Carvalho ChehabWhen doing this, you don't usually want to leave any unpinned tasks in
447da82c92fSMauro Carvalho Chehabthe top cpuset that might use non-trivial amounts of CPU, as such tasks
448da82c92fSMauro Carvalho Chehabmay be artificially constrained to some subset of CPUs, depending on
449da82c92fSMauro Carvalho Chehabthe particulars of this flag setting in descendant cpusets.  Even if
450da82c92fSMauro Carvalho Chehabsuch a task could use spare CPU cycles in some other CPUs, the kernel
451da82c92fSMauro Carvalho Chehabscheduler might not consider the possibility of load balancing that
452da82c92fSMauro Carvalho Chehabtask to that underused CPU.
453da82c92fSMauro Carvalho Chehab
454da82c92fSMauro Carvalho ChehabOf course, tasks pinned to a particular CPU can be left in a cpuset
455da82c92fSMauro Carvalho Chehabthat disables "cpuset.sched_load_balance" as those tasks aren't going anywhere
456da82c92fSMauro Carvalho Chehabelse anyway.
457da82c92fSMauro Carvalho Chehab
458da82c92fSMauro Carvalho ChehabThere is an impedance mismatch here, between cpusets and sched domains.
459da82c92fSMauro Carvalho ChehabCpusets are hierarchical and nest.  Sched domains are flat; they don't
460da82c92fSMauro Carvalho Chehaboverlap and each CPU is in at most one sched domain.
461da82c92fSMauro Carvalho Chehab
462da82c92fSMauro Carvalho ChehabIt is necessary for sched domains to be flat because load balancing
463da82c92fSMauro Carvalho Chehabacross partially overlapping sets of CPUs would risk unstable dynamics
464da82c92fSMauro Carvalho Chehabthat would be beyond our understanding.  So if each of two partially
465da82c92fSMauro Carvalho Chehaboverlapping cpusets enables the flag 'cpuset.sched_load_balance', then we
466da82c92fSMauro Carvalho Chehabform a single sched domain that is a superset of both.  We won't move
467da82c92fSMauro Carvalho Chehaba task to a CPU outside its cpuset, but the scheduler load balancing
468da82c92fSMauro Carvalho Chehabcode might waste some compute cycles considering that possibility.
469da82c92fSMauro Carvalho Chehab
470da82c92fSMauro Carvalho ChehabThis mismatch is why there is not a simple one-to-one relation
471da82c92fSMauro Carvalho Chehabbetween which cpusets have the flag "cpuset.sched_load_balance" enabled,
472da82c92fSMauro Carvalho Chehaband the sched domain configuration.  If a cpuset enables the flag, it
473da82c92fSMauro Carvalho Chehabwill get balancing across all its CPUs, but if it disables the flag,
474da82c92fSMauro Carvalho Chehabit will only be assured of no load balancing if no other overlapping
475da82c92fSMauro Carvalho Chehabcpuset enables the flag.
476da82c92fSMauro Carvalho Chehab
477da82c92fSMauro Carvalho ChehabIf two cpusets have partially overlapping 'cpuset.cpus' allowed, and only
478da82c92fSMauro Carvalho Chehabone of them has this flag enabled, then the other may find its
479da82c92fSMauro Carvalho Chehabtasks only partially load balanced, just on the overlapping CPUs.
480da82c92fSMauro Carvalho ChehabThis is just the general case of the top_cpuset example given a few
481da82c92fSMauro Carvalho Chehabparagraphs above.  In the general case, as in the top cpuset case,
482da82c92fSMauro Carvalho Chehabdon't leave tasks that might use non-trivial amounts of CPU in
483da82c92fSMauro Carvalho Chehabsuch partially load balanced cpusets, as they may be artificially
484da82c92fSMauro Carvalho Chehabconstrained to some subset of the CPUs allowed to them, for lack of
485da82c92fSMauro Carvalho Chehabload balancing to the other CPUs.
486da82c92fSMauro Carvalho Chehab
487da82c92fSMauro Carvalho ChehabCPUs in "cpuset.isolcpus" were excluded from load balancing by the
488da82c92fSMauro Carvalho Chehabisolcpus= kernel boot option, and will never be load balanced regardless
489da82c92fSMauro Carvalho Chehabof the value of "cpuset.sched_load_balance" in any cpuset.
490da82c92fSMauro Carvalho Chehab
491da82c92fSMauro Carvalho Chehab1.7.1 sched_load_balance implementation details.
492da82c92fSMauro Carvalho Chehab------------------------------------------------
493da82c92fSMauro Carvalho Chehab
494da82c92fSMauro Carvalho ChehabThe per-cpuset flag 'cpuset.sched_load_balance' defaults to enabled (contrary
495da82c92fSMauro Carvalho Chehabto most cpuset flags.)  When enabled for a cpuset, the kernel will
496da82c92fSMauro Carvalho Chehabensure that it can load balance across all the CPUs in that cpuset
497da82c92fSMauro Carvalho Chehab(makes sure that all the CPUs in the cpus_allowed of that cpuset are
498da82c92fSMauro Carvalho Chehabin the same sched domain.)
499da82c92fSMauro Carvalho Chehab
500da82c92fSMauro Carvalho ChehabIf two overlapping cpusets both have 'cpuset.sched_load_balance' enabled,
501da82c92fSMauro Carvalho Chehabthen they will be (must be) both in the same sched domain.
502da82c92fSMauro Carvalho Chehab
503da82c92fSMauro Carvalho ChehabIf, as is the default, the top cpuset has 'cpuset.sched_load_balance' enabled,
504da82c92fSMauro Carvalho Chehabthen by the above that means there is a single sched domain covering
505da82c92fSMauro Carvalho Chehabthe whole system, regardless of any other cpuset settings.
506da82c92fSMauro Carvalho Chehab
507da82c92fSMauro Carvalho ChehabThe kernel commits to user space that it will avoid load balancing
508da82c92fSMauro Carvalho Chehabwhere it can.  It will pick as fine a granularity partition of sched
509da82c92fSMauro Carvalho Chehabdomains as it can while still providing load balancing for any set
510da82c92fSMauro Carvalho Chehabof CPUs allowed to a cpuset having 'cpuset.sched_load_balance' enabled.
511da82c92fSMauro Carvalho Chehab
512da82c92fSMauro Carvalho ChehabThe internal kernel cpuset to scheduler interface passes from the
513da82c92fSMauro Carvalho Chehabcpuset code to the scheduler code a partition of the load balanced
514da82c92fSMauro Carvalho ChehabCPUs in the system. This partition is a set of subsets (represented
515da82c92fSMauro Carvalho Chehabas an array of struct cpumask) of CPUs, pairwise disjoint, that cover
516da82c92fSMauro Carvalho Chehaball the CPUs that must be load balanced.
517da82c92fSMauro Carvalho Chehab
518da82c92fSMauro Carvalho ChehabThe cpuset code builds a new such partition and passes it to the
519da82c92fSMauro Carvalho Chehabscheduler sched domain setup code, to have the sched domains rebuilt
520da82c92fSMauro Carvalho Chehabas necessary, whenever:
521da82c92fSMauro Carvalho Chehab
522da82c92fSMauro Carvalho Chehab - the 'cpuset.sched_load_balance' flag of a cpuset with non-empty CPUs changes,
523da82c92fSMauro Carvalho Chehab - or CPUs come or go from a cpuset with this flag enabled,
524da82c92fSMauro Carvalho Chehab - or 'cpuset.sched_relax_domain_level' value of a cpuset with non-empty CPUs
525da82c92fSMauro Carvalho Chehab   and with this flag enabled changes,
526da82c92fSMauro Carvalho Chehab - or a cpuset with non-empty CPUs and with this flag enabled is removed,
527da82c92fSMauro Carvalho Chehab - or a cpu is offlined/onlined.
528da82c92fSMauro Carvalho Chehab
529da82c92fSMauro Carvalho ChehabThis partition exactly defines what sched domains the scheduler should
530da82c92fSMauro Carvalho Chehabsetup - one sched domain for each element (struct cpumask) in the
531da82c92fSMauro Carvalho Chehabpartition.
532da82c92fSMauro Carvalho Chehab
533da82c92fSMauro Carvalho ChehabThe scheduler remembers the currently active sched domain partitions.
534da82c92fSMauro Carvalho ChehabWhen the scheduler routine partition_sched_domains() is invoked from
535da82c92fSMauro Carvalho Chehabthe cpuset code to update these sched domains, it compares the new
536da82c92fSMauro Carvalho Chehabpartition requested with the current, and updates its sched domains,
537da82c92fSMauro Carvalho Chehabremoving the old and adding the new, for each change.
538da82c92fSMauro Carvalho Chehab
539da82c92fSMauro Carvalho Chehab
540da82c92fSMauro Carvalho Chehab1.8 What is sched_relax_domain_level ?
541da82c92fSMauro Carvalho Chehab--------------------------------------
542da82c92fSMauro Carvalho Chehab
543da82c92fSMauro Carvalho ChehabIn sched domain, the scheduler migrates tasks in 2 ways; periodic load
544da82c92fSMauro Carvalho Chehabbalance on tick, and at time of some schedule events.
545da82c92fSMauro Carvalho Chehab
546da82c92fSMauro Carvalho ChehabWhen a task is woken up, scheduler try to move the task on idle CPU.
547da82c92fSMauro Carvalho ChehabFor example, if a task A running on CPU X activates another task B
548da82c92fSMauro Carvalho Chehabon the same CPU X, and if CPU Y is X's sibling and performing idle,
549da82c92fSMauro Carvalho Chehabthen scheduler migrate task B to CPU Y so that task B can start on
550da82c92fSMauro Carvalho ChehabCPU Y without waiting task A on CPU X.
551da82c92fSMauro Carvalho Chehab
552da82c92fSMauro Carvalho ChehabAnd if a CPU run out of tasks in its runqueue, the CPU try to pull
553da82c92fSMauro Carvalho Chehabextra tasks from other busy CPUs to help them before it is going to
554da82c92fSMauro Carvalho Chehabbe idle.
555da82c92fSMauro Carvalho Chehab
556da82c92fSMauro Carvalho ChehabOf course it takes some searching cost to find movable tasks and/or
557da82c92fSMauro Carvalho Chehabidle CPUs, the scheduler might not search all CPUs in the domain
558da82c92fSMauro Carvalho Chehabevery time.  In fact, in some architectures, the searching ranges on
559da82c92fSMauro Carvalho Chehabevents are limited in the same socket or node where the CPU locates,
560da82c92fSMauro Carvalho Chehabwhile the load balance on tick searches all.
561da82c92fSMauro Carvalho Chehab
562da82c92fSMauro Carvalho ChehabFor example, assume CPU Z is relatively far from CPU X.  Even if CPU Z
563da82c92fSMauro Carvalho Chehabis idle while CPU X and the siblings are busy, scheduler can't migrate
564da82c92fSMauro Carvalho Chehabwoken task B from X to Z since it is out of its searching range.
565da82c92fSMauro Carvalho ChehabAs the result, task B on CPU X need to wait task A or wait load balance
566da82c92fSMauro Carvalho Chehabon the next tick.  For some applications in special situation, waiting
567da82c92fSMauro Carvalho Chehab1 tick may be too long.
568da82c92fSMauro Carvalho Chehab
569da82c92fSMauro Carvalho ChehabThe 'cpuset.sched_relax_domain_level' file allows you to request changing
570da82c92fSMauro Carvalho Chehabthis searching range as you like.  This file takes int value which
571*0f1c74beSVitalii Bursovindicates size of searching range in levels approximately as follows,
572da82c92fSMauro Carvalho Chehabotherwise initial value -1 that indicates the cpuset has no request.
573da82c92fSMauro Carvalho Chehab
574da82c92fSMauro Carvalho Chehab====== ===========================================================
575da82c92fSMauro Carvalho Chehab  -1   no request. use system default or follow request of others.
576da82c92fSMauro Carvalho Chehab   0   no search.
577da82c92fSMauro Carvalho Chehab   1   search siblings (hyperthreads in a core).
578da82c92fSMauro Carvalho Chehab   2   search cores in a package.
579da82c92fSMauro Carvalho Chehab   3   search cpus in a node [= system wide on non-NUMA system]
580da82c92fSMauro Carvalho Chehab   4   search nodes in a chunk of node [on NUMA system]
581da82c92fSMauro Carvalho Chehab   5   search system wide [on NUMA system]
582da82c92fSMauro Carvalho Chehab====== ===========================================================
583da82c92fSMauro Carvalho Chehab
584*0f1c74beSVitalii BursovNot all levels can be present and values can change depending on the
585*0f1c74beSVitalii Bursovsystem architecture and kernel configuration. Check
586*0f1c74beSVitalii Bursov/sys/kernel/debug/sched/domains/cpu*/domain*/ for system-specific
587*0f1c74beSVitalii Bursovdetails.
588*0f1c74beSVitalii Bursov
589da82c92fSMauro Carvalho ChehabThe system default is architecture dependent.  The system default
590da82c92fSMauro Carvalho Chehabcan be changed using the relax_domain_level= boot parameter.
591da82c92fSMauro Carvalho Chehab
592da82c92fSMauro Carvalho ChehabThis file is per-cpuset and affect the sched domain where the cpuset
593da82c92fSMauro Carvalho Chehabbelongs to.  Therefore if the flag 'cpuset.sched_load_balance' of a cpuset
594da82c92fSMauro Carvalho Chehabis disabled, then 'cpuset.sched_relax_domain_level' have no effect since
595da82c92fSMauro Carvalho Chehabthere is no sched domain belonging the cpuset.
596da82c92fSMauro Carvalho Chehab
597da82c92fSMauro Carvalho ChehabIf multiple cpusets are overlapping and hence they form a single sched
598da82c92fSMauro Carvalho Chehabdomain, the largest value among those is used.  Be careful, if one
599da82c92fSMauro Carvalho Chehabrequests 0 and others are -1 then 0 is used.
600da82c92fSMauro Carvalho Chehab
601da82c92fSMauro Carvalho ChehabNote that modifying this file will have both good and bad effects,
602da82c92fSMauro Carvalho Chehaband whether it is acceptable or not depends on your situation.
603da82c92fSMauro Carvalho ChehabDon't modify this file if you are not sure.
604da82c92fSMauro Carvalho Chehab
605da82c92fSMauro Carvalho ChehabIf your situation is:
606da82c92fSMauro Carvalho Chehab
607da82c92fSMauro Carvalho Chehab - The migration costs between each cpu can be assumed considerably
608da82c92fSMauro Carvalho Chehab   small(for you) due to your special application's behavior or
609da82c92fSMauro Carvalho Chehab   special hardware support for CPU cache etc.
610da82c92fSMauro Carvalho Chehab - The searching cost doesn't have impact(for you) or you can make
611da82c92fSMauro Carvalho Chehab   the searching cost enough small by managing cpuset to compact etc.
612da82c92fSMauro Carvalho Chehab - The latency is required even it sacrifices cache hit rate etc.
613da82c92fSMauro Carvalho Chehab   then increasing 'sched_relax_domain_level' would benefit you.
614da82c92fSMauro Carvalho Chehab
615da82c92fSMauro Carvalho Chehab
616da82c92fSMauro Carvalho Chehab1.9 How do I use cpusets ?
617da82c92fSMauro Carvalho Chehab--------------------------
618da82c92fSMauro Carvalho Chehab
619da82c92fSMauro Carvalho ChehabIn order to minimize the impact of cpusets on critical kernel
620da82c92fSMauro Carvalho Chehabcode, such as the scheduler, and due to the fact that the kernel
621da82c92fSMauro Carvalho Chehabdoes not support one task updating the memory placement of another
622da82c92fSMauro Carvalho Chehabtask directly, the impact on a task of changing its cpuset CPU
623da82c92fSMauro Carvalho Chehabor Memory Node placement, or of changing to which cpuset a task
624da82c92fSMauro Carvalho Chehabis attached, is subtle.
625da82c92fSMauro Carvalho Chehab
626da82c92fSMauro Carvalho ChehabIf a cpuset has its Memory Nodes modified, then for each task attached
627da82c92fSMauro Carvalho Chehabto that cpuset, the next time that the kernel attempts to allocate
628da82c92fSMauro Carvalho Chehaba page of memory for that task, the kernel will notice the change
629da82c92fSMauro Carvalho Chehabin the task's cpuset, and update its per-task memory placement to
630da82c92fSMauro Carvalho Chehabremain within the new cpusets memory placement.  If the task was using
631da82c92fSMauro Carvalho Chehabmempolicy MPOL_BIND, and the nodes to which it was bound overlap with
632da82c92fSMauro Carvalho Chehabits new cpuset, then the task will continue to use whatever subset
633da82c92fSMauro Carvalho Chehabof MPOL_BIND nodes are still allowed in the new cpuset.  If the task
634da82c92fSMauro Carvalho Chehabwas using MPOL_BIND and now none of its MPOL_BIND nodes are allowed
635da82c92fSMauro Carvalho Chehabin the new cpuset, then the task will be essentially treated as if it
636da82c92fSMauro Carvalho Chehabwas MPOL_BIND bound to the new cpuset (even though its NUMA placement,
637da82c92fSMauro Carvalho Chehabas queried by get_mempolicy(), doesn't change).  If a task is moved
638da82c92fSMauro Carvalho Chehabfrom one cpuset to another, then the kernel will adjust the task's
639da82c92fSMauro Carvalho Chehabmemory placement, as above, the next time that the kernel attempts
640da82c92fSMauro Carvalho Chehabto allocate a page of memory for that task.
641da82c92fSMauro Carvalho Chehab
642da82c92fSMauro Carvalho ChehabIf a cpuset has its 'cpuset.cpus' modified, then each task in that cpuset
643da82c92fSMauro Carvalho Chehabwill have its allowed CPU placement changed immediately.  Similarly,
644da82c92fSMauro Carvalho Chehabif a task's pid is written to another cpuset's 'tasks' file, then its
645da82c92fSMauro Carvalho Chehaballowed CPU placement is changed immediately.  If such a task had been
646da82c92fSMauro Carvalho Chehabbound to some subset of its cpuset using the sched_setaffinity() call,
647da82c92fSMauro Carvalho Chehabthe task will be allowed to run on any CPU allowed in its new cpuset,
648da82c92fSMauro Carvalho Chehabnegating the effect of the prior sched_setaffinity() call.
649da82c92fSMauro Carvalho Chehab
650da82c92fSMauro Carvalho ChehabIn summary, the memory placement of a task whose cpuset is changed is
651da82c92fSMauro Carvalho Chehabupdated by the kernel, on the next allocation of a page for that task,
652da82c92fSMauro Carvalho Chehaband the processor placement is updated immediately.
653da82c92fSMauro Carvalho Chehab
654da82c92fSMauro Carvalho ChehabNormally, once a page is allocated (given a physical page
655da82c92fSMauro Carvalho Chehabof main memory) then that page stays on whatever node it
656da82c92fSMauro Carvalho Chehabwas allocated, so long as it remains allocated, even if the
657da82c92fSMauro Carvalho Chehabcpusets memory placement policy 'cpuset.mems' subsequently changes.
658da82c92fSMauro Carvalho ChehabIf the cpuset flag file 'cpuset.memory_migrate' is set true, then when
659da82c92fSMauro Carvalho Chehabtasks are attached to that cpuset, any pages that task had
660da82c92fSMauro Carvalho Chehaballocated to it on nodes in its previous cpuset are migrated
661da82c92fSMauro Carvalho Chehabto the task's new cpuset. The relative placement of the page within
662da82c92fSMauro Carvalho Chehabthe cpuset is preserved during these migration operations if possible.
663da82c92fSMauro Carvalho ChehabFor example if the page was on the second valid node of the prior cpuset
664da82c92fSMauro Carvalho Chehabthen the page will be placed on the second valid node of the new cpuset.
665da82c92fSMauro Carvalho Chehab
666da82c92fSMauro Carvalho ChehabAlso if 'cpuset.memory_migrate' is set true, then if that cpuset's
667da82c92fSMauro Carvalho Chehab'cpuset.mems' file is modified, pages allocated to tasks in that
668da82c92fSMauro Carvalho Chehabcpuset, that were on nodes in the previous setting of 'cpuset.mems',
669da82c92fSMauro Carvalho Chehabwill be moved to nodes in the new setting of 'mems.'
670da82c92fSMauro Carvalho ChehabPages that were not in the task's prior cpuset, or in the cpuset's
671da82c92fSMauro Carvalho Chehabprior 'cpuset.mems' setting, will not be moved.
672da82c92fSMauro Carvalho Chehab
673da82c92fSMauro Carvalho ChehabThere is an exception to the above.  If hotplug functionality is used
674da82c92fSMauro Carvalho Chehabto remove all the CPUs that are currently assigned to a cpuset,
675da82c92fSMauro Carvalho Chehabthen all the tasks in that cpuset will be moved to the nearest ancestor
676da82c92fSMauro Carvalho Chehabwith non-empty cpus.  But the moving of some (or all) tasks might fail if
677da82c92fSMauro Carvalho Chehabcpuset is bound with another cgroup subsystem which has some restrictions
678da82c92fSMauro Carvalho Chehabon task attaching.  In this failing case, those tasks will stay
679da82c92fSMauro Carvalho Chehabin the original cpuset, and the kernel will automatically update
680da82c92fSMauro Carvalho Chehabtheir cpus_allowed to allow all online CPUs.  When memory hotplug
681da82c92fSMauro Carvalho Chehabfunctionality for removing Memory Nodes is available, a similar exception
682da82c92fSMauro Carvalho Chehabis expected to apply there as well.  In general, the kernel prefers to
683da82c92fSMauro Carvalho Chehabviolate cpuset placement, over starving a task that has had all
684da82c92fSMauro Carvalho Chehabits allowed CPUs or Memory Nodes taken offline.
685da82c92fSMauro Carvalho Chehab
686da82c92fSMauro Carvalho ChehabThere is a second exception to the above.  GFP_ATOMIC requests are
687da82c92fSMauro Carvalho Chehabkernel internal allocations that must be satisfied, immediately.
688da82c92fSMauro Carvalho ChehabThe kernel may drop some request, in rare cases even panic, if a
689da82c92fSMauro Carvalho ChehabGFP_ATOMIC alloc fails.  If the request cannot be satisfied within
690da82c92fSMauro Carvalho Chehabthe current task's cpuset, then we relax the cpuset, and look for
691da82c92fSMauro Carvalho Chehabmemory anywhere we can find it.  It's better to violate the cpuset
692da82c92fSMauro Carvalho Chehabthan stress the kernel.
693da82c92fSMauro Carvalho Chehab
694da82c92fSMauro Carvalho ChehabTo start a new job that is to be contained within a cpuset, the steps are:
695da82c92fSMauro Carvalho Chehab
696da82c92fSMauro Carvalho Chehab 1) mkdir /sys/fs/cgroup/cpuset
697da82c92fSMauro Carvalho Chehab 2) mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset
698da82c92fSMauro Carvalho Chehab 3) Create the new cpuset by doing mkdir's and write's (or echo's) in
699da82c92fSMauro Carvalho Chehab    the /sys/fs/cgroup/cpuset virtual file system.
700da82c92fSMauro Carvalho Chehab 4) Start a task that will be the "founding father" of the new job.
701da82c92fSMauro Carvalho Chehab 5) Attach that task to the new cpuset by writing its pid to the
702da82c92fSMauro Carvalho Chehab    /sys/fs/cgroup/cpuset tasks file for that cpuset.
703da82c92fSMauro Carvalho Chehab 6) fork, exec or clone the job tasks from this founding father task.
704da82c92fSMauro Carvalho Chehab
705da82c92fSMauro Carvalho ChehabFor example, the following sequence of commands will setup a cpuset
706da82c92fSMauro Carvalho Chehabnamed "Charlie", containing just CPUs 2 and 3, and Memory Node 1,
707da82c92fSMauro Carvalho Chehaband then start a subshell 'sh' in that cpuset::
708da82c92fSMauro Carvalho Chehab
709da82c92fSMauro Carvalho Chehab  mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset
710da82c92fSMauro Carvalho Chehab  cd /sys/fs/cgroup/cpuset
711da82c92fSMauro Carvalho Chehab  mkdir Charlie
712da82c92fSMauro Carvalho Chehab  cd Charlie
713da82c92fSMauro Carvalho Chehab  /bin/echo 2-3 > cpuset.cpus
714da82c92fSMauro Carvalho Chehab  /bin/echo 1 > cpuset.mems
715da82c92fSMauro Carvalho Chehab  /bin/echo $$ > tasks
716da82c92fSMauro Carvalho Chehab  sh
717da82c92fSMauro Carvalho Chehab  # The subshell 'sh' is now running in cpuset Charlie
718da82c92fSMauro Carvalho Chehab  # The next line should display '/Charlie'
719da82c92fSMauro Carvalho Chehab  cat /proc/self/cpuset
720da82c92fSMauro Carvalho Chehab
721da82c92fSMauro Carvalho ChehabThere are ways to query or modify cpusets:
722da82c92fSMauro Carvalho Chehab
723da82c92fSMauro Carvalho Chehab - via the cpuset file system directly, using the various cd, mkdir, echo,
724da82c92fSMauro Carvalho Chehab   cat, rmdir commands from the shell, or their equivalent from C.
725da82c92fSMauro Carvalho Chehab - via the C library libcpuset.
726da82c92fSMauro Carvalho Chehab - via the C library libcgroup.
7279403d9cbSKamalesh Babulal   (https://github.com/libcgroup/libcgroup/)
728da82c92fSMauro Carvalho Chehab - via the python application cset.
729da82c92fSMauro Carvalho Chehab   (http://code.google.com/p/cpuset/)
730da82c92fSMauro Carvalho Chehab
731da82c92fSMauro Carvalho ChehabThe sched_setaffinity calls can also be done at the shell prompt using
732da82c92fSMauro Carvalho ChehabSGI's runon or Robert Love's taskset.  The mbind and set_mempolicy
733da82c92fSMauro Carvalho Chehabcalls can be done at the shell prompt using the numactl command
734da82c92fSMauro Carvalho Chehab(part of Andi Kleen's numa package).
735da82c92fSMauro Carvalho Chehab
736da82c92fSMauro Carvalho Chehab2. Usage Examples and Syntax
737da82c92fSMauro Carvalho Chehab============================
738da82c92fSMauro Carvalho Chehab
739da82c92fSMauro Carvalho Chehab2.1 Basic Usage
740da82c92fSMauro Carvalho Chehab---------------
741da82c92fSMauro Carvalho Chehab
742da82c92fSMauro Carvalho ChehabCreating, modifying, using the cpusets can be done through the cpuset
743da82c92fSMauro Carvalho Chehabvirtual filesystem.
744da82c92fSMauro Carvalho Chehab
745da82c92fSMauro Carvalho ChehabTo mount it, type:
746da82c92fSMauro Carvalho Chehab# mount -t cgroup -o cpuset cpuset /sys/fs/cgroup/cpuset
747da82c92fSMauro Carvalho Chehab
748da82c92fSMauro Carvalho ChehabThen under /sys/fs/cgroup/cpuset you can find a tree that corresponds to the
749da82c92fSMauro Carvalho Chehabtree of the cpusets in the system. For instance, /sys/fs/cgroup/cpuset
750da82c92fSMauro Carvalho Chehabis the cpuset that holds the whole system.
751da82c92fSMauro Carvalho Chehab
752da82c92fSMauro Carvalho ChehabIf you want to create a new cpuset under /sys/fs/cgroup/cpuset::
753da82c92fSMauro Carvalho Chehab
754da82c92fSMauro Carvalho Chehab  # cd /sys/fs/cgroup/cpuset
755da82c92fSMauro Carvalho Chehab  # mkdir my_cpuset
756da82c92fSMauro Carvalho Chehab
757da82c92fSMauro Carvalho ChehabNow you want to do something with this cpuset::
758da82c92fSMauro Carvalho Chehab
759da82c92fSMauro Carvalho Chehab  # cd my_cpuset
760da82c92fSMauro Carvalho Chehab
761da82c92fSMauro Carvalho ChehabIn this directory you can find several files::
762da82c92fSMauro Carvalho Chehab
763da82c92fSMauro Carvalho Chehab  # ls
764da82c92fSMauro Carvalho Chehab  cgroup.clone_children  cpuset.memory_pressure
765da82c92fSMauro Carvalho Chehab  cgroup.event_control   cpuset.memory_spread_page
766da82c92fSMauro Carvalho Chehab  cgroup.procs           cpuset.memory_spread_slab
767da82c92fSMauro Carvalho Chehab  cpuset.cpu_exclusive   cpuset.mems
768da82c92fSMauro Carvalho Chehab  cpuset.cpus            cpuset.sched_load_balance
769da82c92fSMauro Carvalho Chehab  cpuset.mem_exclusive   cpuset.sched_relax_domain_level
770da82c92fSMauro Carvalho Chehab  cpuset.mem_hardwall    notify_on_release
771da82c92fSMauro Carvalho Chehab  cpuset.memory_migrate  tasks
772da82c92fSMauro Carvalho Chehab
773da82c92fSMauro Carvalho ChehabReading them will give you information about the state of this cpuset:
774da82c92fSMauro Carvalho Chehabthe CPUs and Memory Nodes it can use, the processes that are using
775da82c92fSMauro Carvalho Chehabit, its properties.  By writing to these files you can manipulate
776da82c92fSMauro Carvalho Chehabthe cpuset.
777da82c92fSMauro Carvalho Chehab
778da82c92fSMauro Carvalho ChehabSet some flags::
779da82c92fSMauro Carvalho Chehab
780da82c92fSMauro Carvalho Chehab  # /bin/echo 1 > cpuset.cpu_exclusive
781da82c92fSMauro Carvalho Chehab
782da82c92fSMauro Carvalho ChehabAdd some cpus::
783da82c92fSMauro Carvalho Chehab
784da82c92fSMauro Carvalho Chehab  # /bin/echo 0-7 > cpuset.cpus
785da82c92fSMauro Carvalho Chehab
786da82c92fSMauro Carvalho ChehabAdd some mems::
787da82c92fSMauro Carvalho Chehab
788da82c92fSMauro Carvalho Chehab  # /bin/echo 0-7 > cpuset.mems
789da82c92fSMauro Carvalho Chehab
790da82c92fSMauro Carvalho ChehabNow attach your shell to this cpuset::
791da82c92fSMauro Carvalho Chehab
792da82c92fSMauro Carvalho Chehab  # /bin/echo $$ > tasks
793da82c92fSMauro Carvalho Chehab
794da82c92fSMauro Carvalho ChehabYou can also create cpusets inside your cpuset by using mkdir in this
795da82c92fSMauro Carvalho Chehabdirectory::
796da82c92fSMauro Carvalho Chehab
797da82c92fSMauro Carvalho Chehab  # mkdir my_sub_cs
798da82c92fSMauro Carvalho Chehab
799da82c92fSMauro Carvalho ChehabTo remove a cpuset, just use rmdir::
800da82c92fSMauro Carvalho Chehab
801da82c92fSMauro Carvalho Chehab  # rmdir my_sub_cs
802da82c92fSMauro Carvalho Chehab
803da82c92fSMauro Carvalho ChehabThis will fail if the cpuset is in use (has cpusets inside, or has
804da82c92fSMauro Carvalho Chehabprocesses attached).
805da82c92fSMauro Carvalho Chehab
806da82c92fSMauro Carvalho ChehabNote that for legacy reasons, the "cpuset" filesystem exists as a
807da82c92fSMauro Carvalho Chehabwrapper around the cgroup filesystem.
808da82c92fSMauro Carvalho Chehab
809da82c92fSMauro Carvalho ChehabThe command::
810da82c92fSMauro Carvalho Chehab
811da82c92fSMauro Carvalho Chehab  mount -t cpuset X /sys/fs/cgroup/cpuset
812da82c92fSMauro Carvalho Chehab
813da82c92fSMauro Carvalho Chehabis equivalent to::
814da82c92fSMauro Carvalho Chehab
815da82c92fSMauro Carvalho Chehab  mount -t cgroup -ocpuset,noprefix X /sys/fs/cgroup/cpuset
816da82c92fSMauro Carvalho Chehab  echo "/sbin/cpuset_release_agent" > /sys/fs/cgroup/cpuset/release_agent
817da82c92fSMauro Carvalho Chehab
818da82c92fSMauro Carvalho Chehab2.2 Adding/removing cpus
819da82c92fSMauro Carvalho Chehab------------------------
820da82c92fSMauro Carvalho Chehab
821da82c92fSMauro Carvalho ChehabThis is the syntax to use when writing in the cpus or mems files
822da82c92fSMauro Carvalho Chehabin cpuset directories::
823da82c92fSMauro Carvalho Chehab
824da82c92fSMauro Carvalho Chehab  # /bin/echo 1-4 > cpuset.cpus		-> set cpus list to cpus 1,2,3,4
825da82c92fSMauro Carvalho Chehab  # /bin/echo 1,2,3,4 > cpuset.cpus	-> set cpus list to cpus 1,2,3,4
826da82c92fSMauro Carvalho Chehab
827da82c92fSMauro Carvalho ChehabTo add a CPU to a cpuset, write the new list of CPUs including the
828da82c92fSMauro Carvalho ChehabCPU to be added. To add 6 to the above cpuset::
829da82c92fSMauro Carvalho Chehab
830da82c92fSMauro Carvalho Chehab  # /bin/echo 1-4,6 > cpuset.cpus	-> set cpus list to cpus 1,2,3,4,6
831da82c92fSMauro Carvalho Chehab
832da82c92fSMauro Carvalho ChehabSimilarly to remove a CPU from a cpuset, write the new list of CPUs
833da82c92fSMauro Carvalho Chehabwithout the CPU to be removed.
834da82c92fSMauro Carvalho Chehab
835da82c92fSMauro Carvalho ChehabTo remove all the CPUs::
836da82c92fSMauro Carvalho Chehab
837da82c92fSMauro Carvalho Chehab  # /bin/echo "" > cpuset.cpus		-> clear cpus list
838da82c92fSMauro Carvalho Chehab
839da82c92fSMauro Carvalho Chehab2.3 Setting flags
840da82c92fSMauro Carvalho Chehab-----------------
841da82c92fSMauro Carvalho Chehab
842da82c92fSMauro Carvalho ChehabThe syntax is very simple::
843da82c92fSMauro Carvalho Chehab
844da82c92fSMauro Carvalho Chehab  # /bin/echo 1 > cpuset.cpu_exclusive 	-> set flag 'cpuset.cpu_exclusive'
845da82c92fSMauro Carvalho Chehab  # /bin/echo 0 > cpuset.cpu_exclusive 	-> unset flag 'cpuset.cpu_exclusive'
846da82c92fSMauro Carvalho Chehab
847da82c92fSMauro Carvalho Chehab2.4 Attaching processes
848da82c92fSMauro Carvalho Chehab-----------------------
849da82c92fSMauro Carvalho Chehab
850da82c92fSMauro Carvalho Chehab::
851da82c92fSMauro Carvalho Chehab
852da82c92fSMauro Carvalho Chehab  # /bin/echo PID > tasks
853da82c92fSMauro Carvalho Chehab
854da82c92fSMauro Carvalho ChehabNote that it is PID, not PIDs. You can only attach ONE task at a time.
855da82c92fSMauro Carvalho ChehabIf you have several tasks to attach, you have to do it one after another::
856da82c92fSMauro Carvalho Chehab
857da82c92fSMauro Carvalho Chehab  # /bin/echo PID1 > tasks
858da82c92fSMauro Carvalho Chehab  # /bin/echo PID2 > tasks
859da82c92fSMauro Carvalho Chehab	...
860da82c92fSMauro Carvalho Chehab  # /bin/echo PIDn > tasks
861da82c92fSMauro Carvalho Chehab
862da82c92fSMauro Carvalho Chehab
863da82c92fSMauro Carvalho Chehab3. Questions
864da82c92fSMauro Carvalho Chehab============
865da82c92fSMauro Carvalho Chehab
866da82c92fSMauro Carvalho ChehabQ:
867da82c92fSMauro Carvalho Chehab   what's up with this '/bin/echo' ?
868da82c92fSMauro Carvalho Chehab
869da82c92fSMauro Carvalho ChehabA:
870da82c92fSMauro Carvalho Chehab   bash's builtin 'echo' command does not check calls to write() against
871da82c92fSMauro Carvalho Chehab   errors. If you use it in the cpuset file system, you won't be
872da82c92fSMauro Carvalho Chehab   able to tell whether a command succeeded or failed.
873da82c92fSMauro Carvalho Chehab
874da82c92fSMauro Carvalho ChehabQ:
875da82c92fSMauro Carvalho Chehab   When I attach processes, only the first of the line gets really attached !
876da82c92fSMauro Carvalho Chehab
877da82c92fSMauro Carvalho ChehabA:
878da82c92fSMauro Carvalho Chehab   We can only return one error code per call to write(). So you should also
879da82c92fSMauro Carvalho Chehab   put only ONE pid.
880da82c92fSMauro Carvalho Chehab
881da82c92fSMauro Carvalho Chehab4. Contact
882da82c92fSMauro Carvalho Chehab==========
883da82c92fSMauro Carvalho Chehab
884da82c92fSMauro Carvalho ChehabWeb: http://www.bullopensource.org/cpuset
885