xref: /linux/Documentation/admin-guide/cgroup-v1/cpusets.rst (revision da82c92f1150f66afabf78d2c85ef9ac18dc6d38)
1*da82c92fSMauro Carvalho Chehab=======
2*da82c92fSMauro Carvalho ChehabCPUSETS
3*da82c92fSMauro Carvalho Chehab=======
4*da82c92fSMauro Carvalho Chehab
5*da82c92fSMauro Carvalho ChehabCopyright (C) 2004 BULL SA.
6*da82c92fSMauro Carvalho Chehab
7*da82c92fSMauro Carvalho ChehabWritten by Simon.Derr@bull.net
8*da82c92fSMauro Carvalho Chehab
9*da82c92fSMauro Carvalho Chehab- Portions Copyright (c) 2004-2006 Silicon Graphics, Inc.
10*da82c92fSMauro Carvalho Chehab- Modified by Paul Jackson <pj@sgi.com>
11*da82c92fSMauro Carvalho Chehab- Modified by Christoph Lameter <cl@linux.com>
12*da82c92fSMauro Carvalho Chehab- Modified by Paul Menage <menage@google.com>
13*da82c92fSMauro Carvalho Chehab- Modified by Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
14*da82c92fSMauro Carvalho Chehab
15*da82c92fSMauro Carvalho Chehab.. CONTENTS:
16*da82c92fSMauro Carvalho Chehab
17*da82c92fSMauro Carvalho Chehab   1. Cpusets
18*da82c92fSMauro Carvalho Chehab     1.1 What are cpusets ?
19*da82c92fSMauro Carvalho Chehab     1.2 Why are cpusets needed ?
20*da82c92fSMauro Carvalho Chehab     1.3 How are cpusets implemented ?
21*da82c92fSMauro Carvalho Chehab     1.4 What are exclusive cpusets ?
22*da82c92fSMauro Carvalho Chehab     1.5 What is memory_pressure ?
23*da82c92fSMauro Carvalho Chehab     1.6 What is memory spread ?
24*da82c92fSMauro Carvalho Chehab     1.7 What is sched_load_balance ?
25*da82c92fSMauro Carvalho Chehab     1.8 What is sched_relax_domain_level ?
26*da82c92fSMauro Carvalho Chehab     1.9 How do I use cpusets ?
27*da82c92fSMauro Carvalho Chehab   2. Usage Examples and Syntax
28*da82c92fSMauro Carvalho Chehab     2.1 Basic Usage
29*da82c92fSMauro Carvalho Chehab     2.2 Adding/removing cpus
30*da82c92fSMauro Carvalho Chehab     2.3 Setting flags
31*da82c92fSMauro Carvalho Chehab     2.4 Attaching processes
32*da82c92fSMauro Carvalho Chehab   3. Questions
33*da82c92fSMauro Carvalho Chehab   4. Contact
34*da82c92fSMauro Carvalho Chehab
35*da82c92fSMauro Carvalho Chehab1. Cpusets
36*da82c92fSMauro Carvalho Chehab==========
37*da82c92fSMauro Carvalho Chehab
38*da82c92fSMauro Carvalho Chehab1.1 What are cpusets ?
39*da82c92fSMauro Carvalho Chehab----------------------
40*da82c92fSMauro Carvalho Chehab
41*da82c92fSMauro Carvalho ChehabCpusets provide a mechanism for assigning a set of CPUs and Memory
42*da82c92fSMauro Carvalho ChehabNodes to a set of tasks.   In this document "Memory Node" refers to
43*da82c92fSMauro Carvalho Chehaban on-line node that contains memory.
44*da82c92fSMauro Carvalho Chehab
45*da82c92fSMauro Carvalho ChehabCpusets constrain the CPU and Memory placement of tasks to only
46*da82c92fSMauro Carvalho Chehabthe resources within a task's current cpuset.  They form a nested
47*da82c92fSMauro Carvalho Chehabhierarchy visible in a virtual file system.  These are the essential
48*da82c92fSMauro Carvalho Chehabhooks, beyond what is already present, required to manage dynamic
49*da82c92fSMauro Carvalho Chehabjob placement on large systems.
50*da82c92fSMauro Carvalho Chehab
51*da82c92fSMauro Carvalho ChehabCpusets use the generic cgroup subsystem described in
52*da82c92fSMauro Carvalho ChehabDocumentation/admin-guide/cgroup-v1/cgroups.rst.
53*da82c92fSMauro Carvalho Chehab
54*da82c92fSMauro Carvalho ChehabRequests by a task, using the sched_setaffinity(2) system call to
55*da82c92fSMauro Carvalho Chehabinclude CPUs in its CPU affinity mask, and using the mbind(2) and
56*da82c92fSMauro Carvalho Chehabset_mempolicy(2) system calls to include Memory Nodes in its memory
57*da82c92fSMauro Carvalho Chehabpolicy, are both filtered through that task's cpuset, filtering out any
58*da82c92fSMauro Carvalho ChehabCPUs or Memory Nodes not in that cpuset.  The scheduler will not
59*da82c92fSMauro Carvalho Chehabschedule a task on a CPU that is not allowed in its cpus_allowed
60*da82c92fSMauro Carvalho Chehabvector, and the kernel page allocator will not allocate a page on a
61*da82c92fSMauro Carvalho Chehabnode that is not allowed in the requesting task's mems_allowed vector.
62*da82c92fSMauro Carvalho Chehab
63*da82c92fSMauro Carvalho ChehabUser level code may create and destroy cpusets by name in the cgroup
64*da82c92fSMauro Carvalho Chehabvirtual file system, manage the attributes and permissions of these
65*da82c92fSMauro Carvalho Chehabcpusets and which CPUs and Memory Nodes are assigned to each cpuset,
66*da82c92fSMauro Carvalho Chehabspecify and query to which cpuset a task is assigned, and list the
67*da82c92fSMauro Carvalho Chehabtask pids assigned to a cpuset.
68*da82c92fSMauro Carvalho Chehab
69*da82c92fSMauro Carvalho Chehab
70*da82c92fSMauro Carvalho Chehab1.2 Why are cpusets needed ?
71*da82c92fSMauro Carvalho Chehab----------------------------
72*da82c92fSMauro Carvalho Chehab
73*da82c92fSMauro Carvalho ChehabThe management of large computer systems, with many processors (CPUs),
74*da82c92fSMauro Carvalho Chehabcomplex memory cache hierarchies and multiple Memory Nodes having
75*da82c92fSMauro Carvalho Chehabnon-uniform access times (NUMA) presents additional challenges for
76*da82c92fSMauro Carvalho Chehabthe efficient scheduling and memory placement of processes.
77*da82c92fSMauro Carvalho Chehab
78*da82c92fSMauro Carvalho ChehabFrequently more modest sized systems can be operated with adequate
79*da82c92fSMauro Carvalho Chehabefficiency just by letting the operating system automatically share
80*da82c92fSMauro Carvalho Chehabthe available CPU and Memory resources amongst the requesting tasks.
81*da82c92fSMauro Carvalho Chehab
82*da82c92fSMauro Carvalho ChehabBut larger systems, which benefit more from careful processor and
83*da82c92fSMauro Carvalho Chehabmemory placement to reduce memory access times and contention,
84*da82c92fSMauro Carvalho Chehaband which typically represent a larger investment for the customer,
85*da82c92fSMauro Carvalho Chehabcan benefit from explicitly placing jobs on properly sized subsets of
86*da82c92fSMauro Carvalho Chehabthe system.
87*da82c92fSMauro Carvalho Chehab
88*da82c92fSMauro Carvalho ChehabThis can be especially valuable on:
89*da82c92fSMauro Carvalho Chehab
90*da82c92fSMauro Carvalho Chehab    * Web Servers running multiple instances of the same web application,
91*da82c92fSMauro Carvalho Chehab    * Servers running different applications (for instance, a web server
92*da82c92fSMauro Carvalho Chehab      and a database), or
93*da82c92fSMauro Carvalho Chehab    * NUMA systems running large HPC applications with demanding
94*da82c92fSMauro Carvalho Chehab      performance characteristics.
95*da82c92fSMauro Carvalho Chehab
96*da82c92fSMauro Carvalho ChehabThese subsets, or "soft partitions" must be able to be dynamically
97*da82c92fSMauro Carvalho Chehabadjusted, as the job mix changes, without impacting other concurrently
98*da82c92fSMauro Carvalho Chehabexecuting jobs. The location of the running jobs pages may also be moved
99*da82c92fSMauro Carvalho Chehabwhen the memory locations are changed.
100*da82c92fSMauro Carvalho Chehab
101*da82c92fSMauro Carvalho ChehabThe kernel cpuset patch provides the minimum essential kernel
102*da82c92fSMauro Carvalho Chehabmechanisms required to efficiently implement such subsets.  It
103*da82c92fSMauro Carvalho Chehableverages existing CPU and Memory Placement facilities in the Linux
104*da82c92fSMauro Carvalho Chehabkernel to avoid any additional impact on the critical scheduler or
105*da82c92fSMauro Carvalho Chehabmemory allocator code.
106*da82c92fSMauro Carvalho Chehab
107*da82c92fSMauro Carvalho Chehab
108*da82c92fSMauro Carvalho Chehab1.3 How are cpusets implemented ?
109*da82c92fSMauro Carvalho Chehab---------------------------------
110*da82c92fSMauro Carvalho Chehab
111*da82c92fSMauro Carvalho ChehabCpusets provide a Linux kernel mechanism to constrain which CPUs and
112*da82c92fSMauro Carvalho ChehabMemory Nodes are used by a process or set of processes.
113*da82c92fSMauro Carvalho Chehab
114*da82c92fSMauro Carvalho ChehabThe Linux kernel already has a pair of mechanisms to specify on which
115*da82c92fSMauro Carvalho ChehabCPUs a task may be scheduled (sched_setaffinity) and on which Memory
116*da82c92fSMauro Carvalho ChehabNodes it may obtain memory (mbind, set_mempolicy).
117*da82c92fSMauro Carvalho Chehab
118*da82c92fSMauro Carvalho ChehabCpusets extends these two mechanisms as follows:
119*da82c92fSMauro Carvalho Chehab
120*da82c92fSMauro Carvalho Chehab - Cpusets are sets of allowed CPUs and Memory Nodes, known to the
121*da82c92fSMauro Carvalho Chehab   kernel.
122*da82c92fSMauro Carvalho Chehab - Each task in the system is attached to a cpuset, via a pointer
123*da82c92fSMauro Carvalho Chehab   in the task structure to a reference counted cgroup structure.
124*da82c92fSMauro Carvalho Chehab - Calls to sched_setaffinity are filtered to just those CPUs
125*da82c92fSMauro Carvalho Chehab   allowed in that task's cpuset.
126*da82c92fSMauro Carvalho Chehab - Calls to mbind and set_mempolicy are filtered to just
127*da82c92fSMauro Carvalho Chehab   those Memory Nodes allowed in that task's cpuset.
128*da82c92fSMauro Carvalho Chehab - The root cpuset contains all the systems CPUs and Memory
129*da82c92fSMauro Carvalho Chehab   Nodes.
130*da82c92fSMauro Carvalho Chehab - For any cpuset, one can define child cpusets containing a subset
131*da82c92fSMauro Carvalho Chehab   of the parents CPU and Memory Node resources.
132*da82c92fSMauro Carvalho Chehab - The hierarchy of cpusets can be mounted at /dev/cpuset, for
133*da82c92fSMauro Carvalho Chehab   browsing and manipulation from user space.
134*da82c92fSMauro Carvalho Chehab - A cpuset may be marked exclusive, which ensures that no other
135*da82c92fSMauro Carvalho Chehab   cpuset (except direct ancestors and descendants) may contain
136*da82c92fSMauro Carvalho Chehab   any overlapping CPUs or Memory Nodes.
137*da82c92fSMauro Carvalho Chehab - You can list all the tasks (by pid) attached to any cpuset.
138*da82c92fSMauro Carvalho Chehab
139*da82c92fSMauro Carvalho ChehabThe implementation of cpusets requires a few, simple hooks
140*da82c92fSMauro Carvalho Chehabinto the rest of the kernel, none in performance critical paths:
141*da82c92fSMauro Carvalho Chehab
142*da82c92fSMauro Carvalho Chehab - in init/main.c, to initialize the root cpuset at system boot.
143*da82c92fSMauro Carvalho Chehab - in fork and exit, to attach and detach a task from its cpuset.
144*da82c92fSMauro Carvalho Chehab - in sched_setaffinity, to mask the requested CPUs by what's
145*da82c92fSMauro Carvalho Chehab   allowed in that task's cpuset.
146*da82c92fSMauro Carvalho Chehab - in sched.c migrate_live_tasks(), to keep migrating tasks within
147*da82c92fSMauro Carvalho Chehab   the CPUs allowed by their cpuset, if possible.
148*da82c92fSMauro Carvalho Chehab - in the mbind and set_mempolicy system calls, to mask the requested
149*da82c92fSMauro Carvalho Chehab   Memory Nodes by what's allowed in that task's cpuset.
150*da82c92fSMauro Carvalho Chehab - in page_alloc.c, to restrict memory to allowed nodes.
151*da82c92fSMauro Carvalho Chehab - in vmscan.c, to restrict page recovery to the current cpuset.
152*da82c92fSMauro Carvalho Chehab
153*da82c92fSMauro Carvalho ChehabYou should mount the "cgroup" filesystem type in order to enable
154*da82c92fSMauro Carvalho Chehabbrowsing and modifying the cpusets presently known to the kernel.  No
155*da82c92fSMauro Carvalho Chehabnew system calls are added for cpusets - all support for querying and
156*da82c92fSMauro Carvalho Chehabmodifying cpusets is via this cpuset file system.
157*da82c92fSMauro Carvalho Chehab
158*da82c92fSMauro Carvalho ChehabThe /proc/<pid>/status file for each task has four added lines,
159*da82c92fSMauro Carvalho Chehabdisplaying the task's cpus_allowed (on which CPUs it may be scheduled)
160*da82c92fSMauro Carvalho Chehaband mems_allowed (on which Memory Nodes it may obtain memory),
161*da82c92fSMauro Carvalho Chehabin the two formats seen in the following example::
162*da82c92fSMauro Carvalho Chehab
163*da82c92fSMauro Carvalho Chehab  Cpus_allowed:   ffffffff,ffffffff,ffffffff,ffffffff
164*da82c92fSMauro Carvalho Chehab  Cpus_allowed_list:      0-127
165*da82c92fSMauro Carvalho Chehab  Mems_allowed:   ffffffff,ffffffff
166*da82c92fSMauro Carvalho Chehab  Mems_allowed_list:      0-63
167*da82c92fSMauro Carvalho Chehab
168*da82c92fSMauro Carvalho ChehabEach cpuset is represented by a directory in the cgroup file system
169*da82c92fSMauro Carvalho Chehabcontaining (on top of the standard cgroup files) the following
170*da82c92fSMauro Carvalho Chehabfiles describing that cpuset:
171*da82c92fSMauro Carvalho Chehab
172*da82c92fSMauro Carvalho Chehab - cpuset.cpus: list of CPUs in that cpuset
173*da82c92fSMauro Carvalho Chehab - cpuset.mems: list of Memory Nodes in that cpuset
174*da82c92fSMauro Carvalho Chehab - cpuset.memory_migrate flag: if set, move pages to cpusets nodes
175*da82c92fSMauro Carvalho Chehab - cpuset.cpu_exclusive flag: is cpu placement exclusive?
176*da82c92fSMauro Carvalho Chehab - cpuset.mem_exclusive flag: is memory placement exclusive?
177*da82c92fSMauro Carvalho Chehab - cpuset.mem_hardwall flag:  is memory allocation hardwalled
178*da82c92fSMauro Carvalho Chehab - cpuset.memory_pressure: measure of how much paging pressure in cpuset
179*da82c92fSMauro Carvalho Chehab - cpuset.memory_spread_page flag: if set, spread page cache evenly on allowed nodes
180*da82c92fSMauro Carvalho Chehab - cpuset.memory_spread_slab flag: if set, spread slab cache evenly on allowed nodes
181*da82c92fSMauro Carvalho Chehab - cpuset.sched_load_balance flag: if set, load balance within CPUs on that cpuset
182*da82c92fSMauro Carvalho Chehab - cpuset.sched_relax_domain_level: the searching range when migrating tasks
183*da82c92fSMauro Carvalho Chehab
184*da82c92fSMauro Carvalho ChehabIn addition, only the root cpuset has the following file:
185*da82c92fSMauro Carvalho Chehab
186*da82c92fSMauro Carvalho Chehab - cpuset.memory_pressure_enabled flag: compute memory_pressure?
187*da82c92fSMauro Carvalho Chehab
188*da82c92fSMauro Carvalho ChehabNew cpusets are created using the mkdir system call or shell
189*da82c92fSMauro Carvalho Chehabcommand.  The properties of a cpuset, such as its flags, allowed
190*da82c92fSMauro Carvalho ChehabCPUs and Memory Nodes, and attached tasks, are modified by writing
191*da82c92fSMauro Carvalho Chehabto the appropriate file in that cpusets directory, as listed above.
192*da82c92fSMauro Carvalho Chehab
193*da82c92fSMauro Carvalho ChehabThe named hierarchical structure of nested cpusets allows partitioning
194*da82c92fSMauro Carvalho Chehaba large system into nested, dynamically changeable, "soft-partitions".
195*da82c92fSMauro Carvalho Chehab
196*da82c92fSMauro Carvalho ChehabThe attachment of each task, automatically inherited at fork by any
197*da82c92fSMauro Carvalho Chehabchildren of that task, to a cpuset allows organizing the work load
198*da82c92fSMauro Carvalho Chehabon a system into related sets of tasks such that each set is constrained
199*da82c92fSMauro Carvalho Chehabto using the CPUs and Memory Nodes of a particular cpuset.  A task
200*da82c92fSMauro Carvalho Chehabmay be re-attached to any other cpuset, if allowed by the permissions
201*da82c92fSMauro Carvalho Chehabon the necessary cpuset file system directories.
202*da82c92fSMauro Carvalho Chehab
203*da82c92fSMauro Carvalho ChehabSuch management of a system "in the large" integrates smoothly with
204*da82c92fSMauro Carvalho Chehabthe detailed placement done on individual tasks and memory regions
205*da82c92fSMauro Carvalho Chehabusing the sched_setaffinity, mbind and set_mempolicy system calls.
206*da82c92fSMauro Carvalho Chehab
207*da82c92fSMauro Carvalho ChehabThe following rules apply to each cpuset:
208*da82c92fSMauro Carvalho Chehab
209*da82c92fSMauro Carvalho Chehab - Its CPUs and Memory Nodes must be a subset of its parents.
210*da82c92fSMauro Carvalho Chehab - It can't be marked exclusive unless its parent is.
211*da82c92fSMauro Carvalho Chehab - If its cpu or memory is exclusive, they may not overlap any sibling.
212*da82c92fSMauro Carvalho Chehab
213*da82c92fSMauro Carvalho ChehabThese rules, and the natural hierarchy of cpusets, enable efficient
214*da82c92fSMauro Carvalho Chehabenforcement of the exclusive guarantee, without having to scan all
215*da82c92fSMauro Carvalho Chehabcpusets every time any of them change to ensure nothing overlaps a
216*da82c92fSMauro Carvalho Chehabexclusive cpuset.  Also, the use of a Linux virtual file system (vfs)
217*da82c92fSMauro Carvalho Chehabto represent the cpuset hierarchy provides for a familiar permission
218*da82c92fSMauro Carvalho Chehaband name space for cpusets, with a minimum of additional kernel code.
219*da82c92fSMauro Carvalho Chehab
220*da82c92fSMauro Carvalho ChehabThe cpus and mems files in the root (top_cpuset) cpuset are
221*da82c92fSMauro Carvalho Chehabread-only.  The cpus file automatically tracks the value of
222*da82c92fSMauro Carvalho Chehabcpu_online_mask using a CPU hotplug notifier, and the mems file
223*da82c92fSMauro Carvalho Chehabautomatically tracks the value of node_states[N_MEMORY]--i.e.,
224*da82c92fSMauro Carvalho Chehabnodes with memory--using the cpuset_track_online_nodes() hook.
225*da82c92fSMauro Carvalho Chehab
226*da82c92fSMauro Carvalho Chehab
227*da82c92fSMauro Carvalho Chehab1.4 What are exclusive cpusets ?
228*da82c92fSMauro Carvalho Chehab--------------------------------
229*da82c92fSMauro Carvalho Chehab
230*da82c92fSMauro Carvalho ChehabIf a cpuset is cpu or mem exclusive, no other cpuset, other than
231*da82c92fSMauro Carvalho Chehaba direct ancestor or descendant, may share any of the same CPUs or
232*da82c92fSMauro Carvalho ChehabMemory Nodes.
233*da82c92fSMauro Carvalho Chehab
234*da82c92fSMauro Carvalho ChehabA cpuset that is cpuset.mem_exclusive *or* cpuset.mem_hardwall is "hardwalled",
235*da82c92fSMauro Carvalho Chehabi.e. it restricts kernel allocations for page, buffer and other data
236*da82c92fSMauro Carvalho Chehabcommonly shared by the kernel across multiple users.  All cpusets,
237*da82c92fSMauro Carvalho Chehabwhether hardwalled or not, restrict allocations of memory for user
238*da82c92fSMauro Carvalho Chehabspace.  This enables configuring a system so that several independent
239*da82c92fSMauro Carvalho Chehabjobs can share common kernel data, such as file system pages, while
240*da82c92fSMauro Carvalho Chehabisolating each job's user allocation in its own cpuset.  To do this,
241*da82c92fSMauro Carvalho Chehabconstruct a large mem_exclusive cpuset to hold all the jobs, and
242*da82c92fSMauro Carvalho Chehabconstruct child, non-mem_exclusive cpusets for each individual job.
243*da82c92fSMauro Carvalho ChehabOnly a small amount of typical kernel memory, such as requests from
244*da82c92fSMauro Carvalho Chehabinterrupt handlers, is allowed to be taken outside even a
245*da82c92fSMauro Carvalho Chehabmem_exclusive cpuset.
246*da82c92fSMauro Carvalho Chehab
247*da82c92fSMauro Carvalho Chehab
248*da82c92fSMauro Carvalho Chehab1.5 What is memory_pressure ?
249*da82c92fSMauro Carvalho Chehab-----------------------------
250*da82c92fSMauro Carvalho ChehabThe memory_pressure of a cpuset provides a simple per-cpuset metric
251*da82c92fSMauro Carvalho Chehabof the rate that the tasks in a cpuset are attempting to free up in
252*da82c92fSMauro Carvalho Chehabuse memory on the nodes of the cpuset to satisfy additional memory
253*da82c92fSMauro Carvalho Chehabrequests.
254*da82c92fSMauro Carvalho Chehab
255*da82c92fSMauro Carvalho ChehabThis enables batch managers monitoring jobs running in dedicated
256*da82c92fSMauro Carvalho Chehabcpusets to efficiently detect what level of memory pressure that job
257*da82c92fSMauro Carvalho Chehabis causing.
258*da82c92fSMauro Carvalho Chehab
259*da82c92fSMauro Carvalho ChehabThis is useful both on tightly managed systems running a wide mix of
260*da82c92fSMauro Carvalho Chehabsubmitted jobs, which may choose to terminate or re-prioritize jobs that
261*da82c92fSMauro Carvalho Chehabare trying to use more memory than allowed on the nodes assigned to them,
262*da82c92fSMauro Carvalho Chehaband with tightly coupled, long running, massively parallel scientific
263*da82c92fSMauro Carvalho Chehabcomputing jobs that will dramatically fail to meet required performance
264*da82c92fSMauro Carvalho Chehabgoals if they start to use more memory than allowed to them.
265*da82c92fSMauro Carvalho Chehab
266*da82c92fSMauro Carvalho ChehabThis mechanism provides a very economical way for the batch manager
267*da82c92fSMauro Carvalho Chehabto monitor a cpuset for signs of memory pressure.  It's up to the
268*da82c92fSMauro Carvalho Chehabbatch manager or other user code to decide what to do about it and
269*da82c92fSMauro Carvalho Chehabtake action.
270*da82c92fSMauro Carvalho Chehab
271*da82c92fSMauro Carvalho Chehab==>
272*da82c92fSMauro Carvalho Chehab    Unless this feature is enabled by writing "1" to the special file
273*da82c92fSMauro Carvalho Chehab    /dev/cpuset/memory_pressure_enabled, the hook in the rebalance
274*da82c92fSMauro Carvalho Chehab    code of __alloc_pages() for this metric reduces to simply noticing
275*da82c92fSMauro Carvalho Chehab    that the cpuset_memory_pressure_enabled flag is zero.  So only
276*da82c92fSMauro Carvalho Chehab    systems that enable this feature will compute the metric.
277*da82c92fSMauro Carvalho Chehab
278*da82c92fSMauro Carvalho ChehabWhy a per-cpuset, running average:
279*da82c92fSMauro Carvalho Chehab
280*da82c92fSMauro Carvalho Chehab    Because this meter is per-cpuset, rather than per-task or mm,
281*da82c92fSMauro Carvalho Chehab    the system load imposed by a batch scheduler monitoring this
282*da82c92fSMauro Carvalho Chehab    metric is sharply reduced on large systems, because a scan of
283*da82c92fSMauro Carvalho Chehab    the tasklist can be avoided on each set of queries.
284*da82c92fSMauro Carvalho Chehab
285*da82c92fSMauro Carvalho Chehab    Because this meter is a running average, instead of an accumulating
286*da82c92fSMauro Carvalho Chehab    counter, a batch scheduler can detect memory pressure with a
287*da82c92fSMauro Carvalho Chehab    single read, instead of having to read and accumulate results
288*da82c92fSMauro Carvalho Chehab    for a period of time.
289*da82c92fSMauro Carvalho Chehab
290*da82c92fSMauro Carvalho Chehab    Because this meter is per-cpuset rather than per-task or mm,
291*da82c92fSMauro Carvalho Chehab    the batch scheduler can obtain the key information, memory
292*da82c92fSMauro Carvalho Chehab    pressure in a cpuset, with a single read, rather than having to
293*da82c92fSMauro Carvalho Chehab    query and accumulate results over all the (dynamically changing)
294*da82c92fSMauro Carvalho Chehab    set of tasks in the cpuset.
295*da82c92fSMauro Carvalho Chehab
296*da82c92fSMauro Carvalho ChehabA per-cpuset simple digital filter (requires a spinlock and 3 words
297*da82c92fSMauro Carvalho Chehabof data per-cpuset) is kept, and updated by any task attached to that
298*da82c92fSMauro Carvalho Chehabcpuset, if it enters the synchronous (direct) page reclaim code.
299*da82c92fSMauro Carvalho Chehab
300*da82c92fSMauro Carvalho ChehabA per-cpuset file provides an integer number representing the recent
301*da82c92fSMauro Carvalho Chehab(half-life of 10 seconds) rate of direct page reclaims caused by
302*da82c92fSMauro Carvalho Chehabthe tasks in the cpuset, in units of reclaims attempted per second,
303*da82c92fSMauro Carvalho Chehabtimes 1000.
304*da82c92fSMauro Carvalho Chehab
305*da82c92fSMauro Carvalho Chehab
306*da82c92fSMauro Carvalho Chehab1.6 What is memory spread ?
307*da82c92fSMauro Carvalho Chehab---------------------------
308*da82c92fSMauro Carvalho ChehabThere are two boolean flag files per cpuset that control where the
309*da82c92fSMauro Carvalho Chehabkernel allocates pages for the file system buffers and related in
310*da82c92fSMauro Carvalho Chehabkernel data structures.  They are called 'cpuset.memory_spread_page' and
311*da82c92fSMauro Carvalho Chehab'cpuset.memory_spread_slab'.
312*da82c92fSMauro Carvalho Chehab
313*da82c92fSMauro Carvalho ChehabIf the per-cpuset boolean flag file 'cpuset.memory_spread_page' is set, then
314*da82c92fSMauro Carvalho Chehabthe kernel will spread the file system buffers (page cache) evenly
315*da82c92fSMauro Carvalho Chehabover all the nodes that the faulting task is allowed to use, instead
316*da82c92fSMauro Carvalho Chehabof preferring to put those pages on the node where the task is running.
317*da82c92fSMauro Carvalho Chehab
318*da82c92fSMauro Carvalho ChehabIf the per-cpuset boolean flag file 'cpuset.memory_spread_slab' is set,
319*da82c92fSMauro Carvalho Chehabthen the kernel will spread some file system related slab caches,
320*da82c92fSMauro Carvalho Chehabsuch as for inodes and dentries evenly over all the nodes that the
321*da82c92fSMauro Carvalho Chehabfaulting task is allowed to use, instead of preferring to put those
322*da82c92fSMauro Carvalho Chehabpages on the node where the task is running.
323*da82c92fSMauro Carvalho Chehab
324*da82c92fSMauro Carvalho ChehabThe setting of these flags does not affect anonymous data segment or
325*da82c92fSMauro Carvalho Chehabstack segment pages of a task.
326*da82c92fSMauro Carvalho Chehab
327*da82c92fSMauro Carvalho ChehabBy default, both kinds of memory spreading are off, and memory
328*da82c92fSMauro Carvalho Chehabpages are allocated on the node local to where the task is running,
329*da82c92fSMauro Carvalho Chehabexcept perhaps as modified by the task's NUMA mempolicy or cpuset
330*da82c92fSMauro Carvalho Chehabconfiguration, so long as sufficient free memory pages are available.
331*da82c92fSMauro Carvalho Chehab
332*da82c92fSMauro Carvalho ChehabWhen new cpusets are created, they inherit the memory spread settings
333*da82c92fSMauro Carvalho Chehabof their parent.
334*da82c92fSMauro Carvalho Chehab
335*da82c92fSMauro Carvalho ChehabSetting memory spreading causes allocations for the affected page
336*da82c92fSMauro Carvalho Chehabor slab caches to ignore the task's NUMA mempolicy and be spread
337*da82c92fSMauro Carvalho Chehabinstead.    Tasks using mbind() or set_mempolicy() calls to set NUMA
338*da82c92fSMauro Carvalho Chehabmempolicies will not notice any change in these calls as a result of
339*da82c92fSMauro Carvalho Chehabtheir containing task's memory spread settings.  If memory spreading
340*da82c92fSMauro Carvalho Chehabis turned off, then the currently specified NUMA mempolicy once again
341*da82c92fSMauro Carvalho Chehabapplies to memory page allocations.
342*da82c92fSMauro Carvalho Chehab
343*da82c92fSMauro Carvalho ChehabBoth 'cpuset.memory_spread_page' and 'cpuset.memory_spread_slab' are boolean flag
344*da82c92fSMauro Carvalho Chehabfiles.  By default they contain "0", meaning that the feature is off
345*da82c92fSMauro Carvalho Chehabfor that cpuset.  If a "1" is written to that file, then that turns
346*da82c92fSMauro Carvalho Chehabthe named feature on.
347*da82c92fSMauro Carvalho Chehab
348*da82c92fSMauro Carvalho ChehabThe implementation is simple.
349*da82c92fSMauro Carvalho Chehab
350*da82c92fSMauro Carvalho ChehabSetting the flag 'cpuset.memory_spread_page' turns on a per-process flag
351*da82c92fSMauro Carvalho ChehabPFA_SPREAD_PAGE for each task that is in that cpuset or subsequently
352*da82c92fSMauro Carvalho Chehabjoins that cpuset.  The page allocation calls for the page cache
353*da82c92fSMauro Carvalho Chehabis modified to perform an inline check for this PFA_SPREAD_PAGE task
354*da82c92fSMauro Carvalho Chehabflag, and if set, a call to a new routine cpuset_mem_spread_node()
355*da82c92fSMauro Carvalho Chehabreturns the node to prefer for the allocation.
356*da82c92fSMauro Carvalho Chehab
357*da82c92fSMauro Carvalho ChehabSimilarly, setting 'cpuset.memory_spread_slab' turns on the flag
358*da82c92fSMauro Carvalho ChehabPFA_SPREAD_SLAB, and appropriately marked slab caches will allocate
359*da82c92fSMauro Carvalho Chehabpages from the node returned by cpuset_mem_spread_node().
360*da82c92fSMauro Carvalho Chehab
361*da82c92fSMauro Carvalho ChehabThe cpuset_mem_spread_node() routine is also simple.  It uses the
362*da82c92fSMauro Carvalho Chehabvalue of a per-task rotor cpuset_mem_spread_rotor to select the next
363*da82c92fSMauro Carvalho Chehabnode in the current task's mems_allowed to prefer for the allocation.
364*da82c92fSMauro Carvalho Chehab
365*da82c92fSMauro Carvalho ChehabThis memory placement policy is also known (in other contexts) as
366*da82c92fSMauro Carvalho Chehabround-robin or interleave.
367*da82c92fSMauro Carvalho Chehab
368*da82c92fSMauro Carvalho ChehabThis policy can provide substantial improvements for jobs that need
369*da82c92fSMauro Carvalho Chehabto place thread local data on the corresponding node, but that need
370*da82c92fSMauro Carvalho Chehabto access large file system data sets that need to be spread across
371*da82c92fSMauro Carvalho Chehabthe several nodes in the jobs cpuset in order to fit.  Without this
372*da82c92fSMauro Carvalho Chehabpolicy, especially for jobs that might have one thread reading in the
373*da82c92fSMauro Carvalho Chehabdata set, the memory allocation across the nodes in the jobs cpuset
374*da82c92fSMauro Carvalho Chehabcan become very uneven.
375*da82c92fSMauro Carvalho Chehab
376*da82c92fSMauro Carvalho Chehab1.7 What is sched_load_balance ?
377*da82c92fSMauro Carvalho Chehab--------------------------------
378*da82c92fSMauro Carvalho Chehab
379*da82c92fSMauro Carvalho ChehabThe kernel scheduler (kernel/sched/core.c) automatically load balances
380*da82c92fSMauro Carvalho Chehabtasks.  If one CPU is underutilized, kernel code running on that
381*da82c92fSMauro Carvalho ChehabCPU will look for tasks on other more overloaded CPUs and move those
382*da82c92fSMauro Carvalho Chehabtasks to itself, within the constraints of such placement mechanisms
383*da82c92fSMauro Carvalho Chehabas cpusets and sched_setaffinity.
384*da82c92fSMauro Carvalho Chehab
385*da82c92fSMauro Carvalho ChehabThe algorithmic cost of load balancing and its impact on key shared
386*da82c92fSMauro Carvalho Chehabkernel data structures such as the task list increases more than
387*da82c92fSMauro Carvalho Chehablinearly with the number of CPUs being balanced.  So the scheduler
388*da82c92fSMauro Carvalho Chehabhas support to partition the systems CPUs into a number of sched
389*da82c92fSMauro Carvalho Chehabdomains such that it only load balances within each sched domain.
390*da82c92fSMauro Carvalho ChehabEach sched domain covers some subset of the CPUs in the system;
391*da82c92fSMauro Carvalho Chehabno two sched domains overlap; some CPUs might not be in any sched
392*da82c92fSMauro Carvalho Chehabdomain and hence won't be load balanced.
393*da82c92fSMauro Carvalho Chehab
394*da82c92fSMauro Carvalho ChehabPut simply, it costs less to balance between two smaller sched domains
395*da82c92fSMauro Carvalho Chehabthan one big one, but doing so means that overloads in one of the
396*da82c92fSMauro Carvalho Chehabtwo domains won't be load balanced to the other one.
397*da82c92fSMauro Carvalho Chehab
398*da82c92fSMauro Carvalho ChehabBy default, there is one sched domain covering all CPUs, including those
399*da82c92fSMauro Carvalho Chehabmarked isolated using the kernel boot time "isolcpus=" argument. However,
400*da82c92fSMauro Carvalho Chehabthe isolated CPUs will not participate in load balancing, and will not
401*da82c92fSMauro Carvalho Chehabhave tasks running on them unless explicitly assigned.
402*da82c92fSMauro Carvalho Chehab
403*da82c92fSMauro Carvalho ChehabThis default load balancing across all CPUs is not well suited for
404*da82c92fSMauro Carvalho Chehabthe following two situations:
405*da82c92fSMauro Carvalho Chehab
406*da82c92fSMauro Carvalho Chehab 1) On large systems, load balancing across many CPUs is expensive.
407*da82c92fSMauro Carvalho Chehab    If the system is managed using cpusets to place independent jobs
408*da82c92fSMauro Carvalho Chehab    on separate sets of CPUs, full load balancing is unnecessary.
409*da82c92fSMauro Carvalho Chehab 2) Systems supporting realtime on some CPUs need to minimize
410*da82c92fSMauro Carvalho Chehab    system overhead on those CPUs, including avoiding task load
411*da82c92fSMauro Carvalho Chehab    balancing if that is not needed.
412*da82c92fSMauro Carvalho Chehab
413*da82c92fSMauro Carvalho ChehabWhen the per-cpuset flag "cpuset.sched_load_balance" is enabled (the default
414*da82c92fSMauro Carvalho Chehabsetting), it requests that all the CPUs in that cpusets allowed 'cpuset.cpus'
415*da82c92fSMauro Carvalho Chehabbe contained in a single sched domain, ensuring that load balancing
416*da82c92fSMauro Carvalho Chehabcan move a task (not otherwised pinned, as by sched_setaffinity)
417*da82c92fSMauro Carvalho Chehabfrom any CPU in that cpuset to any other.
418*da82c92fSMauro Carvalho Chehab
419*da82c92fSMauro Carvalho ChehabWhen the per-cpuset flag "cpuset.sched_load_balance" is disabled, then the
420*da82c92fSMauro Carvalho Chehabscheduler will avoid load balancing across the CPUs in that cpuset,
421*da82c92fSMauro Carvalho Chehab--except-- in so far as is necessary because some overlapping cpuset
422*da82c92fSMauro Carvalho Chehabhas "sched_load_balance" enabled.
423*da82c92fSMauro Carvalho Chehab
424*da82c92fSMauro Carvalho ChehabSo, for example, if the top cpuset has the flag "cpuset.sched_load_balance"
425*da82c92fSMauro Carvalho Chehabenabled, then the scheduler will have one sched domain covering all
426*da82c92fSMauro Carvalho ChehabCPUs, and the setting of the "cpuset.sched_load_balance" flag in any other
427*da82c92fSMauro Carvalho Chehabcpusets won't matter, as we're already fully load balancing.
428*da82c92fSMauro Carvalho Chehab
429*da82c92fSMauro Carvalho ChehabTherefore in the above two situations, the top cpuset flag
430*da82c92fSMauro Carvalho Chehab"cpuset.sched_load_balance" should be disabled, and only some of the smaller,
431*da82c92fSMauro Carvalho Chehabchild cpusets have this flag enabled.
432*da82c92fSMauro Carvalho Chehab
433*da82c92fSMauro Carvalho ChehabWhen doing this, you don't usually want to leave any unpinned tasks in
434*da82c92fSMauro Carvalho Chehabthe top cpuset that might use non-trivial amounts of CPU, as such tasks
435*da82c92fSMauro Carvalho Chehabmay be artificially constrained to some subset of CPUs, depending on
436*da82c92fSMauro Carvalho Chehabthe particulars of this flag setting in descendant cpusets.  Even if
437*da82c92fSMauro Carvalho Chehabsuch a task could use spare CPU cycles in some other CPUs, the kernel
438*da82c92fSMauro Carvalho Chehabscheduler might not consider the possibility of load balancing that
439*da82c92fSMauro Carvalho Chehabtask to that underused CPU.
440*da82c92fSMauro Carvalho Chehab
441*da82c92fSMauro Carvalho ChehabOf course, tasks pinned to a particular CPU can be left in a cpuset
442*da82c92fSMauro Carvalho Chehabthat disables "cpuset.sched_load_balance" as those tasks aren't going anywhere
443*da82c92fSMauro Carvalho Chehabelse anyway.
444*da82c92fSMauro Carvalho Chehab
445*da82c92fSMauro Carvalho ChehabThere is an impedance mismatch here, between cpusets and sched domains.
446*da82c92fSMauro Carvalho ChehabCpusets are hierarchical and nest.  Sched domains are flat; they don't
447*da82c92fSMauro Carvalho Chehaboverlap and each CPU is in at most one sched domain.
448*da82c92fSMauro Carvalho Chehab
449*da82c92fSMauro Carvalho ChehabIt is necessary for sched domains to be flat because load balancing
450*da82c92fSMauro Carvalho Chehabacross partially overlapping sets of CPUs would risk unstable dynamics
451*da82c92fSMauro Carvalho Chehabthat would be beyond our understanding.  So if each of two partially
452*da82c92fSMauro Carvalho Chehaboverlapping cpusets enables the flag 'cpuset.sched_load_balance', then we
453*da82c92fSMauro Carvalho Chehabform a single sched domain that is a superset of both.  We won't move
454*da82c92fSMauro Carvalho Chehaba task to a CPU outside its cpuset, but the scheduler load balancing
455*da82c92fSMauro Carvalho Chehabcode might waste some compute cycles considering that possibility.
456*da82c92fSMauro Carvalho Chehab
457*da82c92fSMauro Carvalho ChehabThis mismatch is why there is not a simple one-to-one relation
458*da82c92fSMauro Carvalho Chehabbetween which cpusets have the flag "cpuset.sched_load_balance" enabled,
459*da82c92fSMauro Carvalho Chehaband the sched domain configuration.  If a cpuset enables the flag, it
460*da82c92fSMauro Carvalho Chehabwill get balancing across all its CPUs, but if it disables the flag,
461*da82c92fSMauro Carvalho Chehabit will only be assured of no load balancing if no other overlapping
462*da82c92fSMauro Carvalho Chehabcpuset enables the flag.
463*da82c92fSMauro Carvalho Chehab
464*da82c92fSMauro Carvalho ChehabIf two cpusets have partially overlapping 'cpuset.cpus' allowed, and only
465*da82c92fSMauro Carvalho Chehabone of them has this flag enabled, then the other may find its
466*da82c92fSMauro Carvalho Chehabtasks only partially load balanced, just on the overlapping CPUs.
467*da82c92fSMauro Carvalho ChehabThis is just the general case of the top_cpuset example given a few
468*da82c92fSMauro Carvalho Chehabparagraphs above.  In the general case, as in the top cpuset case,
469*da82c92fSMauro Carvalho Chehabdon't leave tasks that might use non-trivial amounts of CPU in
470*da82c92fSMauro Carvalho Chehabsuch partially load balanced cpusets, as they may be artificially
471*da82c92fSMauro Carvalho Chehabconstrained to some subset of the CPUs allowed to them, for lack of
472*da82c92fSMauro Carvalho Chehabload balancing to the other CPUs.
473*da82c92fSMauro Carvalho Chehab
474*da82c92fSMauro Carvalho ChehabCPUs in "cpuset.isolcpus" were excluded from load balancing by the
475*da82c92fSMauro Carvalho Chehabisolcpus= kernel boot option, and will never be load balanced regardless
476*da82c92fSMauro Carvalho Chehabof the value of "cpuset.sched_load_balance" in any cpuset.
477*da82c92fSMauro Carvalho Chehab
478*da82c92fSMauro Carvalho Chehab1.7.1 sched_load_balance implementation details.
479*da82c92fSMauro Carvalho Chehab------------------------------------------------
480*da82c92fSMauro Carvalho Chehab
481*da82c92fSMauro Carvalho ChehabThe per-cpuset flag 'cpuset.sched_load_balance' defaults to enabled (contrary
482*da82c92fSMauro Carvalho Chehabto most cpuset flags.)  When enabled for a cpuset, the kernel will
483*da82c92fSMauro Carvalho Chehabensure that it can load balance across all the CPUs in that cpuset
484*da82c92fSMauro Carvalho Chehab(makes sure that all the CPUs in the cpus_allowed of that cpuset are
485*da82c92fSMauro Carvalho Chehabin the same sched domain.)
486*da82c92fSMauro Carvalho Chehab
487*da82c92fSMauro Carvalho ChehabIf two overlapping cpusets both have 'cpuset.sched_load_balance' enabled,
488*da82c92fSMauro Carvalho Chehabthen they will be (must be) both in the same sched domain.
489*da82c92fSMauro Carvalho Chehab
490*da82c92fSMauro Carvalho ChehabIf, as is the default, the top cpuset has 'cpuset.sched_load_balance' enabled,
491*da82c92fSMauro Carvalho Chehabthen by the above that means there is a single sched domain covering
492*da82c92fSMauro Carvalho Chehabthe whole system, regardless of any other cpuset settings.
493*da82c92fSMauro Carvalho Chehab
494*da82c92fSMauro Carvalho ChehabThe kernel commits to user space that it will avoid load balancing
495*da82c92fSMauro Carvalho Chehabwhere it can.  It will pick as fine a granularity partition of sched
496*da82c92fSMauro Carvalho Chehabdomains as it can while still providing load balancing for any set
497*da82c92fSMauro Carvalho Chehabof CPUs allowed to a cpuset having 'cpuset.sched_load_balance' enabled.
498*da82c92fSMauro Carvalho Chehab
499*da82c92fSMauro Carvalho ChehabThe internal kernel cpuset to scheduler interface passes from the
500*da82c92fSMauro Carvalho Chehabcpuset code to the scheduler code a partition of the load balanced
501*da82c92fSMauro Carvalho ChehabCPUs in the system. This partition is a set of subsets (represented
502*da82c92fSMauro Carvalho Chehabas an array of struct cpumask) of CPUs, pairwise disjoint, that cover
503*da82c92fSMauro Carvalho Chehaball the CPUs that must be load balanced.
504*da82c92fSMauro Carvalho Chehab
505*da82c92fSMauro Carvalho ChehabThe cpuset code builds a new such partition and passes it to the
506*da82c92fSMauro Carvalho Chehabscheduler sched domain setup code, to have the sched domains rebuilt
507*da82c92fSMauro Carvalho Chehabas necessary, whenever:
508*da82c92fSMauro Carvalho Chehab
509*da82c92fSMauro Carvalho Chehab - the 'cpuset.sched_load_balance' flag of a cpuset with non-empty CPUs changes,
510*da82c92fSMauro Carvalho Chehab - or CPUs come or go from a cpuset with this flag enabled,
511*da82c92fSMauro Carvalho Chehab - or 'cpuset.sched_relax_domain_level' value of a cpuset with non-empty CPUs
512*da82c92fSMauro Carvalho Chehab   and with this flag enabled changes,
513*da82c92fSMauro Carvalho Chehab - or a cpuset with non-empty CPUs and with this flag enabled is removed,
514*da82c92fSMauro Carvalho Chehab - or a cpu is offlined/onlined.
515*da82c92fSMauro Carvalho Chehab
516*da82c92fSMauro Carvalho ChehabThis partition exactly defines what sched domains the scheduler should
517*da82c92fSMauro Carvalho Chehabsetup - one sched domain for each element (struct cpumask) in the
518*da82c92fSMauro Carvalho Chehabpartition.
519*da82c92fSMauro Carvalho Chehab
520*da82c92fSMauro Carvalho ChehabThe scheduler remembers the currently active sched domain partitions.
521*da82c92fSMauro Carvalho ChehabWhen the scheduler routine partition_sched_domains() is invoked from
522*da82c92fSMauro Carvalho Chehabthe cpuset code to update these sched domains, it compares the new
523*da82c92fSMauro Carvalho Chehabpartition requested with the current, and updates its sched domains,
524*da82c92fSMauro Carvalho Chehabremoving the old and adding the new, for each change.
525*da82c92fSMauro Carvalho Chehab
526*da82c92fSMauro Carvalho Chehab
527*da82c92fSMauro Carvalho Chehab1.8 What is sched_relax_domain_level ?
528*da82c92fSMauro Carvalho Chehab--------------------------------------
529*da82c92fSMauro Carvalho Chehab
530*da82c92fSMauro Carvalho ChehabIn sched domain, the scheduler migrates tasks in 2 ways; periodic load
531*da82c92fSMauro Carvalho Chehabbalance on tick, and at time of some schedule events.
532*da82c92fSMauro Carvalho Chehab
533*da82c92fSMauro Carvalho ChehabWhen a task is woken up, scheduler try to move the task on idle CPU.
534*da82c92fSMauro Carvalho ChehabFor example, if a task A running on CPU X activates another task B
535*da82c92fSMauro Carvalho Chehabon the same CPU X, and if CPU Y is X's sibling and performing idle,
536*da82c92fSMauro Carvalho Chehabthen scheduler migrate task B to CPU Y so that task B can start on
537*da82c92fSMauro Carvalho ChehabCPU Y without waiting task A on CPU X.
538*da82c92fSMauro Carvalho Chehab
539*da82c92fSMauro Carvalho ChehabAnd if a CPU run out of tasks in its runqueue, the CPU try to pull
540*da82c92fSMauro Carvalho Chehabextra tasks from other busy CPUs to help them before it is going to
541*da82c92fSMauro Carvalho Chehabbe idle.
542*da82c92fSMauro Carvalho Chehab
543*da82c92fSMauro Carvalho ChehabOf course it takes some searching cost to find movable tasks and/or
544*da82c92fSMauro Carvalho Chehabidle CPUs, the scheduler might not search all CPUs in the domain
545*da82c92fSMauro Carvalho Chehabevery time.  In fact, in some architectures, the searching ranges on
546*da82c92fSMauro Carvalho Chehabevents are limited in the same socket or node where the CPU locates,
547*da82c92fSMauro Carvalho Chehabwhile the load balance on tick searches all.
548*da82c92fSMauro Carvalho Chehab
549*da82c92fSMauro Carvalho ChehabFor example, assume CPU Z is relatively far from CPU X.  Even if CPU Z
550*da82c92fSMauro Carvalho Chehabis idle while CPU X and the siblings are busy, scheduler can't migrate
551*da82c92fSMauro Carvalho Chehabwoken task B from X to Z since it is out of its searching range.
552*da82c92fSMauro Carvalho ChehabAs the result, task B on CPU X need to wait task A or wait load balance
553*da82c92fSMauro Carvalho Chehabon the next tick.  For some applications in special situation, waiting
554*da82c92fSMauro Carvalho Chehab1 tick may be too long.
555*da82c92fSMauro Carvalho Chehab
556*da82c92fSMauro Carvalho ChehabThe 'cpuset.sched_relax_domain_level' file allows you to request changing
557*da82c92fSMauro Carvalho Chehabthis searching range as you like.  This file takes int value which
558*da82c92fSMauro Carvalho Chehabindicates size of searching range in levels ideally as follows,
559*da82c92fSMauro Carvalho Chehabotherwise initial value -1 that indicates the cpuset has no request.
560*da82c92fSMauro Carvalho Chehab
561*da82c92fSMauro Carvalho Chehab====== ===========================================================
562*da82c92fSMauro Carvalho Chehab  -1   no request. use system default or follow request of others.
563*da82c92fSMauro Carvalho Chehab   0   no search.
564*da82c92fSMauro Carvalho Chehab   1   search siblings (hyperthreads in a core).
565*da82c92fSMauro Carvalho Chehab   2   search cores in a package.
566*da82c92fSMauro Carvalho Chehab   3   search cpus in a node [= system wide on non-NUMA system]
567*da82c92fSMauro Carvalho Chehab   4   search nodes in a chunk of node [on NUMA system]
568*da82c92fSMauro Carvalho Chehab   5   search system wide [on NUMA system]
569*da82c92fSMauro Carvalho Chehab====== ===========================================================
570*da82c92fSMauro Carvalho Chehab
571*da82c92fSMauro Carvalho ChehabThe system default is architecture dependent.  The system default
572*da82c92fSMauro Carvalho Chehabcan be changed using the relax_domain_level= boot parameter.
573*da82c92fSMauro Carvalho Chehab
574*da82c92fSMauro Carvalho ChehabThis file is per-cpuset and affect the sched domain where the cpuset
575*da82c92fSMauro Carvalho Chehabbelongs to.  Therefore if the flag 'cpuset.sched_load_balance' of a cpuset
576*da82c92fSMauro Carvalho Chehabis disabled, then 'cpuset.sched_relax_domain_level' have no effect since
577*da82c92fSMauro Carvalho Chehabthere is no sched domain belonging the cpuset.
578*da82c92fSMauro Carvalho Chehab
579*da82c92fSMauro Carvalho ChehabIf multiple cpusets are overlapping and hence they form a single sched
580*da82c92fSMauro Carvalho Chehabdomain, the largest value among those is used.  Be careful, if one
581*da82c92fSMauro Carvalho Chehabrequests 0 and others are -1 then 0 is used.
582*da82c92fSMauro Carvalho Chehab
583*da82c92fSMauro Carvalho ChehabNote that modifying this file will have both good and bad effects,
584*da82c92fSMauro Carvalho Chehaband whether it is acceptable or not depends on your situation.
585*da82c92fSMauro Carvalho ChehabDon't modify this file if you are not sure.
586*da82c92fSMauro Carvalho Chehab
587*da82c92fSMauro Carvalho ChehabIf your situation is:
588*da82c92fSMauro Carvalho Chehab
589*da82c92fSMauro Carvalho Chehab - The migration costs between each cpu can be assumed considerably
590*da82c92fSMauro Carvalho Chehab   small(for you) due to your special application's behavior or
591*da82c92fSMauro Carvalho Chehab   special hardware support for CPU cache etc.
592*da82c92fSMauro Carvalho Chehab - The searching cost doesn't have impact(for you) or you can make
593*da82c92fSMauro Carvalho Chehab   the searching cost enough small by managing cpuset to compact etc.
594*da82c92fSMauro Carvalho Chehab - The latency is required even it sacrifices cache hit rate etc.
595*da82c92fSMauro Carvalho Chehab   then increasing 'sched_relax_domain_level' would benefit you.
596*da82c92fSMauro Carvalho Chehab
597*da82c92fSMauro Carvalho Chehab
598*da82c92fSMauro Carvalho Chehab1.9 How do I use cpusets ?
599*da82c92fSMauro Carvalho Chehab--------------------------
600*da82c92fSMauro Carvalho Chehab
601*da82c92fSMauro Carvalho ChehabIn order to minimize the impact of cpusets on critical kernel
602*da82c92fSMauro Carvalho Chehabcode, such as the scheduler, and due to the fact that the kernel
603*da82c92fSMauro Carvalho Chehabdoes not support one task updating the memory placement of another
604*da82c92fSMauro Carvalho Chehabtask directly, the impact on a task of changing its cpuset CPU
605*da82c92fSMauro Carvalho Chehabor Memory Node placement, or of changing to which cpuset a task
606*da82c92fSMauro Carvalho Chehabis attached, is subtle.
607*da82c92fSMauro Carvalho Chehab
608*da82c92fSMauro Carvalho ChehabIf a cpuset has its Memory Nodes modified, then for each task attached
609*da82c92fSMauro Carvalho Chehabto that cpuset, the next time that the kernel attempts to allocate
610*da82c92fSMauro Carvalho Chehaba page of memory for that task, the kernel will notice the change
611*da82c92fSMauro Carvalho Chehabin the task's cpuset, and update its per-task memory placement to
612*da82c92fSMauro Carvalho Chehabremain within the new cpusets memory placement.  If the task was using
613*da82c92fSMauro Carvalho Chehabmempolicy MPOL_BIND, and the nodes to which it was bound overlap with
614*da82c92fSMauro Carvalho Chehabits new cpuset, then the task will continue to use whatever subset
615*da82c92fSMauro Carvalho Chehabof MPOL_BIND nodes are still allowed in the new cpuset.  If the task
616*da82c92fSMauro Carvalho Chehabwas using MPOL_BIND and now none of its MPOL_BIND nodes are allowed
617*da82c92fSMauro Carvalho Chehabin the new cpuset, then the task will be essentially treated as if it
618*da82c92fSMauro Carvalho Chehabwas MPOL_BIND bound to the new cpuset (even though its NUMA placement,
619*da82c92fSMauro Carvalho Chehabas queried by get_mempolicy(), doesn't change).  If a task is moved
620*da82c92fSMauro Carvalho Chehabfrom one cpuset to another, then the kernel will adjust the task's
621*da82c92fSMauro Carvalho Chehabmemory placement, as above, the next time that the kernel attempts
622*da82c92fSMauro Carvalho Chehabto allocate a page of memory for that task.
623*da82c92fSMauro Carvalho Chehab
624*da82c92fSMauro Carvalho ChehabIf a cpuset has its 'cpuset.cpus' modified, then each task in that cpuset
625*da82c92fSMauro Carvalho Chehabwill have its allowed CPU placement changed immediately.  Similarly,
626*da82c92fSMauro Carvalho Chehabif a task's pid is written to another cpuset's 'tasks' file, then its
627*da82c92fSMauro Carvalho Chehaballowed CPU placement is changed immediately.  If such a task had been
628*da82c92fSMauro Carvalho Chehabbound to some subset of its cpuset using the sched_setaffinity() call,
629*da82c92fSMauro Carvalho Chehabthe task will be allowed to run on any CPU allowed in its new cpuset,
630*da82c92fSMauro Carvalho Chehabnegating the effect of the prior sched_setaffinity() call.
631*da82c92fSMauro Carvalho Chehab
632*da82c92fSMauro Carvalho ChehabIn summary, the memory placement of a task whose cpuset is changed is
633*da82c92fSMauro Carvalho Chehabupdated by the kernel, on the next allocation of a page for that task,
634*da82c92fSMauro Carvalho Chehaband the processor placement is updated immediately.
635*da82c92fSMauro Carvalho Chehab
636*da82c92fSMauro Carvalho ChehabNormally, once a page is allocated (given a physical page
637*da82c92fSMauro Carvalho Chehabof main memory) then that page stays on whatever node it
638*da82c92fSMauro Carvalho Chehabwas allocated, so long as it remains allocated, even if the
639*da82c92fSMauro Carvalho Chehabcpusets memory placement policy 'cpuset.mems' subsequently changes.
640*da82c92fSMauro Carvalho ChehabIf the cpuset flag file 'cpuset.memory_migrate' is set true, then when
641*da82c92fSMauro Carvalho Chehabtasks are attached to that cpuset, any pages that task had
642*da82c92fSMauro Carvalho Chehaballocated to it on nodes in its previous cpuset are migrated
643*da82c92fSMauro Carvalho Chehabto the task's new cpuset. The relative placement of the page within
644*da82c92fSMauro Carvalho Chehabthe cpuset is preserved during these migration operations if possible.
645*da82c92fSMauro Carvalho ChehabFor example if the page was on the second valid node of the prior cpuset
646*da82c92fSMauro Carvalho Chehabthen the page will be placed on the second valid node of the new cpuset.
647*da82c92fSMauro Carvalho Chehab
648*da82c92fSMauro Carvalho ChehabAlso if 'cpuset.memory_migrate' is set true, then if that cpuset's
649*da82c92fSMauro Carvalho Chehab'cpuset.mems' file is modified, pages allocated to tasks in that
650*da82c92fSMauro Carvalho Chehabcpuset, that were on nodes in the previous setting of 'cpuset.mems',
651*da82c92fSMauro Carvalho Chehabwill be moved to nodes in the new setting of 'mems.'
652*da82c92fSMauro Carvalho ChehabPages that were not in the task's prior cpuset, or in the cpuset's
653*da82c92fSMauro Carvalho Chehabprior 'cpuset.mems' setting, will not be moved.
654*da82c92fSMauro Carvalho Chehab
655*da82c92fSMauro Carvalho ChehabThere is an exception to the above.  If hotplug functionality is used
656*da82c92fSMauro Carvalho Chehabto remove all the CPUs that are currently assigned to a cpuset,
657*da82c92fSMauro Carvalho Chehabthen all the tasks in that cpuset will be moved to the nearest ancestor
658*da82c92fSMauro Carvalho Chehabwith non-empty cpus.  But the moving of some (or all) tasks might fail if
659*da82c92fSMauro Carvalho Chehabcpuset is bound with another cgroup subsystem which has some restrictions
660*da82c92fSMauro Carvalho Chehabon task attaching.  In this failing case, those tasks will stay
661*da82c92fSMauro Carvalho Chehabin the original cpuset, and the kernel will automatically update
662*da82c92fSMauro Carvalho Chehabtheir cpus_allowed to allow all online CPUs.  When memory hotplug
663*da82c92fSMauro Carvalho Chehabfunctionality for removing Memory Nodes is available, a similar exception
664*da82c92fSMauro Carvalho Chehabis expected to apply there as well.  In general, the kernel prefers to
665*da82c92fSMauro Carvalho Chehabviolate cpuset placement, over starving a task that has had all
666*da82c92fSMauro Carvalho Chehabits allowed CPUs or Memory Nodes taken offline.
667*da82c92fSMauro Carvalho Chehab
668*da82c92fSMauro Carvalho ChehabThere is a second exception to the above.  GFP_ATOMIC requests are
669*da82c92fSMauro Carvalho Chehabkernel internal allocations that must be satisfied, immediately.
670*da82c92fSMauro Carvalho ChehabThe kernel may drop some request, in rare cases even panic, if a
671*da82c92fSMauro Carvalho ChehabGFP_ATOMIC alloc fails.  If the request cannot be satisfied within
672*da82c92fSMauro Carvalho Chehabthe current task's cpuset, then we relax the cpuset, and look for
673*da82c92fSMauro Carvalho Chehabmemory anywhere we can find it.  It's better to violate the cpuset
674*da82c92fSMauro Carvalho Chehabthan stress the kernel.
675*da82c92fSMauro Carvalho Chehab
676*da82c92fSMauro Carvalho ChehabTo start a new job that is to be contained within a cpuset, the steps are:
677*da82c92fSMauro Carvalho Chehab
678*da82c92fSMauro Carvalho Chehab 1) mkdir /sys/fs/cgroup/cpuset
679*da82c92fSMauro Carvalho Chehab 2) mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset
680*da82c92fSMauro Carvalho Chehab 3) Create the new cpuset by doing mkdir's and write's (or echo's) in
681*da82c92fSMauro Carvalho Chehab    the /sys/fs/cgroup/cpuset virtual file system.
682*da82c92fSMauro Carvalho Chehab 4) Start a task that will be the "founding father" of the new job.
683*da82c92fSMauro Carvalho Chehab 5) Attach that task to the new cpuset by writing its pid to the
684*da82c92fSMauro Carvalho Chehab    /sys/fs/cgroup/cpuset tasks file for that cpuset.
685*da82c92fSMauro Carvalho Chehab 6) fork, exec or clone the job tasks from this founding father task.
686*da82c92fSMauro Carvalho Chehab
687*da82c92fSMauro Carvalho ChehabFor example, the following sequence of commands will setup a cpuset
688*da82c92fSMauro Carvalho Chehabnamed "Charlie", containing just CPUs 2 and 3, and Memory Node 1,
689*da82c92fSMauro Carvalho Chehaband then start a subshell 'sh' in that cpuset::
690*da82c92fSMauro Carvalho Chehab
691*da82c92fSMauro Carvalho Chehab  mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset
692*da82c92fSMauro Carvalho Chehab  cd /sys/fs/cgroup/cpuset
693*da82c92fSMauro Carvalho Chehab  mkdir Charlie
694*da82c92fSMauro Carvalho Chehab  cd Charlie
695*da82c92fSMauro Carvalho Chehab  /bin/echo 2-3 > cpuset.cpus
696*da82c92fSMauro Carvalho Chehab  /bin/echo 1 > cpuset.mems
697*da82c92fSMauro Carvalho Chehab  /bin/echo $$ > tasks
698*da82c92fSMauro Carvalho Chehab  sh
699*da82c92fSMauro Carvalho Chehab  # The subshell 'sh' is now running in cpuset Charlie
700*da82c92fSMauro Carvalho Chehab  # The next line should display '/Charlie'
701*da82c92fSMauro Carvalho Chehab  cat /proc/self/cpuset
702*da82c92fSMauro Carvalho Chehab
703*da82c92fSMauro Carvalho ChehabThere are ways to query or modify cpusets:
704*da82c92fSMauro Carvalho Chehab
705*da82c92fSMauro Carvalho Chehab - via the cpuset file system directly, using the various cd, mkdir, echo,
706*da82c92fSMauro Carvalho Chehab   cat, rmdir commands from the shell, or their equivalent from C.
707*da82c92fSMauro Carvalho Chehab - via the C library libcpuset.
708*da82c92fSMauro Carvalho Chehab - via the C library libcgroup.
709*da82c92fSMauro Carvalho Chehab   (http://sourceforge.net/projects/libcg/)
710*da82c92fSMauro Carvalho Chehab - via the python application cset.
711*da82c92fSMauro Carvalho Chehab   (http://code.google.com/p/cpuset/)
712*da82c92fSMauro Carvalho Chehab
713*da82c92fSMauro Carvalho ChehabThe sched_setaffinity calls can also be done at the shell prompt using
714*da82c92fSMauro Carvalho ChehabSGI's runon or Robert Love's taskset.  The mbind and set_mempolicy
715*da82c92fSMauro Carvalho Chehabcalls can be done at the shell prompt using the numactl command
716*da82c92fSMauro Carvalho Chehab(part of Andi Kleen's numa package).
717*da82c92fSMauro Carvalho Chehab
718*da82c92fSMauro Carvalho Chehab2. Usage Examples and Syntax
719*da82c92fSMauro Carvalho Chehab============================
720*da82c92fSMauro Carvalho Chehab
721*da82c92fSMauro Carvalho Chehab2.1 Basic Usage
722*da82c92fSMauro Carvalho Chehab---------------
723*da82c92fSMauro Carvalho Chehab
724*da82c92fSMauro Carvalho ChehabCreating, modifying, using the cpusets can be done through the cpuset
725*da82c92fSMauro Carvalho Chehabvirtual filesystem.
726*da82c92fSMauro Carvalho Chehab
727*da82c92fSMauro Carvalho ChehabTo mount it, type:
728*da82c92fSMauro Carvalho Chehab# mount -t cgroup -o cpuset cpuset /sys/fs/cgroup/cpuset
729*da82c92fSMauro Carvalho Chehab
730*da82c92fSMauro Carvalho ChehabThen under /sys/fs/cgroup/cpuset you can find a tree that corresponds to the
731*da82c92fSMauro Carvalho Chehabtree of the cpusets in the system. For instance, /sys/fs/cgroup/cpuset
732*da82c92fSMauro Carvalho Chehabis the cpuset that holds the whole system.
733*da82c92fSMauro Carvalho Chehab
734*da82c92fSMauro Carvalho ChehabIf you want to create a new cpuset under /sys/fs/cgroup/cpuset::
735*da82c92fSMauro Carvalho Chehab
736*da82c92fSMauro Carvalho Chehab  # cd /sys/fs/cgroup/cpuset
737*da82c92fSMauro Carvalho Chehab  # mkdir my_cpuset
738*da82c92fSMauro Carvalho Chehab
739*da82c92fSMauro Carvalho ChehabNow you want to do something with this cpuset::
740*da82c92fSMauro Carvalho Chehab
741*da82c92fSMauro Carvalho Chehab  # cd my_cpuset
742*da82c92fSMauro Carvalho Chehab
743*da82c92fSMauro Carvalho ChehabIn this directory you can find several files::
744*da82c92fSMauro Carvalho Chehab
745*da82c92fSMauro Carvalho Chehab  # ls
746*da82c92fSMauro Carvalho Chehab  cgroup.clone_children  cpuset.memory_pressure
747*da82c92fSMauro Carvalho Chehab  cgroup.event_control   cpuset.memory_spread_page
748*da82c92fSMauro Carvalho Chehab  cgroup.procs           cpuset.memory_spread_slab
749*da82c92fSMauro Carvalho Chehab  cpuset.cpu_exclusive   cpuset.mems
750*da82c92fSMauro Carvalho Chehab  cpuset.cpus            cpuset.sched_load_balance
751*da82c92fSMauro Carvalho Chehab  cpuset.mem_exclusive   cpuset.sched_relax_domain_level
752*da82c92fSMauro Carvalho Chehab  cpuset.mem_hardwall    notify_on_release
753*da82c92fSMauro Carvalho Chehab  cpuset.memory_migrate  tasks
754*da82c92fSMauro Carvalho Chehab
755*da82c92fSMauro Carvalho ChehabReading them will give you information about the state of this cpuset:
756*da82c92fSMauro Carvalho Chehabthe CPUs and Memory Nodes it can use, the processes that are using
757*da82c92fSMauro Carvalho Chehabit, its properties.  By writing to these files you can manipulate
758*da82c92fSMauro Carvalho Chehabthe cpuset.
759*da82c92fSMauro Carvalho Chehab
760*da82c92fSMauro Carvalho ChehabSet some flags::
761*da82c92fSMauro Carvalho Chehab
762*da82c92fSMauro Carvalho Chehab  # /bin/echo 1 > cpuset.cpu_exclusive
763*da82c92fSMauro Carvalho Chehab
764*da82c92fSMauro Carvalho ChehabAdd some cpus::
765*da82c92fSMauro Carvalho Chehab
766*da82c92fSMauro Carvalho Chehab  # /bin/echo 0-7 > cpuset.cpus
767*da82c92fSMauro Carvalho Chehab
768*da82c92fSMauro Carvalho ChehabAdd some mems::
769*da82c92fSMauro Carvalho Chehab
770*da82c92fSMauro Carvalho Chehab  # /bin/echo 0-7 > cpuset.mems
771*da82c92fSMauro Carvalho Chehab
772*da82c92fSMauro Carvalho ChehabNow attach your shell to this cpuset::
773*da82c92fSMauro Carvalho Chehab
774*da82c92fSMauro Carvalho Chehab  # /bin/echo $$ > tasks
775*da82c92fSMauro Carvalho Chehab
776*da82c92fSMauro Carvalho ChehabYou can also create cpusets inside your cpuset by using mkdir in this
777*da82c92fSMauro Carvalho Chehabdirectory::
778*da82c92fSMauro Carvalho Chehab
779*da82c92fSMauro Carvalho Chehab  # mkdir my_sub_cs
780*da82c92fSMauro Carvalho Chehab
781*da82c92fSMauro Carvalho ChehabTo remove a cpuset, just use rmdir::
782*da82c92fSMauro Carvalho Chehab
783*da82c92fSMauro Carvalho Chehab  # rmdir my_sub_cs
784*da82c92fSMauro Carvalho Chehab
785*da82c92fSMauro Carvalho ChehabThis will fail if the cpuset is in use (has cpusets inside, or has
786*da82c92fSMauro Carvalho Chehabprocesses attached).
787*da82c92fSMauro Carvalho Chehab
788*da82c92fSMauro Carvalho ChehabNote that for legacy reasons, the "cpuset" filesystem exists as a
789*da82c92fSMauro Carvalho Chehabwrapper around the cgroup filesystem.
790*da82c92fSMauro Carvalho Chehab
791*da82c92fSMauro Carvalho ChehabThe command::
792*da82c92fSMauro Carvalho Chehab
793*da82c92fSMauro Carvalho Chehab  mount -t cpuset X /sys/fs/cgroup/cpuset
794*da82c92fSMauro Carvalho Chehab
795*da82c92fSMauro Carvalho Chehabis equivalent to::
796*da82c92fSMauro Carvalho Chehab
797*da82c92fSMauro Carvalho Chehab  mount -t cgroup -ocpuset,noprefix X /sys/fs/cgroup/cpuset
798*da82c92fSMauro Carvalho Chehab  echo "/sbin/cpuset_release_agent" > /sys/fs/cgroup/cpuset/release_agent
799*da82c92fSMauro Carvalho Chehab
800*da82c92fSMauro Carvalho Chehab2.2 Adding/removing cpus
801*da82c92fSMauro Carvalho Chehab------------------------
802*da82c92fSMauro Carvalho Chehab
803*da82c92fSMauro Carvalho ChehabThis is the syntax to use when writing in the cpus or mems files
804*da82c92fSMauro Carvalho Chehabin cpuset directories::
805*da82c92fSMauro Carvalho Chehab
806*da82c92fSMauro Carvalho Chehab  # /bin/echo 1-4 > cpuset.cpus		-> set cpus list to cpus 1,2,3,4
807*da82c92fSMauro Carvalho Chehab  # /bin/echo 1,2,3,4 > cpuset.cpus	-> set cpus list to cpus 1,2,3,4
808*da82c92fSMauro Carvalho Chehab
809*da82c92fSMauro Carvalho ChehabTo add a CPU to a cpuset, write the new list of CPUs including the
810*da82c92fSMauro Carvalho ChehabCPU to be added. To add 6 to the above cpuset::
811*da82c92fSMauro Carvalho Chehab
812*da82c92fSMauro Carvalho Chehab  # /bin/echo 1-4,6 > cpuset.cpus	-> set cpus list to cpus 1,2,3,4,6
813*da82c92fSMauro Carvalho Chehab
814*da82c92fSMauro Carvalho ChehabSimilarly to remove a CPU from a cpuset, write the new list of CPUs
815*da82c92fSMauro Carvalho Chehabwithout the CPU to be removed.
816*da82c92fSMauro Carvalho Chehab
817*da82c92fSMauro Carvalho ChehabTo remove all the CPUs::
818*da82c92fSMauro Carvalho Chehab
819*da82c92fSMauro Carvalho Chehab  # /bin/echo "" > cpuset.cpus		-> clear cpus list
820*da82c92fSMauro Carvalho Chehab
821*da82c92fSMauro Carvalho Chehab2.3 Setting flags
822*da82c92fSMauro Carvalho Chehab-----------------
823*da82c92fSMauro Carvalho Chehab
824*da82c92fSMauro Carvalho ChehabThe syntax is very simple::
825*da82c92fSMauro Carvalho Chehab
826*da82c92fSMauro Carvalho Chehab  # /bin/echo 1 > cpuset.cpu_exclusive 	-> set flag 'cpuset.cpu_exclusive'
827*da82c92fSMauro Carvalho Chehab  # /bin/echo 0 > cpuset.cpu_exclusive 	-> unset flag 'cpuset.cpu_exclusive'
828*da82c92fSMauro Carvalho Chehab
829*da82c92fSMauro Carvalho Chehab2.4 Attaching processes
830*da82c92fSMauro Carvalho Chehab-----------------------
831*da82c92fSMauro Carvalho Chehab
832*da82c92fSMauro Carvalho Chehab::
833*da82c92fSMauro Carvalho Chehab
834*da82c92fSMauro Carvalho Chehab  # /bin/echo PID > tasks
835*da82c92fSMauro Carvalho Chehab
836*da82c92fSMauro Carvalho ChehabNote that it is PID, not PIDs. You can only attach ONE task at a time.
837*da82c92fSMauro Carvalho ChehabIf you have several tasks to attach, you have to do it one after another::
838*da82c92fSMauro Carvalho Chehab
839*da82c92fSMauro Carvalho Chehab  # /bin/echo PID1 > tasks
840*da82c92fSMauro Carvalho Chehab  # /bin/echo PID2 > tasks
841*da82c92fSMauro Carvalho Chehab	...
842*da82c92fSMauro Carvalho Chehab  # /bin/echo PIDn > tasks
843*da82c92fSMauro Carvalho Chehab
844*da82c92fSMauro Carvalho Chehab
845*da82c92fSMauro Carvalho Chehab3. Questions
846*da82c92fSMauro Carvalho Chehab============
847*da82c92fSMauro Carvalho Chehab
848*da82c92fSMauro Carvalho ChehabQ:
849*da82c92fSMauro Carvalho Chehab   what's up with this '/bin/echo' ?
850*da82c92fSMauro Carvalho Chehab
851*da82c92fSMauro Carvalho ChehabA:
852*da82c92fSMauro Carvalho Chehab   bash's builtin 'echo' command does not check calls to write() against
853*da82c92fSMauro Carvalho Chehab   errors. If you use it in the cpuset file system, you won't be
854*da82c92fSMauro Carvalho Chehab   able to tell whether a command succeeded or failed.
855*da82c92fSMauro Carvalho Chehab
856*da82c92fSMauro Carvalho ChehabQ:
857*da82c92fSMauro Carvalho Chehab   When I attach processes, only the first of the line gets really attached !
858*da82c92fSMauro Carvalho Chehab
859*da82c92fSMauro Carvalho ChehabA:
860*da82c92fSMauro Carvalho Chehab   We can only return one error code per call to write(). So you should also
861*da82c92fSMauro Carvalho Chehab   put only ONE pid.
862*da82c92fSMauro Carvalho Chehab
863*da82c92fSMauro Carvalho Chehab4. Contact
864*da82c92fSMauro Carvalho Chehab==========
865*da82c92fSMauro Carvalho Chehab
866*da82c92fSMauro Carvalho ChehabWeb: http://www.bullopensource.org/cpuset
867