1*da82c92fSMauro Carvalho Chehab======= 2*da82c92fSMauro Carvalho ChehabCPUSETS 3*da82c92fSMauro Carvalho Chehab======= 4*da82c92fSMauro Carvalho Chehab 5*da82c92fSMauro Carvalho ChehabCopyright (C) 2004 BULL SA. 6*da82c92fSMauro Carvalho Chehab 7*da82c92fSMauro Carvalho ChehabWritten by Simon.Derr@bull.net 8*da82c92fSMauro Carvalho Chehab 9*da82c92fSMauro Carvalho Chehab- Portions Copyright (c) 2004-2006 Silicon Graphics, Inc. 10*da82c92fSMauro Carvalho Chehab- Modified by Paul Jackson <pj@sgi.com> 11*da82c92fSMauro Carvalho Chehab- Modified by Christoph Lameter <cl@linux.com> 12*da82c92fSMauro Carvalho Chehab- Modified by Paul Menage <menage@google.com> 13*da82c92fSMauro Carvalho Chehab- Modified by Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> 14*da82c92fSMauro Carvalho Chehab 15*da82c92fSMauro Carvalho Chehab.. CONTENTS: 16*da82c92fSMauro Carvalho Chehab 17*da82c92fSMauro Carvalho Chehab 1. Cpusets 18*da82c92fSMauro Carvalho Chehab 1.1 What are cpusets ? 19*da82c92fSMauro Carvalho Chehab 1.2 Why are cpusets needed ? 20*da82c92fSMauro Carvalho Chehab 1.3 How are cpusets implemented ? 21*da82c92fSMauro Carvalho Chehab 1.4 What are exclusive cpusets ? 22*da82c92fSMauro Carvalho Chehab 1.5 What is memory_pressure ? 23*da82c92fSMauro Carvalho Chehab 1.6 What is memory spread ? 24*da82c92fSMauro Carvalho Chehab 1.7 What is sched_load_balance ? 25*da82c92fSMauro Carvalho Chehab 1.8 What is sched_relax_domain_level ? 26*da82c92fSMauro Carvalho Chehab 1.9 How do I use cpusets ? 27*da82c92fSMauro Carvalho Chehab 2. Usage Examples and Syntax 28*da82c92fSMauro Carvalho Chehab 2.1 Basic Usage 29*da82c92fSMauro Carvalho Chehab 2.2 Adding/removing cpus 30*da82c92fSMauro Carvalho Chehab 2.3 Setting flags 31*da82c92fSMauro Carvalho Chehab 2.4 Attaching processes 32*da82c92fSMauro Carvalho Chehab 3. Questions 33*da82c92fSMauro Carvalho Chehab 4. Contact 34*da82c92fSMauro Carvalho Chehab 35*da82c92fSMauro Carvalho Chehab1. Cpusets 36*da82c92fSMauro Carvalho Chehab========== 37*da82c92fSMauro Carvalho Chehab 38*da82c92fSMauro Carvalho Chehab1.1 What are cpusets ? 39*da82c92fSMauro Carvalho Chehab---------------------- 40*da82c92fSMauro Carvalho Chehab 41*da82c92fSMauro Carvalho ChehabCpusets provide a mechanism for assigning a set of CPUs and Memory 42*da82c92fSMauro Carvalho ChehabNodes to a set of tasks. In this document "Memory Node" refers to 43*da82c92fSMauro Carvalho Chehaban on-line node that contains memory. 44*da82c92fSMauro Carvalho Chehab 45*da82c92fSMauro Carvalho ChehabCpusets constrain the CPU and Memory placement of tasks to only 46*da82c92fSMauro Carvalho Chehabthe resources within a task's current cpuset. They form a nested 47*da82c92fSMauro Carvalho Chehabhierarchy visible in a virtual file system. These are the essential 48*da82c92fSMauro Carvalho Chehabhooks, beyond what is already present, required to manage dynamic 49*da82c92fSMauro Carvalho Chehabjob placement on large systems. 50*da82c92fSMauro Carvalho Chehab 51*da82c92fSMauro Carvalho ChehabCpusets use the generic cgroup subsystem described in 52*da82c92fSMauro Carvalho ChehabDocumentation/admin-guide/cgroup-v1/cgroups.rst. 53*da82c92fSMauro Carvalho Chehab 54*da82c92fSMauro Carvalho ChehabRequests by a task, using the sched_setaffinity(2) system call to 55*da82c92fSMauro Carvalho Chehabinclude CPUs in its CPU affinity mask, and using the mbind(2) and 56*da82c92fSMauro Carvalho Chehabset_mempolicy(2) system calls to include Memory Nodes in its memory 57*da82c92fSMauro Carvalho Chehabpolicy, are both filtered through that task's cpuset, filtering out any 58*da82c92fSMauro Carvalho ChehabCPUs or Memory Nodes not in that cpuset. The scheduler will not 59*da82c92fSMauro Carvalho Chehabschedule a task on a CPU that is not allowed in its cpus_allowed 60*da82c92fSMauro Carvalho Chehabvector, and the kernel page allocator will not allocate a page on a 61*da82c92fSMauro Carvalho Chehabnode that is not allowed in the requesting task's mems_allowed vector. 62*da82c92fSMauro Carvalho Chehab 63*da82c92fSMauro Carvalho ChehabUser level code may create and destroy cpusets by name in the cgroup 64*da82c92fSMauro Carvalho Chehabvirtual file system, manage the attributes and permissions of these 65*da82c92fSMauro Carvalho Chehabcpusets and which CPUs and Memory Nodes are assigned to each cpuset, 66*da82c92fSMauro Carvalho Chehabspecify and query to which cpuset a task is assigned, and list the 67*da82c92fSMauro Carvalho Chehabtask pids assigned to a cpuset. 68*da82c92fSMauro Carvalho Chehab 69*da82c92fSMauro Carvalho Chehab 70*da82c92fSMauro Carvalho Chehab1.2 Why are cpusets needed ? 71*da82c92fSMauro Carvalho Chehab---------------------------- 72*da82c92fSMauro Carvalho Chehab 73*da82c92fSMauro Carvalho ChehabThe management of large computer systems, with many processors (CPUs), 74*da82c92fSMauro Carvalho Chehabcomplex memory cache hierarchies and multiple Memory Nodes having 75*da82c92fSMauro Carvalho Chehabnon-uniform access times (NUMA) presents additional challenges for 76*da82c92fSMauro Carvalho Chehabthe efficient scheduling and memory placement of processes. 77*da82c92fSMauro Carvalho Chehab 78*da82c92fSMauro Carvalho ChehabFrequently more modest sized systems can be operated with adequate 79*da82c92fSMauro Carvalho Chehabefficiency just by letting the operating system automatically share 80*da82c92fSMauro Carvalho Chehabthe available CPU and Memory resources amongst the requesting tasks. 81*da82c92fSMauro Carvalho Chehab 82*da82c92fSMauro Carvalho ChehabBut larger systems, which benefit more from careful processor and 83*da82c92fSMauro Carvalho Chehabmemory placement to reduce memory access times and contention, 84*da82c92fSMauro Carvalho Chehaband which typically represent a larger investment for the customer, 85*da82c92fSMauro Carvalho Chehabcan benefit from explicitly placing jobs on properly sized subsets of 86*da82c92fSMauro Carvalho Chehabthe system. 87*da82c92fSMauro Carvalho Chehab 88*da82c92fSMauro Carvalho ChehabThis can be especially valuable on: 89*da82c92fSMauro Carvalho Chehab 90*da82c92fSMauro Carvalho Chehab * Web Servers running multiple instances of the same web application, 91*da82c92fSMauro Carvalho Chehab * Servers running different applications (for instance, a web server 92*da82c92fSMauro Carvalho Chehab and a database), or 93*da82c92fSMauro Carvalho Chehab * NUMA systems running large HPC applications with demanding 94*da82c92fSMauro Carvalho Chehab performance characteristics. 95*da82c92fSMauro Carvalho Chehab 96*da82c92fSMauro Carvalho ChehabThese subsets, or "soft partitions" must be able to be dynamically 97*da82c92fSMauro Carvalho Chehabadjusted, as the job mix changes, without impacting other concurrently 98*da82c92fSMauro Carvalho Chehabexecuting jobs. The location of the running jobs pages may also be moved 99*da82c92fSMauro Carvalho Chehabwhen the memory locations are changed. 100*da82c92fSMauro Carvalho Chehab 101*da82c92fSMauro Carvalho ChehabThe kernel cpuset patch provides the minimum essential kernel 102*da82c92fSMauro Carvalho Chehabmechanisms required to efficiently implement such subsets. It 103*da82c92fSMauro Carvalho Chehableverages existing CPU and Memory Placement facilities in the Linux 104*da82c92fSMauro Carvalho Chehabkernel to avoid any additional impact on the critical scheduler or 105*da82c92fSMauro Carvalho Chehabmemory allocator code. 106*da82c92fSMauro Carvalho Chehab 107*da82c92fSMauro Carvalho Chehab 108*da82c92fSMauro Carvalho Chehab1.3 How are cpusets implemented ? 109*da82c92fSMauro Carvalho Chehab--------------------------------- 110*da82c92fSMauro Carvalho Chehab 111*da82c92fSMauro Carvalho ChehabCpusets provide a Linux kernel mechanism to constrain which CPUs and 112*da82c92fSMauro Carvalho ChehabMemory Nodes are used by a process or set of processes. 113*da82c92fSMauro Carvalho Chehab 114*da82c92fSMauro Carvalho ChehabThe Linux kernel already has a pair of mechanisms to specify on which 115*da82c92fSMauro Carvalho ChehabCPUs a task may be scheduled (sched_setaffinity) and on which Memory 116*da82c92fSMauro Carvalho ChehabNodes it may obtain memory (mbind, set_mempolicy). 117*da82c92fSMauro Carvalho Chehab 118*da82c92fSMauro Carvalho ChehabCpusets extends these two mechanisms as follows: 119*da82c92fSMauro Carvalho Chehab 120*da82c92fSMauro Carvalho Chehab - Cpusets are sets of allowed CPUs and Memory Nodes, known to the 121*da82c92fSMauro Carvalho Chehab kernel. 122*da82c92fSMauro Carvalho Chehab - Each task in the system is attached to a cpuset, via a pointer 123*da82c92fSMauro Carvalho Chehab in the task structure to a reference counted cgroup structure. 124*da82c92fSMauro Carvalho Chehab - Calls to sched_setaffinity are filtered to just those CPUs 125*da82c92fSMauro Carvalho Chehab allowed in that task's cpuset. 126*da82c92fSMauro Carvalho Chehab - Calls to mbind and set_mempolicy are filtered to just 127*da82c92fSMauro Carvalho Chehab those Memory Nodes allowed in that task's cpuset. 128*da82c92fSMauro Carvalho Chehab - The root cpuset contains all the systems CPUs and Memory 129*da82c92fSMauro Carvalho Chehab Nodes. 130*da82c92fSMauro Carvalho Chehab - For any cpuset, one can define child cpusets containing a subset 131*da82c92fSMauro Carvalho Chehab of the parents CPU and Memory Node resources. 132*da82c92fSMauro Carvalho Chehab - The hierarchy of cpusets can be mounted at /dev/cpuset, for 133*da82c92fSMauro Carvalho Chehab browsing and manipulation from user space. 134*da82c92fSMauro Carvalho Chehab - A cpuset may be marked exclusive, which ensures that no other 135*da82c92fSMauro Carvalho Chehab cpuset (except direct ancestors and descendants) may contain 136*da82c92fSMauro Carvalho Chehab any overlapping CPUs or Memory Nodes. 137*da82c92fSMauro Carvalho Chehab - You can list all the tasks (by pid) attached to any cpuset. 138*da82c92fSMauro Carvalho Chehab 139*da82c92fSMauro Carvalho ChehabThe implementation of cpusets requires a few, simple hooks 140*da82c92fSMauro Carvalho Chehabinto the rest of the kernel, none in performance critical paths: 141*da82c92fSMauro Carvalho Chehab 142*da82c92fSMauro Carvalho Chehab - in init/main.c, to initialize the root cpuset at system boot. 143*da82c92fSMauro Carvalho Chehab - in fork and exit, to attach and detach a task from its cpuset. 144*da82c92fSMauro Carvalho Chehab - in sched_setaffinity, to mask the requested CPUs by what's 145*da82c92fSMauro Carvalho Chehab allowed in that task's cpuset. 146*da82c92fSMauro Carvalho Chehab - in sched.c migrate_live_tasks(), to keep migrating tasks within 147*da82c92fSMauro Carvalho Chehab the CPUs allowed by their cpuset, if possible. 148*da82c92fSMauro Carvalho Chehab - in the mbind and set_mempolicy system calls, to mask the requested 149*da82c92fSMauro Carvalho Chehab Memory Nodes by what's allowed in that task's cpuset. 150*da82c92fSMauro Carvalho Chehab - in page_alloc.c, to restrict memory to allowed nodes. 151*da82c92fSMauro Carvalho Chehab - in vmscan.c, to restrict page recovery to the current cpuset. 152*da82c92fSMauro Carvalho Chehab 153*da82c92fSMauro Carvalho ChehabYou should mount the "cgroup" filesystem type in order to enable 154*da82c92fSMauro Carvalho Chehabbrowsing and modifying the cpusets presently known to the kernel. No 155*da82c92fSMauro Carvalho Chehabnew system calls are added for cpusets - all support for querying and 156*da82c92fSMauro Carvalho Chehabmodifying cpusets is via this cpuset file system. 157*da82c92fSMauro Carvalho Chehab 158*da82c92fSMauro Carvalho ChehabThe /proc/<pid>/status file for each task has four added lines, 159*da82c92fSMauro Carvalho Chehabdisplaying the task's cpus_allowed (on which CPUs it may be scheduled) 160*da82c92fSMauro Carvalho Chehaband mems_allowed (on which Memory Nodes it may obtain memory), 161*da82c92fSMauro Carvalho Chehabin the two formats seen in the following example:: 162*da82c92fSMauro Carvalho Chehab 163*da82c92fSMauro Carvalho Chehab Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff 164*da82c92fSMauro Carvalho Chehab Cpus_allowed_list: 0-127 165*da82c92fSMauro Carvalho Chehab Mems_allowed: ffffffff,ffffffff 166*da82c92fSMauro Carvalho Chehab Mems_allowed_list: 0-63 167*da82c92fSMauro Carvalho Chehab 168*da82c92fSMauro Carvalho ChehabEach cpuset is represented by a directory in the cgroup file system 169*da82c92fSMauro Carvalho Chehabcontaining (on top of the standard cgroup files) the following 170*da82c92fSMauro Carvalho Chehabfiles describing that cpuset: 171*da82c92fSMauro Carvalho Chehab 172*da82c92fSMauro Carvalho Chehab - cpuset.cpus: list of CPUs in that cpuset 173*da82c92fSMauro Carvalho Chehab - cpuset.mems: list of Memory Nodes in that cpuset 174*da82c92fSMauro Carvalho Chehab - cpuset.memory_migrate flag: if set, move pages to cpusets nodes 175*da82c92fSMauro Carvalho Chehab - cpuset.cpu_exclusive flag: is cpu placement exclusive? 176*da82c92fSMauro Carvalho Chehab - cpuset.mem_exclusive flag: is memory placement exclusive? 177*da82c92fSMauro Carvalho Chehab - cpuset.mem_hardwall flag: is memory allocation hardwalled 178*da82c92fSMauro Carvalho Chehab - cpuset.memory_pressure: measure of how much paging pressure in cpuset 179*da82c92fSMauro Carvalho Chehab - cpuset.memory_spread_page flag: if set, spread page cache evenly on allowed nodes 180*da82c92fSMauro Carvalho Chehab - cpuset.memory_spread_slab flag: if set, spread slab cache evenly on allowed nodes 181*da82c92fSMauro Carvalho Chehab - cpuset.sched_load_balance flag: if set, load balance within CPUs on that cpuset 182*da82c92fSMauro Carvalho Chehab - cpuset.sched_relax_domain_level: the searching range when migrating tasks 183*da82c92fSMauro Carvalho Chehab 184*da82c92fSMauro Carvalho ChehabIn addition, only the root cpuset has the following file: 185*da82c92fSMauro Carvalho Chehab 186*da82c92fSMauro Carvalho Chehab - cpuset.memory_pressure_enabled flag: compute memory_pressure? 187*da82c92fSMauro Carvalho Chehab 188*da82c92fSMauro Carvalho ChehabNew cpusets are created using the mkdir system call or shell 189*da82c92fSMauro Carvalho Chehabcommand. The properties of a cpuset, such as its flags, allowed 190*da82c92fSMauro Carvalho ChehabCPUs and Memory Nodes, and attached tasks, are modified by writing 191*da82c92fSMauro Carvalho Chehabto the appropriate file in that cpusets directory, as listed above. 192*da82c92fSMauro Carvalho Chehab 193*da82c92fSMauro Carvalho ChehabThe named hierarchical structure of nested cpusets allows partitioning 194*da82c92fSMauro Carvalho Chehaba large system into nested, dynamically changeable, "soft-partitions". 195*da82c92fSMauro Carvalho Chehab 196*da82c92fSMauro Carvalho ChehabThe attachment of each task, automatically inherited at fork by any 197*da82c92fSMauro Carvalho Chehabchildren of that task, to a cpuset allows organizing the work load 198*da82c92fSMauro Carvalho Chehabon a system into related sets of tasks such that each set is constrained 199*da82c92fSMauro Carvalho Chehabto using the CPUs and Memory Nodes of a particular cpuset. A task 200*da82c92fSMauro Carvalho Chehabmay be re-attached to any other cpuset, if allowed by the permissions 201*da82c92fSMauro Carvalho Chehabon the necessary cpuset file system directories. 202*da82c92fSMauro Carvalho Chehab 203*da82c92fSMauro Carvalho ChehabSuch management of a system "in the large" integrates smoothly with 204*da82c92fSMauro Carvalho Chehabthe detailed placement done on individual tasks and memory regions 205*da82c92fSMauro Carvalho Chehabusing the sched_setaffinity, mbind and set_mempolicy system calls. 206*da82c92fSMauro Carvalho Chehab 207*da82c92fSMauro Carvalho ChehabThe following rules apply to each cpuset: 208*da82c92fSMauro Carvalho Chehab 209*da82c92fSMauro Carvalho Chehab - Its CPUs and Memory Nodes must be a subset of its parents. 210*da82c92fSMauro Carvalho Chehab - It can't be marked exclusive unless its parent is. 211*da82c92fSMauro Carvalho Chehab - If its cpu or memory is exclusive, they may not overlap any sibling. 212*da82c92fSMauro Carvalho Chehab 213*da82c92fSMauro Carvalho ChehabThese rules, and the natural hierarchy of cpusets, enable efficient 214*da82c92fSMauro Carvalho Chehabenforcement of the exclusive guarantee, without having to scan all 215*da82c92fSMauro Carvalho Chehabcpusets every time any of them change to ensure nothing overlaps a 216*da82c92fSMauro Carvalho Chehabexclusive cpuset. Also, the use of a Linux virtual file system (vfs) 217*da82c92fSMauro Carvalho Chehabto represent the cpuset hierarchy provides for a familiar permission 218*da82c92fSMauro Carvalho Chehaband name space for cpusets, with a minimum of additional kernel code. 219*da82c92fSMauro Carvalho Chehab 220*da82c92fSMauro Carvalho ChehabThe cpus and mems files in the root (top_cpuset) cpuset are 221*da82c92fSMauro Carvalho Chehabread-only. The cpus file automatically tracks the value of 222*da82c92fSMauro Carvalho Chehabcpu_online_mask using a CPU hotplug notifier, and the mems file 223*da82c92fSMauro Carvalho Chehabautomatically tracks the value of node_states[N_MEMORY]--i.e., 224*da82c92fSMauro Carvalho Chehabnodes with memory--using the cpuset_track_online_nodes() hook. 225*da82c92fSMauro Carvalho Chehab 226*da82c92fSMauro Carvalho Chehab 227*da82c92fSMauro Carvalho Chehab1.4 What are exclusive cpusets ? 228*da82c92fSMauro Carvalho Chehab-------------------------------- 229*da82c92fSMauro Carvalho Chehab 230*da82c92fSMauro Carvalho ChehabIf a cpuset is cpu or mem exclusive, no other cpuset, other than 231*da82c92fSMauro Carvalho Chehaba direct ancestor or descendant, may share any of the same CPUs or 232*da82c92fSMauro Carvalho ChehabMemory Nodes. 233*da82c92fSMauro Carvalho Chehab 234*da82c92fSMauro Carvalho ChehabA cpuset that is cpuset.mem_exclusive *or* cpuset.mem_hardwall is "hardwalled", 235*da82c92fSMauro Carvalho Chehabi.e. it restricts kernel allocations for page, buffer and other data 236*da82c92fSMauro Carvalho Chehabcommonly shared by the kernel across multiple users. All cpusets, 237*da82c92fSMauro Carvalho Chehabwhether hardwalled or not, restrict allocations of memory for user 238*da82c92fSMauro Carvalho Chehabspace. This enables configuring a system so that several independent 239*da82c92fSMauro Carvalho Chehabjobs can share common kernel data, such as file system pages, while 240*da82c92fSMauro Carvalho Chehabisolating each job's user allocation in its own cpuset. To do this, 241*da82c92fSMauro Carvalho Chehabconstruct a large mem_exclusive cpuset to hold all the jobs, and 242*da82c92fSMauro Carvalho Chehabconstruct child, non-mem_exclusive cpusets for each individual job. 243*da82c92fSMauro Carvalho ChehabOnly a small amount of typical kernel memory, such as requests from 244*da82c92fSMauro Carvalho Chehabinterrupt handlers, is allowed to be taken outside even a 245*da82c92fSMauro Carvalho Chehabmem_exclusive cpuset. 246*da82c92fSMauro Carvalho Chehab 247*da82c92fSMauro Carvalho Chehab 248*da82c92fSMauro Carvalho Chehab1.5 What is memory_pressure ? 249*da82c92fSMauro Carvalho Chehab----------------------------- 250*da82c92fSMauro Carvalho ChehabThe memory_pressure of a cpuset provides a simple per-cpuset metric 251*da82c92fSMauro Carvalho Chehabof the rate that the tasks in a cpuset are attempting to free up in 252*da82c92fSMauro Carvalho Chehabuse memory on the nodes of the cpuset to satisfy additional memory 253*da82c92fSMauro Carvalho Chehabrequests. 254*da82c92fSMauro Carvalho Chehab 255*da82c92fSMauro Carvalho ChehabThis enables batch managers monitoring jobs running in dedicated 256*da82c92fSMauro Carvalho Chehabcpusets to efficiently detect what level of memory pressure that job 257*da82c92fSMauro Carvalho Chehabis causing. 258*da82c92fSMauro Carvalho Chehab 259*da82c92fSMauro Carvalho ChehabThis is useful both on tightly managed systems running a wide mix of 260*da82c92fSMauro Carvalho Chehabsubmitted jobs, which may choose to terminate or re-prioritize jobs that 261*da82c92fSMauro Carvalho Chehabare trying to use more memory than allowed on the nodes assigned to them, 262*da82c92fSMauro Carvalho Chehaband with tightly coupled, long running, massively parallel scientific 263*da82c92fSMauro Carvalho Chehabcomputing jobs that will dramatically fail to meet required performance 264*da82c92fSMauro Carvalho Chehabgoals if they start to use more memory than allowed to them. 265*da82c92fSMauro Carvalho Chehab 266*da82c92fSMauro Carvalho ChehabThis mechanism provides a very economical way for the batch manager 267*da82c92fSMauro Carvalho Chehabto monitor a cpuset for signs of memory pressure. It's up to the 268*da82c92fSMauro Carvalho Chehabbatch manager or other user code to decide what to do about it and 269*da82c92fSMauro Carvalho Chehabtake action. 270*da82c92fSMauro Carvalho Chehab 271*da82c92fSMauro Carvalho Chehab==> 272*da82c92fSMauro Carvalho Chehab Unless this feature is enabled by writing "1" to the special file 273*da82c92fSMauro Carvalho Chehab /dev/cpuset/memory_pressure_enabled, the hook in the rebalance 274*da82c92fSMauro Carvalho Chehab code of __alloc_pages() for this metric reduces to simply noticing 275*da82c92fSMauro Carvalho Chehab that the cpuset_memory_pressure_enabled flag is zero. So only 276*da82c92fSMauro Carvalho Chehab systems that enable this feature will compute the metric. 277*da82c92fSMauro Carvalho Chehab 278*da82c92fSMauro Carvalho ChehabWhy a per-cpuset, running average: 279*da82c92fSMauro Carvalho Chehab 280*da82c92fSMauro Carvalho Chehab Because this meter is per-cpuset, rather than per-task or mm, 281*da82c92fSMauro Carvalho Chehab the system load imposed by a batch scheduler monitoring this 282*da82c92fSMauro Carvalho Chehab metric is sharply reduced on large systems, because a scan of 283*da82c92fSMauro Carvalho Chehab the tasklist can be avoided on each set of queries. 284*da82c92fSMauro Carvalho Chehab 285*da82c92fSMauro Carvalho Chehab Because this meter is a running average, instead of an accumulating 286*da82c92fSMauro Carvalho Chehab counter, a batch scheduler can detect memory pressure with a 287*da82c92fSMauro Carvalho Chehab single read, instead of having to read and accumulate results 288*da82c92fSMauro Carvalho Chehab for a period of time. 289*da82c92fSMauro Carvalho Chehab 290*da82c92fSMauro Carvalho Chehab Because this meter is per-cpuset rather than per-task or mm, 291*da82c92fSMauro Carvalho Chehab the batch scheduler can obtain the key information, memory 292*da82c92fSMauro Carvalho Chehab pressure in a cpuset, with a single read, rather than having to 293*da82c92fSMauro Carvalho Chehab query and accumulate results over all the (dynamically changing) 294*da82c92fSMauro Carvalho Chehab set of tasks in the cpuset. 295*da82c92fSMauro Carvalho Chehab 296*da82c92fSMauro Carvalho ChehabA per-cpuset simple digital filter (requires a spinlock and 3 words 297*da82c92fSMauro Carvalho Chehabof data per-cpuset) is kept, and updated by any task attached to that 298*da82c92fSMauro Carvalho Chehabcpuset, if it enters the synchronous (direct) page reclaim code. 299*da82c92fSMauro Carvalho Chehab 300*da82c92fSMauro Carvalho ChehabA per-cpuset file provides an integer number representing the recent 301*da82c92fSMauro Carvalho Chehab(half-life of 10 seconds) rate of direct page reclaims caused by 302*da82c92fSMauro Carvalho Chehabthe tasks in the cpuset, in units of reclaims attempted per second, 303*da82c92fSMauro Carvalho Chehabtimes 1000. 304*da82c92fSMauro Carvalho Chehab 305*da82c92fSMauro Carvalho Chehab 306*da82c92fSMauro Carvalho Chehab1.6 What is memory spread ? 307*da82c92fSMauro Carvalho Chehab--------------------------- 308*da82c92fSMauro Carvalho ChehabThere are two boolean flag files per cpuset that control where the 309*da82c92fSMauro Carvalho Chehabkernel allocates pages for the file system buffers and related in 310*da82c92fSMauro Carvalho Chehabkernel data structures. They are called 'cpuset.memory_spread_page' and 311*da82c92fSMauro Carvalho Chehab'cpuset.memory_spread_slab'. 312*da82c92fSMauro Carvalho Chehab 313*da82c92fSMauro Carvalho ChehabIf the per-cpuset boolean flag file 'cpuset.memory_spread_page' is set, then 314*da82c92fSMauro Carvalho Chehabthe kernel will spread the file system buffers (page cache) evenly 315*da82c92fSMauro Carvalho Chehabover all the nodes that the faulting task is allowed to use, instead 316*da82c92fSMauro Carvalho Chehabof preferring to put those pages on the node where the task is running. 317*da82c92fSMauro Carvalho Chehab 318*da82c92fSMauro Carvalho ChehabIf the per-cpuset boolean flag file 'cpuset.memory_spread_slab' is set, 319*da82c92fSMauro Carvalho Chehabthen the kernel will spread some file system related slab caches, 320*da82c92fSMauro Carvalho Chehabsuch as for inodes and dentries evenly over all the nodes that the 321*da82c92fSMauro Carvalho Chehabfaulting task is allowed to use, instead of preferring to put those 322*da82c92fSMauro Carvalho Chehabpages on the node where the task is running. 323*da82c92fSMauro Carvalho Chehab 324*da82c92fSMauro Carvalho ChehabThe setting of these flags does not affect anonymous data segment or 325*da82c92fSMauro Carvalho Chehabstack segment pages of a task. 326*da82c92fSMauro Carvalho Chehab 327*da82c92fSMauro Carvalho ChehabBy default, both kinds of memory spreading are off, and memory 328*da82c92fSMauro Carvalho Chehabpages are allocated on the node local to where the task is running, 329*da82c92fSMauro Carvalho Chehabexcept perhaps as modified by the task's NUMA mempolicy or cpuset 330*da82c92fSMauro Carvalho Chehabconfiguration, so long as sufficient free memory pages are available. 331*da82c92fSMauro Carvalho Chehab 332*da82c92fSMauro Carvalho ChehabWhen new cpusets are created, they inherit the memory spread settings 333*da82c92fSMauro Carvalho Chehabof their parent. 334*da82c92fSMauro Carvalho Chehab 335*da82c92fSMauro Carvalho ChehabSetting memory spreading causes allocations for the affected page 336*da82c92fSMauro Carvalho Chehabor slab caches to ignore the task's NUMA mempolicy and be spread 337*da82c92fSMauro Carvalho Chehabinstead. Tasks using mbind() or set_mempolicy() calls to set NUMA 338*da82c92fSMauro Carvalho Chehabmempolicies will not notice any change in these calls as a result of 339*da82c92fSMauro Carvalho Chehabtheir containing task's memory spread settings. If memory spreading 340*da82c92fSMauro Carvalho Chehabis turned off, then the currently specified NUMA mempolicy once again 341*da82c92fSMauro Carvalho Chehabapplies to memory page allocations. 342*da82c92fSMauro Carvalho Chehab 343*da82c92fSMauro Carvalho ChehabBoth 'cpuset.memory_spread_page' and 'cpuset.memory_spread_slab' are boolean flag 344*da82c92fSMauro Carvalho Chehabfiles. By default they contain "0", meaning that the feature is off 345*da82c92fSMauro Carvalho Chehabfor that cpuset. If a "1" is written to that file, then that turns 346*da82c92fSMauro Carvalho Chehabthe named feature on. 347*da82c92fSMauro Carvalho Chehab 348*da82c92fSMauro Carvalho ChehabThe implementation is simple. 349*da82c92fSMauro Carvalho Chehab 350*da82c92fSMauro Carvalho ChehabSetting the flag 'cpuset.memory_spread_page' turns on a per-process flag 351*da82c92fSMauro Carvalho ChehabPFA_SPREAD_PAGE for each task that is in that cpuset or subsequently 352*da82c92fSMauro Carvalho Chehabjoins that cpuset. The page allocation calls for the page cache 353*da82c92fSMauro Carvalho Chehabis modified to perform an inline check for this PFA_SPREAD_PAGE task 354*da82c92fSMauro Carvalho Chehabflag, and if set, a call to a new routine cpuset_mem_spread_node() 355*da82c92fSMauro Carvalho Chehabreturns the node to prefer for the allocation. 356*da82c92fSMauro Carvalho Chehab 357*da82c92fSMauro Carvalho ChehabSimilarly, setting 'cpuset.memory_spread_slab' turns on the flag 358*da82c92fSMauro Carvalho ChehabPFA_SPREAD_SLAB, and appropriately marked slab caches will allocate 359*da82c92fSMauro Carvalho Chehabpages from the node returned by cpuset_mem_spread_node(). 360*da82c92fSMauro Carvalho Chehab 361*da82c92fSMauro Carvalho ChehabThe cpuset_mem_spread_node() routine is also simple. It uses the 362*da82c92fSMauro Carvalho Chehabvalue of a per-task rotor cpuset_mem_spread_rotor to select the next 363*da82c92fSMauro Carvalho Chehabnode in the current task's mems_allowed to prefer for the allocation. 364*da82c92fSMauro Carvalho Chehab 365*da82c92fSMauro Carvalho ChehabThis memory placement policy is also known (in other contexts) as 366*da82c92fSMauro Carvalho Chehabround-robin or interleave. 367*da82c92fSMauro Carvalho Chehab 368*da82c92fSMauro Carvalho ChehabThis policy can provide substantial improvements for jobs that need 369*da82c92fSMauro Carvalho Chehabto place thread local data on the corresponding node, but that need 370*da82c92fSMauro Carvalho Chehabto access large file system data sets that need to be spread across 371*da82c92fSMauro Carvalho Chehabthe several nodes in the jobs cpuset in order to fit. Without this 372*da82c92fSMauro Carvalho Chehabpolicy, especially for jobs that might have one thread reading in the 373*da82c92fSMauro Carvalho Chehabdata set, the memory allocation across the nodes in the jobs cpuset 374*da82c92fSMauro Carvalho Chehabcan become very uneven. 375*da82c92fSMauro Carvalho Chehab 376*da82c92fSMauro Carvalho Chehab1.7 What is sched_load_balance ? 377*da82c92fSMauro Carvalho Chehab-------------------------------- 378*da82c92fSMauro Carvalho Chehab 379*da82c92fSMauro Carvalho ChehabThe kernel scheduler (kernel/sched/core.c) automatically load balances 380*da82c92fSMauro Carvalho Chehabtasks. If one CPU is underutilized, kernel code running on that 381*da82c92fSMauro Carvalho ChehabCPU will look for tasks on other more overloaded CPUs and move those 382*da82c92fSMauro Carvalho Chehabtasks to itself, within the constraints of such placement mechanisms 383*da82c92fSMauro Carvalho Chehabas cpusets and sched_setaffinity. 384*da82c92fSMauro Carvalho Chehab 385*da82c92fSMauro Carvalho ChehabThe algorithmic cost of load balancing and its impact on key shared 386*da82c92fSMauro Carvalho Chehabkernel data structures such as the task list increases more than 387*da82c92fSMauro Carvalho Chehablinearly with the number of CPUs being balanced. So the scheduler 388*da82c92fSMauro Carvalho Chehabhas support to partition the systems CPUs into a number of sched 389*da82c92fSMauro Carvalho Chehabdomains such that it only load balances within each sched domain. 390*da82c92fSMauro Carvalho ChehabEach sched domain covers some subset of the CPUs in the system; 391*da82c92fSMauro Carvalho Chehabno two sched domains overlap; some CPUs might not be in any sched 392*da82c92fSMauro Carvalho Chehabdomain and hence won't be load balanced. 393*da82c92fSMauro Carvalho Chehab 394*da82c92fSMauro Carvalho ChehabPut simply, it costs less to balance between two smaller sched domains 395*da82c92fSMauro Carvalho Chehabthan one big one, but doing so means that overloads in one of the 396*da82c92fSMauro Carvalho Chehabtwo domains won't be load balanced to the other one. 397*da82c92fSMauro Carvalho Chehab 398*da82c92fSMauro Carvalho ChehabBy default, there is one sched domain covering all CPUs, including those 399*da82c92fSMauro Carvalho Chehabmarked isolated using the kernel boot time "isolcpus=" argument. However, 400*da82c92fSMauro Carvalho Chehabthe isolated CPUs will not participate in load balancing, and will not 401*da82c92fSMauro Carvalho Chehabhave tasks running on them unless explicitly assigned. 402*da82c92fSMauro Carvalho Chehab 403*da82c92fSMauro Carvalho ChehabThis default load balancing across all CPUs is not well suited for 404*da82c92fSMauro Carvalho Chehabthe following two situations: 405*da82c92fSMauro Carvalho Chehab 406*da82c92fSMauro Carvalho Chehab 1) On large systems, load balancing across many CPUs is expensive. 407*da82c92fSMauro Carvalho Chehab If the system is managed using cpusets to place independent jobs 408*da82c92fSMauro Carvalho Chehab on separate sets of CPUs, full load balancing is unnecessary. 409*da82c92fSMauro Carvalho Chehab 2) Systems supporting realtime on some CPUs need to minimize 410*da82c92fSMauro Carvalho Chehab system overhead on those CPUs, including avoiding task load 411*da82c92fSMauro Carvalho Chehab balancing if that is not needed. 412*da82c92fSMauro Carvalho Chehab 413*da82c92fSMauro Carvalho ChehabWhen the per-cpuset flag "cpuset.sched_load_balance" is enabled (the default 414*da82c92fSMauro Carvalho Chehabsetting), it requests that all the CPUs in that cpusets allowed 'cpuset.cpus' 415*da82c92fSMauro Carvalho Chehabbe contained in a single sched domain, ensuring that load balancing 416*da82c92fSMauro Carvalho Chehabcan move a task (not otherwised pinned, as by sched_setaffinity) 417*da82c92fSMauro Carvalho Chehabfrom any CPU in that cpuset to any other. 418*da82c92fSMauro Carvalho Chehab 419*da82c92fSMauro Carvalho ChehabWhen the per-cpuset flag "cpuset.sched_load_balance" is disabled, then the 420*da82c92fSMauro Carvalho Chehabscheduler will avoid load balancing across the CPUs in that cpuset, 421*da82c92fSMauro Carvalho Chehab--except-- in so far as is necessary because some overlapping cpuset 422*da82c92fSMauro Carvalho Chehabhas "sched_load_balance" enabled. 423*da82c92fSMauro Carvalho Chehab 424*da82c92fSMauro Carvalho ChehabSo, for example, if the top cpuset has the flag "cpuset.sched_load_balance" 425*da82c92fSMauro Carvalho Chehabenabled, then the scheduler will have one sched domain covering all 426*da82c92fSMauro Carvalho ChehabCPUs, and the setting of the "cpuset.sched_load_balance" flag in any other 427*da82c92fSMauro Carvalho Chehabcpusets won't matter, as we're already fully load balancing. 428*da82c92fSMauro Carvalho Chehab 429*da82c92fSMauro Carvalho ChehabTherefore in the above two situations, the top cpuset flag 430*da82c92fSMauro Carvalho Chehab"cpuset.sched_load_balance" should be disabled, and only some of the smaller, 431*da82c92fSMauro Carvalho Chehabchild cpusets have this flag enabled. 432*da82c92fSMauro Carvalho Chehab 433*da82c92fSMauro Carvalho ChehabWhen doing this, you don't usually want to leave any unpinned tasks in 434*da82c92fSMauro Carvalho Chehabthe top cpuset that might use non-trivial amounts of CPU, as such tasks 435*da82c92fSMauro Carvalho Chehabmay be artificially constrained to some subset of CPUs, depending on 436*da82c92fSMauro Carvalho Chehabthe particulars of this flag setting in descendant cpusets. Even if 437*da82c92fSMauro Carvalho Chehabsuch a task could use spare CPU cycles in some other CPUs, the kernel 438*da82c92fSMauro Carvalho Chehabscheduler might not consider the possibility of load balancing that 439*da82c92fSMauro Carvalho Chehabtask to that underused CPU. 440*da82c92fSMauro Carvalho Chehab 441*da82c92fSMauro Carvalho ChehabOf course, tasks pinned to a particular CPU can be left in a cpuset 442*da82c92fSMauro Carvalho Chehabthat disables "cpuset.sched_load_balance" as those tasks aren't going anywhere 443*da82c92fSMauro Carvalho Chehabelse anyway. 444*da82c92fSMauro Carvalho Chehab 445*da82c92fSMauro Carvalho ChehabThere is an impedance mismatch here, between cpusets and sched domains. 446*da82c92fSMauro Carvalho ChehabCpusets are hierarchical and nest. Sched domains are flat; they don't 447*da82c92fSMauro Carvalho Chehaboverlap and each CPU is in at most one sched domain. 448*da82c92fSMauro Carvalho Chehab 449*da82c92fSMauro Carvalho ChehabIt is necessary for sched domains to be flat because load balancing 450*da82c92fSMauro Carvalho Chehabacross partially overlapping sets of CPUs would risk unstable dynamics 451*da82c92fSMauro Carvalho Chehabthat would be beyond our understanding. So if each of two partially 452*da82c92fSMauro Carvalho Chehaboverlapping cpusets enables the flag 'cpuset.sched_load_balance', then we 453*da82c92fSMauro Carvalho Chehabform a single sched domain that is a superset of both. We won't move 454*da82c92fSMauro Carvalho Chehaba task to a CPU outside its cpuset, but the scheduler load balancing 455*da82c92fSMauro Carvalho Chehabcode might waste some compute cycles considering that possibility. 456*da82c92fSMauro Carvalho Chehab 457*da82c92fSMauro Carvalho ChehabThis mismatch is why there is not a simple one-to-one relation 458*da82c92fSMauro Carvalho Chehabbetween which cpusets have the flag "cpuset.sched_load_balance" enabled, 459*da82c92fSMauro Carvalho Chehaband the sched domain configuration. If a cpuset enables the flag, it 460*da82c92fSMauro Carvalho Chehabwill get balancing across all its CPUs, but if it disables the flag, 461*da82c92fSMauro Carvalho Chehabit will only be assured of no load balancing if no other overlapping 462*da82c92fSMauro Carvalho Chehabcpuset enables the flag. 463*da82c92fSMauro Carvalho Chehab 464*da82c92fSMauro Carvalho ChehabIf two cpusets have partially overlapping 'cpuset.cpus' allowed, and only 465*da82c92fSMauro Carvalho Chehabone of them has this flag enabled, then the other may find its 466*da82c92fSMauro Carvalho Chehabtasks only partially load balanced, just on the overlapping CPUs. 467*da82c92fSMauro Carvalho ChehabThis is just the general case of the top_cpuset example given a few 468*da82c92fSMauro Carvalho Chehabparagraphs above. In the general case, as in the top cpuset case, 469*da82c92fSMauro Carvalho Chehabdon't leave tasks that might use non-trivial amounts of CPU in 470*da82c92fSMauro Carvalho Chehabsuch partially load balanced cpusets, as they may be artificially 471*da82c92fSMauro Carvalho Chehabconstrained to some subset of the CPUs allowed to them, for lack of 472*da82c92fSMauro Carvalho Chehabload balancing to the other CPUs. 473*da82c92fSMauro Carvalho Chehab 474*da82c92fSMauro Carvalho ChehabCPUs in "cpuset.isolcpus" were excluded from load balancing by the 475*da82c92fSMauro Carvalho Chehabisolcpus= kernel boot option, and will never be load balanced regardless 476*da82c92fSMauro Carvalho Chehabof the value of "cpuset.sched_load_balance" in any cpuset. 477*da82c92fSMauro Carvalho Chehab 478*da82c92fSMauro Carvalho Chehab1.7.1 sched_load_balance implementation details. 479*da82c92fSMauro Carvalho Chehab------------------------------------------------ 480*da82c92fSMauro Carvalho Chehab 481*da82c92fSMauro Carvalho ChehabThe per-cpuset flag 'cpuset.sched_load_balance' defaults to enabled (contrary 482*da82c92fSMauro Carvalho Chehabto most cpuset flags.) When enabled for a cpuset, the kernel will 483*da82c92fSMauro Carvalho Chehabensure that it can load balance across all the CPUs in that cpuset 484*da82c92fSMauro Carvalho Chehab(makes sure that all the CPUs in the cpus_allowed of that cpuset are 485*da82c92fSMauro Carvalho Chehabin the same sched domain.) 486*da82c92fSMauro Carvalho Chehab 487*da82c92fSMauro Carvalho ChehabIf two overlapping cpusets both have 'cpuset.sched_load_balance' enabled, 488*da82c92fSMauro Carvalho Chehabthen they will be (must be) both in the same sched domain. 489*da82c92fSMauro Carvalho Chehab 490*da82c92fSMauro Carvalho ChehabIf, as is the default, the top cpuset has 'cpuset.sched_load_balance' enabled, 491*da82c92fSMauro Carvalho Chehabthen by the above that means there is a single sched domain covering 492*da82c92fSMauro Carvalho Chehabthe whole system, regardless of any other cpuset settings. 493*da82c92fSMauro Carvalho Chehab 494*da82c92fSMauro Carvalho ChehabThe kernel commits to user space that it will avoid load balancing 495*da82c92fSMauro Carvalho Chehabwhere it can. It will pick as fine a granularity partition of sched 496*da82c92fSMauro Carvalho Chehabdomains as it can while still providing load balancing for any set 497*da82c92fSMauro Carvalho Chehabof CPUs allowed to a cpuset having 'cpuset.sched_load_balance' enabled. 498*da82c92fSMauro Carvalho Chehab 499*da82c92fSMauro Carvalho ChehabThe internal kernel cpuset to scheduler interface passes from the 500*da82c92fSMauro Carvalho Chehabcpuset code to the scheduler code a partition of the load balanced 501*da82c92fSMauro Carvalho ChehabCPUs in the system. This partition is a set of subsets (represented 502*da82c92fSMauro Carvalho Chehabas an array of struct cpumask) of CPUs, pairwise disjoint, that cover 503*da82c92fSMauro Carvalho Chehaball the CPUs that must be load balanced. 504*da82c92fSMauro Carvalho Chehab 505*da82c92fSMauro Carvalho ChehabThe cpuset code builds a new such partition and passes it to the 506*da82c92fSMauro Carvalho Chehabscheduler sched domain setup code, to have the sched domains rebuilt 507*da82c92fSMauro Carvalho Chehabas necessary, whenever: 508*da82c92fSMauro Carvalho Chehab 509*da82c92fSMauro Carvalho Chehab - the 'cpuset.sched_load_balance' flag of a cpuset with non-empty CPUs changes, 510*da82c92fSMauro Carvalho Chehab - or CPUs come or go from a cpuset with this flag enabled, 511*da82c92fSMauro Carvalho Chehab - or 'cpuset.sched_relax_domain_level' value of a cpuset with non-empty CPUs 512*da82c92fSMauro Carvalho Chehab and with this flag enabled changes, 513*da82c92fSMauro Carvalho Chehab - or a cpuset with non-empty CPUs and with this flag enabled is removed, 514*da82c92fSMauro Carvalho Chehab - or a cpu is offlined/onlined. 515*da82c92fSMauro Carvalho Chehab 516*da82c92fSMauro Carvalho ChehabThis partition exactly defines what sched domains the scheduler should 517*da82c92fSMauro Carvalho Chehabsetup - one sched domain for each element (struct cpumask) in the 518*da82c92fSMauro Carvalho Chehabpartition. 519*da82c92fSMauro Carvalho Chehab 520*da82c92fSMauro Carvalho ChehabThe scheduler remembers the currently active sched domain partitions. 521*da82c92fSMauro Carvalho ChehabWhen the scheduler routine partition_sched_domains() is invoked from 522*da82c92fSMauro Carvalho Chehabthe cpuset code to update these sched domains, it compares the new 523*da82c92fSMauro Carvalho Chehabpartition requested with the current, and updates its sched domains, 524*da82c92fSMauro Carvalho Chehabremoving the old and adding the new, for each change. 525*da82c92fSMauro Carvalho Chehab 526*da82c92fSMauro Carvalho Chehab 527*da82c92fSMauro Carvalho Chehab1.8 What is sched_relax_domain_level ? 528*da82c92fSMauro Carvalho Chehab-------------------------------------- 529*da82c92fSMauro Carvalho Chehab 530*da82c92fSMauro Carvalho ChehabIn sched domain, the scheduler migrates tasks in 2 ways; periodic load 531*da82c92fSMauro Carvalho Chehabbalance on tick, and at time of some schedule events. 532*da82c92fSMauro Carvalho Chehab 533*da82c92fSMauro Carvalho ChehabWhen a task is woken up, scheduler try to move the task on idle CPU. 534*da82c92fSMauro Carvalho ChehabFor example, if a task A running on CPU X activates another task B 535*da82c92fSMauro Carvalho Chehabon the same CPU X, and if CPU Y is X's sibling and performing idle, 536*da82c92fSMauro Carvalho Chehabthen scheduler migrate task B to CPU Y so that task B can start on 537*da82c92fSMauro Carvalho ChehabCPU Y without waiting task A on CPU X. 538*da82c92fSMauro Carvalho Chehab 539*da82c92fSMauro Carvalho ChehabAnd if a CPU run out of tasks in its runqueue, the CPU try to pull 540*da82c92fSMauro Carvalho Chehabextra tasks from other busy CPUs to help them before it is going to 541*da82c92fSMauro Carvalho Chehabbe idle. 542*da82c92fSMauro Carvalho Chehab 543*da82c92fSMauro Carvalho ChehabOf course it takes some searching cost to find movable tasks and/or 544*da82c92fSMauro Carvalho Chehabidle CPUs, the scheduler might not search all CPUs in the domain 545*da82c92fSMauro Carvalho Chehabevery time. In fact, in some architectures, the searching ranges on 546*da82c92fSMauro Carvalho Chehabevents are limited in the same socket or node where the CPU locates, 547*da82c92fSMauro Carvalho Chehabwhile the load balance on tick searches all. 548*da82c92fSMauro Carvalho Chehab 549*da82c92fSMauro Carvalho ChehabFor example, assume CPU Z is relatively far from CPU X. Even if CPU Z 550*da82c92fSMauro Carvalho Chehabis idle while CPU X and the siblings are busy, scheduler can't migrate 551*da82c92fSMauro Carvalho Chehabwoken task B from X to Z since it is out of its searching range. 552*da82c92fSMauro Carvalho ChehabAs the result, task B on CPU X need to wait task A or wait load balance 553*da82c92fSMauro Carvalho Chehabon the next tick. For some applications in special situation, waiting 554*da82c92fSMauro Carvalho Chehab1 tick may be too long. 555*da82c92fSMauro Carvalho Chehab 556*da82c92fSMauro Carvalho ChehabThe 'cpuset.sched_relax_domain_level' file allows you to request changing 557*da82c92fSMauro Carvalho Chehabthis searching range as you like. This file takes int value which 558*da82c92fSMauro Carvalho Chehabindicates size of searching range in levels ideally as follows, 559*da82c92fSMauro Carvalho Chehabotherwise initial value -1 that indicates the cpuset has no request. 560*da82c92fSMauro Carvalho Chehab 561*da82c92fSMauro Carvalho Chehab====== =========================================================== 562*da82c92fSMauro Carvalho Chehab -1 no request. use system default or follow request of others. 563*da82c92fSMauro Carvalho Chehab 0 no search. 564*da82c92fSMauro Carvalho Chehab 1 search siblings (hyperthreads in a core). 565*da82c92fSMauro Carvalho Chehab 2 search cores in a package. 566*da82c92fSMauro Carvalho Chehab 3 search cpus in a node [= system wide on non-NUMA system] 567*da82c92fSMauro Carvalho Chehab 4 search nodes in a chunk of node [on NUMA system] 568*da82c92fSMauro Carvalho Chehab 5 search system wide [on NUMA system] 569*da82c92fSMauro Carvalho Chehab====== =========================================================== 570*da82c92fSMauro Carvalho Chehab 571*da82c92fSMauro Carvalho ChehabThe system default is architecture dependent. The system default 572*da82c92fSMauro Carvalho Chehabcan be changed using the relax_domain_level= boot parameter. 573*da82c92fSMauro Carvalho Chehab 574*da82c92fSMauro Carvalho ChehabThis file is per-cpuset and affect the sched domain where the cpuset 575*da82c92fSMauro Carvalho Chehabbelongs to. Therefore if the flag 'cpuset.sched_load_balance' of a cpuset 576*da82c92fSMauro Carvalho Chehabis disabled, then 'cpuset.sched_relax_domain_level' have no effect since 577*da82c92fSMauro Carvalho Chehabthere is no sched domain belonging the cpuset. 578*da82c92fSMauro Carvalho Chehab 579*da82c92fSMauro Carvalho ChehabIf multiple cpusets are overlapping and hence they form a single sched 580*da82c92fSMauro Carvalho Chehabdomain, the largest value among those is used. Be careful, if one 581*da82c92fSMauro Carvalho Chehabrequests 0 and others are -1 then 0 is used. 582*da82c92fSMauro Carvalho Chehab 583*da82c92fSMauro Carvalho ChehabNote that modifying this file will have both good and bad effects, 584*da82c92fSMauro Carvalho Chehaband whether it is acceptable or not depends on your situation. 585*da82c92fSMauro Carvalho ChehabDon't modify this file if you are not sure. 586*da82c92fSMauro Carvalho Chehab 587*da82c92fSMauro Carvalho ChehabIf your situation is: 588*da82c92fSMauro Carvalho Chehab 589*da82c92fSMauro Carvalho Chehab - The migration costs between each cpu can be assumed considerably 590*da82c92fSMauro Carvalho Chehab small(for you) due to your special application's behavior or 591*da82c92fSMauro Carvalho Chehab special hardware support for CPU cache etc. 592*da82c92fSMauro Carvalho Chehab - The searching cost doesn't have impact(for you) or you can make 593*da82c92fSMauro Carvalho Chehab the searching cost enough small by managing cpuset to compact etc. 594*da82c92fSMauro Carvalho Chehab - The latency is required even it sacrifices cache hit rate etc. 595*da82c92fSMauro Carvalho Chehab then increasing 'sched_relax_domain_level' would benefit you. 596*da82c92fSMauro Carvalho Chehab 597*da82c92fSMauro Carvalho Chehab 598*da82c92fSMauro Carvalho Chehab1.9 How do I use cpusets ? 599*da82c92fSMauro Carvalho Chehab-------------------------- 600*da82c92fSMauro Carvalho Chehab 601*da82c92fSMauro Carvalho ChehabIn order to minimize the impact of cpusets on critical kernel 602*da82c92fSMauro Carvalho Chehabcode, such as the scheduler, and due to the fact that the kernel 603*da82c92fSMauro Carvalho Chehabdoes not support one task updating the memory placement of another 604*da82c92fSMauro Carvalho Chehabtask directly, the impact on a task of changing its cpuset CPU 605*da82c92fSMauro Carvalho Chehabor Memory Node placement, or of changing to which cpuset a task 606*da82c92fSMauro Carvalho Chehabis attached, is subtle. 607*da82c92fSMauro Carvalho Chehab 608*da82c92fSMauro Carvalho ChehabIf a cpuset has its Memory Nodes modified, then for each task attached 609*da82c92fSMauro Carvalho Chehabto that cpuset, the next time that the kernel attempts to allocate 610*da82c92fSMauro Carvalho Chehaba page of memory for that task, the kernel will notice the change 611*da82c92fSMauro Carvalho Chehabin the task's cpuset, and update its per-task memory placement to 612*da82c92fSMauro Carvalho Chehabremain within the new cpusets memory placement. If the task was using 613*da82c92fSMauro Carvalho Chehabmempolicy MPOL_BIND, and the nodes to which it was bound overlap with 614*da82c92fSMauro Carvalho Chehabits new cpuset, then the task will continue to use whatever subset 615*da82c92fSMauro Carvalho Chehabof MPOL_BIND nodes are still allowed in the new cpuset. If the task 616*da82c92fSMauro Carvalho Chehabwas using MPOL_BIND and now none of its MPOL_BIND nodes are allowed 617*da82c92fSMauro Carvalho Chehabin the new cpuset, then the task will be essentially treated as if it 618*da82c92fSMauro Carvalho Chehabwas MPOL_BIND bound to the new cpuset (even though its NUMA placement, 619*da82c92fSMauro Carvalho Chehabas queried by get_mempolicy(), doesn't change). If a task is moved 620*da82c92fSMauro Carvalho Chehabfrom one cpuset to another, then the kernel will adjust the task's 621*da82c92fSMauro Carvalho Chehabmemory placement, as above, the next time that the kernel attempts 622*da82c92fSMauro Carvalho Chehabto allocate a page of memory for that task. 623*da82c92fSMauro Carvalho Chehab 624*da82c92fSMauro Carvalho ChehabIf a cpuset has its 'cpuset.cpus' modified, then each task in that cpuset 625*da82c92fSMauro Carvalho Chehabwill have its allowed CPU placement changed immediately. Similarly, 626*da82c92fSMauro Carvalho Chehabif a task's pid is written to another cpuset's 'tasks' file, then its 627*da82c92fSMauro Carvalho Chehaballowed CPU placement is changed immediately. If such a task had been 628*da82c92fSMauro Carvalho Chehabbound to some subset of its cpuset using the sched_setaffinity() call, 629*da82c92fSMauro Carvalho Chehabthe task will be allowed to run on any CPU allowed in its new cpuset, 630*da82c92fSMauro Carvalho Chehabnegating the effect of the prior sched_setaffinity() call. 631*da82c92fSMauro Carvalho Chehab 632*da82c92fSMauro Carvalho ChehabIn summary, the memory placement of a task whose cpuset is changed is 633*da82c92fSMauro Carvalho Chehabupdated by the kernel, on the next allocation of a page for that task, 634*da82c92fSMauro Carvalho Chehaband the processor placement is updated immediately. 635*da82c92fSMauro Carvalho Chehab 636*da82c92fSMauro Carvalho ChehabNormally, once a page is allocated (given a physical page 637*da82c92fSMauro Carvalho Chehabof main memory) then that page stays on whatever node it 638*da82c92fSMauro Carvalho Chehabwas allocated, so long as it remains allocated, even if the 639*da82c92fSMauro Carvalho Chehabcpusets memory placement policy 'cpuset.mems' subsequently changes. 640*da82c92fSMauro Carvalho ChehabIf the cpuset flag file 'cpuset.memory_migrate' is set true, then when 641*da82c92fSMauro Carvalho Chehabtasks are attached to that cpuset, any pages that task had 642*da82c92fSMauro Carvalho Chehaballocated to it on nodes in its previous cpuset are migrated 643*da82c92fSMauro Carvalho Chehabto the task's new cpuset. The relative placement of the page within 644*da82c92fSMauro Carvalho Chehabthe cpuset is preserved during these migration operations if possible. 645*da82c92fSMauro Carvalho ChehabFor example if the page was on the second valid node of the prior cpuset 646*da82c92fSMauro Carvalho Chehabthen the page will be placed on the second valid node of the new cpuset. 647*da82c92fSMauro Carvalho Chehab 648*da82c92fSMauro Carvalho ChehabAlso if 'cpuset.memory_migrate' is set true, then if that cpuset's 649*da82c92fSMauro Carvalho Chehab'cpuset.mems' file is modified, pages allocated to tasks in that 650*da82c92fSMauro Carvalho Chehabcpuset, that were on nodes in the previous setting of 'cpuset.mems', 651*da82c92fSMauro Carvalho Chehabwill be moved to nodes in the new setting of 'mems.' 652*da82c92fSMauro Carvalho ChehabPages that were not in the task's prior cpuset, or in the cpuset's 653*da82c92fSMauro Carvalho Chehabprior 'cpuset.mems' setting, will not be moved. 654*da82c92fSMauro Carvalho Chehab 655*da82c92fSMauro Carvalho ChehabThere is an exception to the above. If hotplug functionality is used 656*da82c92fSMauro Carvalho Chehabto remove all the CPUs that are currently assigned to a cpuset, 657*da82c92fSMauro Carvalho Chehabthen all the tasks in that cpuset will be moved to the nearest ancestor 658*da82c92fSMauro Carvalho Chehabwith non-empty cpus. But the moving of some (or all) tasks might fail if 659*da82c92fSMauro Carvalho Chehabcpuset is bound with another cgroup subsystem which has some restrictions 660*da82c92fSMauro Carvalho Chehabon task attaching. In this failing case, those tasks will stay 661*da82c92fSMauro Carvalho Chehabin the original cpuset, and the kernel will automatically update 662*da82c92fSMauro Carvalho Chehabtheir cpus_allowed to allow all online CPUs. When memory hotplug 663*da82c92fSMauro Carvalho Chehabfunctionality for removing Memory Nodes is available, a similar exception 664*da82c92fSMauro Carvalho Chehabis expected to apply there as well. In general, the kernel prefers to 665*da82c92fSMauro Carvalho Chehabviolate cpuset placement, over starving a task that has had all 666*da82c92fSMauro Carvalho Chehabits allowed CPUs or Memory Nodes taken offline. 667*da82c92fSMauro Carvalho Chehab 668*da82c92fSMauro Carvalho ChehabThere is a second exception to the above. GFP_ATOMIC requests are 669*da82c92fSMauro Carvalho Chehabkernel internal allocations that must be satisfied, immediately. 670*da82c92fSMauro Carvalho ChehabThe kernel may drop some request, in rare cases even panic, if a 671*da82c92fSMauro Carvalho ChehabGFP_ATOMIC alloc fails. If the request cannot be satisfied within 672*da82c92fSMauro Carvalho Chehabthe current task's cpuset, then we relax the cpuset, and look for 673*da82c92fSMauro Carvalho Chehabmemory anywhere we can find it. It's better to violate the cpuset 674*da82c92fSMauro Carvalho Chehabthan stress the kernel. 675*da82c92fSMauro Carvalho Chehab 676*da82c92fSMauro Carvalho ChehabTo start a new job that is to be contained within a cpuset, the steps are: 677*da82c92fSMauro Carvalho Chehab 678*da82c92fSMauro Carvalho Chehab 1) mkdir /sys/fs/cgroup/cpuset 679*da82c92fSMauro Carvalho Chehab 2) mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset 680*da82c92fSMauro Carvalho Chehab 3) Create the new cpuset by doing mkdir's and write's (or echo's) in 681*da82c92fSMauro Carvalho Chehab the /sys/fs/cgroup/cpuset virtual file system. 682*da82c92fSMauro Carvalho Chehab 4) Start a task that will be the "founding father" of the new job. 683*da82c92fSMauro Carvalho Chehab 5) Attach that task to the new cpuset by writing its pid to the 684*da82c92fSMauro Carvalho Chehab /sys/fs/cgroup/cpuset tasks file for that cpuset. 685*da82c92fSMauro Carvalho Chehab 6) fork, exec or clone the job tasks from this founding father task. 686*da82c92fSMauro Carvalho Chehab 687*da82c92fSMauro Carvalho ChehabFor example, the following sequence of commands will setup a cpuset 688*da82c92fSMauro Carvalho Chehabnamed "Charlie", containing just CPUs 2 and 3, and Memory Node 1, 689*da82c92fSMauro Carvalho Chehaband then start a subshell 'sh' in that cpuset:: 690*da82c92fSMauro Carvalho Chehab 691*da82c92fSMauro Carvalho Chehab mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset 692*da82c92fSMauro Carvalho Chehab cd /sys/fs/cgroup/cpuset 693*da82c92fSMauro Carvalho Chehab mkdir Charlie 694*da82c92fSMauro Carvalho Chehab cd Charlie 695*da82c92fSMauro Carvalho Chehab /bin/echo 2-3 > cpuset.cpus 696*da82c92fSMauro Carvalho Chehab /bin/echo 1 > cpuset.mems 697*da82c92fSMauro Carvalho Chehab /bin/echo $$ > tasks 698*da82c92fSMauro Carvalho Chehab sh 699*da82c92fSMauro Carvalho Chehab # The subshell 'sh' is now running in cpuset Charlie 700*da82c92fSMauro Carvalho Chehab # The next line should display '/Charlie' 701*da82c92fSMauro Carvalho Chehab cat /proc/self/cpuset 702*da82c92fSMauro Carvalho Chehab 703*da82c92fSMauro Carvalho ChehabThere are ways to query or modify cpusets: 704*da82c92fSMauro Carvalho Chehab 705*da82c92fSMauro Carvalho Chehab - via the cpuset file system directly, using the various cd, mkdir, echo, 706*da82c92fSMauro Carvalho Chehab cat, rmdir commands from the shell, or their equivalent from C. 707*da82c92fSMauro Carvalho Chehab - via the C library libcpuset. 708*da82c92fSMauro Carvalho Chehab - via the C library libcgroup. 709*da82c92fSMauro Carvalho Chehab (http://sourceforge.net/projects/libcg/) 710*da82c92fSMauro Carvalho Chehab - via the python application cset. 711*da82c92fSMauro Carvalho Chehab (http://code.google.com/p/cpuset/) 712*da82c92fSMauro Carvalho Chehab 713*da82c92fSMauro Carvalho ChehabThe sched_setaffinity calls can also be done at the shell prompt using 714*da82c92fSMauro Carvalho ChehabSGI's runon or Robert Love's taskset. The mbind and set_mempolicy 715*da82c92fSMauro Carvalho Chehabcalls can be done at the shell prompt using the numactl command 716*da82c92fSMauro Carvalho Chehab(part of Andi Kleen's numa package). 717*da82c92fSMauro Carvalho Chehab 718*da82c92fSMauro Carvalho Chehab2. Usage Examples and Syntax 719*da82c92fSMauro Carvalho Chehab============================ 720*da82c92fSMauro Carvalho Chehab 721*da82c92fSMauro Carvalho Chehab2.1 Basic Usage 722*da82c92fSMauro Carvalho Chehab--------------- 723*da82c92fSMauro Carvalho Chehab 724*da82c92fSMauro Carvalho ChehabCreating, modifying, using the cpusets can be done through the cpuset 725*da82c92fSMauro Carvalho Chehabvirtual filesystem. 726*da82c92fSMauro Carvalho Chehab 727*da82c92fSMauro Carvalho ChehabTo mount it, type: 728*da82c92fSMauro Carvalho Chehab# mount -t cgroup -o cpuset cpuset /sys/fs/cgroup/cpuset 729*da82c92fSMauro Carvalho Chehab 730*da82c92fSMauro Carvalho ChehabThen under /sys/fs/cgroup/cpuset you can find a tree that corresponds to the 731*da82c92fSMauro Carvalho Chehabtree of the cpusets in the system. For instance, /sys/fs/cgroup/cpuset 732*da82c92fSMauro Carvalho Chehabis the cpuset that holds the whole system. 733*da82c92fSMauro Carvalho Chehab 734*da82c92fSMauro Carvalho ChehabIf you want to create a new cpuset under /sys/fs/cgroup/cpuset:: 735*da82c92fSMauro Carvalho Chehab 736*da82c92fSMauro Carvalho Chehab # cd /sys/fs/cgroup/cpuset 737*da82c92fSMauro Carvalho Chehab # mkdir my_cpuset 738*da82c92fSMauro Carvalho Chehab 739*da82c92fSMauro Carvalho ChehabNow you want to do something with this cpuset:: 740*da82c92fSMauro Carvalho Chehab 741*da82c92fSMauro Carvalho Chehab # cd my_cpuset 742*da82c92fSMauro Carvalho Chehab 743*da82c92fSMauro Carvalho ChehabIn this directory you can find several files:: 744*da82c92fSMauro Carvalho Chehab 745*da82c92fSMauro Carvalho Chehab # ls 746*da82c92fSMauro Carvalho Chehab cgroup.clone_children cpuset.memory_pressure 747*da82c92fSMauro Carvalho Chehab cgroup.event_control cpuset.memory_spread_page 748*da82c92fSMauro Carvalho Chehab cgroup.procs cpuset.memory_spread_slab 749*da82c92fSMauro Carvalho Chehab cpuset.cpu_exclusive cpuset.mems 750*da82c92fSMauro Carvalho Chehab cpuset.cpus cpuset.sched_load_balance 751*da82c92fSMauro Carvalho Chehab cpuset.mem_exclusive cpuset.sched_relax_domain_level 752*da82c92fSMauro Carvalho Chehab cpuset.mem_hardwall notify_on_release 753*da82c92fSMauro Carvalho Chehab cpuset.memory_migrate tasks 754*da82c92fSMauro Carvalho Chehab 755*da82c92fSMauro Carvalho ChehabReading them will give you information about the state of this cpuset: 756*da82c92fSMauro Carvalho Chehabthe CPUs and Memory Nodes it can use, the processes that are using 757*da82c92fSMauro Carvalho Chehabit, its properties. By writing to these files you can manipulate 758*da82c92fSMauro Carvalho Chehabthe cpuset. 759*da82c92fSMauro Carvalho Chehab 760*da82c92fSMauro Carvalho ChehabSet some flags:: 761*da82c92fSMauro Carvalho Chehab 762*da82c92fSMauro Carvalho Chehab # /bin/echo 1 > cpuset.cpu_exclusive 763*da82c92fSMauro Carvalho Chehab 764*da82c92fSMauro Carvalho ChehabAdd some cpus:: 765*da82c92fSMauro Carvalho Chehab 766*da82c92fSMauro Carvalho Chehab # /bin/echo 0-7 > cpuset.cpus 767*da82c92fSMauro Carvalho Chehab 768*da82c92fSMauro Carvalho ChehabAdd some mems:: 769*da82c92fSMauro Carvalho Chehab 770*da82c92fSMauro Carvalho Chehab # /bin/echo 0-7 > cpuset.mems 771*da82c92fSMauro Carvalho Chehab 772*da82c92fSMauro Carvalho ChehabNow attach your shell to this cpuset:: 773*da82c92fSMauro Carvalho Chehab 774*da82c92fSMauro Carvalho Chehab # /bin/echo $$ > tasks 775*da82c92fSMauro Carvalho Chehab 776*da82c92fSMauro Carvalho ChehabYou can also create cpusets inside your cpuset by using mkdir in this 777*da82c92fSMauro Carvalho Chehabdirectory:: 778*da82c92fSMauro Carvalho Chehab 779*da82c92fSMauro Carvalho Chehab # mkdir my_sub_cs 780*da82c92fSMauro Carvalho Chehab 781*da82c92fSMauro Carvalho ChehabTo remove a cpuset, just use rmdir:: 782*da82c92fSMauro Carvalho Chehab 783*da82c92fSMauro Carvalho Chehab # rmdir my_sub_cs 784*da82c92fSMauro Carvalho Chehab 785*da82c92fSMauro Carvalho ChehabThis will fail if the cpuset is in use (has cpusets inside, or has 786*da82c92fSMauro Carvalho Chehabprocesses attached). 787*da82c92fSMauro Carvalho Chehab 788*da82c92fSMauro Carvalho ChehabNote that for legacy reasons, the "cpuset" filesystem exists as a 789*da82c92fSMauro Carvalho Chehabwrapper around the cgroup filesystem. 790*da82c92fSMauro Carvalho Chehab 791*da82c92fSMauro Carvalho ChehabThe command:: 792*da82c92fSMauro Carvalho Chehab 793*da82c92fSMauro Carvalho Chehab mount -t cpuset X /sys/fs/cgroup/cpuset 794*da82c92fSMauro Carvalho Chehab 795*da82c92fSMauro Carvalho Chehabis equivalent to:: 796*da82c92fSMauro Carvalho Chehab 797*da82c92fSMauro Carvalho Chehab mount -t cgroup -ocpuset,noprefix X /sys/fs/cgroup/cpuset 798*da82c92fSMauro Carvalho Chehab echo "/sbin/cpuset_release_agent" > /sys/fs/cgroup/cpuset/release_agent 799*da82c92fSMauro Carvalho Chehab 800*da82c92fSMauro Carvalho Chehab2.2 Adding/removing cpus 801*da82c92fSMauro Carvalho Chehab------------------------ 802*da82c92fSMauro Carvalho Chehab 803*da82c92fSMauro Carvalho ChehabThis is the syntax to use when writing in the cpus or mems files 804*da82c92fSMauro Carvalho Chehabin cpuset directories:: 805*da82c92fSMauro Carvalho Chehab 806*da82c92fSMauro Carvalho Chehab # /bin/echo 1-4 > cpuset.cpus -> set cpus list to cpus 1,2,3,4 807*da82c92fSMauro Carvalho Chehab # /bin/echo 1,2,3,4 > cpuset.cpus -> set cpus list to cpus 1,2,3,4 808*da82c92fSMauro Carvalho Chehab 809*da82c92fSMauro Carvalho ChehabTo add a CPU to a cpuset, write the new list of CPUs including the 810*da82c92fSMauro Carvalho ChehabCPU to be added. To add 6 to the above cpuset:: 811*da82c92fSMauro Carvalho Chehab 812*da82c92fSMauro Carvalho Chehab # /bin/echo 1-4,6 > cpuset.cpus -> set cpus list to cpus 1,2,3,4,6 813*da82c92fSMauro Carvalho Chehab 814*da82c92fSMauro Carvalho ChehabSimilarly to remove a CPU from a cpuset, write the new list of CPUs 815*da82c92fSMauro Carvalho Chehabwithout the CPU to be removed. 816*da82c92fSMauro Carvalho Chehab 817*da82c92fSMauro Carvalho ChehabTo remove all the CPUs:: 818*da82c92fSMauro Carvalho Chehab 819*da82c92fSMauro Carvalho Chehab # /bin/echo "" > cpuset.cpus -> clear cpus list 820*da82c92fSMauro Carvalho Chehab 821*da82c92fSMauro Carvalho Chehab2.3 Setting flags 822*da82c92fSMauro Carvalho Chehab----------------- 823*da82c92fSMauro Carvalho Chehab 824*da82c92fSMauro Carvalho ChehabThe syntax is very simple:: 825*da82c92fSMauro Carvalho Chehab 826*da82c92fSMauro Carvalho Chehab # /bin/echo 1 > cpuset.cpu_exclusive -> set flag 'cpuset.cpu_exclusive' 827*da82c92fSMauro Carvalho Chehab # /bin/echo 0 > cpuset.cpu_exclusive -> unset flag 'cpuset.cpu_exclusive' 828*da82c92fSMauro Carvalho Chehab 829*da82c92fSMauro Carvalho Chehab2.4 Attaching processes 830*da82c92fSMauro Carvalho Chehab----------------------- 831*da82c92fSMauro Carvalho Chehab 832*da82c92fSMauro Carvalho Chehab:: 833*da82c92fSMauro Carvalho Chehab 834*da82c92fSMauro Carvalho Chehab # /bin/echo PID > tasks 835*da82c92fSMauro Carvalho Chehab 836*da82c92fSMauro Carvalho ChehabNote that it is PID, not PIDs. You can only attach ONE task at a time. 837*da82c92fSMauro Carvalho ChehabIf you have several tasks to attach, you have to do it one after another:: 838*da82c92fSMauro Carvalho Chehab 839*da82c92fSMauro Carvalho Chehab # /bin/echo PID1 > tasks 840*da82c92fSMauro Carvalho Chehab # /bin/echo PID2 > tasks 841*da82c92fSMauro Carvalho Chehab ... 842*da82c92fSMauro Carvalho Chehab # /bin/echo PIDn > tasks 843*da82c92fSMauro Carvalho Chehab 844*da82c92fSMauro Carvalho Chehab 845*da82c92fSMauro Carvalho Chehab3. Questions 846*da82c92fSMauro Carvalho Chehab============ 847*da82c92fSMauro Carvalho Chehab 848*da82c92fSMauro Carvalho ChehabQ: 849*da82c92fSMauro Carvalho Chehab what's up with this '/bin/echo' ? 850*da82c92fSMauro Carvalho Chehab 851*da82c92fSMauro Carvalho ChehabA: 852*da82c92fSMauro Carvalho Chehab bash's builtin 'echo' command does not check calls to write() against 853*da82c92fSMauro Carvalho Chehab errors. If you use it in the cpuset file system, you won't be 854*da82c92fSMauro Carvalho Chehab able to tell whether a command succeeded or failed. 855*da82c92fSMauro Carvalho Chehab 856*da82c92fSMauro Carvalho ChehabQ: 857*da82c92fSMauro Carvalho Chehab When I attach processes, only the first of the line gets really attached ! 858*da82c92fSMauro Carvalho Chehab 859*da82c92fSMauro Carvalho ChehabA: 860*da82c92fSMauro Carvalho Chehab We can only return one error code per call to write(). So you should also 861*da82c92fSMauro Carvalho Chehab put only ONE pid. 862*da82c92fSMauro Carvalho Chehab 863*da82c92fSMauro Carvalho Chehab4. Contact 864*da82c92fSMauro Carvalho Chehab========== 865*da82c92fSMauro Carvalho Chehab 866*da82c92fSMauro Carvalho ChehabWeb: http://www.bullopensource.org/cpuset 867