150aab9b1SRalph Campbell.. _cpusets: 250aab9b1SRalph Campbell 3da82c92fSMauro Carvalho Chehab======= 4da82c92fSMauro Carvalho ChehabCPUSETS 5da82c92fSMauro Carvalho Chehab======= 6da82c92fSMauro Carvalho Chehab 7da82c92fSMauro Carvalho ChehabCopyright (C) 2004 BULL SA. 8da82c92fSMauro Carvalho Chehab 9da82c92fSMauro Carvalho ChehabWritten by Simon.Derr@bull.net 10da82c92fSMauro Carvalho Chehab 11da82c92fSMauro Carvalho Chehab- Portions Copyright (c) 2004-2006 Silicon Graphics, Inc. 12da82c92fSMauro Carvalho Chehab- Modified by Paul Jackson <pj@sgi.com> 13da82c92fSMauro Carvalho Chehab- Modified by Christoph Lameter <cl@linux.com> 14da82c92fSMauro Carvalho Chehab- Modified by Paul Menage <menage@google.com> 15da82c92fSMauro Carvalho Chehab- Modified by Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> 16da82c92fSMauro Carvalho Chehab 17da82c92fSMauro Carvalho Chehab.. CONTENTS: 18da82c92fSMauro Carvalho Chehab 19da82c92fSMauro Carvalho Chehab 1. Cpusets 20da82c92fSMauro Carvalho Chehab 1.1 What are cpusets ? 21da82c92fSMauro Carvalho Chehab 1.2 Why are cpusets needed ? 22da82c92fSMauro Carvalho Chehab 1.3 How are cpusets implemented ? 23da82c92fSMauro Carvalho Chehab 1.4 What are exclusive cpusets ? 24da82c92fSMauro Carvalho Chehab 1.5 What is memory_pressure ? 25da82c92fSMauro Carvalho Chehab 1.6 What is memory spread ? 26da82c92fSMauro Carvalho Chehab 1.7 What is sched_load_balance ? 27da82c92fSMauro Carvalho Chehab 1.8 What is sched_relax_domain_level ? 28da82c92fSMauro Carvalho Chehab 1.9 How do I use cpusets ? 29da82c92fSMauro Carvalho Chehab 2. Usage Examples and Syntax 30da82c92fSMauro Carvalho Chehab 2.1 Basic Usage 31da82c92fSMauro Carvalho Chehab 2.2 Adding/removing cpus 32da82c92fSMauro Carvalho Chehab 2.3 Setting flags 33da82c92fSMauro Carvalho Chehab 2.4 Attaching processes 34da82c92fSMauro Carvalho Chehab 3. Questions 35da82c92fSMauro Carvalho Chehab 4. Contact 36da82c92fSMauro Carvalho Chehab 37da82c92fSMauro Carvalho Chehab1. Cpusets 38da82c92fSMauro Carvalho Chehab========== 39da82c92fSMauro Carvalho Chehab 40da82c92fSMauro Carvalho Chehab1.1 What are cpusets ? 41da82c92fSMauro Carvalho Chehab---------------------- 42da82c92fSMauro Carvalho Chehab 43da82c92fSMauro Carvalho ChehabCpusets provide a mechanism for assigning a set of CPUs and Memory 44da82c92fSMauro Carvalho ChehabNodes to a set of tasks. In this document "Memory Node" refers to 45da82c92fSMauro Carvalho Chehaban on-line node that contains memory. 46da82c92fSMauro Carvalho Chehab 47da82c92fSMauro Carvalho ChehabCpusets constrain the CPU and Memory placement of tasks to only 48da82c92fSMauro Carvalho Chehabthe resources within a task's current cpuset. They form a nested 49da82c92fSMauro Carvalho Chehabhierarchy visible in a virtual file system. These are the essential 50da82c92fSMauro Carvalho Chehabhooks, beyond what is already present, required to manage dynamic 51da82c92fSMauro Carvalho Chehabjob placement on large systems. 52da82c92fSMauro Carvalho Chehab 53da82c92fSMauro Carvalho ChehabCpusets use the generic cgroup subsystem described in 54da82c92fSMauro Carvalho ChehabDocumentation/admin-guide/cgroup-v1/cgroups.rst. 55da82c92fSMauro Carvalho Chehab 56da82c92fSMauro Carvalho ChehabRequests by a task, using the sched_setaffinity(2) system call to 57da82c92fSMauro Carvalho Chehabinclude CPUs in its CPU affinity mask, and using the mbind(2) and 58da82c92fSMauro Carvalho Chehabset_mempolicy(2) system calls to include Memory Nodes in its memory 59da82c92fSMauro Carvalho Chehabpolicy, are both filtered through that task's cpuset, filtering out any 60da82c92fSMauro Carvalho ChehabCPUs or Memory Nodes not in that cpuset. The scheduler will not 61da82c92fSMauro Carvalho Chehabschedule a task on a CPU that is not allowed in its cpus_allowed 62da82c92fSMauro Carvalho Chehabvector, and the kernel page allocator will not allocate a page on a 63da82c92fSMauro Carvalho Chehabnode that is not allowed in the requesting task's mems_allowed vector. 64da82c92fSMauro Carvalho Chehab 65da82c92fSMauro Carvalho ChehabUser level code may create and destroy cpusets by name in the cgroup 66da82c92fSMauro Carvalho Chehabvirtual file system, manage the attributes and permissions of these 67da82c92fSMauro Carvalho Chehabcpusets and which CPUs and Memory Nodes are assigned to each cpuset, 68da82c92fSMauro Carvalho Chehabspecify and query to which cpuset a task is assigned, and list the 69da82c92fSMauro Carvalho Chehabtask pids assigned to a cpuset. 70da82c92fSMauro Carvalho Chehab 71da82c92fSMauro Carvalho Chehab 72da82c92fSMauro Carvalho Chehab1.2 Why are cpusets needed ? 73da82c92fSMauro Carvalho Chehab---------------------------- 74da82c92fSMauro Carvalho Chehab 75da82c92fSMauro Carvalho ChehabThe management of large computer systems, with many processors (CPUs), 76da82c92fSMauro Carvalho Chehabcomplex memory cache hierarchies and multiple Memory Nodes having 77da82c92fSMauro Carvalho Chehabnon-uniform access times (NUMA) presents additional challenges for 78da82c92fSMauro Carvalho Chehabthe efficient scheduling and memory placement of processes. 79da82c92fSMauro Carvalho Chehab 80da82c92fSMauro Carvalho ChehabFrequently more modest sized systems can be operated with adequate 81da82c92fSMauro Carvalho Chehabefficiency just by letting the operating system automatically share 82da82c92fSMauro Carvalho Chehabthe available CPU and Memory resources amongst the requesting tasks. 83da82c92fSMauro Carvalho Chehab 84da82c92fSMauro Carvalho ChehabBut larger systems, which benefit more from careful processor and 85da82c92fSMauro Carvalho Chehabmemory placement to reduce memory access times and contention, 86da82c92fSMauro Carvalho Chehaband which typically represent a larger investment for the customer, 87da82c92fSMauro Carvalho Chehabcan benefit from explicitly placing jobs on properly sized subsets of 88da82c92fSMauro Carvalho Chehabthe system. 89da82c92fSMauro Carvalho Chehab 90da82c92fSMauro Carvalho ChehabThis can be especially valuable on: 91da82c92fSMauro Carvalho Chehab 92da82c92fSMauro Carvalho Chehab * Web Servers running multiple instances of the same web application, 93da82c92fSMauro Carvalho Chehab * Servers running different applications (for instance, a web server 94da82c92fSMauro Carvalho Chehab and a database), or 95da82c92fSMauro Carvalho Chehab * NUMA systems running large HPC applications with demanding 96da82c92fSMauro Carvalho Chehab performance characteristics. 97da82c92fSMauro Carvalho Chehab 98da82c92fSMauro Carvalho ChehabThese subsets, or "soft partitions" must be able to be dynamically 99da82c92fSMauro Carvalho Chehabadjusted, as the job mix changes, without impacting other concurrently 100da82c92fSMauro Carvalho Chehabexecuting jobs. The location of the running jobs pages may also be moved 101da82c92fSMauro Carvalho Chehabwhen the memory locations are changed. 102da82c92fSMauro Carvalho Chehab 103da82c92fSMauro Carvalho ChehabThe kernel cpuset patch provides the minimum essential kernel 104da82c92fSMauro Carvalho Chehabmechanisms required to efficiently implement such subsets. It 105da82c92fSMauro Carvalho Chehableverages existing CPU and Memory Placement facilities in the Linux 106da82c92fSMauro Carvalho Chehabkernel to avoid any additional impact on the critical scheduler or 107da82c92fSMauro Carvalho Chehabmemory allocator code. 108da82c92fSMauro Carvalho Chehab 109da82c92fSMauro Carvalho Chehab 110da82c92fSMauro Carvalho Chehab1.3 How are cpusets implemented ? 111da82c92fSMauro Carvalho Chehab--------------------------------- 112da82c92fSMauro Carvalho Chehab 113da82c92fSMauro Carvalho ChehabCpusets provide a Linux kernel mechanism to constrain which CPUs and 114da82c92fSMauro Carvalho ChehabMemory Nodes are used by a process or set of processes. 115da82c92fSMauro Carvalho Chehab 116da82c92fSMauro Carvalho ChehabThe Linux kernel already has a pair of mechanisms to specify on which 117da82c92fSMauro Carvalho ChehabCPUs a task may be scheduled (sched_setaffinity) and on which Memory 118da82c92fSMauro Carvalho ChehabNodes it may obtain memory (mbind, set_mempolicy). 119da82c92fSMauro Carvalho Chehab 120da82c92fSMauro Carvalho ChehabCpusets extends these two mechanisms as follows: 121da82c92fSMauro Carvalho Chehab 122da82c92fSMauro Carvalho Chehab - Cpusets are sets of allowed CPUs and Memory Nodes, known to the 123da82c92fSMauro Carvalho Chehab kernel. 124da82c92fSMauro Carvalho Chehab - Each task in the system is attached to a cpuset, via a pointer 125da82c92fSMauro Carvalho Chehab in the task structure to a reference counted cgroup structure. 126da82c92fSMauro Carvalho Chehab - Calls to sched_setaffinity are filtered to just those CPUs 127da82c92fSMauro Carvalho Chehab allowed in that task's cpuset. 128da82c92fSMauro Carvalho Chehab - Calls to mbind and set_mempolicy are filtered to just 129da82c92fSMauro Carvalho Chehab those Memory Nodes allowed in that task's cpuset. 130da82c92fSMauro Carvalho Chehab - The root cpuset contains all the systems CPUs and Memory 131da82c92fSMauro Carvalho Chehab Nodes. 132da82c92fSMauro Carvalho Chehab - For any cpuset, one can define child cpusets containing a subset 133da82c92fSMauro Carvalho Chehab of the parents CPU and Memory Node resources. 134da82c92fSMauro Carvalho Chehab - The hierarchy of cpusets can be mounted at /dev/cpuset, for 135da82c92fSMauro Carvalho Chehab browsing and manipulation from user space. 136da82c92fSMauro Carvalho Chehab - A cpuset may be marked exclusive, which ensures that no other 137da82c92fSMauro Carvalho Chehab cpuset (except direct ancestors and descendants) may contain 138da82c92fSMauro Carvalho Chehab any overlapping CPUs or Memory Nodes. 139da82c92fSMauro Carvalho Chehab - You can list all the tasks (by pid) attached to any cpuset. 140da82c92fSMauro Carvalho Chehab 141da82c92fSMauro Carvalho ChehabThe implementation of cpusets requires a few, simple hooks 142da82c92fSMauro Carvalho Chehabinto the rest of the kernel, none in performance critical paths: 143da82c92fSMauro Carvalho Chehab 144da82c92fSMauro Carvalho Chehab - in init/main.c, to initialize the root cpuset at system boot. 145da82c92fSMauro Carvalho Chehab - in fork and exit, to attach and detach a task from its cpuset. 146da82c92fSMauro Carvalho Chehab - in sched_setaffinity, to mask the requested CPUs by what's 147da82c92fSMauro Carvalho Chehab allowed in that task's cpuset. 148da82c92fSMauro Carvalho Chehab - in sched.c migrate_live_tasks(), to keep migrating tasks within 149da82c92fSMauro Carvalho Chehab the CPUs allowed by their cpuset, if possible. 150da82c92fSMauro Carvalho Chehab - in the mbind and set_mempolicy system calls, to mask the requested 151da82c92fSMauro Carvalho Chehab Memory Nodes by what's allowed in that task's cpuset. 152da82c92fSMauro Carvalho Chehab - in page_alloc.c, to restrict memory to allowed nodes. 153da82c92fSMauro Carvalho Chehab - in vmscan.c, to restrict page recovery to the current cpuset. 154da82c92fSMauro Carvalho Chehab 155da82c92fSMauro Carvalho ChehabYou should mount the "cgroup" filesystem type in order to enable 156da82c92fSMauro Carvalho Chehabbrowsing and modifying the cpusets presently known to the kernel. No 157da82c92fSMauro Carvalho Chehabnew system calls are added for cpusets - all support for querying and 158da82c92fSMauro Carvalho Chehabmodifying cpusets is via this cpuset file system. 159da82c92fSMauro Carvalho Chehab 160da82c92fSMauro Carvalho ChehabThe /proc/<pid>/status file for each task has four added lines, 161da82c92fSMauro Carvalho Chehabdisplaying the task's cpus_allowed (on which CPUs it may be scheduled) 162da82c92fSMauro Carvalho Chehaband mems_allowed (on which Memory Nodes it may obtain memory), 163da82c92fSMauro Carvalho Chehabin the two formats seen in the following example:: 164da82c92fSMauro Carvalho Chehab 165da82c92fSMauro Carvalho Chehab Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff 166da82c92fSMauro Carvalho Chehab Cpus_allowed_list: 0-127 167da82c92fSMauro Carvalho Chehab Mems_allowed: ffffffff,ffffffff 168da82c92fSMauro Carvalho Chehab Mems_allowed_list: 0-63 169da82c92fSMauro Carvalho Chehab 170da82c92fSMauro Carvalho ChehabEach cpuset is represented by a directory in the cgroup file system 171da82c92fSMauro Carvalho Chehabcontaining (on top of the standard cgroup files) the following 172da82c92fSMauro Carvalho Chehabfiles describing that cpuset: 173da82c92fSMauro Carvalho Chehab 174da82c92fSMauro Carvalho Chehab - cpuset.cpus: list of CPUs in that cpuset 175da82c92fSMauro Carvalho Chehab - cpuset.mems: list of Memory Nodes in that cpuset 176da82c92fSMauro Carvalho Chehab - cpuset.memory_migrate flag: if set, move pages to cpusets nodes 177da82c92fSMauro Carvalho Chehab - cpuset.cpu_exclusive flag: is cpu placement exclusive? 178da82c92fSMauro Carvalho Chehab - cpuset.mem_exclusive flag: is memory placement exclusive? 179da82c92fSMauro Carvalho Chehab - cpuset.mem_hardwall flag: is memory allocation hardwalled 180da82c92fSMauro Carvalho Chehab - cpuset.memory_pressure: measure of how much paging pressure in cpuset 181da82c92fSMauro Carvalho Chehab - cpuset.memory_spread_page flag: if set, spread page cache evenly on allowed nodes 1823ab67a9cSXiongwei Song - cpuset.memory_spread_slab flag: OBSOLETE. Doesn't have any function. 183da82c92fSMauro Carvalho Chehab - cpuset.sched_load_balance flag: if set, load balance within CPUs on that cpuset 184da82c92fSMauro Carvalho Chehab - cpuset.sched_relax_domain_level: the searching range when migrating tasks 185da82c92fSMauro Carvalho Chehab 186da82c92fSMauro Carvalho ChehabIn addition, only the root cpuset has the following file: 187da82c92fSMauro Carvalho Chehab 188da82c92fSMauro Carvalho Chehab - cpuset.memory_pressure_enabled flag: compute memory_pressure? 189da82c92fSMauro Carvalho Chehab 190da82c92fSMauro Carvalho ChehabNew cpusets are created using the mkdir system call or shell 191da82c92fSMauro Carvalho Chehabcommand. The properties of a cpuset, such as its flags, allowed 192da82c92fSMauro Carvalho ChehabCPUs and Memory Nodes, and attached tasks, are modified by writing 193da82c92fSMauro Carvalho Chehabto the appropriate file in that cpusets directory, as listed above. 194da82c92fSMauro Carvalho Chehab 195da82c92fSMauro Carvalho ChehabThe named hierarchical structure of nested cpusets allows partitioning 196da82c92fSMauro Carvalho Chehaba large system into nested, dynamically changeable, "soft-partitions". 197da82c92fSMauro Carvalho Chehab 198da82c92fSMauro Carvalho ChehabThe attachment of each task, automatically inherited at fork by any 199da82c92fSMauro Carvalho Chehabchildren of that task, to a cpuset allows organizing the work load 200da82c92fSMauro Carvalho Chehabon a system into related sets of tasks such that each set is constrained 201da82c92fSMauro Carvalho Chehabto using the CPUs and Memory Nodes of a particular cpuset. A task 202da82c92fSMauro Carvalho Chehabmay be re-attached to any other cpuset, if allowed by the permissions 203da82c92fSMauro Carvalho Chehabon the necessary cpuset file system directories. 204da82c92fSMauro Carvalho Chehab 205da82c92fSMauro Carvalho ChehabSuch management of a system "in the large" integrates smoothly with 206da82c92fSMauro Carvalho Chehabthe detailed placement done on individual tasks and memory regions 207da82c92fSMauro Carvalho Chehabusing the sched_setaffinity, mbind and set_mempolicy system calls. 208da82c92fSMauro Carvalho Chehab 209da82c92fSMauro Carvalho ChehabThe following rules apply to each cpuset: 210da82c92fSMauro Carvalho Chehab 211da82c92fSMauro Carvalho Chehab - Its CPUs and Memory Nodes must be a subset of its parents. 212da82c92fSMauro Carvalho Chehab - It can't be marked exclusive unless its parent is. 213da82c92fSMauro Carvalho Chehab - If its cpu or memory is exclusive, they may not overlap any sibling. 214da82c92fSMauro Carvalho Chehab 215da82c92fSMauro Carvalho ChehabThese rules, and the natural hierarchy of cpusets, enable efficient 216da82c92fSMauro Carvalho Chehabenforcement of the exclusive guarantee, without having to scan all 217da82c92fSMauro Carvalho Chehabcpusets every time any of them change to ensure nothing overlaps a 218da82c92fSMauro Carvalho Chehabexclusive cpuset. Also, the use of a Linux virtual file system (vfs) 219da82c92fSMauro Carvalho Chehabto represent the cpuset hierarchy provides for a familiar permission 220da82c92fSMauro Carvalho Chehaband name space for cpusets, with a minimum of additional kernel code. 221da82c92fSMauro Carvalho Chehab 222da82c92fSMauro Carvalho ChehabThe cpus and mems files in the root (top_cpuset) cpuset are 223da82c92fSMauro Carvalho Chehabread-only. The cpus file automatically tracks the value of 224da82c92fSMauro Carvalho Chehabcpu_online_mask using a CPU hotplug notifier, and the mems file 225da82c92fSMauro Carvalho Chehabautomatically tracks the value of node_states[N_MEMORY]--i.e., 226da82c92fSMauro Carvalho Chehabnodes with memory--using the cpuset_track_online_nodes() hook. 227da82c92fSMauro Carvalho Chehab 2280c05b9bdSWaiman LongThe cpuset.effective_cpus and cpuset.effective_mems files are 2290c05b9bdSWaiman Longnormally read-only copies of cpuset.cpus and cpuset.mems files 2300c05b9bdSWaiman Longrespectively. If the cpuset cgroup filesystem is mounted with the 2310c05b9bdSWaiman Longspecial "cpuset_v2_mode" option, the behavior of these files will become 2320c05b9bdSWaiman Longsimilar to the corresponding files in cpuset v2. In other words, hotplug 2330c05b9bdSWaiman Longevents will not change cpuset.cpus and cpuset.mems. Those events will 2340c05b9bdSWaiman Longonly affect cpuset.effective_cpus and cpuset.effective_mems which show 2350c05b9bdSWaiman Longthe actual cpus and memory nodes that are currently used by this cpuset. 2360c05b9bdSWaiman LongSee Documentation/admin-guide/cgroup-v2.rst for more information about 2370c05b9bdSWaiman Longcpuset v2 behavior. 2380c05b9bdSWaiman Long 239da82c92fSMauro Carvalho Chehab 240da82c92fSMauro Carvalho Chehab1.4 What are exclusive cpusets ? 241da82c92fSMauro Carvalho Chehab-------------------------------- 242da82c92fSMauro Carvalho Chehab 243da82c92fSMauro Carvalho ChehabIf a cpuset is cpu or mem exclusive, no other cpuset, other than 244da82c92fSMauro Carvalho Chehaba direct ancestor or descendant, may share any of the same CPUs or 245da82c92fSMauro Carvalho ChehabMemory Nodes. 246da82c92fSMauro Carvalho Chehab 247da82c92fSMauro Carvalho ChehabA cpuset that is cpuset.mem_exclusive *or* cpuset.mem_hardwall is "hardwalled", 248da82c92fSMauro Carvalho Chehabi.e. it restricts kernel allocations for page, buffer and other data 249da82c92fSMauro Carvalho Chehabcommonly shared by the kernel across multiple users. All cpusets, 250da82c92fSMauro Carvalho Chehabwhether hardwalled or not, restrict allocations of memory for user 251da82c92fSMauro Carvalho Chehabspace. This enables configuring a system so that several independent 252da82c92fSMauro Carvalho Chehabjobs can share common kernel data, such as file system pages, while 253da82c92fSMauro Carvalho Chehabisolating each job's user allocation in its own cpuset. To do this, 254da82c92fSMauro Carvalho Chehabconstruct a large mem_exclusive cpuset to hold all the jobs, and 255da82c92fSMauro Carvalho Chehabconstruct child, non-mem_exclusive cpusets for each individual job. 256da82c92fSMauro Carvalho ChehabOnly a small amount of typical kernel memory, such as requests from 257da82c92fSMauro Carvalho Chehabinterrupt handlers, is allowed to be taken outside even a 258da82c92fSMauro Carvalho Chehabmem_exclusive cpuset. 259da82c92fSMauro Carvalho Chehab 260da82c92fSMauro Carvalho Chehab 261da82c92fSMauro Carvalho Chehab1.5 What is memory_pressure ? 262da82c92fSMauro Carvalho Chehab----------------------------- 263da82c92fSMauro Carvalho ChehabThe memory_pressure of a cpuset provides a simple per-cpuset metric 264da82c92fSMauro Carvalho Chehabof the rate that the tasks in a cpuset are attempting to free up in 265da82c92fSMauro Carvalho Chehabuse memory on the nodes of the cpuset to satisfy additional memory 266da82c92fSMauro Carvalho Chehabrequests. 267da82c92fSMauro Carvalho Chehab 268da82c92fSMauro Carvalho ChehabThis enables batch managers monitoring jobs running in dedicated 269da82c92fSMauro Carvalho Chehabcpusets to efficiently detect what level of memory pressure that job 270da82c92fSMauro Carvalho Chehabis causing. 271da82c92fSMauro Carvalho Chehab 272da82c92fSMauro Carvalho ChehabThis is useful both on tightly managed systems running a wide mix of 273da82c92fSMauro Carvalho Chehabsubmitted jobs, which may choose to terminate or re-prioritize jobs that 274da82c92fSMauro Carvalho Chehabare trying to use more memory than allowed on the nodes assigned to them, 275da82c92fSMauro Carvalho Chehaband with tightly coupled, long running, massively parallel scientific 276da82c92fSMauro Carvalho Chehabcomputing jobs that will dramatically fail to meet required performance 277da82c92fSMauro Carvalho Chehabgoals if they start to use more memory than allowed to them. 278da82c92fSMauro Carvalho Chehab 279da82c92fSMauro Carvalho ChehabThis mechanism provides a very economical way for the batch manager 280da82c92fSMauro Carvalho Chehabto monitor a cpuset for signs of memory pressure. It's up to the 281da82c92fSMauro Carvalho Chehabbatch manager or other user code to decide what to do about it and 282da82c92fSMauro Carvalho Chehabtake action. 283da82c92fSMauro Carvalho Chehab 284da82c92fSMauro Carvalho Chehab==> 285da82c92fSMauro Carvalho Chehab Unless this feature is enabled by writing "1" to the special file 286da82c92fSMauro Carvalho Chehab /dev/cpuset/memory_pressure_enabled, the hook in the rebalance 287da82c92fSMauro Carvalho Chehab code of __alloc_pages() for this metric reduces to simply noticing 288da82c92fSMauro Carvalho Chehab that the cpuset_memory_pressure_enabled flag is zero. So only 289da82c92fSMauro Carvalho Chehab systems that enable this feature will compute the metric. 290da82c92fSMauro Carvalho Chehab 291da82c92fSMauro Carvalho ChehabWhy a per-cpuset, running average: 292da82c92fSMauro Carvalho Chehab 293da82c92fSMauro Carvalho Chehab Because this meter is per-cpuset, rather than per-task or mm, 294da82c92fSMauro Carvalho Chehab the system load imposed by a batch scheduler monitoring this 295da82c92fSMauro Carvalho Chehab metric is sharply reduced on large systems, because a scan of 296da82c92fSMauro Carvalho Chehab the tasklist can be avoided on each set of queries. 297da82c92fSMauro Carvalho Chehab 298da82c92fSMauro Carvalho Chehab Because this meter is a running average, instead of an accumulating 299da82c92fSMauro Carvalho Chehab counter, a batch scheduler can detect memory pressure with a 300da82c92fSMauro Carvalho Chehab single read, instead of having to read and accumulate results 301da82c92fSMauro Carvalho Chehab for a period of time. 302da82c92fSMauro Carvalho Chehab 303da82c92fSMauro Carvalho Chehab Because this meter is per-cpuset rather than per-task or mm, 304da82c92fSMauro Carvalho Chehab the batch scheduler can obtain the key information, memory 305da82c92fSMauro Carvalho Chehab pressure in a cpuset, with a single read, rather than having to 306da82c92fSMauro Carvalho Chehab query and accumulate results over all the (dynamically changing) 307da82c92fSMauro Carvalho Chehab set of tasks in the cpuset. 308da82c92fSMauro Carvalho Chehab 309da82c92fSMauro Carvalho ChehabA per-cpuset simple digital filter (requires a spinlock and 3 words 310da82c92fSMauro Carvalho Chehabof data per-cpuset) is kept, and updated by any task attached to that 311da82c92fSMauro Carvalho Chehabcpuset, if it enters the synchronous (direct) page reclaim code. 312da82c92fSMauro Carvalho Chehab 313da82c92fSMauro Carvalho ChehabA per-cpuset file provides an integer number representing the recent 314da82c92fSMauro Carvalho Chehab(half-life of 10 seconds) rate of direct page reclaims caused by 315da82c92fSMauro Carvalho Chehabthe tasks in the cpuset, in units of reclaims attempted per second, 316da82c92fSMauro Carvalho Chehabtimes 1000. 317da82c92fSMauro Carvalho Chehab 318da82c92fSMauro Carvalho Chehab 319da82c92fSMauro Carvalho Chehab1.6 What is memory spread ? 320da82c92fSMauro Carvalho Chehab--------------------------- 321da82c92fSMauro Carvalho ChehabThere are two boolean flag files per cpuset that control where the 322da82c92fSMauro Carvalho Chehabkernel allocates pages for the file system buffers and related in 323da82c92fSMauro Carvalho Chehabkernel data structures. They are called 'cpuset.memory_spread_page' and 324da82c92fSMauro Carvalho Chehab'cpuset.memory_spread_slab'. 325da82c92fSMauro Carvalho Chehab 326da82c92fSMauro Carvalho ChehabIf the per-cpuset boolean flag file 'cpuset.memory_spread_page' is set, then 327da82c92fSMauro Carvalho Chehabthe kernel will spread the file system buffers (page cache) evenly 328da82c92fSMauro Carvalho Chehabover all the nodes that the faulting task is allowed to use, instead 329da82c92fSMauro Carvalho Chehabof preferring to put those pages on the node where the task is running. 330da82c92fSMauro Carvalho Chehab 331da82c92fSMauro Carvalho ChehabIf the per-cpuset boolean flag file 'cpuset.memory_spread_slab' is set, 332da82c92fSMauro Carvalho Chehabthen the kernel will spread some file system related slab caches, 333da82c92fSMauro Carvalho Chehabsuch as for inodes and dentries evenly over all the nodes that the 334da82c92fSMauro Carvalho Chehabfaulting task is allowed to use, instead of preferring to put those 335da82c92fSMauro Carvalho Chehabpages on the node where the task is running. 336da82c92fSMauro Carvalho Chehab 337da82c92fSMauro Carvalho ChehabThe setting of these flags does not affect anonymous data segment or 338da82c92fSMauro Carvalho Chehabstack segment pages of a task. 339da82c92fSMauro Carvalho Chehab 340da82c92fSMauro Carvalho ChehabBy default, both kinds of memory spreading are off, and memory 341da82c92fSMauro Carvalho Chehabpages are allocated on the node local to where the task is running, 342da82c92fSMauro Carvalho Chehabexcept perhaps as modified by the task's NUMA mempolicy or cpuset 343da82c92fSMauro Carvalho Chehabconfiguration, so long as sufficient free memory pages are available. 344da82c92fSMauro Carvalho Chehab 345da82c92fSMauro Carvalho ChehabWhen new cpusets are created, they inherit the memory spread settings 346da82c92fSMauro Carvalho Chehabof their parent. 347da82c92fSMauro Carvalho Chehab 348da82c92fSMauro Carvalho ChehabSetting memory spreading causes allocations for the affected page 349da82c92fSMauro Carvalho Chehabor slab caches to ignore the task's NUMA mempolicy and be spread 350da82c92fSMauro Carvalho Chehabinstead. Tasks using mbind() or set_mempolicy() calls to set NUMA 351da82c92fSMauro Carvalho Chehabmempolicies will not notice any change in these calls as a result of 352da82c92fSMauro Carvalho Chehabtheir containing task's memory spread settings. If memory spreading 353da82c92fSMauro Carvalho Chehabis turned off, then the currently specified NUMA mempolicy once again 354da82c92fSMauro Carvalho Chehabapplies to memory page allocations. 355da82c92fSMauro Carvalho Chehab 356da82c92fSMauro Carvalho ChehabBoth 'cpuset.memory_spread_page' and 'cpuset.memory_spread_slab' are boolean flag 357da82c92fSMauro Carvalho Chehabfiles. By default they contain "0", meaning that the feature is off 358da82c92fSMauro Carvalho Chehabfor that cpuset. If a "1" is written to that file, then that turns 359da82c92fSMauro Carvalho Chehabthe named feature on. 360da82c92fSMauro Carvalho Chehab 361da82c92fSMauro Carvalho ChehabThe implementation is simple. 362da82c92fSMauro Carvalho Chehab 363da82c92fSMauro Carvalho ChehabSetting the flag 'cpuset.memory_spread_page' turns on a per-process flag 364da82c92fSMauro Carvalho ChehabPFA_SPREAD_PAGE for each task that is in that cpuset or subsequently 365da82c92fSMauro Carvalho Chehabjoins that cpuset. The page allocation calls for the page cache 366da82c92fSMauro Carvalho Chehabis modified to perform an inline check for this PFA_SPREAD_PAGE task 367da82c92fSMauro Carvalho Chehabflag, and if set, a call to a new routine cpuset_mem_spread_node() 368da82c92fSMauro Carvalho Chehabreturns the node to prefer for the allocation. 369da82c92fSMauro Carvalho Chehab 370da82c92fSMauro Carvalho ChehabSimilarly, setting 'cpuset.memory_spread_slab' turns on the flag 371da82c92fSMauro Carvalho ChehabPFA_SPREAD_SLAB, and appropriately marked slab caches will allocate 372da82c92fSMauro Carvalho Chehabpages from the node returned by cpuset_mem_spread_node(). 373da82c92fSMauro Carvalho Chehab 374da82c92fSMauro Carvalho ChehabThe cpuset_mem_spread_node() routine is also simple. It uses the 375da82c92fSMauro Carvalho Chehabvalue of a per-task rotor cpuset_mem_spread_rotor to select the next 376da82c92fSMauro Carvalho Chehabnode in the current task's mems_allowed to prefer for the allocation. 377da82c92fSMauro Carvalho Chehab 378da82c92fSMauro Carvalho ChehabThis memory placement policy is also known (in other contexts) as 379da82c92fSMauro Carvalho Chehabround-robin or interleave. 380da82c92fSMauro Carvalho Chehab 381da82c92fSMauro Carvalho ChehabThis policy can provide substantial improvements for jobs that need 382da82c92fSMauro Carvalho Chehabto place thread local data on the corresponding node, but that need 383da82c92fSMauro Carvalho Chehabto access large file system data sets that need to be spread across 384da82c92fSMauro Carvalho Chehabthe several nodes in the jobs cpuset in order to fit. Without this 385da82c92fSMauro Carvalho Chehabpolicy, especially for jobs that might have one thread reading in the 386da82c92fSMauro Carvalho Chehabdata set, the memory allocation across the nodes in the jobs cpuset 387da82c92fSMauro Carvalho Chehabcan become very uneven. 388da82c92fSMauro Carvalho Chehab 389da82c92fSMauro Carvalho Chehab1.7 What is sched_load_balance ? 390da82c92fSMauro Carvalho Chehab-------------------------------- 391da82c92fSMauro Carvalho Chehab 392da82c92fSMauro Carvalho ChehabThe kernel scheduler (kernel/sched/core.c) automatically load balances 393da82c92fSMauro Carvalho Chehabtasks. If one CPU is underutilized, kernel code running on that 394da82c92fSMauro Carvalho ChehabCPU will look for tasks on other more overloaded CPUs and move those 395da82c92fSMauro Carvalho Chehabtasks to itself, within the constraints of such placement mechanisms 396da82c92fSMauro Carvalho Chehabas cpusets and sched_setaffinity. 397da82c92fSMauro Carvalho Chehab 398da82c92fSMauro Carvalho ChehabThe algorithmic cost of load balancing and its impact on key shared 399da82c92fSMauro Carvalho Chehabkernel data structures such as the task list increases more than 400da82c92fSMauro Carvalho Chehablinearly with the number of CPUs being balanced. So the scheduler 401da82c92fSMauro Carvalho Chehabhas support to partition the systems CPUs into a number of sched 402da82c92fSMauro Carvalho Chehabdomains such that it only load balances within each sched domain. 403da82c92fSMauro Carvalho ChehabEach sched domain covers some subset of the CPUs in the system; 404da82c92fSMauro Carvalho Chehabno two sched domains overlap; some CPUs might not be in any sched 405da82c92fSMauro Carvalho Chehabdomain and hence won't be load balanced. 406da82c92fSMauro Carvalho Chehab 407da82c92fSMauro Carvalho ChehabPut simply, it costs less to balance between two smaller sched domains 408da82c92fSMauro Carvalho Chehabthan one big one, but doing so means that overloads in one of the 409da82c92fSMauro Carvalho Chehabtwo domains won't be load balanced to the other one. 410da82c92fSMauro Carvalho Chehab 411da82c92fSMauro Carvalho ChehabBy default, there is one sched domain covering all CPUs, including those 412da82c92fSMauro Carvalho Chehabmarked isolated using the kernel boot time "isolcpus=" argument. However, 413da82c92fSMauro Carvalho Chehabthe isolated CPUs will not participate in load balancing, and will not 414da82c92fSMauro Carvalho Chehabhave tasks running on them unless explicitly assigned. 415da82c92fSMauro Carvalho Chehab 416da82c92fSMauro Carvalho ChehabThis default load balancing across all CPUs is not well suited for 417da82c92fSMauro Carvalho Chehabthe following two situations: 418da82c92fSMauro Carvalho Chehab 419da82c92fSMauro Carvalho Chehab 1) On large systems, load balancing across many CPUs is expensive. 420da82c92fSMauro Carvalho Chehab If the system is managed using cpusets to place independent jobs 421da82c92fSMauro Carvalho Chehab on separate sets of CPUs, full load balancing is unnecessary. 422da82c92fSMauro Carvalho Chehab 2) Systems supporting realtime on some CPUs need to minimize 423da82c92fSMauro Carvalho Chehab system overhead on those CPUs, including avoiding task load 424da82c92fSMauro Carvalho Chehab balancing if that is not needed. 425da82c92fSMauro Carvalho Chehab 426da82c92fSMauro Carvalho ChehabWhen the per-cpuset flag "cpuset.sched_load_balance" is enabled (the default 427da82c92fSMauro Carvalho Chehabsetting), it requests that all the CPUs in that cpusets allowed 'cpuset.cpus' 428da82c92fSMauro Carvalho Chehabbe contained in a single sched domain, ensuring that load balancing 429da82c92fSMauro Carvalho Chehabcan move a task (not otherwised pinned, as by sched_setaffinity) 430da82c92fSMauro Carvalho Chehabfrom any CPU in that cpuset to any other. 431da82c92fSMauro Carvalho Chehab 432da82c92fSMauro Carvalho ChehabWhen the per-cpuset flag "cpuset.sched_load_balance" is disabled, then the 433da82c92fSMauro Carvalho Chehabscheduler will avoid load balancing across the CPUs in that cpuset, 434da82c92fSMauro Carvalho Chehab--except-- in so far as is necessary because some overlapping cpuset 435da82c92fSMauro Carvalho Chehabhas "sched_load_balance" enabled. 436da82c92fSMauro Carvalho Chehab 437da82c92fSMauro Carvalho ChehabSo, for example, if the top cpuset has the flag "cpuset.sched_load_balance" 438da82c92fSMauro Carvalho Chehabenabled, then the scheduler will have one sched domain covering all 439da82c92fSMauro Carvalho ChehabCPUs, and the setting of the "cpuset.sched_load_balance" flag in any other 440da82c92fSMauro Carvalho Chehabcpusets won't matter, as we're already fully load balancing. 441da82c92fSMauro Carvalho Chehab 442da82c92fSMauro Carvalho ChehabTherefore in the above two situations, the top cpuset flag 443da82c92fSMauro Carvalho Chehab"cpuset.sched_load_balance" should be disabled, and only some of the smaller, 444da82c92fSMauro Carvalho Chehabchild cpusets have this flag enabled. 445da82c92fSMauro Carvalho Chehab 446da82c92fSMauro Carvalho ChehabWhen doing this, you don't usually want to leave any unpinned tasks in 447da82c92fSMauro Carvalho Chehabthe top cpuset that might use non-trivial amounts of CPU, as such tasks 448da82c92fSMauro Carvalho Chehabmay be artificially constrained to some subset of CPUs, depending on 449da82c92fSMauro Carvalho Chehabthe particulars of this flag setting in descendant cpusets. Even if 450da82c92fSMauro Carvalho Chehabsuch a task could use spare CPU cycles in some other CPUs, the kernel 451da82c92fSMauro Carvalho Chehabscheduler might not consider the possibility of load balancing that 452da82c92fSMauro Carvalho Chehabtask to that underused CPU. 453da82c92fSMauro Carvalho Chehab 454da82c92fSMauro Carvalho ChehabOf course, tasks pinned to a particular CPU can be left in a cpuset 455da82c92fSMauro Carvalho Chehabthat disables "cpuset.sched_load_balance" as those tasks aren't going anywhere 456da82c92fSMauro Carvalho Chehabelse anyway. 457da82c92fSMauro Carvalho Chehab 458da82c92fSMauro Carvalho ChehabThere is an impedance mismatch here, between cpusets and sched domains. 459da82c92fSMauro Carvalho ChehabCpusets are hierarchical and nest. Sched domains are flat; they don't 460da82c92fSMauro Carvalho Chehaboverlap and each CPU is in at most one sched domain. 461da82c92fSMauro Carvalho Chehab 462da82c92fSMauro Carvalho ChehabIt is necessary for sched domains to be flat because load balancing 463da82c92fSMauro Carvalho Chehabacross partially overlapping sets of CPUs would risk unstable dynamics 464da82c92fSMauro Carvalho Chehabthat would be beyond our understanding. So if each of two partially 465da82c92fSMauro Carvalho Chehaboverlapping cpusets enables the flag 'cpuset.sched_load_balance', then we 466da82c92fSMauro Carvalho Chehabform a single sched domain that is a superset of both. We won't move 467da82c92fSMauro Carvalho Chehaba task to a CPU outside its cpuset, but the scheduler load balancing 468da82c92fSMauro Carvalho Chehabcode might waste some compute cycles considering that possibility. 469da82c92fSMauro Carvalho Chehab 470da82c92fSMauro Carvalho ChehabThis mismatch is why there is not a simple one-to-one relation 471da82c92fSMauro Carvalho Chehabbetween which cpusets have the flag "cpuset.sched_load_balance" enabled, 472da82c92fSMauro Carvalho Chehaband the sched domain configuration. If a cpuset enables the flag, it 473da82c92fSMauro Carvalho Chehabwill get balancing across all its CPUs, but if it disables the flag, 474da82c92fSMauro Carvalho Chehabit will only be assured of no load balancing if no other overlapping 475da82c92fSMauro Carvalho Chehabcpuset enables the flag. 476da82c92fSMauro Carvalho Chehab 477da82c92fSMauro Carvalho ChehabIf two cpusets have partially overlapping 'cpuset.cpus' allowed, and only 478da82c92fSMauro Carvalho Chehabone of them has this flag enabled, then the other may find its 479da82c92fSMauro Carvalho Chehabtasks only partially load balanced, just on the overlapping CPUs. 480da82c92fSMauro Carvalho ChehabThis is just the general case of the top_cpuset example given a few 481da82c92fSMauro Carvalho Chehabparagraphs above. In the general case, as in the top cpuset case, 482da82c92fSMauro Carvalho Chehabdon't leave tasks that might use non-trivial amounts of CPU in 483da82c92fSMauro Carvalho Chehabsuch partially load balanced cpusets, as they may be artificially 484da82c92fSMauro Carvalho Chehabconstrained to some subset of the CPUs allowed to them, for lack of 485da82c92fSMauro Carvalho Chehabload balancing to the other CPUs. 486da82c92fSMauro Carvalho Chehab 487da82c92fSMauro Carvalho ChehabCPUs in "cpuset.isolcpus" were excluded from load balancing by the 488da82c92fSMauro Carvalho Chehabisolcpus= kernel boot option, and will never be load balanced regardless 489da82c92fSMauro Carvalho Chehabof the value of "cpuset.sched_load_balance" in any cpuset. 490da82c92fSMauro Carvalho Chehab 491da82c92fSMauro Carvalho Chehab1.7.1 sched_load_balance implementation details. 492da82c92fSMauro Carvalho Chehab------------------------------------------------ 493da82c92fSMauro Carvalho Chehab 494da82c92fSMauro Carvalho ChehabThe per-cpuset flag 'cpuset.sched_load_balance' defaults to enabled (contrary 495da82c92fSMauro Carvalho Chehabto most cpuset flags.) When enabled for a cpuset, the kernel will 496da82c92fSMauro Carvalho Chehabensure that it can load balance across all the CPUs in that cpuset 497da82c92fSMauro Carvalho Chehab(makes sure that all the CPUs in the cpus_allowed of that cpuset are 498da82c92fSMauro Carvalho Chehabin the same sched domain.) 499da82c92fSMauro Carvalho Chehab 500da82c92fSMauro Carvalho ChehabIf two overlapping cpusets both have 'cpuset.sched_load_balance' enabled, 501da82c92fSMauro Carvalho Chehabthen they will be (must be) both in the same sched domain. 502da82c92fSMauro Carvalho Chehab 503da82c92fSMauro Carvalho ChehabIf, as is the default, the top cpuset has 'cpuset.sched_load_balance' enabled, 504da82c92fSMauro Carvalho Chehabthen by the above that means there is a single sched domain covering 505da82c92fSMauro Carvalho Chehabthe whole system, regardless of any other cpuset settings. 506da82c92fSMauro Carvalho Chehab 507da82c92fSMauro Carvalho ChehabThe kernel commits to user space that it will avoid load balancing 508da82c92fSMauro Carvalho Chehabwhere it can. It will pick as fine a granularity partition of sched 509da82c92fSMauro Carvalho Chehabdomains as it can while still providing load balancing for any set 510da82c92fSMauro Carvalho Chehabof CPUs allowed to a cpuset having 'cpuset.sched_load_balance' enabled. 511da82c92fSMauro Carvalho Chehab 512da82c92fSMauro Carvalho ChehabThe internal kernel cpuset to scheduler interface passes from the 513da82c92fSMauro Carvalho Chehabcpuset code to the scheduler code a partition of the load balanced 514da82c92fSMauro Carvalho ChehabCPUs in the system. This partition is a set of subsets (represented 515da82c92fSMauro Carvalho Chehabas an array of struct cpumask) of CPUs, pairwise disjoint, that cover 516da82c92fSMauro Carvalho Chehaball the CPUs that must be load balanced. 517da82c92fSMauro Carvalho Chehab 518da82c92fSMauro Carvalho ChehabThe cpuset code builds a new such partition and passes it to the 519da82c92fSMauro Carvalho Chehabscheduler sched domain setup code, to have the sched domains rebuilt 520da82c92fSMauro Carvalho Chehabas necessary, whenever: 521da82c92fSMauro Carvalho Chehab 522da82c92fSMauro Carvalho Chehab - the 'cpuset.sched_load_balance' flag of a cpuset with non-empty CPUs changes, 523da82c92fSMauro Carvalho Chehab - or CPUs come or go from a cpuset with this flag enabled, 524da82c92fSMauro Carvalho Chehab - or 'cpuset.sched_relax_domain_level' value of a cpuset with non-empty CPUs 525da82c92fSMauro Carvalho Chehab and with this flag enabled changes, 526da82c92fSMauro Carvalho Chehab - or a cpuset with non-empty CPUs and with this flag enabled is removed, 527da82c92fSMauro Carvalho Chehab - or a cpu is offlined/onlined. 528da82c92fSMauro Carvalho Chehab 529da82c92fSMauro Carvalho ChehabThis partition exactly defines what sched domains the scheduler should 530da82c92fSMauro Carvalho Chehabsetup - one sched domain for each element (struct cpumask) in the 531da82c92fSMauro Carvalho Chehabpartition. 532da82c92fSMauro Carvalho Chehab 533da82c92fSMauro Carvalho ChehabThe scheduler remembers the currently active sched domain partitions. 534da82c92fSMauro Carvalho ChehabWhen the scheduler routine partition_sched_domains() is invoked from 535da82c92fSMauro Carvalho Chehabthe cpuset code to update these sched domains, it compares the new 536da82c92fSMauro Carvalho Chehabpartition requested with the current, and updates its sched domains, 537da82c92fSMauro Carvalho Chehabremoving the old and adding the new, for each change. 538da82c92fSMauro Carvalho Chehab 539da82c92fSMauro Carvalho Chehab 540da82c92fSMauro Carvalho Chehab1.8 What is sched_relax_domain_level ? 541da82c92fSMauro Carvalho Chehab-------------------------------------- 542da82c92fSMauro Carvalho Chehab 543da82c92fSMauro Carvalho ChehabIn sched domain, the scheduler migrates tasks in 2 ways; periodic load 544da82c92fSMauro Carvalho Chehabbalance on tick, and at time of some schedule events. 545da82c92fSMauro Carvalho Chehab 546da82c92fSMauro Carvalho ChehabWhen a task is woken up, scheduler try to move the task on idle CPU. 547da82c92fSMauro Carvalho ChehabFor example, if a task A running on CPU X activates another task B 548da82c92fSMauro Carvalho Chehabon the same CPU X, and if CPU Y is X's sibling and performing idle, 549da82c92fSMauro Carvalho Chehabthen scheduler migrate task B to CPU Y so that task B can start on 550da82c92fSMauro Carvalho ChehabCPU Y without waiting task A on CPU X. 551da82c92fSMauro Carvalho Chehab 552da82c92fSMauro Carvalho ChehabAnd if a CPU run out of tasks in its runqueue, the CPU try to pull 553da82c92fSMauro Carvalho Chehabextra tasks from other busy CPUs to help them before it is going to 554da82c92fSMauro Carvalho Chehabbe idle. 555da82c92fSMauro Carvalho Chehab 556da82c92fSMauro Carvalho ChehabOf course it takes some searching cost to find movable tasks and/or 557da82c92fSMauro Carvalho Chehabidle CPUs, the scheduler might not search all CPUs in the domain 558da82c92fSMauro Carvalho Chehabevery time. In fact, in some architectures, the searching ranges on 559da82c92fSMauro Carvalho Chehabevents are limited in the same socket or node where the CPU locates, 560da82c92fSMauro Carvalho Chehabwhile the load balance on tick searches all. 561da82c92fSMauro Carvalho Chehab 562da82c92fSMauro Carvalho ChehabFor example, assume CPU Z is relatively far from CPU X. Even if CPU Z 563da82c92fSMauro Carvalho Chehabis idle while CPU X and the siblings are busy, scheduler can't migrate 564da82c92fSMauro Carvalho Chehabwoken task B from X to Z since it is out of its searching range. 565da82c92fSMauro Carvalho ChehabAs the result, task B on CPU X need to wait task A or wait load balance 566da82c92fSMauro Carvalho Chehabon the next tick. For some applications in special situation, waiting 567da82c92fSMauro Carvalho Chehab1 tick may be too long. 568da82c92fSMauro Carvalho Chehab 569da82c92fSMauro Carvalho ChehabThe 'cpuset.sched_relax_domain_level' file allows you to request changing 570da82c92fSMauro Carvalho Chehabthis searching range as you like. This file takes int value which 571*0f1c74beSVitalii Bursovindicates size of searching range in levels approximately as follows, 572da82c92fSMauro Carvalho Chehabotherwise initial value -1 that indicates the cpuset has no request. 573da82c92fSMauro Carvalho Chehab 574da82c92fSMauro Carvalho Chehab====== =========================================================== 575da82c92fSMauro Carvalho Chehab -1 no request. use system default or follow request of others. 576da82c92fSMauro Carvalho Chehab 0 no search. 577da82c92fSMauro Carvalho Chehab 1 search siblings (hyperthreads in a core). 578da82c92fSMauro Carvalho Chehab 2 search cores in a package. 579da82c92fSMauro Carvalho Chehab 3 search cpus in a node [= system wide on non-NUMA system] 580da82c92fSMauro Carvalho Chehab 4 search nodes in a chunk of node [on NUMA system] 581da82c92fSMauro Carvalho Chehab 5 search system wide [on NUMA system] 582da82c92fSMauro Carvalho Chehab====== =========================================================== 583da82c92fSMauro Carvalho Chehab 584*0f1c74beSVitalii BursovNot all levels can be present and values can change depending on the 585*0f1c74beSVitalii Bursovsystem architecture and kernel configuration. Check 586*0f1c74beSVitalii Bursov/sys/kernel/debug/sched/domains/cpu*/domain*/ for system-specific 587*0f1c74beSVitalii Bursovdetails. 588*0f1c74beSVitalii Bursov 589da82c92fSMauro Carvalho ChehabThe system default is architecture dependent. The system default 590da82c92fSMauro Carvalho Chehabcan be changed using the relax_domain_level= boot parameter. 591da82c92fSMauro Carvalho Chehab 592da82c92fSMauro Carvalho ChehabThis file is per-cpuset and affect the sched domain where the cpuset 593da82c92fSMauro Carvalho Chehabbelongs to. Therefore if the flag 'cpuset.sched_load_balance' of a cpuset 594da82c92fSMauro Carvalho Chehabis disabled, then 'cpuset.sched_relax_domain_level' have no effect since 595da82c92fSMauro Carvalho Chehabthere is no sched domain belonging the cpuset. 596da82c92fSMauro Carvalho Chehab 597da82c92fSMauro Carvalho ChehabIf multiple cpusets are overlapping and hence they form a single sched 598da82c92fSMauro Carvalho Chehabdomain, the largest value among those is used. Be careful, if one 599da82c92fSMauro Carvalho Chehabrequests 0 and others are -1 then 0 is used. 600da82c92fSMauro Carvalho Chehab 601da82c92fSMauro Carvalho ChehabNote that modifying this file will have both good and bad effects, 602da82c92fSMauro Carvalho Chehaband whether it is acceptable or not depends on your situation. 603da82c92fSMauro Carvalho ChehabDon't modify this file if you are not sure. 604da82c92fSMauro Carvalho Chehab 605da82c92fSMauro Carvalho ChehabIf your situation is: 606da82c92fSMauro Carvalho Chehab 607da82c92fSMauro Carvalho Chehab - The migration costs between each cpu can be assumed considerably 608da82c92fSMauro Carvalho Chehab small(for you) due to your special application's behavior or 609da82c92fSMauro Carvalho Chehab special hardware support for CPU cache etc. 610da82c92fSMauro Carvalho Chehab - The searching cost doesn't have impact(for you) or you can make 611da82c92fSMauro Carvalho Chehab the searching cost enough small by managing cpuset to compact etc. 612da82c92fSMauro Carvalho Chehab - The latency is required even it sacrifices cache hit rate etc. 613da82c92fSMauro Carvalho Chehab then increasing 'sched_relax_domain_level' would benefit you. 614da82c92fSMauro Carvalho Chehab 615da82c92fSMauro Carvalho Chehab 616da82c92fSMauro Carvalho Chehab1.9 How do I use cpusets ? 617da82c92fSMauro Carvalho Chehab-------------------------- 618da82c92fSMauro Carvalho Chehab 619da82c92fSMauro Carvalho ChehabIn order to minimize the impact of cpusets on critical kernel 620da82c92fSMauro Carvalho Chehabcode, such as the scheduler, and due to the fact that the kernel 621da82c92fSMauro Carvalho Chehabdoes not support one task updating the memory placement of another 622da82c92fSMauro Carvalho Chehabtask directly, the impact on a task of changing its cpuset CPU 623da82c92fSMauro Carvalho Chehabor Memory Node placement, or of changing to which cpuset a task 624da82c92fSMauro Carvalho Chehabis attached, is subtle. 625da82c92fSMauro Carvalho Chehab 626da82c92fSMauro Carvalho ChehabIf a cpuset has its Memory Nodes modified, then for each task attached 627da82c92fSMauro Carvalho Chehabto that cpuset, the next time that the kernel attempts to allocate 628da82c92fSMauro Carvalho Chehaba page of memory for that task, the kernel will notice the change 629da82c92fSMauro Carvalho Chehabin the task's cpuset, and update its per-task memory placement to 630da82c92fSMauro Carvalho Chehabremain within the new cpusets memory placement. If the task was using 631da82c92fSMauro Carvalho Chehabmempolicy MPOL_BIND, and the nodes to which it was bound overlap with 632da82c92fSMauro Carvalho Chehabits new cpuset, then the task will continue to use whatever subset 633da82c92fSMauro Carvalho Chehabof MPOL_BIND nodes are still allowed in the new cpuset. If the task 634da82c92fSMauro Carvalho Chehabwas using MPOL_BIND and now none of its MPOL_BIND nodes are allowed 635da82c92fSMauro Carvalho Chehabin the new cpuset, then the task will be essentially treated as if it 636da82c92fSMauro Carvalho Chehabwas MPOL_BIND bound to the new cpuset (even though its NUMA placement, 637da82c92fSMauro Carvalho Chehabas queried by get_mempolicy(), doesn't change). If a task is moved 638da82c92fSMauro Carvalho Chehabfrom one cpuset to another, then the kernel will adjust the task's 639da82c92fSMauro Carvalho Chehabmemory placement, as above, the next time that the kernel attempts 640da82c92fSMauro Carvalho Chehabto allocate a page of memory for that task. 641da82c92fSMauro Carvalho Chehab 642da82c92fSMauro Carvalho ChehabIf a cpuset has its 'cpuset.cpus' modified, then each task in that cpuset 643da82c92fSMauro Carvalho Chehabwill have its allowed CPU placement changed immediately. Similarly, 644da82c92fSMauro Carvalho Chehabif a task's pid is written to another cpuset's 'tasks' file, then its 645da82c92fSMauro Carvalho Chehaballowed CPU placement is changed immediately. If such a task had been 646da82c92fSMauro Carvalho Chehabbound to some subset of its cpuset using the sched_setaffinity() call, 647da82c92fSMauro Carvalho Chehabthe task will be allowed to run on any CPU allowed in its new cpuset, 648da82c92fSMauro Carvalho Chehabnegating the effect of the prior sched_setaffinity() call. 649da82c92fSMauro Carvalho Chehab 650da82c92fSMauro Carvalho ChehabIn summary, the memory placement of a task whose cpuset is changed is 651da82c92fSMauro Carvalho Chehabupdated by the kernel, on the next allocation of a page for that task, 652da82c92fSMauro Carvalho Chehaband the processor placement is updated immediately. 653da82c92fSMauro Carvalho Chehab 654da82c92fSMauro Carvalho ChehabNormally, once a page is allocated (given a physical page 655da82c92fSMauro Carvalho Chehabof main memory) then that page stays on whatever node it 656da82c92fSMauro Carvalho Chehabwas allocated, so long as it remains allocated, even if the 657da82c92fSMauro Carvalho Chehabcpusets memory placement policy 'cpuset.mems' subsequently changes. 658da82c92fSMauro Carvalho ChehabIf the cpuset flag file 'cpuset.memory_migrate' is set true, then when 659da82c92fSMauro Carvalho Chehabtasks are attached to that cpuset, any pages that task had 660da82c92fSMauro Carvalho Chehaballocated to it on nodes in its previous cpuset are migrated 661da82c92fSMauro Carvalho Chehabto the task's new cpuset. The relative placement of the page within 662da82c92fSMauro Carvalho Chehabthe cpuset is preserved during these migration operations if possible. 663da82c92fSMauro Carvalho ChehabFor example if the page was on the second valid node of the prior cpuset 664da82c92fSMauro Carvalho Chehabthen the page will be placed on the second valid node of the new cpuset. 665da82c92fSMauro Carvalho Chehab 666da82c92fSMauro Carvalho ChehabAlso if 'cpuset.memory_migrate' is set true, then if that cpuset's 667da82c92fSMauro Carvalho Chehab'cpuset.mems' file is modified, pages allocated to tasks in that 668da82c92fSMauro Carvalho Chehabcpuset, that were on nodes in the previous setting of 'cpuset.mems', 669da82c92fSMauro Carvalho Chehabwill be moved to nodes in the new setting of 'mems.' 670da82c92fSMauro Carvalho ChehabPages that were not in the task's prior cpuset, or in the cpuset's 671da82c92fSMauro Carvalho Chehabprior 'cpuset.mems' setting, will not be moved. 672da82c92fSMauro Carvalho Chehab 673da82c92fSMauro Carvalho ChehabThere is an exception to the above. If hotplug functionality is used 674da82c92fSMauro Carvalho Chehabto remove all the CPUs that are currently assigned to a cpuset, 675da82c92fSMauro Carvalho Chehabthen all the tasks in that cpuset will be moved to the nearest ancestor 676da82c92fSMauro Carvalho Chehabwith non-empty cpus. But the moving of some (or all) tasks might fail if 677da82c92fSMauro Carvalho Chehabcpuset is bound with another cgroup subsystem which has some restrictions 678da82c92fSMauro Carvalho Chehabon task attaching. In this failing case, those tasks will stay 679da82c92fSMauro Carvalho Chehabin the original cpuset, and the kernel will automatically update 680da82c92fSMauro Carvalho Chehabtheir cpus_allowed to allow all online CPUs. When memory hotplug 681da82c92fSMauro Carvalho Chehabfunctionality for removing Memory Nodes is available, a similar exception 682da82c92fSMauro Carvalho Chehabis expected to apply there as well. In general, the kernel prefers to 683da82c92fSMauro Carvalho Chehabviolate cpuset placement, over starving a task that has had all 684da82c92fSMauro Carvalho Chehabits allowed CPUs or Memory Nodes taken offline. 685da82c92fSMauro Carvalho Chehab 686da82c92fSMauro Carvalho ChehabThere is a second exception to the above. GFP_ATOMIC requests are 687da82c92fSMauro Carvalho Chehabkernel internal allocations that must be satisfied, immediately. 688da82c92fSMauro Carvalho ChehabThe kernel may drop some request, in rare cases even panic, if a 689da82c92fSMauro Carvalho ChehabGFP_ATOMIC alloc fails. If the request cannot be satisfied within 690da82c92fSMauro Carvalho Chehabthe current task's cpuset, then we relax the cpuset, and look for 691da82c92fSMauro Carvalho Chehabmemory anywhere we can find it. It's better to violate the cpuset 692da82c92fSMauro Carvalho Chehabthan stress the kernel. 693da82c92fSMauro Carvalho Chehab 694da82c92fSMauro Carvalho ChehabTo start a new job that is to be contained within a cpuset, the steps are: 695da82c92fSMauro Carvalho Chehab 696da82c92fSMauro Carvalho Chehab 1) mkdir /sys/fs/cgroup/cpuset 697da82c92fSMauro Carvalho Chehab 2) mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset 698da82c92fSMauro Carvalho Chehab 3) Create the new cpuset by doing mkdir's and write's (or echo's) in 699da82c92fSMauro Carvalho Chehab the /sys/fs/cgroup/cpuset virtual file system. 700da82c92fSMauro Carvalho Chehab 4) Start a task that will be the "founding father" of the new job. 701da82c92fSMauro Carvalho Chehab 5) Attach that task to the new cpuset by writing its pid to the 702da82c92fSMauro Carvalho Chehab /sys/fs/cgroup/cpuset tasks file for that cpuset. 703da82c92fSMauro Carvalho Chehab 6) fork, exec or clone the job tasks from this founding father task. 704da82c92fSMauro Carvalho Chehab 705da82c92fSMauro Carvalho ChehabFor example, the following sequence of commands will setup a cpuset 706da82c92fSMauro Carvalho Chehabnamed "Charlie", containing just CPUs 2 and 3, and Memory Node 1, 707da82c92fSMauro Carvalho Chehaband then start a subshell 'sh' in that cpuset:: 708da82c92fSMauro Carvalho Chehab 709da82c92fSMauro Carvalho Chehab mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset 710da82c92fSMauro Carvalho Chehab cd /sys/fs/cgroup/cpuset 711da82c92fSMauro Carvalho Chehab mkdir Charlie 712da82c92fSMauro Carvalho Chehab cd Charlie 713da82c92fSMauro Carvalho Chehab /bin/echo 2-3 > cpuset.cpus 714da82c92fSMauro Carvalho Chehab /bin/echo 1 > cpuset.mems 715da82c92fSMauro Carvalho Chehab /bin/echo $$ > tasks 716da82c92fSMauro Carvalho Chehab sh 717da82c92fSMauro Carvalho Chehab # The subshell 'sh' is now running in cpuset Charlie 718da82c92fSMauro Carvalho Chehab # The next line should display '/Charlie' 719da82c92fSMauro Carvalho Chehab cat /proc/self/cpuset 720da82c92fSMauro Carvalho Chehab 721da82c92fSMauro Carvalho ChehabThere are ways to query or modify cpusets: 722da82c92fSMauro Carvalho Chehab 723da82c92fSMauro Carvalho Chehab - via the cpuset file system directly, using the various cd, mkdir, echo, 724da82c92fSMauro Carvalho Chehab cat, rmdir commands from the shell, or their equivalent from C. 725da82c92fSMauro Carvalho Chehab - via the C library libcpuset. 726da82c92fSMauro Carvalho Chehab - via the C library libcgroup. 7279403d9cbSKamalesh Babulal (https://github.com/libcgroup/libcgroup/) 728da82c92fSMauro Carvalho Chehab - via the python application cset. 729da82c92fSMauro Carvalho Chehab (http://code.google.com/p/cpuset/) 730da82c92fSMauro Carvalho Chehab 731da82c92fSMauro Carvalho ChehabThe sched_setaffinity calls can also be done at the shell prompt using 732da82c92fSMauro Carvalho ChehabSGI's runon or Robert Love's taskset. The mbind and set_mempolicy 733da82c92fSMauro Carvalho Chehabcalls can be done at the shell prompt using the numactl command 734da82c92fSMauro Carvalho Chehab(part of Andi Kleen's numa package). 735da82c92fSMauro Carvalho Chehab 736da82c92fSMauro Carvalho Chehab2. Usage Examples and Syntax 737da82c92fSMauro Carvalho Chehab============================ 738da82c92fSMauro Carvalho Chehab 739da82c92fSMauro Carvalho Chehab2.1 Basic Usage 740da82c92fSMauro Carvalho Chehab--------------- 741da82c92fSMauro Carvalho Chehab 742da82c92fSMauro Carvalho ChehabCreating, modifying, using the cpusets can be done through the cpuset 743da82c92fSMauro Carvalho Chehabvirtual filesystem. 744da82c92fSMauro Carvalho Chehab 745da82c92fSMauro Carvalho ChehabTo mount it, type: 746da82c92fSMauro Carvalho Chehab# mount -t cgroup -o cpuset cpuset /sys/fs/cgroup/cpuset 747da82c92fSMauro Carvalho Chehab 748da82c92fSMauro Carvalho ChehabThen under /sys/fs/cgroup/cpuset you can find a tree that corresponds to the 749da82c92fSMauro Carvalho Chehabtree of the cpusets in the system. For instance, /sys/fs/cgroup/cpuset 750da82c92fSMauro Carvalho Chehabis the cpuset that holds the whole system. 751da82c92fSMauro Carvalho Chehab 752da82c92fSMauro Carvalho ChehabIf you want to create a new cpuset under /sys/fs/cgroup/cpuset:: 753da82c92fSMauro Carvalho Chehab 754da82c92fSMauro Carvalho Chehab # cd /sys/fs/cgroup/cpuset 755da82c92fSMauro Carvalho Chehab # mkdir my_cpuset 756da82c92fSMauro Carvalho Chehab 757da82c92fSMauro Carvalho ChehabNow you want to do something with this cpuset:: 758da82c92fSMauro Carvalho Chehab 759da82c92fSMauro Carvalho Chehab # cd my_cpuset 760da82c92fSMauro Carvalho Chehab 761da82c92fSMauro Carvalho ChehabIn this directory you can find several files:: 762da82c92fSMauro Carvalho Chehab 763da82c92fSMauro Carvalho Chehab # ls 764da82c92fSMauro Carvalho Chehab cgroup.clone_children cpuset.memory_pressure 765da82c92fSMauro Carvalho Chehab cgroup.event_control cpuset.memory_spread_page 766da82c92fSMauro Carvalho Chehab cgroup.procs cpuset.memory_spread_slab 767da82c92fSMauro Carvalho Chehab cpuset.cpu_exclusive cpuset.mems 768da82c92fSMauro Carvalho Chehab cpuset.cpus cpuset.sched_load_balance 769da82c92fSMauro Carvalho Chehab cpuset.mem_exclusive cpuset.sched_relax_domain_level 770da82c92fSMauro Carvalho Chehab cpuset.mem_hardwall notify_on_release 771da82c92fSMauro Carvalho Chehab cpuset.memory_migrate tasks 772da82c92fSMauro Carvalho Chehab 773da82c92fSMauro Carvalho ChehabReading them will give you information about the state of this cpuset: 774da82c92fSMauro Carvalho Chehabthe CPUs and Memory Nodes it can use, the processes that are using 775da82c92fSMauro Carvalho Chehabit, its properties. By writing to these files you can manipulate 776da82c92fSMauro Carvalho Chehabthe cpuset. 777da82c92fSMauro Carvalho Chehab 778da82c92fSMauro Carvalho ChehabSet some flags:: 779da82c92fSMauro Carvalho Chehab 780da82c92fSMauro Carvalho Chehab # /bin/echo 1 > cpuset.cpu_exclusive 781da82c92fSMauro Carvalho Chehab 782da82c92fSMauro Carvalho ChehabAdd some cpus:: 783da82c92fSMauro Carvalho Chehab 784da82c92fSMauro Carvalho Chehab # /bin/echo 0-7 > cpuset.cpus 785da82c92fSMauro Carvalho Chehab 786da82c92fSMauro Carvalho ChehabAdd some mems:: 787da82c92fSMauro Carvalho Chehab 788da82c92fSMauro Carvalho Chehab # /bin/echo 0-7 > cpuset.mems 789da82c92fSMauro Carvalho Chehab 790da82c92fSMauro Carvalho ChehabNow attach your shell to this cpuset:: 791da82c92fSMauro Carvalho Chehab 792da82c92fSMauro Carvalho Chehab # /bin/echo $$ > tasks 793da82c92fSMauro Carvalho Chehab 794da82c92fSMauro Carvalho ChehabYou can also create cpusets inside your cpuset by using mkdir in this 795da82c92fSMauro Carvalho Chehabdirectory:: 796da82c92fSMauro Carvalho Chehab 797da82c92fSMauro Carvalho Chehab # mkdir my_sub_cs 798da82c92fSMauro Carvalho Chehab 799da82c92fSMauro Carvalho ChehabTo remove a cpuset, just use rmdir:: 800da82c92fSMauro Carvalho Chehab 801da82c92fSMauro Carvalho Chehab # rmdir my_sub_cs 802da82c92fSMauro Carvalho Chehab 803da82c92fSMauro Carvalho ChehabThis will fail if the cpuset is in use (has cpusets inside, or has 804da82c92fSMauro Carvalho Chehabprocesses attached). 805da82c92fSMauro Carvalho Chehab 806da82c92fSMauro Carvalho ChehabNote that for legacy reasons, the "cpuset" filesystem exists as a 807da82c92fSMauro Carvalho Chehabwrapper around the cgroup filesystem. 808da82c92fSMauro Carvalho Chehab 809da82c92fSMauro Carvalho ChehabThe command:: 810da82c92fSMauro Carvalho Chehab 811da82c92fSMauro Carvalho Chehab mount -t cpuset X /sys/fs/cgroup/cpuset 812da82c92fSMauro Carvalho Chehab 813da82c92fSMauro Carvalho Chehabis equivalent to:: 814da82c92fSMauro Carvalho Chehab 815da82c92fSMauro Carvalho Chehab mount -t cgroup -ocpuset,noprefix X /sys/fs/cgroup/cpuset 816da82c92fSMauro Carvalho Chehab echo "/sbin/cpuset_release_agent" > /sys/fs/cgroup/cpuset/release_agent 817da82c92fSMauro Carvalho Chehab 818da82c92fSMauro Carvalho Chehab2.2 Adding/removing cpus 819da82c92fSMauro Carvalho Chehab------------------------ 820da82c92fSMauro Carvalho Chehab 821da82c92fSMauro Carvalho ChehabThis is the syntax to use when writing in the cpus or mems files 822da82c92fSMauro Carvalho Chehabin cpuset directories:: 823da82c92fSMauro Carvalho Chehab 824da82c92fSMauro Carvalho Chehab # /bin/echo 1-4 > cpuset.cpus -> set cpus list to cpus 1,2,3,4 825da82c92fSMauro Carvalho Chehab # /bin/echo 1,2,3,4 > cpuset.cpus -> set cpus list to cpus 1,2,3,4 826da82c92fSMauro Carvalho Chehab 827da82c92fSMauro Carvalho ChehabTo add a CPU to a cpuset, write the new list of CPUs including the 828da82c92fSMauro Carvalho ChehabCPU to be added. To add 6 to the above cpuset:: 829da82c92fSMauro Carvalho Chehab 830da82c92fSMauro Carvalho Chehab # /bin/echo 1-4,6 > cpuset.cpus -> set cpus list to cpus 1,2,3,4,6 831da82c92fSMauro Carvalho Chehab 832da82c92fSMauro Carvalho ChehabSimilarly to remove a CPU from a cpuset, write the new list of CPUs 833da82c92fSMauro Carvalho Chehabwithout the CPU to be removed. 834da82c92fSMauro Carvalho Chehab 835da82c92fSMauro Carvalho ChehabTo remove all the CPUs:: 836da82c92fSMauro Carvalho Chehab 837da82c92fSMauro Carvalho Chehab # /bin/echo "" > cpuset.cpus -> clear cpus list 838da82c92fSMauro Carvalho Chehab 839da82c92fSMauro Carvalho Chehab2.3 Setting flags 840da82c92fSMauro Carvalho Chehab----------------- 841da82c92fSMauro Carvalho Chehab 842da82c92fSMauro Carvalho ChehabThe syntax is very simple:: 843da82c92fSMauro Carvalho Chehab 844da82c92fSMauro Carvalho Chehab # /bin/echo 1 > cpuset.cpu_exclusive -> set flag 'cpuset.cpu_exclusive' 845da82c92fSMauro Carvalho Chehab # /bin/echo 0 > cpuset.cpu_exclusive -> unset flag 'cpuset.cpu_exclusive' 846da82c92fSMauro Carvalho Chehab 847da82c92fSMauro Carvalho Chehab2.4 Attaching processes 848da82c92fSMauro Carvalho Chehab----------------------- 849da82c92fSMauro Carvalho Chehab 850da82c92fSMauro Carvalho Chehab:: 851da82c92fSMauro Carvalho Chehab 852da82c92fSMauro Carvalho Chehab # /bin/echo PID > tasks 853da82c92fSMauro Carvalho Chehab 854da82c92fSMauro Carvalho ChehabNote that it is PID, not PIDs. You can only attach ONE task at a time. 855da82c92fSMauro Carvalho ChehabIf you have several tasks to attach, you have to do it one after another:: 856da82c92fSMauro Carvalho Chehab 857da82c92fSMauro Carvalho Chehab # /bin/echo PID1 > tasks 858da82c92fSMauro Carvalho Chehab # /bin/echo PID2 > tasks 859da82c92fSMauro Carvalho Chehab ... 860da82c92fSMauro Carvalho Chehab # /bin/echo PIDn > tasks 861da82c92fSMauro Carvalho Chehab 862da82c92fSMauro Carvalho Chehab 863da82c92fSMauro Carvalho Chehab3. Questions 864da82c92fSMauro Carvalho Chehab============ 865da82c92fSMauro Carvalho Chehab 866da82c92fSMauro Carvalho ChehabQ: 867da82c92fSMauro Carvalho Chehab what's up with this '/bin/echo' ? 868da82c92fSMauro Carvalho Chehab 869da82c92fSMauro Carvalho ChehabA: 870da82c92fSMauro Carvalho Chehab bash's builtin 'echo' command does not check calls to write() against 871da82c92fSMauro Carvalho Chehab errors. If you use it in the cpuset file system, you won't be 872da82c92fSMauro Carvalho Chehab able to tell whether a command succeeded or failed. 873da82c92fSMauro Carvalho Chehab 874da82c92fSMauro Carvalho ChehabQ: 875da82c92fSMauro Carvalho Chehab When I attach processes, only the first of the line gets really attached ! 876da82c92fSMauro Carvalho Chehab 877da82c92fSMauro Carvalho ChehabA: 878da82c92fSMauro Carvalho Chehab We can only return one error code per call to write(). So you should also 879da82c92fSMauro Carvalho Chehab put only ONE pid. 880da82c92fSMauro Carvalho Chehab 881da82c92fSMauro Carvalho Chehab4. Contact 882da82c92fSMauro Carvalho Chehab========== 883da82c92fSMauro Carvalho Chehab 884da82c92fSMauro Carvalho ChehabWeb: http://www.bullopensource.org/cpuset 885