1============================= 2Per-task statistics interface 3============================= 4 5 6Taskstats is a netlink-based interface for sending per-task and 7per-process statistics from the kernel to userspace. 8 9Taskstats was designed for the following benefits: 10 11- efficiently provide statistics during lifetime of a task and on its exit 12- unified interface for multiple accounting subsystems 13- extensibility for use by future accounting patches 14 15Terminology 16----------- 17 18"pid", "tid" and "task" are used interchangeably and refer to the standard 19Linux task defined by struct task_struct. per-pid stats are the same as 20per-task stats. 21 22"tgid", "process" and "thread group" are used interchangeably and refer to the 23tasks that share an mm_struct i.e. the traditional Unix process. Despite the 24use of tgid, there is no special treatment for the task that is thread group 25leader - a process is deemed alive as long as it has any task belonging to it. 26 27Usage 28----- 29 30To get statistics during a task's lifetime, userspace opens a unicast netlink 31socket (NETLINK_GENERIC family) and sends commands specifying a pid or a tgid. 32The response contains statistics for a task (if pid is specified) or the sum of 33statistics for all tasks of the process (if tgid is specified). 34 35To obtain statistics for tasks which are exiting, the userspace listener 36sends a register command and specifies a cpumask. Whenever a task exits on 37one of the cpus in the cpumask, its per-pid statistics are sent to the 38registered listener. Using cpumasks allows the data received by one listener 39to be limited and assists in flow control over the netlink interface and is 40explained in more detail below. 41 42If the exiting task is the last thread exiting its thread group, 43an additional record containing the per-tgid stats is also sent to userspace. 44The latter contains the sum of per-pid stats for all threads in the thread 45group, both past and present. 46 47getdelays.c is a simple utility demonstrating usage of the taskstats interface 48for reporting delay accounting statistics. Users can register cpumasks, 49send commands and process responses, listen for per-tid/tgid exit data, 50write the data received to a file and do basic flow control by increasing 51receive buffer sizes. 52 53Interface 54--------- 55 56The user-kernel interface is encapsulated in include/linux/taskstats.h 57 58To avoid this documentation becoming obsolete as the interface evolves, only 59an outline of the current version is given. taskstats.h always overrides the 60description here. 61 62struct taskstats is the common accounting structure for both per-pid and 63per-tgid data. It is versioned and can be extended by each accounting subsystem 64that is added to the kernel. The fields and their semantics are defined in the 65taskstats.h file. 66 67The data exchanged between user and kernel space is a netlink message belonging 68to the NETLINK_GENERIC family and using the netlink attributes interface. 69The messages are in the format:: 70 71 +----------+- - -+-------------+-------------------+ 72 | nlmsghdr | Pad | genlmsghdr | taskstats payload | 73 +----------+- - -+-------------+-------------------+ 74 75 76The taskstats payload is one of the following three kinds: 77 781. Commands: Sent from user to kernel. Commands to get data on 79 a pid/tgid consist of one attribute, of type TASKSTATS_CMD_ATTR_PID/TGID, 80 containing a u32 pid or tgid in the attribute payload. The pid/tgid denotes 81 the task/process for which userspace wants statistics. 82 83 Commands to register/deregister interest in exit data from a set of cpus 84 consist of one attribute, of type 85 TASKSTATS_CMD_ATTR_REGISTER/DEREGISTER_CPUMASK and contain a cpumask in the 86 attribute payload. The cpumask is specified as an ascii string of 87 comma-separated cpu ranges e.g. to listen to exit data from cpus 1,2,3,5,7,8 88 the cpumask would be "1-3,5,7-8". If userspace forgets to deregister 89 interest in cpus before closing the listening socket, the kernel cleans up 90 its interest set over time. However, for the sake of efficiency, an explicit 91 deregistration is advisable. 92 932. Response for a command: sent from the kernel in response to a userspace 94 command. The payload is a series of three attributes of type: 95 96 a) TASKSTATS_TYPE_AGGR_PID/TGID: attribute containing no payload but 97 indicates a pid/tgid will be followed by some stats. 98 99 b) TASKSTATS_TYPE_PID/TGID: attribute whose payload is the pid/tgid whose 100 stats are being returned. 101 102 c) TASKSTATS_TYPE_STATS: attribute with a struct taskstats as payload. The 103 same structure is used for both per-pid and per-tgid stats. 104 1053. New message sent by kernel whenever a task exits. The payload consists of a 106 series of attributes of the following type: 107 108 a) TASKSTATS_TYPE_AGGR_PID: indicates next two attributes will be pid+stats 109 b) TASKSTATS_TYPE_PID: contains exiting task's pid 110 c) TASKSTATS_TYPE_STATS: contains the exiting task's per-pid stats 111 d) TASKSTATS_TYPE_AGGR_TGID: indicates next two attributes will be 112 tgid+stats 113 e) TASKSTATS_TYPE_TGID: contains tgid of process to which task belongs 114 f) TASKSTATS_TYPE_STATS: contains the per-tgid stats for exiting task's 115 process 116 117 118per-tgid stats 119-------------- 120 121Taskstats provides per-process stats, in addition to per-task stats, since 122resource management is often done at a process granularity and aggregating task 123stats in userspace alone is inefficient and potentially inaccurate (due to lack 124of atomicity). 125 126However, maintaining per-process, in addition to per-task stats, within the 127kernel has space and time overheads. To address this, the taskstats code 128accumulates each exiting task's statistics into a process-wide data structure. 129When the last task of a process exits, the process level data accumulated also 130gets sent to userspace (along with the per-task data). 131 132When a user queries to get per-tgid data, the sum of all other live threads in 133the group is added up and added to the accumulated total for previously exited 134threads of the same thread group. 135 136Extending taskstats 137------------------- 138 139There are two ways to extend the taskstats interface to export more 140per-task/process stats as patches to collect them get added to the kernel 141in future: 142 1431. Adding more fields to the end of the existing struct taskstats. Backward 144 compatibility is ensured by the version number within the 145 structure. Userspace will use only the fields of the struct that correspond 146 to the version its using. 147 1482. Defining separate statistic structs and using the netlink attributes 149 interface to return them. Since userspace processes each netlink attribute 150 independently, it can always ignore attributes whose type it does not 151 understand (because it is using an older version of the interface). 152 153 154Choosing between 1. and 2. is a matter of trading off flexibility and 155overhead. If only a few fields need to be added, then 1. is the preferable 156path since the kernel and userspace don't need to incur the overhead of 157processing new netlink attributes. But if the new fields expand the existing 158struct too much, requiring disparate userspace accounting utilities to 159unnecessarily receive large structures whose fields are of no interest, then 160extending the attributes structure would be worthwhile. 161 162Flow control for taskstats 163-------------------------- 164 165When the rate of task exits becomes large, a listener may not be able to keep 166up with the kernel's rate of sending per-tid/tgid exit data leading to data 167loss. This possibility gets compounded when the taskstats structure gets 168extended and the number of cpus grows large. 169 170To avoid losing statistics, userspace should do one or more of the following: 171 172- increase the receive buffer sizes for the netlink sockets opened by 173 listeners to receive exit data. 174 175- create more listeners and reduce the number of cpus being listened to by 176 each listener. In the extreme case, there could be one listener for each cpu. 177 Users may also consider setting the cpu affinity of the listener to the subset 178 of cpus to which it listens, especially if they are listening to just one cpu. 179 180Despite these measures, if the userspace receives ENOBUFS error messages 181indicated overflow of receive buffers, it should take measures to handle the 182loss of data. 183